encoding_rs 0.8.33

A Gecko-oriented implementation of the Encoding Standard
Documentation
This document contains notes about various ideas that for one reason or another
are not being actively pursued.

## Next byte is non-ASCII after ASCII optimization

The current plan for a SIMD-accelerated inner loop for handling ASCII bytes
makes no use of the bit of information that if the buffers didn't end but the
ASCII loop exited, the next byte will not be an ASCII byte.

## Handling ASCII with table lookups when decoding single-byte to UTF-16

Both uconv and ICU outperform encoding_rs when decoding single-byte to UTF-16.
unconv doesn't even do anything fancy to manually unroll the loop (see below).
Both handle even the ASCII range using table lookup. That is, there's no branch
for checking if we're in the lower or upper half of the encoding.

However, adding SIMD acceleration for the ASCII half will likely be a bigger
win than eliminating the branch to decide ASCII vs. non-ASCII.

## Manual loop unrolling for single-byte encodings

ICU currently outperforms encoding_rs (by over x2!) when decoding a single-byte
encoding to UTF-16. This appears to be thanks to manually unrolling the
conversion loop by 16. See [ucnv_MBCSSingleToBMPWithOffsets][1].

[1]: https://ssl.icu-project.org/repos/icu/icu/tags/release-55-1/source/common/ucnvmbcs.cpp

Notably, none of the single-byte encodings have bytes that'd decode to the
upper half of BMP. Therefore, if the unmappable marker has the highest bit set
instead of being zero, the check for unmappables within a 16-character stride
can be done either by ORing the BMP characters in the stride together and
checking the high bit or by loading the upper halves of the BMP charaters
in a `u8x8` register and checking the high bits using the `_mm_movemask_epi8`
/ `pmovmskb` SSE2 instruction.

## After non-ASCII, handle ASCII punctuation without SIMD

Since the failure mode of SIMD ASCII acceleration involves wasted aligment
checks and a wasted SIMD read when the next code unit is non-ASCII and non-Latin
scripts have runs of non-ASCII even if ASCII spaces and punctuation is used,
consider handling the next two or three bytes following non-ASCII as non-SIMD
before looping back to the SIMD mode. Maybe move back to SIMD ASCII faster if
there's ASCII that's not space or punctuation. Maybe with the "space or
punctuation" check in place, this code can be allowed to be in place even for
UTF-8 and Latin single-byte (i.e. not having different code for Latin and
non-Latin single-byte).

## Prefer maintaining aligment

Instead of returning to acceleration directly after non-ASCII, consider
continuing to the alignment boundary without acceleration.

## Read from SIMD lanes instead of RAM (cache) when ASCII check fails

When the SIMD ASCII check fails, the data has already been read from memory.
Test whether it's faster to read the data by lane from the SIMD register than
to read it again from RAM (cache).

## Use Level 2 Hanzi and Level 2 Kanji ordering

These two are ordered by radical and then by stroke count, so in principle,
they should be mostly Unicode-ordered, although at least Level 2 Hanzi isn't
fully Unicode-ordered. Is "mostly" good enough for encode accelelation?

## Create a `divmod_94()` function

Experiment with a function that computes `(i / 94, i % 94)` more efficiently
than generic code.

## Align writes on Aarch64

On [Cortex-A57](https://stackoverflow.com/questions/45714535/performance-of-unaligned-simd-load-store-on-aarch64/45938112#45938112
), it might be a good idea to move the destination into 16-byte alignment.

## Unalign UTF-8 validation on Aarch64

Currently, Aarch64 runs the generic ALU UTF-8 validation code that aligns
reads. That's probably unnecessary on Aarch64. (SIMD was slower than ALU!)

## Table-driven UTF-8 validation

When there are at least four bytes left, read all four. With each byte
index into tables corresponding to magic values indexable by byte in
each position.

In the value read from the table indexed by lead byte, encode the
following in 16 bits: advance 2 bits (2, 3 or 4 bytes), 9 positional
bits one of which is set to indicate the type of lead byte (8 valid
types, in the 8 lowest bits, and invalid, ASCII would be tenth type),
and the mask for extracting the payload bits from the lead byte
(for conversion to UTF-16 or UTF-32).

In the tables indexable by the trail bytes, in each positions
corresponding byte the lead byte type, store 1 if the trail is
invalid given the lead and 0 if valid given the lead.

Use the low 8 bits of the of the 16 bits read from the first
table to mask (bitwise AND) one positional bit from each of the 
three other values. Bitwise OR the results together with the
bit that is 1 if the lead is invalid. If the result is zero,
the sequence is valid. Otherwise it's invalid.

Use the advance to advance. In the conversion to UTF-16 or
UTF-32 case, use the mast for extracting the meaningful
bits from the lead byte to mask them from the lead. Shift
left by 6 as many times as the advance indicates, etc.