reed-solomon-16 0.1.0

Reed-Solomon GF(2^16) erasure coding with O(n log n) complexity
Documentation
Algorithm documentation.

As I don't understand algorithm fully myself,
I'll just document some parts which I do understand.

# Shard

- Reed-Solomon `GF(2^16)` erasure coding works on 16-bit elements ([`GfElement`]).
- A **shard** is a byte-array which is interpreted as an array of [`GfElement`]:s.

A naive implementation could e.g. require shards to be a multiple of **2 bytes**
and then interpret each byte-pair as low/high parts of a single [`GfElement`]:

```text
[ low_0, high_0, low_1, high_1, ...]
```

However that approach isn't good for SIMD optimizations.
Instead shards are required to be a multiple of **64 bytes**.
In each 64-byte block first 32 bytes are low parts of 32 [`GfElement`]:s
and last 32 bytes are high parts of those 32 [`GfElement`]:s.

```text
[ low_0, low_1, ..., low_31, high_0, high_1, ..., high_31 ]
```

A shard then consists of one or more of these 64-byte blocks:

```text
// -------- first 64-byte block --------- | --------- second 64-byte block ---------- | ...
[ low_0, ..., low_31, high_0, ..., high_31, low_32, ..., low_63, high_32, ..., high_63, ... ]
```

## Original shards and recovery shards

- The data which is going to be protected by Reed-Solomon erasure coding
  is split into equal-sized **original shards**.
    - **`original_count`** is the number of original shards.
- Additional **recovery shards** of same size are then created
  which contain recovery data so that original data can be fully restored
  from any set of **`original_count`** shards, original or recovery.
    - **`recovery_count`** is the number of recovery shards.

Algorithm supports any combination of
1 - 32768 original shards with 1 - 32768 recovery shards.
Up to 65535 original or recovery shards is also possible with following limitations:

| `original_count` | `recovery_count` |
| ---------------- | ---------------- |
| `<= 2^16 - 2^n`  | `<= 2^n`         |
| `<= 61440`       | `<= 4096`        |
| `<= 57344`       | `<= 8192`        |
| `<= 49152`       | `<= 16384`       |
| **`<= 32768`**   | **`<= 32768`**   |
| `<= 16384`       | `<= 49152`       |
| `<= 8192`        | `<= 57344`       |
| `<= 4096`        | `<= 61440`       |
| `<= 2^n`         | `<= 2^16 - 2^n`  |

# Rate

Encoding and decoding both have two variations:

- **High rate** refers to having more original shards than recovery shards.
    - High rate must be used when there are over 32768 original shards.
    - High rate encoding uses **chunks** of `recovery_count.next_power_of_two()` shards.
- **Low rate** refers to having more recovery shards than original shards.
    - Low rate must be used when there are over 32768 recovery shards.
    - Low rate encoding uses **chunks** of `original_count.next_power_of_two()` shards.
- Because of padding either rate can be used when there are
  at most 32768 original shards and at most 32768 recovery shards.
    - High rate and low rate are not [^1] compatible with each other,
      i.e. decoding must use same rate that encoding used.
    - With multiple chunks "correct" rate is generally faster in encoding
      and not-slower in decoding.
    - With single chunk "wrong" rate is generally faster in decoding
      if `original_count` and `recovery_count` differ a lot.

[^1]: They seem to be compatible with single chunk. However I don't quite
    understand why and I don't recommend relying on this.

## Benchmarks

- These benchmarks are from `cargo bench rate`
  and use similar setup than [main benchmarks],
  except with maximum possible shard loss.

| original : recovery | Chunks  | HighRateEncoder | LowRateEncoder | HighRateDecoder | LowRateDecoder |
| ------------------- | ------- | --------------- | -------------- | --------------- | -------------- |
| 1024 : 1024         | 1x 1024 | 175 MiB/s       | 176 MiB/s      | 76 MiB/s        | 75 MiB/s       |
| 1024 : 1025 (Low)   | 2x 1024 | 140             | **153**        | 47              | **59**         |
| 1025 : 1024 (High)  | 2x 1024 | **152**         | 132            | **60**          | 46             |
| 1024 : 2048 (Low)   | 2x 1024 | 157             | **169**        | 70              | 70             |
| 2048 : 1024 (High)  | 2x 1024 | **167**         | 151            | 69              | 68             |
| 1025 : 1025         | 1x 2048 | 125             | 126            | 44              | 43             |
| 1025 : 2048 (Low)   | 1x 2048 | 144             | 144            | **65** **!!!**  | 53             |
| 2048 : 1025 (High)  | 1x 2048 | 144             | 145            | 53              | **62** **!!!** |
| 2048 : 2048         | 1x 2048 | 156             | 157            | 70              | 69             |

[main benchmarks]: crate#benchmarks

# Encoding

Encoding takes original shards as input and generates recovery shards.

## High rate encoding

- Encoding is done in **chunks** of `recovery_count.next_power_of_two()` shards.
- Original shards are split into chunks and last chunk
  is padded with zero-filled shards if needed.
- Recovery shards fit into a single chunk
  which is padded with unused shards if needed.
- Recovery chunk is generated with following algorithm

```text
recovery_chunk = FFT(
    IFFT(original_chunk_0, skew_0) xor
    IFFT(original_chunk_1, skew_1) xor
    ...
)
```

This is implemented in [`HighRateEncoder`].

## Low rate encoding

- Encoding is done in **chunks** of `original_count.next_power_of_two()` shards.
- Original shards fit into a single chunk
  which is padded with zero-filled shards if needed.
- Recovery shards are generated in chunks and last chunk
  is padded with unused shards if needed.
- Recovery chunks are generated with following algorithm

```text
recovery_chunk_0 = FFT( IFFT(original_chunk), skew_0 )
recovery_chunk_1 = FFT( IFFT(original_chunk), skew_1 )
...
```

This is implemented in [`LowRateEncoder`].

# Decoding

**TODO**


[`GfElement`]: crate::engine::GfElement
[`HighRateEncoder`]: crate::rate::HighRateEncoder
[`LowRateEncoder`]: crate::rate::LowRateEncoder