unicode_id_start/
lib.rs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
//! [![github]](https://github.com/Boshen/unicode-id-start) [![crates-io]](https://crates.io/crates/unicode-id-start) [![docs-rs]](https://docs.rs/unicode-id-start)
//!
//! [github]: https://img.shields.io/badge/github-8da0cb?style=for-the-badge&labelColor=555555&logo=github
//! [crates-io]: https://img.shields.io/badge/crates.io-fc8d62?style=for-the-badge&labelColor=555555&logo=rust
//! [docs-rs]: https://img.shields.io/badge/docs.rs-66c2a5?style=for-the-badge&labelColor=555555&logo=docs.rs
//!
//! <br>
//!
//! Implementation of [Unicode Standard Annex #31][tr31] for determining which
//! `char` values are valid in programming language identifiers.
//!
//! [tr31]: https://www.unicode.org/reports/tr31/
//!
//! This crate is a better optimized implementation of the older `unicode-id`
//! crate. This crate uses less static storage, and is able to classify both
//! ASCII and non-ASCII codepoints with better performance, 2&ndash;10&times;
//! faster than `unicode-id`.
//!
//! <br>
//!
//! ## Comparison of performance
//!
//! The following table shows a comparison between five Unicode identifier
//! implementations.
//!
//! - `unicode-id-start` is this crate, which is a fork of `unicode-ident`;
//! - [`unicode-xid`] is a widely used crate run by the "unicode-rs" org; [`unicode-id`] is a fork of `unicode-xid`;
//! - `ucd-trie` and `fst` are two data structures supported by the
//!   [`ucd-generate`] tool;
//! - [`roaring`] is a Rust implementation of Roaring bitmap.
//!
//! The *static storage* column shows the total size of `static` tables that the
//! crate bakes into your binary, measured in 1000s of bytes.
//!
//! The remaining columns show the **cost per call** to evaluate whether a
//! single `char` has the ID\_Start or ID\_Continue Unicode property,
//! comparing across different ratios of ASCII to non-ASCII codepoints in the
//! input data.
//!
//! [`unicode-ident`]: https://github.com/dtolnay/unicode-ident
//! [`unicode-xid`]: https://github.com/unicode-rs/unicode-xid
//! [`unicode-id`]: https://github.com/Boshen/unicode-id
//! [`ucd-generate`]: https://github.com/BurntSushi/ucd-generate
//! [`roaring`]: https://github.com/RoaringBitmap/roaring-rs
//!
//! | | static storage | 0% nonascii | 1% | 10% | 100% nonascii |
//! |---|---|---|---|---|---|
//! | **`unicode-id-start`** | 9.68 K | 0.96 ns | 0.95 ns | 1.09 ns | 1.55 ns |
//! | **`unicode-id`** | 11.23 K | 1.88 ns | 2.14 ns | 3.48 ns | 15.63 ns |
//! | **`ucd-trie`** | 9.93 K | 1.29 ns | 1.28 ns | 1.36 ns | 2.15 ns |
//! | **`fst`** | 131 K | 55.1 ns | 54.9 ns | 53.2 ns | 28.5 ns |
//! | **`roaring`** | 66.1 K | 2.78 ns | 3.09 ns | 3.37 ns | 4.70 ns |
//! | **`roaring`** | 66.1 K | 2.78 ns | 3.09 ns | 3.37 ns | 4.70 ns |
//!
//! Source code for the benchmark is provided in the *bench* directory of this
//! repo and may be repeated by running `cargo criterion`.
//!
//! <br>
//!
//! ## Comparison of data structures
//!
//! #### unicode-id
//!
//! They use a sorted array of character ranges, and do a binary search to look
//! up whether a given character lands inside one of those ranges.
//!
//! ```rust
//! # const _: &str = stringify! {
//! static ID_Continue_table: [(char, char); 763] = [
//!     ('\u{30}', '\u{39}'),  // 0-9
//!     ('\u{41}', '\u{5a}'),  // A-Z
//! # "
//!     …
//! # "
//!     ('\u{e0100}', '\u{e01ef}'),
//! ];
//! # };
//! ```
//!
//! The static storage used by this data structure scales with the number of
//! contiguous ranges of identifier codepoints in Unicode. Every table entry
//! consumes 8 bytes, because it consists of a pair of 32-bit `char` values.
//!
//! In some ranges of the Unicode codepoint space, this is quite a sparse
//! representation &ndash; there are some ranges where tens of thousands of
//! adjacent codepoints are all valid identifier characters. In other places,
//! the representation is quite inefficient. A characater like `µ` (U+00B5)
//! which is surrounded by non-identifier codepoints consumes 64 bits in the
//! table, while it would be just 1 bit in a dense bitmap.
//!
//! On a system with 64-byte cache lines, binary searching the table touches 7
//! cache lines on average. Each cache line fits only 8 table entries.
//! Additionally, the branching performed during the binary search is probably
//! mostly unpredictable to the branch predictor.
//!
//! Overall, the crate ends up being about 10&times; slower on non-ASCII input
//! compared to the fastest crate.
//!
//! A potential improvement would be to pack the table entries more compactly.
//! Rust's `char` type is a 21-bit integer padded to 32 bits, which means every
//! table entry is holding 22 bits of wasted space, adding up to 3.9 K. They
//! could instead fit every table entry into 6 bytes, leaving out some of the
//! padding, for a 25% improvement in space used. With some cleverness it may be
//! possible to fit in 5 bytes or even 4 bytes by storing a low char and an
//! extent, instead of low char and high char. I don't expect that performance
//! would improve much but this could be the most efficient for space across all
//! the libraries, needing only about 7 K to store.
//!
//! #### ucd-trie
//!
//! Their data structure is a compressed trie set specifically tailored for
//! Unicode codepoints. The design is credited to Raph Levien in
//! [rust-lang/rust#33098].
//!
//! [rust-lang/rust#33098]: https://github.com/rust-lang/rust/pull/33098
//!
//! ```rust
//! pub struct TrieSet {
//!     tree1_level1: &'static [u64; 32],
//!     tree2_level1: &'static [u8; 992],
//!     tree2_level2: &'static [u64],
//!     tree3_level1: &'static [u8; 256],
//!     tree3_level2: &'static [u8],
//!     tree3_level3: &'static [u64],
//! }
//! ```
//!
//! It represents codepoint sets using a trie to achieve prefix compression. The
//! final states of the trie are embedded in leaves or "chunks", where each
//! chunk is a 64-bit integer. Each bit position of the integer corresponds to
//! whether a particular codepoint is in the set or not. These chunks are not
//! just a compact representation of the final states of the trie, but are also
//! a form of suffix compression. In particular, if multiple ranges of 64
//! contiguous codepoints have the same Unicode properties, then they all map to
//! the same chunk in the final level of the trie.
//!
//! Being tailored for Unicode codepoints, this trie is partitioned into three
//! disjoint sets: tree1, tree2, tree3. The first set corresponds to codepoints
//! \[0, 0x800), the second \[0x800, 0x10000) and the third \[0x10000,
//! 0x110000). These partitions conveniently correspond to the space of 1 or 2
//! byte UTF-8 encoded codepoints, 3 byte UTF-8 encoded codepoints and 4 byte
//! UTF-8 encoded codepoints, respectively.
//!
//! Lookups in this data structure are significantly more efficient than binary
//! search. A lookup touches either 1, 2, or 3 cache lines based on which of the
//! trie partitions is being accessed.
//!
//! One possible performance improvement would be for this crate to expose a way
//! to query based on a UTF-8 encoded string, returning the Unicode property
//! corresponding to the first character in the string. Without such an API, the
//! caller is required to tokenize their UTF-8 encoded input data into `char`,
//! hand the `char` into `ucd-trie`, only for `ucd-trie` to undo that work by
//! converting back into the variable-length representation for trie traversal.
//!
//! #### fst
//!
//! Uses a [finite state transducer][fst]. This representation is built into
//! [ucd-generate] but I am not aware of any advantage over the `ucd-trie`
//! representation. In particular `ucd-trie` is optimized for storing Unicode
//! properties while `fst` is not.
//!
//! [fst]: https://github.com/BurntSushi/fst
//! [ucd-generate]: https://github.com/BurntSushi/ucd-generate
//!
//! As far as I can tell, the main thing that causes `fst` to have large size
//! and slow lookups for this use case relative to `ucd-trie` is that it does
//! not specialize for the fact that only 21 of the 32 bits in a `char` are
//! meaningful. There are some dense arrays in the structure with large ranges
//! that could never possibly be used.
//!
//! #### roaring
//!
//! This crate is a pure-Rust implementation of [Roaring Bitmap], a data
//! structure designed for storing sets of 32-bit unsigned integers.
//!
//! [Roaring Bitmap]: https://roaringbitmap.org/about/
//!
//! Roaring bitmaps are compressed bitmaps which tend to outperform conventional
//! compressed bitmaps such as WAH, EWAH or Concise. In some instances, they can
//! be hundreds of times faster and they often offer significantly better
//! compression.
//!
//! In this use case the performance was reasonably competitive but still
//! substantially slower than the Unicode-optimized crates. Meanwhile the
//! compression was significantly worse, requiring 6&times; as much storage for
//! the data structure.
//!
//! I also benchmarked the [`croaring`] crate which is an FFI wrapper around the
//! C reference implementation of Roaring Bitmap. This crate was consistently
//! about 15% slower than pure-Rust `roaring`, which could just be FFI overhead.
//! I did not investigate further.
//!
//! [`croaring`]: https://crates.io/crates/croaring
//!
//! #### unicode-ident
//!
//! This crate is most similar to the `ucd-trie` library, in that it's based on
//! bitmaps stored in the leafs of a trie representation, achieving both prefix
//! compression and suffix compression.
//!
//! The key differences are:
//!
//! - Uses a single 2-level trie, rather than 3 disjoint partitions of different
//!   depth each.
//! - Uses significantly larger chunks: 512 bits rather than 64 bits.
//! - Compresses the ID\_Start and ID\_Continue properties together
//!   simultaneously, rather than duplicating identical trie leaf chunks across
//!   the two.
//!
//! The following diagram show the ID\_Start and ID\_Continue Unicode boolean
//! properties in uncompressed form, in row-major order:
//!
//! <table>
//! <tr><th>ID_Start</th><th>ID_Continue</th></tr>
//! <tr>
//! <td><img alt="ID_Start bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647353-c6eeb922-afec-49b2-9ef5-c03e9d1e0760.png"></td>
//! <td><img alt="ID_Continue bitmap" width="256" src="https://user-images.githubusercontent.com/1940490/168647367-f447cca7-2362-4d7d-8cd7-d21c011d329b.png"></td>
//! </tr>
//! </table>
//!
//! Uncompressed, these would take 140 K to store, which is beyond what would be
//! reasonable. However, as you can see there is a large degree of similarity
//! between the two bitmaps and across the rows, which lends well to
//! compression.
//!
//! This crate stores one 512-bit "row" of the above bitmaps in the leaf level
//! of a trie, and a single additional level to index into the leafs. It turns
//! out there are 124 unique 512-bit chunks across the two bitmaps so 7 bits are
//! sufficient to index them.
//!
//! The chunk size of 512 bits is selected as the size that minimizes the total
//! size of the data structure. A smaller chunk, like 256 or 128 bits, would
//! achieve better deduplication but require a larger index. A larger chunk
//! would increase redundancy in the leaf bitmaps. 512 bit chunks are the
//! optimum for total size of the index plus leaf bitmaps.
//!
//! In fact since there are only 124 unique chunks, we can use an 8-bit index
//! with a spare bit to index at the half-chunk level. This achieves an
//! additional 8.5% compression by eliminating redundancies between the second
//! half of any chunk and the first half of any other chunk. Note that this is
//! not the same as using chunks which are half the size, because it does not
//! necessitate raising the size of the trie's first level.
//!
//! In contrast to binary search or the `ucd-trie` crate, performing lookups in
//! this data structure is straight-line code with no need for branching.

#![no_std]
#![doc(html_root_url = "https://docs.rs/unicode-id-start/1.1.0")]
#![allow(clippy::doc_markdown, clippy::must_use_candidate)]

#[rustfmt::skip]
mod tables;

use crate::tables::{ASCII_CONTINUE, ASCII_START, CHUNK, LEAF, TRIE_CONTINUE, TRIE_START};

/// Check ascii and unicode for id_start
#[inline]
pub fn is_id_start(ch: char) -> bool {
    if ch.is_ascii() {
        return ASCII_START.0[ch as usize];
    }
    is_id_start_unicode(ch)
}

/// Check unicode only for id_start
#[inline]
pub fn is_id_start_unicode(ch: char) -> bool {
    let chunk = *TRIE_START.0.get(ch as usize / 8 / CHUNK).unwrap_or(&0);
    let offset = chunk as usize * CHUNK / 2 + ch as usize / 8 % CHUNK;
    unsafe { LEAF.0.get_unchecked(offset) }.wrapping_shr(ch as u32 % 8) & 1 != 0
}

/// Check ascii and unicode for id_continue
#[inline]
pub fn is_id_continue(ch: char) -> bool {
    if ch.is_ascii() {
        return ASCII_CONTINUE.0[ch as usize];
    }
    is_id_continue_unicode(ch)
}

/// Check and unicode only for id_continue
#[inline]
pub fn is_id_continue_unicode(ch: char) -> bool {
    let chunk = *TRIE_CONTINUE.0.get(ch as usize / 8 / CHUNK).unwrap_or(&0);
    let offset = chunk as usize * CHUNK / 2 + ch as usize / 8 % CHUNK;
    unsafe { LEAF.0.get_unchecked(offset) }.wrapping_shr(ch as u32 % 8) & 1 != 0
}