1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372
/*! Crate `fst` is a library for efficiently storing and searching ordered sets or maps where the keys are byte strings. A key design goal of this crate is to support storing and searching *very large* sets or maps (i.e., billions). This means that much effort has gone in to making sure that all operations are memory efficient. Sets and maps are represented by a finite state machine, which acts as a form of compression on common prefixes and suffixes in the keys. Additionally, finite state machines can be efficiently queried with automata (like regular expressions or Levenshtein distance for fuzzy queries) or lexicographic ranges. To read more about the mechanics of finite state transducers, including a bibliography for algorithms used in this crate, see the docs for the [`raw::Fst`](raw/struct.Fst.html) type. # Installation Simply add a corresponding entry to your `Cargo.toml` dependency list: ```plain [dependencies] fst = "0.4" ``` The examples in this documentation will show the rest. # Overview of types and modules This crate provides the high level abstractions---namely sets and maps---in the top-level module. The `set` and `map` sub-modules contain types specific to sets and maps, such as range queries and streams. The `raw` module permits direct interaction with finite state transducers. Namely, the states and transitions of a transducer can be directly accessed with the `raw` module. */ #![cfg_attr( feature = "levenshtein", doc = r##" # Example: fuzzy query This example shows how to create a set of strings in memory, and then execute a fuzzy query. Namely, the query looks for all keys within an edit distance of `1` of `foo`. (Edit distance is the number of character insertions, deletions or substitutions required to get from one string to another. In this case, a character is a Unicode codepoint.) This requires the `levenshtein` feature in this crate to be enabled. It is not enabled by default. ```rust use fst::{IntoStreamer, Streamer, Set}; use fst::automaton::Levenshtein; # fn main() { example().unwrap(); } fn example() -> Result<(), Box<dyn std::error::Error>> { // A convenient way to create sets in memory. let keys = vec!["fa", "fo", "fob", "focus", "foo", "food", "foul"]; let set = Set::from_iter(keys)?; // Build our fuzzy query. let lev = Levenshtein::new("foo", 1)?; // Apply our fuzzy query to the set we built. let mut stream = set.search(lev).into_stream(); let keys = stream.into_strs()?; assert_eq!(keys, vec!["fo", "fob", "foo", "food"]); Ok(()) } ``` **Warning**: Levenshtein automatons use a lot of memory The construction of Levenshtein automatons should be consider "proof of concept" quality. Namely, they do just enough to be *correct*. But they haven't had any effort put into them to be memory conscious. Note that an error will be returned if a Levenshtein automaton gets too big (tens of MB in heap usage). "## )] /*! # Example: stream to a file and memory map it for searching This shows how to create a `MapBuilder` that will stream construction of the map to a file. Notably, this will never store the entire transducer in memory. Instead, only constant memory is required during construction. For the search phase, we use the [`memmap`](https://crates.io/memmap) crate to make the file available as a `&[u8]` without necessarily reading it all into memory (the operating system will automatically handle that for you). ```rust,no_run # fn example() -> Result<(), fst::Error> { use std::fs::File; use std::io; use fst::{IntoStreamer, Streamer, Map, MapBuilder}; use memmap::Mmap; // This is where we'll write our map to. let mut wtr = io::BufWriter::new(File::create("map.fst")?); // Create a builder that can be used to insert new key-value pairs. let mut build = MapBuilder::new(wtr)?; build.insert("bruce", 1).unwrap(); build.insert("clarence", 2).unwrap(); build.insert("stevie", 3).unwrap(); // Finish construction of the map and flush its contents to disk. build.finish()?; // At this point, the map has been constructed. Now we'd like to search it. // This creates a memory map, which enables searching the map without loading // all of it into memory. let mmap = unsafe { Mmap::map(&File::open("map.fst")?)? }; let map = Map::new(mmap)?; // Query for keys that are greater than or equal to clarence. let mut stream = map.range().ge("clarence").into_stream(); let kvs = stream.into_str_vec()?; assert_eq!(kvs, vec![ ("clarence".to_owned(), 2), ("stevie".to_owned(), 3), ]); # Ok(()) # } # example().unwrap(); ``` # Example: case insensitive search We can perform case insensitive search on a set using a regular expression. We can use the [`regex-automata`](https://docs.rs/regex-automata) crate to compile a regular expression into an automaton: ```ignore use fst::{IntoStreamer, Set}; use regex_automata::dense; // regex-automata crate with 'transducer' feature fn main() -> Result<(), Box<dyn std::error::Error>> { let set = Set::from_iter(&["FoO", "Foo", "fOO", "foo"])?; let pattern = r"(?i)foo"; // Setting 'anchored' is important, otherwise the regex can match anywhere // in the key. This would cause the regex to iterate over every key in the // FST set. let dfa = dense::Builder::new().anchored(true).build(pattern).unwrap(); let keys = set.search(&dfa).into_stream().into_strs()?; assert_eq!(keys, vec!["FoO", "Foo", "fOO", "foo"]); println!("{:?}", keys); Ok(()) } ``` Note that for this to work, the `regex-automata` crate must be compiled with the `transducer` feature enabled: ```toml [dependencies] fst = "0.4" regex-automata = { version = "0.1.9", features = ["transducer"] } ``` # Example: searching multiple sets efficiently Since queries can search a transducer without reading the entire data structure into memory, it is possible to search *many* transducers very quickly. This crate provides efficient set/map operations that allow one to combine multiple streams of search results. Each operation only uses memory proportional to the number of streams. The example below shows how to find all keys that start with `B` or `G`. The example below uses sets, but the same operations are available on maps too. ```rust use fst::automaton::{Automaton, Str}; use fst::set; use fst::{IntoStreamer, Set, Streamer}; # fn main() { example().unwrap(); } fn example() -> Result<(), Box<dyn std::error::Error>> { let set1 = Set::from_iter(&["AC/DC", "Aerosmith"])?; let set2 = Set::from_iter(&["Bob Seger", "Bruce Springsteen"])?; let set3 = Set::from_iter(&["George Thorogood", "Golden Earring"])?; let set4 = Set::from_iter(&["Kansas"])?; let set5 = Set::from_iter(&["Metallica"])?; // Create the matcher. We can reuse it to search all of the sets. let matcher = Str::new("B") .starts_with() .union(Str::new("G").starts_with()); // Build a set operation. All we need to do is add a search result stream // for each set and ask for the union. (Other operations, like intersection // and difference are also available.) let mut stream = set::OpBuilder::new() .add(set1.search(&matcher)) .add(set2.search(&matcher)) .add(set3.search(&matcher)) .add(set4.search(&matcher)) .add(set5.search(&matcher)) .union(); // Now collect all of the keys. Alternatively, you could build another set // here using `SetBuilder::extend_stream`. let mut keys = vec![]; while let Some(key) = stream.next() { keys.push(String::from_utf8(key.to_vec())?); } assert_eq!(keys, vec![ "Bob Seger", "Bruce Springsteen", "George Thorogood", "Golden Earring", ]); Ok(()) } ``` # Memory usage An important advantage of using finite state transducers to represent sets and maps is that they can compress very well depending on the distribution of keys. The smaller your set/map is, the more likely it is that it will fit into memory. If it's in memory, then searching it is faster. Therefore, it is important to do what we can to limit what actually needs to be in memory. This is where automata shine, because they can be queried in their compressed state without loading the entire data structure into memory. This means that one can store a set/map created by this crate on disk and search it without actually reading the entire set/map into memory. This use case is served well by *memory maps*, which lets one assign the entire contents of a file to a contiguous region of virtual memory. Indeed, this crate encourages this mode of operation. Both sets and maps can be constructed from anything that provides an `AsRef<[u8]>` implementation, which any memory map should. This is particularly important for long running processes that use this crate, since it enables the operating system to determine which regions of your finite state transducers are actually in memory. Of course, there are downsides to this approach. Namely, navigating a transducer during a key lookup or a search will likely follow a pattern approximating random access. Supporting random access when reading from disk can be very slow because of how often `seek` must be called (or, in the case of memory maps, page faults). This is somewhat mitigated by the prevalence of solid state drives where seek time is eliminated. Nevertheless, solid state drives are not ubiquitous and it is possible that the OS will not be smart enough to keep your memory mapped transducers in the page cache. In that case, it is advisable to load the entire transducer into your process's memory (e.g., calling `Set::new` with a `Vec<u8>`). # Streams Searching a set or a map needs to provide some way to iterate over the search results. Idiomatic Rust calls for something satisfying the `Iterator` trait to be used here. Unfortunately, this is not possible to do efficiently because the `Iterator` trait does not permit values emitted by the iterator to borrow from the iterator. Borrowing from the iterator is required in our case because keys and values are constructed *during iteration*. Namely, if we were to use iterators, then every key would need its own allocation, which could be quite costly. Instead, this crate provides a `Streamer`, which can be thought of as a streaming iterator. Namely, a stream in this crate maintains a single key buffer and lends it out on each iteration. For more details, including important limitations, see the `Streamer` trait. # Quirks There's no doubt about it, finite state transducers are a specialty data structure. They have a host of restrictions that don't apply to other similar data structures found in the standard library, such as `BTreeSet` and `BTreeMap`. Here are some of them: 1. Sets can only contain keys that are byte strings. 2. Maps can also only contain keys that are byte strings, and its values are limited to unsigned 64 bit integers. (The restriction on values may be relaxed some day.) 3. Creating a set or a map requires inserting keys in lexicographic order. Often, keys are not already sorted, which can make constructing large sets or maps tricky. One way to do it is to sort pieces of the data and build a set/map for each piece. This can be parallelized trivially. Once done, they can be merged together into one big set/map if desired. A somewhat simplistic example of this procedure can be seen in `fst-bin/src/merge.rs` from the root of this crate's repository. */ #![deny(missing_docs)] #[cfg(all(feature = "levenshtein", doctest))] doc_comment::doctest!("../README.md"); pub use crate::automaton::Automaton; pub use crate::error::{Error, Result}; pub use crate::map::{Map, MapBuilder}; pub use crate::set::{Set, SetBuilder}; pub use crate::stream::{IntoStreamer, Streamer}; mod bytes; mod error; #[path = "automaton/mod.rs"] mod inner_automaton; #[path = "map.rs"] mod inner_map; #[path = "set.rs"] mod inner_set; pub mod raw; mod stream; /// Automaton implementations for finite state transducers. /// /// This module defines a trait, `Automaton`, with several implementations /// including, but not limited to, union, intersection and complement. pub mod automaton { pub use crate::inner_automaton::*; } /// Map operations implemented by finite state transducers. /// /// This API provided by this sub-module is close in spirit to the API /// provided by /// [`std::collections::BTreeMap`](https://doc.rust-lang.org/stable/std/collections/struct.BTreeMap.html). /// /// # Overview of types /// /// `Map` is a read only interface to pre-constructed sets. `MapBuilder` is /// used to create new sets. (Once a set is created, it can never be modified.) /// `Stream`, `Keys` and `Values` are streams that originated from a map. /// `StreamBuilder` builds range queries. `OpBuilder` collects a set of streams /// and executes set operations like `union` or `intersection` on them with the /// option of specifying a merge strategy for a map's values. The rest of the /// types are streams for set operations. pub mod map { pub use crate::inner_map::*; } /// Set operations implemented by finite state transducers. /// /// This API provided by this sub-module is close in spirit to the API /// provided by /// [`std::collections::BTreeSet`](https://doc.rust-lang.org/stable/std/collections/struct.BTreeSet.html). /// The principle difference, as with everything else in this crate, is that /// operations are performed on streams of byte strings instead of generic /// iterators. Another difference is that most of the set operations (union, /// intersection, difference and symmetric difference) work on multiple sets at /// the same time, instead of two. /// /// # Overview of types /// /// `Set` is a read only interface to pre-constructed sets. `SetBuilder` is /// used to create new sets. (Once a set is created, it can never be modified.) /// `Stream` is a stream of values that originated from a set (analogous to an /// iterator). `StreamBuilder` builds range queries. `OpBuilder` collects a set /// of streams and executes set operations like `union` or `intersection` on /// them. The rest of the types are streams for set operations. pub mod set { pub use crate::inner_set::*; }