logos 0.12.1 - Docs.rs

# Logos _Create ridiculously fast Lexers._ **Logos** has two goals: + To make it easy to create a Lexer, so you can focus on more complex problems. + To make the generated Lexer faster than anything you'd write by hand. To achieve those, **Logos**: + Combines all token definitions into a single [deterministic state machine](https://en.wikipedia.org/wiki/Deterministic_finite_automaton). + Optimizes branches into [lookup tables](https://en.wikipedia.org/wiki/Lookup_table) or [jump tables](https://en.wikipedia.org/wiki/Branch_table). + Prevents [backtracking](https://en.wikipedia.org/wiki/ReDoS) inside token definitions. + [Unwinds loops](https://en.wikipedia.org/wiki/Loop_unrolling), and batches reads to minimize bounds checking. + Does all of that heavy lifting at compile time. ## Example ```rust use logos::Logos; #[derive(Logos, Debug, PartialEq)] enum Token { // Tokens can be literal strings, of any length. #[token("fast")] Fast, #[token(".")] Period, // Or regular expressions. #[regex("[a-zA-Z]+")] Text, // Logos requires one token variant to handle errors, // it can be named anything you wish. #[error] // We can also use this variant to define whitespace, // or any other matches we wish to skip. #[regex(r"[ \t\n\f]+", logos::skip)] Error, } fn main() { let mut lex = Token::lexer("Create ridiculously fast Lexers."); assert_eq!(lex.next(), Some(Token::Text)); assert_eq!(lex.span(), 0..6); assert_eq!(lex.slice(), "Create"); assert_eq!(lex.next(), Some(Token::Text)); assert_eq!(lex.span(), 7..19); assert_eq!(lex.slice(), "ridiculously"); assert_eq!(lex.next(), Some(Token::Fast)); assert_eq!(lex.span(), 20..24); assert_eq!(lex.slice(), "fast"); assert_eq!(lex.next(), Some(Token::Text)); assert_eq!(lex.slice(), "Lexers"); assert_eq!(lex.span(), 25..31); assert_eq!(lex.next(), Some(Token::Period)); assert_eq!(lex.span(), 31..32); assert_eq!(lex.slice(), "."); assert_eq!(lex.next(), None); } ``` ### Callbacks **Logos** can also call arbitrary functions whenever a pattern is matched, which can be used to put data into a variant: ```rust use logos::{Logos, Lexer}; // Note: callbacks can return `Option` or `Result` fn kilo(lex: &mut Lexer) -> Option { let slice = lex.slice(); let n: u64 = slice[..slice.len() - 1].parse().ok()?; // skip 'k' Some(n * 1_000) } fn mega(lex: &mut Lexer) -> Option { let slice = lex.slice(); let n: u64 = slice[..slice.len() - 1].parse().ok()?; // skip 'm' Some(n * 1_000_000) } #[derive(Logos, Debug, PartialEq)] enum Token { #[regex(r"[ \t\n\f]+", logos::skip)] #[error] Error, // Callbacks can use closure syntax, or refer // to a function defined elsewhere. // // Each pattern can have it's own callback. #[regex("[0-9]+", |lex| lex.slice().parse())] #[regex("[0-9]+k", kilo)] #[regex("[0-9]+m", mega)] Number(u64), } fn main() { let mut lex = Token::lexer("5 42k 75m"); assert_eq!(lex.next(), Some(Token::Number(5))); assert_eq!(lex.slice(), "5"); assert_eq!(lex.next(), Some(Token::Number(42_000))); assert_eq!(lex.slice(), "42k"); assert_eq!(lex.next(), Some(Token::Number(75_000_000))); assert_eq!(lex.slice(), "75m"); assert_eq!(lex.next(), None); } ``` Logos can handle callbacks with following return types: | Return type | Produces | |-----------------------------------|----------------------------------------------------| | `()` | `Token::Unit` | | `bool` | `Token::Unit` **or** `::ERROR` | | `Result<(), _>` | `Token::Unit` **or** `::ERROR` | | `T` | `Token::Value(T)` | | `Option` | `Token::Value(T)` **or** `::ERROR` | | `Result` | `Token::Value(T)` **or** `::ERROR` | | [`Skip`](./struct.Skip.html) | _skips matched input_ | | [`Filter`](./enum.Filter.html) | `Token::Value(T)` **or** _skips matched input_ | Callbacks can be also used to do perform more specialized lexing in place where regular expressions are too limiting. For specifics look at [`Lexer::remainder`](./struct.Lexer.html#method.remainder) and [`Lexer::bump`](./struct.Lexer.html#method.bump). ## Token disambiguation Rule of thumb is: + Longer beats shorter. + Specific beats generic. If any two definitions could match the same input, like `fast` and `[a-zA-Z]+` in the example above, it's the longer and more specific definition of `Token::Fast` that will be the result. This is done by comparing numeric priority attached to each definition. Every consecutive, non-repeating single byte adds 2 to the priority, while every range or regex class adds 1. Loops or optional blocks are ignored, while alternations count the shortest alternative: + `[a-zA-Z]+` has a priority of 1 (lowest possible), because at minimum it can match a single byte to a class. + `foobar` has a priority of 12. + `(foo|hello)(bar)?` has a priority of 6, `foo` being it's shortest possible match.