1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370
//! Trie construction.
use crate::ir::{lower_rule, ExprSequence, PatternInst, PatternSequence};
use crate::log;
use crate::sema::{RuleId, TermEnv, TermId, TypeEnv};
use std::collections::BTreeMap;
/// Construct the tries for each term.
pub fn build_tries(typeenv: &TypeEnv, termenv: &TermEnv) -> BTreeMap<TermId, TrieNode> {
let mut builder = TermFunctionsBuilder::new(typeenv, termenv);
builder.build();
log!("builder: {:?}", builder);
builder.finalize()
}
/// One "input symbol" for the decision tree that handles matching on
/// a term. Each symbol represents one step: we either run a match op,
/// or we finish the match.
///
/// Note that in the original Peepmatic scheme, the input-symbol to
/// the FSM was specified slightly differently. The automaton
/// responded to alphabet symbols that corresponded only to match
/// results, and the "extra state" was used at each automaton node to
/// represent the op to run next. This extra state differentiated
/// nodes that would otherwise be merged together by
/// deduplication. That scheme works well enough, but the "extra
/// state" is slightly confusing and diverges slightly from a pure
/// automaton.
///
/// Instead, here, we imagine that the user of the automaton/trie can
/// query the possible transition edges out of the current state. Each
/// of these edges corresponds to one possible match op to run. After
/// running a match op, we reach a new state corresponding to
/// successful matches up to that point.
///
/// However, it's a bit more subtle than this. Consider the
/// prioritization problem. We want to give the DSL user the ability
/// to change the order in which rules apply, for example to have a
/// tier of "fallback rules" that apply only if more custom rules do
/// not match.
///
/// A somewhat simplistic answer to this problem is "more specific
/// rule wins". However, this implies the existence of a total
/// ordering of linearized match sequences that may not fully capture
/// the intuitive meaning of "more specific". Consider three left-hand
/// sides:
///
/// - (A _ _)
/// - (A (B _) _)
/// - (A _ (B _))
///
/// Intuitively, the first is the least specific. Given the input `(A
/// (B 1) (B 2))`, we can say for sure that the first should not be
/// chosen, because either the second or third would match "more" of
/// the input tree. But which of the second and third should be
/// chosen? A "lexicographic ordering" rule would say that we sort
/// left-hand sides such that the `(B _)` sub-pattern comes before the
/// wildcard `_`, so the second rule wins. But that is arbitrarily
/// privileging one over the other based on the order of the
/// arguments.
///
/// Instead, we can accept explicit priorities from the user to allow
/// either choice. So we need a data structure that can associate
/// matching inputs *with priorities* to outputs.
///
/// Next, we build a decision tree rather than an FSM. Why? Because
/// we're compiling to a structured language, Rust, and states become
/// *program points* rather than *data*, we cannot easily support a
/// DAG structure. In other words, we are not producing a FSM that we
/// can interpret at runtime; rather we are compiling code in which
/// each state corresponds to a sequence of statements and
/// control-flow that branches to a next state, we naturally need
/// nesting; we cannot codegen arbitrary state transitions in an
/// efficient manner. We could support a limited form of DAG that
/// reifies "diamonds" (two alternate paths that reconverge), but
/// supporting this in a way that lets the output refer to values from
/// either side is very complex (we need to invent phi-nodes), and the
/// cases where we want to do this rather than invoke a sub-term (that
/// is compiled to a separate function) are rare. Finally, note that
/// one reason to deduplicate nodes and turn a tree back into a DAG --
/// "output-suffix sharing" as some other instruction-rewriter
/// engines, such as Peepmatic, do -- is not done, because all
/// "output" occurs at leaf nodes; this is necessary because we do not
/// want to start invoking external constructors until we are sure of
/// the match. Some of the code-sharing advantages of the "suffix
/// sharing" scheme can be obtained in a more flexible and
/// user-controllable way (with less understanding of internal
/// compiler logic needed) by factoring logic into different internal
/// terms, which become different compiled functions. This is likely
/// to happen anyway as part of good software engineering practice.
///
/// We prepare for codegen by building a "prioritized trie", where the
/// trie associates input strings with priorities to output values.
/// Each input string is a sequence of match operators followed by an
/// "end of match" token, and each output is a sequence of ops that
/// build the output expression. Each input-output mapping is
/// associated with a priority. The goal of the trie is to generate a
/// decision-tree procedure that lets us execute match ops in a
/// deterministic way, eventually landing at a state that corresponds
/// to the highest-priority matching rule and can produce the output.
///
/// To build this trie, we construct nodes with edges to child nodes;
/// each edge consists of (i) one input token (a `PatternInst` or
/// EOM), and (ii) the priority of rules along this edge. We do not
/// merge rules of different priorities, because the logic to do so is
/// complex and error-prone, necessitating "splits" when we merge
/// together a set of rules over a priority range but later introduce
/// a new possible match op in the "middle" of the range. (E.g., match
/// op A at prio 10, B at prio 5, A at prio 0.) In fact, a previous
/// version of the ISLE compiler worked this way, but in practice the
/// complexity was unneeded.
///
/// To add a rule to this trie, we perform the usual trie-insertion
/// logic, creating edges and subnodes where necessary. A new edge is
/// necessary whenever an edge does not exist for the (priority,
/// symbol) tuple.
///
/// Note that this means that multiple edges with a single match-op
/// may exist, with different priorities.
#[derive(Clone, Debug, PartialEq, Eq, PartialOrd, Ord)]
pub enum TrieSymbol {
/// Run a match operation to continue matching a LHS.
Match {
/// The match operation to run.
op: PatternInst,
},
/// We successfully matched a LHS.
EndOfMatch,
}
impl TrieSymbol {
fn is_eom(&self) -> bool {
match self {
TrieSymbol::EndOfMatch => true,
_ => false,
}
}
}
/// A priority.
pub type Prio = i64;
/// An edge in our term trie.
#[derive(Clone, Debug)]
pub struct TrieEdge {
/// The priority for this edge's sub-trie.
pub prio: Prio,
/// The match operation to perform for this edge.
pub symbol: TrieSymbol,
/// This edge's sub-trie.
pub node: TrieNode,
}
/// A node in the term trie.
#[derive(Clone, Debug)]
pub enum TrieNode {
/// One or more patterns could match.
///
/// Maybe one pattern already has matched, but there are more (higher
/// priority and/or same priority but more specific) patterns that could
/// still match.
Decision {
/// The child sub-tries that we can match from this point on.
edges: Vec<TrieEdge>,
},
/// The successful match of an LHS pattern, and here is its RHS expression.
Leaf {
/// The priority of this rule.
prio: Prio,
/// The RHS expression to evaluate upon a successful LHS pattern match.
output: ExprSequence,
},
/// No LHS pattern matches.
Empty,
}
impl TrieNode {
fn is_empty(&self) -> bool {
matches!(self, &TrieNode::Empty)
}
fn insert(
&mut self,
prio: Prio,
mut input: impl Iterator<Item = TrieSymbol>,
output: ExprSequence,
) -> bool {
// Take one input symbol. There must be *at least* one, EOM if
// nothing else.
let op = input
.next()
.expect("Cannot insert into trie with empty input sequence");
let is_last = op.is_eom();
// If we are empty, turn into a decision node.
if self.is_empty() {
*self = TrieNode::Decision { edges: vec![] };
}
// We must be a decision node.
let edges = match self {
&mut TrieNode::Decision { ref mut edges } => edges,
_ => panic!("insert on leaf node!"),
};
// Now find or insert the appropriate edge.
let edge = edges
.iter()
.position(|edge| edge.symbol == op && edge.prio == prio)
.unwrap_or_else(|| {
edges.push(TrieEdge {
prio,
symbol: op,
node: TrieNode::Empty,
});
edges.len() - 1
});
let edge = &mut edges[edge];
if is_last {
if !edge.node.is_empty() {
// If a leaf node already exists at an overlapping
// prio for this op, there are two competing rules, so
// we can't insert this one.
return false;
}
edge.node = TrieNode::Leaf { prio, output };
true
} else {
edge.node.insert(prio, input, output)
}
}
/// Sort edges by priority.
pub fn sort(&mut self) {
match self {
TrieNode::Decision { edges } => {
// Sort by priority, highest integer value first; then
// by trie symbol.
edges.sort_by_cached_key(|edge| (-edge.prio, edge.symbol.clone()));
for child in edges {
child.node.sort();
}
}
_ => {}
}
}
/// Get a pretty-printed version of this trie, for debugging.
pub fn pretty(&self) -> String {
let mut s = String::new();
pretty_rec(&mut s, self, "");
return s;
fn pretty_rec(s: &mut String, node: &TrieNode, indent: &str) {
match node {
TrieNode::Decision { edges } => {
s.push_str(indent);
s.push_str("TrieNode::Decision:\n");
let new_indent = indent.to_owned() + " ";
for edge in edges {
s.push_str(indent);
s.push_str(&format!(
" edge: prio = {:?}, symbol: {:?}\n",
edge.prio, edge.symbol
));
pretty_rec(s, &edge.node, &new_indent);
}
}
TrieNode::Empty | TrieNode::Leaf { .. } => {
s.push_str(indent);
s.push_str(&format!("{:?}\n", node));
}
}
}
}
}
/// Builder context for one function in generated code corresponding
/// to one root input term.
///
/// A `TermFunctionBuilder` can correspond to the matching
/// control-flow and operations that we execute either when evaluating
/// *forward* on a term, trying to match left-hand sides against it
/// and transforming it into another term; or *backward* on a term,
/// trying to match another rule's left-hand side against an input to
/// produce the term in question (when the term is used in the LHS of
/// the calling term).
#[derive(Debug)]
struct TermFunctionBuilder {
trie: TrieNode,
}
impl TermFunctionBuilder {
fn new() -> Self {
TermFunctionBuilder {
trie: TrieNode::Empty,
}
}
fn add_rule(&mut self, prio: Prio, pattern_seq: PatternSequence, expr_seq: ExprSequence) {
let symbols = pattern_seq
.insts
.into_iter()
.map(|op| TrieSymbol::Match { op })
.chain(std::iter::once(TrieSymbol::EndOfMatch));
self.trie.insert(prio, symbols, expr_seq);
}
fn sort_trie(&mut self) {
self.trie.sort();
}
}
#[derive(Debug)]
struct TermFunctionsBuilder<'a> {
typeenv: &'a TypeEnv,
termenv: &'a TermEnv,
builders_by_term: BTreeMap<TermId, TermFunctionBuilder>,
}
impl<'a> TermFunctionsBuilder<'a> {
fn new(typeenv: &'a TypeEnv, termenv: &'a TermEnv) -> Self {
log!("typeenv: {:?}", typeenv);
log!("termenv: {:?}", termenv);
Self {
builders_by_term: BTreeMap::new(),
typeenv,
termenv,
}
}
fn build(&mut self) {
for rule in 0..self.termenv.rules.len() {
let rule = RuleId(rule);
let prio = self.termenv.rules[rule.index()].prio.unwrap_or(0);
let (pattern, expr) = lower_rule(self.typeenv, self.termenv, rule);
let root_term = self.termenv.rules[rule.index()].lhs.root_term().unwrap();
log!(
"build:\n- rule {:?}\n- pattern {:?}\n- expr {:?}",
self.termenv.rules[rule.index()],
pattern,
expr
);
self.builders_by_term
.entry(root_term)
.or_insert_with(|| TermFunctionBuilder::new())
.add_rule(prio, pattern.clone(), expr.clone());
}
for builder in self.builders_by_term.values_mut() {
builder.sort_trie();
}
}
fn finalize(self) -> BTreeMap<TermId, TrieNode> {
let functions_by_term = self
.builders_by_term
.into_iter()
.map(|(term, builder)| (term, builder.trie))
.collect::<BTreeMap<_, _>>();
functions_by_term
}
}