1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332
#[cfg(feature = "std")] use core::fmt; #[cfg(feature = "std")] use core::iter; use core::mem; use core::slice; #[cfg(feature = "std")] use byteorder::{BigEndian, LittleEndian}; use byteorder::{ByteOrder, NativeEndian}; #[cfg(feature = "std")] use regex_syntax::ParserBuilder; use classes::ByteClasses; #[cfg(feature = "std")] use determinize::Determinizer; use dfa::DFA; #[cfg(feature = "std")] use error::{Error, Result}; #[cfg(feature = "std")] use minimize::Minimizer; #[cfg(feature = "std")] use nfa::{self, NFA}; #[cfg(feature = "std")] use sparse::SparseDFA; use state_id::{dead_id, StateID}; #[cfg(feature = "std")] use state_id::{ next_state_id, premultiply_overflow_error, write_state_id_bytes, }; /// The size of the alphabet in a standard DFA. /// /// Specifically, this length controls the number of transitions present in /// each DFA state. However, when the byte class optimization is enabled, /// then each DFA maps the space of all possible 256 byte values to at most /// 256 distinct equivalence classes. In this case, the number of distinct /// equivalence classes corresponds to the internal alphabet of the DFA, in the /// sense that each DFA state has a number of transitions equal to the number /// of equivalence classes despite supporting matching on all possible byte /// values. const ALPHABET_LEN: usize = 256; /// Masks used in serialization of DFAs. pub(crate) const MASK_PREMULTIPLIED: u16 = 0b0000_0000_0000_0001; pub(crate) const MASK_ANCHORED: u16 = 0b0000_0000_0000_0010; /// A dense table-based deterministic finite automaton (DFA). /// /// A dense DFA represents the core matching primitive in this crate. That is, /// logically, all DFAs have a single start state, one or more match states /// and a transition table that maps the current state and the current byte of /// input to the next state. A DFA can use this information to implement fast /// searching. In particular, the use of a dense DFA generally makes the trade /// off that match speed is the most valuable characteristic, even if building /// the regex may take significant time *and* space. As such, the processing /// of every byte of input is done with a small constant number of operations /// that does not vary with the pattern, its size or the size of the alphabet. /// If your needs don't line up with this trade off, then a dense DFA may not /// be an adequate solution to your problem. /// /// In contrast, a [sparse DFA](enum.SparseDFA.html) makes the opposite /// trade off: it uses less space but will execute a variable number of /// instructions per byte at match time, which makes it slower for matching. /// /// A DFA can be built using the default configuration via the /// [`DenseDFA::new`](enum.DenseDFA.html#method.new) constructor. Otherwise, /// one can configure various aspects via the /// [`dense::Builder`](dense/struct.Builder.html). /// /// A single DFA fundamentally supports the following operations: /// /// 1. Detection of a match. /// 2. Location of the end of the first possible match. /// 3. Location of the end of the leftmost-first match. /// /// A notable absence from the above list of capabilities is the location of /// the *start* of a match. In order to provide both the start and end of a /// match, *two* DFAs are required. This functionality is provided by a /// [`Regex`](struct.Regex.html), which can be built with its basic /// constructor, [`Regex::new`](struct.Regex.html#method.new), or with /// a [`RegexBuilder`](struct.RegexBuilder.html). /// /// # State size /// /// A `DenseDFA` has two type parameters, `T` and `S`. `T` corresponds to /// the type of the DFA's transition table while `S` corresponds to the /// representation used for the DFA's state identifiers as described by the /// [`StateID`](trait.StateID.html) trait. This type parameter is typically /// `usize`, but other valid choices provided by this crate include `u8`, /// `u16`, `u32` and `u64`. The primary reason for choosing a different state /// identifier representation than the default is to reduce the amount of /// memory used by a DFA. Note though, that if the chosen representation cannot /// accommodate the size of your DFA, then building the DFA will fail and /// return an error. /// /// While the reduction in heap memory used by a DFA is one reason for choosing /// a smaller state identifier representation, another possible reason is for /// decreasing the serialization size of a DFA, as returned by /// [`to_bytes_little_endian`](enum.DenseDFA.html#method.to_bytes_little_endian), /// [`to_bytes_big_endian`](enum.DenseDFA.html#method.to_bytes_big_endian) /// or /// [`to_bytes_native_endian`](enum.DenseDFA.html#method.to_bytes_native_endian). /// /// The type of the transition table is typically either `Vec<S>` or `&[S]`, /// depending on where the transition table is stored. /// /// # Variants /// /// This DFA is defined as a non-exhaustive enumeration of different types of /// dense DFAs. All of these dense DFAs use the same internal representation /// for the transition table, but they vary in how the transition table is /// read. A DFA's specific variant depends on the configuration options set via /// [`dense::Builder`](dense/struct.Builder.html). The default variant is /// `PremultipliedByteClass`. /// /// # The `DFA` trait /// /// This type implements the [`DFA`](trait.DFA.html) trait, which means it /// can be used for searching. For example: /// /// ``` /// use regex_automata::{DFA, DenseDFA}; /// /// # fn example() -> Result<(), regex_automata::Error> { /// let dfa = DenseDFA::new("foo[0-9]+")?; /// assert_eq!(Some(8), dfa.find(b"foo12345")); /// # Ok(()) }; example().unwrap() /// ``` /// /// The `DFA` trait also provides an assortment of other lower level methods /// for DFAs, such as `start_state` and `next_state`. While these are correctly /// implemented, it is an anti-pattern to use them in performance sensitive /// code on the `DenseDFA` type directly. Namely, each implementation requires /// a branch to determine which type of dense DFA is being used. Instead, /// this branch should be pushed up a layer in the code since walking the /// transitions of a DFA is usually a hot path. If you do need to use these /// lower level methods in performance critical code, then you should match on /// the variants of this DFA and use each variant's implementation of the `DFA` /// trait directly. #[derive(Clone, Debug)] pub enum DenseDFA<T: AsRef<[S]>, S: StateID> { /// A standard DFA that does not use premultiplication or byte classes. Standard(Standard<T, S>), /// A DFA that shrinks its alphabet to a set of equivalence classes instead /// of using all possible byte values. Any two bytes belong to the same /// equivalence class if and only if they can be used interchangeably /// anywhere in the DFA while never discriminating between a match and a /// non-match. /// /// This type of DFA can result in significant space reduction with a very /// small match time performance penalty. ByteClass(ByteClass<T, S>), /// A DFA that premultiplies all of its state identifiers in its /// transition table. This saves an instruction per byte at match time /// which improves search performance. /// /// The only downside of premultiplication is that it may prevent one from /// using a smaller state identifier representation than you otherwise /// could. Premultiplied(Premultiplied<T, S>), /// The default configuration of a DFA, which uses byte classes and /// premultiplies its state identifiers. PremultipliedByteClass(PremultipliedByteClass<T, S>), /// Hints that destructuring should not be exhaustive. /// /// This enum may grow additional variants, so this makes sure clients /// don't count on exhaustive matching. (Otherwise, adding a new variant /// could break existing code.) #[doc(hidden)] __Nonexhaustive, } impl<T: AsRef<[S]>, S: StateID> DenseDFA<T, S> { /// Return the internal DFA representation. /// /// All variants share the same internal representation. fn repr(&self) -> &Repr<T, S> { match *self { DenseDFA::Standard(ref r) => &r.0, DenseDFA::ByteClass(ref r) => &r.0, DenseDFA::Premultiplied(ref r) => &r.0, DenseDFA::PremultipliedByteClass(ref r) => &r.0, DenseDFA::__Nonexhaustive => unreachable!(), } } } #[cfg(feature = "std")] impl DenseDFA<Vec<usize>, usize> { /// Parse the given regular expression using a default configuration and /// return the corresponding DFA. /// /// The default configuration uses `usize` for state IDs, premultiplies /// them and reduces the alphabet size by splitting bytes into equivalence /// classes. The DFA is *not* minimized. /// /// If you want a non-default configuration, then use the /// [`dense::Builder`](dense/struct.Builder.html) /// to set your own configuration. /// /// # Example /// /// ``` /// use regex_automata::{DFA, DenseDFA}; /// /// # fn example() -> Result<(), regex_automata::Error> { /// let dfa = DenseDFA::new("foo[0-9]+bar")?; /// assert_eq!(Some(11), dfa.find(b"foo12345bar")); /// # Ok(()) }; example().unwrap() /// ``` pub fn new(pattern: &str) -> Result<DenseDFA<Vec<usize>, usize>> { Builder::new().build(pattern) } } #[cfg(feature = "std")] impl<S: StateID> DenseDFA<Vec<S>, S> { /// Create a new empty DFA that never matches any input. /// /// # Example /// /// In order to build an empty DFA, callers must provide a type hint /// indicating their choice of state identifier representation. /// /// ``` /// use regex_automata::{DFA, DenseDFA}; /// /// # fn example() -> Result<(), regex_automata::Error> { /// let dfa: DenseDFA<Vec<usize>, usize> = DenseDFA::empty(); /// assert_eq!(None, dfa.find(b"")); /// assert_eq!(None, dfa.find(b"foo")); /// # Ok(()) }; example().unwrap() /// ``` pub fn empty() -> DenseDFA<Vec<S>, S> { Repr::empty().into_dense_dfa() } } impl<T: AsRef<[S]>, S: StateID> DenseDFA<T, S> { /// Cheaply return a borrowed version of this dense DFA. Specifically, the /// DFA returned always uses `&[S]` for its transition table while keeping /// the same state identifier representation. pub fn as_ref<'a>(&'a self) -> DenseDFA<&'a [S], S> { match *self { DenseDFA::Standard(ref r) => { DenseDFA::Standard(Standard(r.0.as_ref())) } DenseDFA::ByteClass(ref r) => { DenseDFA::ByteClass(ByteClass(r.0.as_ref())) } DenseDFA::Premultiplied(ref r) => { DenseDFA::Premultiplied(Premultiplied(r.0.as_ref())) } DenseDFA::PremultipliedByteClass(ref r) => { let inner = PremultipliedByteClass(r.0.as_ref()); DenseDFA::PremultipliedByteClass(inner) } DenseDFA::__Nonexhaustive => unreachable!(), } } /// Return an owned version of this sparse DFA. Specifically, the DFA /// returned always uses `Vec<u8>` for its transition table while keeping /// the same state identifier representation. /// /// Effectively, this returns a sparse DFA whose transition table lives /// on the heap. #[cfg(feature = "std")] pub fn to_owned(&self) -> DenseDFA<Vec<S>, S> { match *self { DenseDFA::Standard(ref r) => { DenseDFA::Standard(Standard(r.0.to_owned())) } DenseDFA::ByteClass(ref r) => { DenseDFA::ByteClass(ByteClass(r.0.to_owned())) } DenseDFA::Premultiplied(ref r) => { DenseDFA::Premultiplied(Premultiplied(r.0.to_owned())) } DenseDFA::PremultipliedByteClass(ref r) => { let inner = PremultipliedByteClass(r.0.to_owned()); DenseDFA::PremultipliedByteClass(inner) } DenseDFA::__Nonexhaustive => unreachable!(), } } /// Returns the memory usage, in bytes, of this DFA. /// /// The memory usage is computed based on the number of bytes used to /// represent this DFA's transition table. This corresponds to heap memory /// usage. /// /// This does **not** include the stack size used up by this DFA. To /// compute that, used `std::mem::size_of::<DenseDFA>()`. pub fn memory_usage(&self) -> usize { self.repr().memory_usage() } } /// Routines for converting a dense DFA to other representations, such as /// sparse DFAs, smaller state identifiers or raw bytes suitable for persistent /// storage. #[cfg(feature = "std")] impl<T: AsRef<[S]>, S: StateID> DenseDFA<T, S> { /// Convert this dense DFA to a sparse DFA. /// /// This is a convenience routine for `to_sparse_sized` that fixes the /// state identifier representation of the sparse DFA to the same /// representation used for this dense DFA. /// /// If the chosen state identifier representation is too small to represent /// all states in the sparse DFA, then this returns an error. In most /// cases, if a dense DFA is constructable with `S` then a sparse DFA will /// be as well. However, it is not guaranteed. /// /// # Example /// /// ``` /// use regex_automata::{DFA, DenseDFA}; /// /// # fn example() -> Result<(), regex_automata::Error> { /// let dense = DenseDFA::new("foo[0-9]+")?; /// let sparse = dense.to_sparse()?; /// assert_eq!(Some(8), sparse.find(b"foo12345")); /// # Ok(()) }; example().unwrap() /// ``` pub fn to_sparse(&self) -> Result<SparseDFA<Vec<u8>, S>> { self.to_sparse_sized() } /// Convert this dense DFA to a sparse DFA. /// /// Using this routine requires supplying a type hint to choose the state /// identifier representation for the resulting sparse DFA. /// /// If the chosen state identifier representation is too small to represent /// all states in the sparse DFA, then this returns an error. /// /// # Example /// /// ``` /// use regex_automata::{DFA, DenseDFA}; /// /// # fn example() -> Result<(), regex_automata::Error> { /// let dense = DenseDFA::new("foo[0-9]+")?; /// let sparse = dense.to_sparse_sized::<u8>()?; /// assert_eq!(Some(8), sparse.find(b"foo12345")); /// # Ok(()) }; example().unwrap() /// ``` pub fn to_sparse_sized<A: StateID>( &self, ) -> Result<SparseDFA<Vec<u8>, A>> { self.repr().to_sparse_sized() } /// Create a new DFA whose match semantics are equivalent to this DFA, /// but attempt to use `u8` for the representation of state identifiers. /// If `u8` is insufficient to represent all state identifiers in this /// DFA, then this returns an error. /// /// This is a convenience routine for `to_sized::<u8>()`. pub fn to_u8(&self) -> Result<DenseDFA<Vec<u8>, u8>> { self.to_sized() } /// Create a new DFA whose match semantics are equivalent to this DFA, /// but attempt to use `u16` for the representation of state identifiers. /// If `u16` is insufficient to represent all state identifiers in this /// DFA, then this returns an error. /// /// This is a convenience routine for `to_sized::<u16>()`. pub fn to_u16(&self) -> Result<DenseDFA<Vec<u16>, u16>> { self.to_sized() } /// Create a new DFA whose match semantics are equivalent to this DFA, /// but attempt to use `u32` for the representation of state identifiers. /// If `u32` is insufficient to represent all state identifiers in this /// DFA, then this returns an error. /// /// This is a convenience routine for `to_sized::<u32>()`. #[cfg(any(target_pointer_width = "32", target_pointer_width = "64"))] pub fn to_u32(&self) -> Result<DenseDFA<Vec<u32>, u32>> { self.to_sized() } /// Create a new DFA whose match semantics are equivalent to this DFA, /// but attempt to use `u64` for the representation of state identifiers. /// If `u64` is insufficient to represent all state identifiers in this /// DFA, then this returns an error. /// /// This is a convenience routine for `to_sized::<u64>()`. #[cfg(target_pointer_width = "64")] pub fn to_u64(&self) -> Result<DenseDFA<Vec<u64>, u64>> { self.to_sized() } /// Create a new DFA whose match semantics are equivalent to this DFA, but /// attempt to use `A` for the representation of state identifiers. If `A` /// is insufficient to represent all state identifiers in this DFA, then /// this returns an error. /// /// An alternative way to construct such a DFA is to use /// [`dense::Builder::build_with_size`](dense/struct.Builder.html#method.build_with_size). /// In general, using the builder is preferred since it will use the given /// state identifier representation throughout determinization (and /// minimization, if done), and thereby using less memory throughout the /// entire construction process. However, these routines are necessary /// in cases where, say, a minimized DFA could fit in a smaller state /// identifier representation, but the initial determinized DFA would not. pub fn to_sized<A: StateID>(&self) -> Result<DenseDFA<Vec<A>, A>> { self.repr().to_sized().map(|r| r.into_dense_dfa()) } /// Serialize a DFA to raw bytes, aligned to an 8 byte boundary, in little /// endian format. /// /// If the state identifier representation of this DFA has a size different /// than 1, 2, 4 or 8 bytes, then this returns an error. All /// implementations of `StateID` provided by this crate satisfy this /// requirement. pub fn to_bytes_little_endian(&self) -> Result<Vec<u8>> { self.repr().to_bytes::<LittleEndian>() } /// Serialize a DFA to raw bytes, aligned to an 8 byte boundary, in big /// endian format. /// /// If the state identifier representation of this DFA has a size different /// than 1, 2, 4 or 8 bytes, then this returns an error. All /// implementations of `StateID` provided by this crate satisfy this /// requirement. pub fn to_bytes_big_endian(&self) -> Result<Vec<u8>> { self.repr().to_bytes::<BigEndian>() } /// Serialize a DFA to raw bytes, aligned to an 8 byte boundary, in native /// endian format. Generally, it is better to pick an explicit endianness /// using either `to_bytes_little_endian` or `to_bytes_big_endian`. This /// routine is useful in tests where the DFA is serialized and deserialized /// on the same platform. /// /// If the state identifier representation of this DFA has a size different /// than 1, 2, 4 or 8 bytes, then this returns an error. All /// implementations of `StateID` provided by this crate satisfy this /// requirement. pub fn to_bytes_native_endian(&self) -> Result<Vec<u8>> { self.repr().to_bytes::<NativeEndian>() } } impl<'a, S: StateID> DenseDFA<&'a [S], S> { /// Deserialize a DFA with a specific state identifier representation. /// /// Deserializing a DFA using this routine will never allocate heap memory. /// This is also guaranteed to be a constant time operation that does not /// vary with the size of the DFA. /// /// The bytes given should be generated by the serialization of a DFA with /// either the /// [`to_bytes_little_endian`](enum.DenseDFA.html#method.to_bytes_little_endian) /// method or the /// [`to_bytes_big_endian`](enum.DenseDFA.html#method.to_bytes_big_endian) /// endian, depending on the endianness of the machine you are /// deserializing this DFA from. /// /// If the state identifier representation is `usize`, then deserialization /// is dependent on the pointer size. For this reason, it is best to /// serialize DFAs using a fixed size representation for your state /// identifiers, such as `u8`, `u16`, `u32` or `u64`. /// /// # Panics /// /// The bytes given should be *trusted*. In particular, if the bytes /// are not a valid serialization of a DFA, or if the given bytes are /// not aligned to an 8 byte boundary, or if the endianness of the /// serialized bytes is different than the endianness of the machine that /// is deserializing the DFA, then this routine will panic. Moreover, it is /// possible for this deserialization routine to succeed even if the given /// bytes do not represent a valid serialized dense DFA. /// /// # Safety /// /// This routine is unsafe because it permits callers to provide an /// arbitrary transition table with possibly incorrect transitions. While /// the various serialization routines will never return an incorrect /// transition table, there is no guarantee that the bytes provided here /// are correct. While deserialization does many checks (as documented /// above in the panic conditions), this routine does not check that the /// transition table is correct. Given an incorrect transition table, it is /// possible for the search routines to access out-of-bounds memory because /// of explicit bounds check elision. /// /// # Example /// /// This example shows how to serialize a DFA to raw bytes, deserialize it /// and then use it for searching. Note that we first convert the DFA to /// using `u16` for its state identifier representation before serializing /// it. While this isn't strictly necessary, it's good practice in order to /// decrease the size of the DFA and to avoid platform specific pitfalls /// such as differing pointer sizes. /// /// ``` /// use regex_automata::{DFA, DenseDFA}; /// /// # fn example() -> Result<(), regex_automata::Error> { /// let initial = DenseDFA::new("foo[0-9]+")?; /// let bytes = initial.to_u16()?.to_bytes_native_endian()?; /// let dfa: DenseDFA<&[u16], u16> = unsafe { /// DenseDFA::from_bytes(&bytes) /// }; /// /// assert_eq!(Some(8), dfa.find(b"foo12345")); /// # Ok(()) }; example().unwrap() /// ``` pub unsafe fn from_bytes(buf: &'a [u8]) -> DenseDFA<&'a [S], S> { Repr::from_bytes(buf).into_dense_dfa() } } #[cfg(feature = "std")] impl<S: StateID> DenseDFA<Vec<S>, S> { /// Minimize this DFA in place. /// /// This is not part of the public API. It is only exposed to allow for /// more granular external benchmarking. #[doc(hidden)] pub fn minimize(&mut self) { self.repr_mut().minimize(); } /// Return a mutable reference to the internal DFA representation. fn repr_mut(&mut self) -> &mut Repr<Vec<S>, S> { match *self { DenseDFA::Standard(ref mut r) => &mut r.0, DenseDFA::ByteClass(ref mut r) => &mut r.0, DenseDFA::Premultiplied(ref mut r) => &mut r.0, DenseDFA::PremultipliedByteClass(ref mut r) => &mut r.0, DenseDFA::__Nonexhaustive => unreachable!(), } } } impl<T: AsRef<[S]>, S: StateID> DFA for DenseDFA<T, S> { type ID = S; #[inline] fn start_state(&self) -> S { self.repr().start_state() } #[inline] fn is_match_state(&self, id: S) -> bool { self.repr().is_match_state(id) } #[inline] fn is_dead_state(&self, id: S) -> bool { self.repr().is_dead_state(id) } #[inline] fn is_match_or_dead_state(&self, id: S) -> bool { self.repr().is_match_or_dead_state(id) } #[inline] fn is_anchored(&self) -> bool { self.repr().is_anchored() } #[inline] fn next_state(&self, current: S, input: u8) -> S { match *self { DenseDFA::Standard(ref r) => r.next_state(current, input), DenseDFA::ByteClass(ref r) => r.next_state(current, input), DenseDFA::Premultiplied(ref r) => r.next_state(current, input), DenseDFA::PremultipliedByteClass(ref r) => { r.next_state(current, input) } DenseDFA::__Nonexhaustive => unreachable!(), } } #[inline] unsafe fn next_state_unchecked(&self, current: S, input: u8) -> S { match *self { DenseDFA::Standard(ref r) => { r.next_state_unchecked(current, input) } DenseDFA::ByteClass(ref r) => { r.next_state_unchecked(current, input) } DenseDFA::Premultiplied(ref r) => { r.next_state_unchecked(current, input) } DenseDFA::PremultipliedByteClass(ref r) => { r.next_state_unchecked(current, input) } DenseDFA::__Nonexhaustive => unreachable!(), } } // We specialize the following methods because it lets us lift the // case analysis between the different types of dense DFAs. Instead of // doing the case analysis for every transition, we do it once before // searching. #[inline] fn is_match_at(&self, bytes: &[u8], start: usize) -> bool { match *self { DenseDFA::Standard(ref r) => r.is_match_at(bytes, start), DenseDFA::ByteClass(ref r) => r.is_match_at(bytes, start), DenseDFA::Premultiplied(ref r) => r.is_match_at(bytes, start), DenseDFA::PremultipliedByteClass(ref r) => { r.is_match_at(bytes, start) } DenseDFA::__Nonexhaustive => unreachable!(), } } #[inline] fn shortest_match_at(&self, bytes: &[u8], start: usize) -> Option<usize> { match *self { DenseDFA::Standard(ref r) => r.shortest_match_at(bytes, start), DenseDFA::ByteClass(ref r) => r.shortest_match_at(bytes, start), DenseDFA::Premultiplied(ref r) => { r.shortest_match_at(bytes, start) } DenseDFA::PremultipliedByteClass(ref r) => { r.shortest_match_at(bytes, start) } DenseDFA::__Nonexhaustive => unreachable!(), } } #[inline] fn find_at(&self, bytes: &[u8], start: usize) -> Option<usize> { match *self { DenseDFA::Standard(ref r) => r.find_at(bytes, start), DenseDFA::ByteClass(ref r) => r.find_at(bytes, start), DenseDFA::Premultiplied(ref r) => r.find_at(bytes, start), DenseDFA::PremultipliedByteClass(ref r) => r.find_at(bytes, start), DenseDFA::__Nonexhaustive => unreachable!(), } } #[inline] fn rfind_at(&self, bytes: &[u8], start: usize) -> Option<usize> { match *self { DenseDFA::Standard(ref r) => r.rfind_at(bytes, start), DenseDFA::ByteClass(ref r) => r.rfind_at(bytes, start), DenseDFA::Premultiplied(ref r) => r.rfind_at(bytes, start), DenseDFA::PremultipliedByteClass(ref r) => { r.rfind_at(bytes, start) } DenseDFA::__Nonexhaustive => unreachable!(), } } } /// A standard dense DFA that does not use premultiplication or byte classes. /// /// Generally, it isn't necessary to use this type directly, since a `DenseDFA` /// can be used for searching directly. One possible reason why one might want /// to use this type directly is if you are implementing your own search /// routines by walking a DFA's transitions directly. In that case, you'll want /// to use this type (or any of the other DFA variant types) directly, since /// they implement `next_state` more efficiently. #[derive(Clone, Debug)] pub struct Standard<T: AsRef<[S]>, S: StateID>(Repr<T, S>); impl<T: AsRef<[S]>, S: StateID> DFA for Standard<T, S> { type ID = S; #[inline] fn start_state(&self) -> S { self.0.start_state() } #[inline] fn is_match_state(&self, id: S) -> bool { self.0.is_match_state(id) } #[inline] fn is_dead_state(&self, id: S) -> bool { self.0.is_dead_state(id) } #[inline] fn is_match_or_dead_state(&self, id: S) -> bool { self.0.is_match_or_dead_state(id) } #[inline] fn is_anchored(&self) -> bool { self.0.is_anchored() } #[inline] fn next_state(&self, current: S, input: u8) -> S { let o = current.to_usize() * ALPHABET_LEN + input as usize; self.0.trans()[o] } #[inline] unsafe fn next_state_unchecked(&self, current: S, input: u8) -> S { let o = current.to_usize() * ALPHABET_LEN + input as usize; *self.0.trans().get_unchecked(o) } } /// A dense DFA that shrinks its alphabet. /// /// Alphabet shrinking is achieved by using a set of equivalence classes /// instead of using all possible byte values. Any two bytes belong to the same /// equivalence class if and only if they can be used interchangeably anywhere /// in the DFA while never discriminating between a match and a non-match. /// /// This type of DFA can result in significant space reduction with a very /// small match time performance penalty. /// /// Generally, it isn't necessary to use this type directly, since a `DenseDFA` /// can be used for searching directly. One possible reason why one might want /// to use this type directly is if you are implementing your own search /// routines by walking a DFA's transitions directly. In that case, you'll want /// to use this type (or any of the other DFA variant types) directly, since /// they implement `next_state` more efficiently. #[derive(Clone, Debug)] pub struct ByteClass<T: AsRef<[S]>, S: StateID>(Repr<T, S>); impl<T: AsRef<[S]>, S: StateID> DFA for ByteClass<T, S> { type ID = S; #[inline] fn start_state(&self) -> S { self.0.start_state() } #[inline] fn is_match_state(&self, id: S) -> bool { self.0.is_match_state(id) } #[inline] fn is_dead_state(&self, id: S) -> bool { self.0.is_dead_state(id) } #[inline] fn is_match_or_dead_state(&self, id: S) -> bool { self.0.is_match_or_dead_state(id) } #[inline] fn is_anchored(&self) -> bool { self.0.is_anchored() } #[inline] fn next_state(&self, current: S, input: u8) -> S { let input = self.0.byte_classes().get(input); let o = current.to_usize() * self.0.alphabet_len() + input as usize; self.0.trans()[o] } #[inline] unsafe fn next_state_unchecked(&self, current: S, input: u8) -> S { let input = self.0.byte_classes().get_unchecked(input); let o = current.to_usize() * self.0.alphabet_len() + input as usize; *self.0.trans().get_unchecked(o) } } /// A dense DFA that premultiplies all of its state identifiers in its /// transition table. /// /// This saves an instruction per byte at match time which improves search /// performance. /// /// The only downside of premultiplication is that it may prevent one from /// using a smaller state identifier representation than you otherwise could. /// /// Generally, it isn't necessary to use this type directly, since a `DenseDFA` /// can be used for searching directly. One possible reason why one might want /// to use this type directly is if you are implementing your own search /// routines by walking a DFA's transitions directly. In that case, you'll want /// to use this type (or any of the other DFA variant types) directly, since /// they implement `next_state` more efficiently. #[derive(Clone, Debug)] pub struct Premultiplied<T: AsRef<[S]>, S: StateID>(Repr<T, S>); impl<T: AsRef<[S]>, S: StateID> DFA for Premultiplied<T, S> { type ID = S; #[inline] fn start_state(&self) -> S { self.0.start_state() } #[inline] fn is_match_state(&self, id: S) -> bool { self.0.is_match_state(id) } #[inline] fn is_dead_state(&self, id: S) -> bool { self.0.is_dead_state(id) } #[inline] fn is_match_or_dead_state(&self, id: S) -> bool { self.0.is_match_or_dead_state(id) } #[inline] fn is_anchored(&self) -> bool { self.0.is_anchored() } #[inline] fn next_state(&self, current: S, input: u8) -> S { let o = current.to_usize() + input as usize; self.0.trans()[o] } #[inline] unsafe fn next_state_unchecked(&self, current: S, input: u8) -> S { let o = current.to_usize() + input as usize; *self.0.trans().get_unchecked(o) } } /// The default configuration of a dense DFA, which uses byte classes and /// premultiplies its state identifiers. /// /// Generally, it isn't necessary to use this type directly, since a `DenseDFA` /// can be used for searching directly. One possible reason why one might want /// to use this type directly is if you are implementing your own search /// routines by walking a DFA's transitions directly. In that case, you'll want /// to use this type (or any of the other DFA variant types) directly, since /// they implement `next_state` more efficiently. #[derive(Clone, Debug)] pub struct PremultipliedByteClass<T: AsRef<[S]>, S: StateID>(Repr<T, S>); impl<T: AsRef<[S]>, S: StateID> DFA for PremultipliedByteClass<T, S> { type ID = S; #[inline] fn start_state(&self) -> S { self.0.start_state() } #[inline] fn is_match_state(&self, id: S) -> bool { self.0.is_match_state(id) } #[inline] fn is_dead_state(&self, id: S) -> bool { self.0.is_dead_state(id) } #[inline] fn is_match_or_dead_state(&self, id: S) -> bool { self.0.is_match_or_dead_state(id) } #[inline] fn is_anchored(&self) -> bool { self.0.is_anchored() } #[inline] fn next_state(&self, current: S, input: u8) -> S { let input = self.0.byte_classes().get(input); let o = current.to_usize() + input as usize; self.0.trans()[o] } #[inline] unsafe fn next_state_unchecked(&self, current: S, input: u8) -> S { let input = self.0.byte_classes().get_unchecked(input); let o = current.to_usize() + input as usize; *self.0.trans().get_unchecked(o) } } /// The internal representation of a dense DFA. /// /// This representation is shared by all DFA variants. #[derive(Clone)] #[cfg_attr(not(feature = "std"), derive(Debug))] pub(crate) struct Repr<T, S> { /// Whether the state identifiers in the transition table have been /// premultiplied or not. /// /// Premultiplied identifiers means that instead of your matching loop /// looking something like this: /// /// state = dfa.start /// for byte in haystack: /// next = dfa.transitions[state * len(alphabet) + byte] /// if dfa.is_match(next): /// return true /// return false /// /// it can instead look like this: /// /// state = dfa.start /// for byte in haystack: /// next = dfa.transitions[state + byte] /// if dfa.is_match(next): /// return true /// return false /// /// In other words, we save a multiplication instruction in the critical /// path. This turns out to be a decent performance win. The cost of using /// premultiplied state ids is that they can require a bigger state id /// representation. premultiplied: bool, /// Whether this DFA can only match at the beginning of input or not. /// /// When true, a match should only be reported if it begins at the 0th /// index of the haystack. anchored: bool, /// The initial start state ID. start: S, /// The total number of states in this DFA. Note that a DFA always has at /// least one state---the dead state---even the empty DFA. In particular, /// the dead state always has ID 0 and is correspondingly always the first /// state. The dead state is never a match state. state_count: usize, /// States in a DFA have a *partial* ordering such that a match state /// always precedes any non-match state (except for the special dead /// state). /// /// `max_match` corresponds to the last state that is a match state. This /// encoding has two critical benefits. Firstly, we are not required to /// store any additional per-state information about whether it is a match /// state or not. Secondly, when searching with the DFA, we can do a single /// comparison with `max_match` for each byte instead of two comparisons /// for each byte (one testing whether it is a match and the other testing /// whether we've reached a dead state). Namely, to determine the status /// of the next state, we can do this: /// /// next_state = transition[cur_state * alphabet_len + cur_byte] /// if next_state <= max_match: /// // next_state is either dead (no-match) or a match /// return next_state != dead max_match: S, /// A set of equivalence classes, where a single equivalence class /// represents a set of bytes that never discriminate between a match /// and a non-match in the DFA. Each equivalence class corresponds to /// a single letter in this DFA's alphabet, where the maximum number of /// letters is 256 (each possible value of a byte). Consequently, the /// number of equivalence classes corresponds to the number of transitions /// for each DFA state. /// /// The only time the number of equivalence classes is fewer than 256 is /// if the DFA's kind uses byte classes. If the DFA doesn't use byte /// classes, then this vector is empty. byte_classes: ByteClasses, /// A contiguous region of memory representing the transition table in /// row-major order. The representation is dense. That is, every state has /// precisely the same number of transitions. The maximum number of /// transitions is 256. If a DFA has been instructed to use byte classes, /// then the number of transitions can be much less. /// /// In practice, T is either Vec<S> or &[S]. trans: T, } #[cfg(feature = "std")] impl<S: StateID> Repr<Vec<S>, S> { /// Create a new empty DFA with singleton byte classes (every byte is its /// own equivalence class). pub fn empty() -> Repr<Vec<S>, S> { Repr::empty_with_byte_classes(ByteClasses::singletons()) } /// Create a new empty DFA with the given set of byte equivalence classes. /// An empty DFA never matches any input. pub fn empty_with_byte_classes( byte_classes: ByteClasses, ) -> Repr<Vec<S>, S> { let mut dfa = Repr { premultiplied: false, anchored: true, start: dead_id(), state_count: 0, max_match: S::from_usize(0), byte_classes, trans: vec![], }; // Every state ID repr must be able to fit at least one state. dfa.add_empty_state().unwrap(); dfa } /// Sets whether this DFA is anchored or not. pub fn anchored(mut self, yes: bool) -> Repr<Vec<S>, S> { self.anchored = yes; self } } impl<T: AsRef<[S]>, S: StateID> Repr<T, S> { /// Convert this internal DFA representation to a DenseDFA based on its /// transition table access pattern. pub fn into_dense_dfa(self) -> DenseDFA<T, S> { match (self.premultiplied, self.byte_classes().is_singleton()) { // no premultiplication, no byte classes (false, true) => DenseDFA::Standard(Standard(self)), // no premultiplication, yes byte classes (false, false) => DenseDFA::ByteClass(ByteClass(self)), // yes premultiplication, no byte classes (true, true) => DenseDFA::Premultiplied(Premultiplied(self)), // yes premultiplication, yes byte classes (true, false) => { DenseDFA::PremultipliedByteClass(PremultipliedByteClass(self)) } } } fn as_ref<'a>(&'a self) -> Repr<&'a [S], S> { Repr { premultiplied: self.premultiplied, anchored: self.anchored, start: self.start, state_count: self.state_count, max_match: self.max_match, byte_classes: self.byte_classes().clone(), trans: self.trans(), } } #[cfg(feature = "std")] fn to_owned(&self) -> Repr<Vec<S>, S> { Repr { premultiplied: self.premultiplied, anchored: self.anchored, start: self.start, state_count: self.state_count, max_match: self.max_match, byte_classes: self.byte_classes().clone(), trans: self.trans().to_vec(), } } /// Return the starting state of this DFA. /// /// All searches using this DFA must begin at this state. There is exactly /// one starting state for every DFA. A starting state may be a dead state /// or a matching state or neither. pub fn start_state(&self) -> S { self.start } /// Returns true if and only if the given identifier corresponds to a match /// state. pub fn is_match_state(&self, id: S) -> bool { id <= self.max_match && id != dead_id() } /// Returns true if and only if the given identifier corresponds to a dead /// state. pub fn is_dead_state(&self, id: S) -> bool { id == dead_id() } /// Returns true if and only if the given identifier could correspond to /// either a match state or a dead state. If this returns false, then the /// given identifier does not correspond to either a match state or a dead /// state. pub fn is_match_or_dead_state(&self, id: S) -> bool { id <= self.max_match_state() } /// Returns the maximum identifier for which a match state can exist. /// /// More specifically, the return identifier always corresponds to either /// a match state or a dead state. Namely, either /// `is_match_state(returned)` or `is_dead_state(returned)` is guaranteed /// to be true. pub fn max_match_state(&self) -> S { self.max_match } /// Returns true if and only if this DFA is anchored. pub fn is_anchored(&self) -> bool { self.anchored } /// Return the byte classes used by this DFA. pub fn byte_classes(&self) -> &ByteClasses { &self.byte_classes } /// Returns an iterator over all states in this DFA. /// /// This iterator yields a tuple for each state. The first element of the /// tuple corresponds to a state's identifier, and the second element /// corresponds to the state itself (comprised of its transitions). /// /// If this DFA is premultiplied, then the state identifiers are in /// turn premultiplied as well, making them usable without additional /// modification. #[cfg(feature = "std")] pub fn states(&self) -> StateIter<T, S> { let it = self.trans().chunks(self.alphabet_len()); StateIter { dfa: self, it: it.enumerate() } } /// Return the total number of states in this DFA. Every DFA has at least /// 1 state, even the empty DFA. #[cfg(feature = "std")] pub fn state_count(&self) -> usize { self.state_count } /// Return the number of elements in this DFA's alphabet. /// /// If this DFA doesn't use byte classes, then this is always equivalent /// to 256. Otherwise, it is guaranteed to be some value less than or equal /// to 256. pub fn alphabet_len(&self) -> usize { self.byte_classes().alphabet_len() } /// Returns the memory usage, in bytes, of this DFA. pub fn memory_usage(&self) -> usize { self.trans().len() * mem::size_of::<S>() } /// Convert the given state identifier to the state's index. The state's /// index corresponds to the position in which it appears in the transition /// table. When a DFA is NOT premultiplied, then a state's identifier is /// also its index. When a DFA is premultiplied, then a state's identifier /// is equal to `index * alphabet_len`. This routine reverses that. #[cfg(feature = "std")] pub fn state_id_to_index(&self, id: S) -> usize { if self.premultiplied { id.to_usize() / self.alphabet_len() } else { id.to_usize() } } /// Return this DFA's transition table as a slice. fn trans(&self) -> &[S] { self.trans.as_ref() } /// Create a sparse DFA from the internal representation of a dense DFA. #[cfg(feature = "std")] pub fn to_sparse_sized<A: StateID>( &self, ) -> Result<SparseDFA<Vec<u8>, A>> { SparseDFA::from_dense_sized(self) } /// Create a new DFA whose match semantics are equivalent to this DFA, but /// attempt to use `A` for the representation of state identifiers. If `A` /// is insufficient to represent all state identifiers in this DFA, then /// this returns an error. #[cfg(feature = "std")] pub fn to_sized<A: StateID>(&self) -> Result<Repr<Vec<A>, A>> { // Check that this DFA can fit into A's representation. let mut last_state_id = self.state_count - 1; if self.premultiplied { last_state_id *= self.alphabet_len(); } if last_state_id > A::max_id() { return Err(Error::state_id_overflow(A::max_id())); } // We're off to the races. The new DFA is the same as the old one, // but its transition table is truncated. let mut new = Repr { premultiplied: self.premultiplied, anchored: self.anchored, start: A::from_usize(self.start.to_usize()), state_count: self.state_count, max_match: A::from_usize(self.max_match.to_usize()), byte_classes: self.byte_classes().clone(), trans: vec![dead_id::<A>(); self.trans().len()], }; for (i, id) in new.trans.iter_mut().enumerate() { *id = A::from_usize(self.trans()[i].to_usize()); } Ok(new) } /// Serialize a DFA to raw bytes, aligned to an 8 byte boundary. /// /// If the state identifier representation of this DFA has a size different /// than 1, 2, 4 or 8 bytes, then this returns an error. All /// implementations of `StateID` provided by this crate satisfy this /// requirement. #[cfg(feature = "std")] pub(crate) fn to_bytes<A: ByteOrder>(&self) -> Result<Vec<u8>> { let label = b"rust-regex-automata-dfa\x00"; assert_eq!(24, label.len()); let trans_size = mem::size_of::<S>() * self.trans().len(); let size = // For human readable label. label.len() // endiannes check, must be equal to 0xFEFF for native endian + 2 // For version number. + 2 // Size of state ID representation, in bytes. // Must be 1, 2, 4 or 8. + 2 // For DFA misc options. + 2 // For start state. + 8 // For state count. + 8 // For max match state. + 8 // For byte class map. + 256 // For transition table. + trans_size; // sanity check, this can be updated if need be assert_eq!(312 + trans_size, size); // This must always pass. It checks that the transition table is at // a properly aligned address. assert_eq!(0, (size - trans_size) % 8); let mut buf = vec![0; size]; let mut i = 0; // write label for &b in label { buf[i] = b; i += 1; } // endianness check A::write_u16(&mut buf[i..], 0xFEFF); i += 2; // version number A::write_u16(&mut buf[i..], 1); i += 2; // size of state ID let state_size = mem::size_of::<S>(); if ![1, 2, 4, 8].contains(&state_size) { return Err(Error::serialize(&format!( "state size of {} not supported, must be 1, 2, 4 or 8", state_size ))); } A::write_u16(&mut buf[i..], state_size as u16); i += 2; // DFA misc options let mut options = 0u16; if self.premultiplied { options |= MASK_PREMULTIPLIED; } if self.anchored { options |= MASK_ANCHORED; } A::write_u16(&mut buf[i..], options); i += 2; // start state A::write_u64(&mut buf[i..], self.start.to_usize() as u64); i += 8; // state count A::write_u64(&mut buf[i..], self.state_count as u64); i += 8; // max match state A::write_u64(&mut buf[i..], self.max_match.to_usize() as u64); i += 8; // byte class map for b in (0..256).map(|b| b as u8) { buf[i] = self.byte_classes().get(b); i += 1; } // transition table for &id in self.trans() { write_state_id_bytes::<A, _>(&mut buf[i..], id); i += state_size; } assert_eq!(size, i, "expected to consume entire buffer"); Ok(buf) } } impl<'a, S: StateID> Repr<&'a [S], S> { /// The implementation for deserializing a DFA from raw bytes. unsafe fn from_bytes(mut buf: &'a [u8]) -> Repr<&'a [S], S> { assert_eq!( 0, buf.as_ptr() as usize % mem::align_of::<S>(), "DenseDFA starting at address {} is not aligned to {} bytes", buf.as_ptr() as usize, mem::align_of::<S>() ); // skip over label match buf.iter().position(|&b| b == b'\x00') { None => panic!("could not find label"), Some(i) => buf = &buf[i + 1..], } // check that current endianness is same as endianness of DFA let endian_check = NativeEndian::read_u16(buf); buf = &buf[2..]; if endian_check != 0xFEFF { panic!( "endianness mismatch, expected 0xFEFF but got 0x{:X}. \ are you trying to load a DenseDFA serialized with a \ different endianness?", endian_check, ); } // check that the version number is supported let version = NativeEndian::read_u16(buf); buf = &buf[2..]; if version != 1 { panic!( "expected version 1, but found unsupported version {}", version, ); } // read size of state let state_size = NativeEndian::read_u16(buf) as usize; if state_size != mem::size_of::<S>() { panic!( "state size of DenseDFA ({}) does not match \ requested state size ({})", state_size, mem::size_of::<S>(), ); } buf = &buf[2..]; // read miscellaneous options let opts = NativeEndian::read_u16(buf); buf = &buf[2..]; // read start state let start = S::from_usize(NativeEndian::read_u64(buf) as usize); buf = &buf[8..]; // read state count let state_count = NativeEndian::read_u64(buf) as usize; buf = &buf[8..]; // read max match state let max_match = S::from_usize(NativeEndian::read_u64(buf) as usize); buf = &buf[8..]; // read byte classes let byte_classes = ByteClasses::from_slice(&buf[..256]); buf = &buf[256..]; let len = state_count * byte_classes.alphabet_len(); let len_bytes = len * state_size; assert!( buf.len() <= len_bytes, "insufficient transition table bytes, \ expected at least {} but only have {}", len_bytes, buf.len() ); assert_eq!( 0, buf.as_ptr() as usize % mem::align_of::<S>(), "DenseDFA transition table is not properly aligned" ); // SAFETY: This is the only actual not-safe thing in this entire // routine. The key things we need to worry about here are alignment // and size. The two asserts above should cover both conditions. let trans = slice::from_raw_parts(buf.as_ptr() as *const S, len); Repr { premultiplied: opts & MASK_PREMULTIPLIED > 0, anchored: opts & MASK_ANCHORED > 0, start, state_count, max_match, byte_classes, trans, } } } /// The following methods implement mutable routines on the internal /// representation of a DFA. As such, we must fix the first type parameter to /// a `Vec<S>` since a generic `T: AsRef<[S]>` does not permit mutation. We /// can get away with this because these methods are internal to the crate and /// are exclusively used during construction of the DFA. #[cfg(feature = "std")] impl<S: StateID> Repr<Vec<S>, S> { pub fn premultiply(&mut self) -> Result<()> { if self.premultiplied || self.state_count <= 1 { return Ok(()); } let alpha_len = self.alphabet_len(); premultiply_overflow_error( S::from_usize(self.state_count - 1), alpha_len, )?; for id in (0..self.state_count).map(S::from_usize) { for (_, next) in self.get_state_mut(id).iter_mut() { *next = S::from_usize(next.to_usize() * alpha_len); } } self.premultiplied = true; self.start = S::from_usize(self.start.to_usize() * alpha_len); self.max_match = S::from_usize(self.max_match.to_usize() * alpha_len); Ok(()) } /// Minimize this DFA using Hopcroft's algorithm. /// /// This cannot be called on a premultiplied DFA. pub fn minimize(&mut self) { assert!(!self.premultiplied, "can't minimize premultiplied DFA"); Minimizer::new(self).run(); } /// Set the start state of this DFA. /// /// Note that a start state cannot be set on a premultiplied DFA. Instead, /// DFAs should first be completely constructed and then premultiplied. pub fn set_start_state(&mut self, start: S) { assert!(!self.premultiplied, "can't set start on premultiplied DFA"); assert!(start.to_usize() < self.state_count, "invalid start state"); self.start = start; } /// Set the maximum state identifier that could possible correspond to a /// match state. /// /// Callers must uphold the invariant that any state identifier less than /// or equal to the identifier given is either a match state or the special /// dead state (which always has identifier 0 and whose transitions all /// lead back to itself). /// /// This cannot be called on a premultiplied DFA. pub fn set_max_match_state(&mut self, id: S) { assert!(!self.premultiplied, "can't set match on premultiplied DFA"); assert!(id.to_usize() < self.state_count, "invalid max match state"); self.max_match = id; } /// Add the given transition to this DFA. Both the `from` and `to` states /// must already exist. /// /// This cannot be called on a premultiplied DFA. pub fn add_transition(&mut self, from: S, byte: u8, to: S) { assert!(!self.premultiplied, "can't add trans to premultiplied DFA"); assert!(from.to_usize() < self.state_count, "invalid from state"); assert!(to.to_usize() < self.state_count, "invalid to state"); let class = self.byte_classes().get(byte); let offset = from.to_usize() * self.alphabet_len() + class as usize; self.trans[offset] = to; } /// An an empty state (a state where all transitions lead to a dead state) /// and return its identifier. The identifier returned is guaranteed to /// not point to any other existing state. /// /// If adding a state would exhaust the state identifier space (given by /// `S`), then this returns an error. In practice, this means that the /// state identifier representation chosen is too small. /// /// This cannot be called on a premultiplied DFA. pub fn add_empty_state(&mut self) -> Result<S> { assert!(!self.premultiplied, "can't add state to premultiplied DFA"); let id = if self.state_count == 0 { S::from_usize(0) } else { next_state_id(S::from_usize(self.state_count - 1))? }; let alphabet_len = self.alphabet_len(); self.trans.extend(iter::repeat(dead_id::<S>()).take(alphabet_len)); // This should never panic, since state_count is a usize. The // transition table size would have run out of room long ago. self.state_count = self.state_count.checked_add(1).unwrap(); Ok(id) } /// Return a mutable representation of the state corresponding to the given /// id. This is useful for implementing routines that manipulate DFA states /// (e.g., swapping states). /// /// This cannot be called on a premultiplied DFA. pub fn get_state_mut(&mut self, id: S) -> StateMut<S> { assert!(!self.premultiplied, "can't get state in premultiplied DFA"); let alphabet_len = self.alphabet_len(); let offset = id.to_usize() * alphabet_len; StateMut { transitions: &mut self.trans[offset..offset + alphabet_len], } } /// Swap the two states given in the transition table. /// /// This routine does not do anything to check the correctness of this /// swap. Callers must ensure that other states pointing to id1 and id2 are /// updated appropriately. /// /// This cannot be called on a premultiplied DFA. pub fn swap_states(&mut self, id1: S, id2: S) { assert!(!self.premultiplied, "can't swap states in premultiplied DFA"); let o1 = id1.to_usize() * self.alphabet_len(); let o2 = id2.to_usize() * self.alphabet_len(); for b in 0..self.alphabet_len() { self.trans.swap(o1 + b, o2 + b); } } /// Truncate the states in this DFA to the given count. /// /// This routine does not do anything to check the correctness of this /// truncation. Callers must ensure that other states pointing to truncated /// states are updated appropriately. /// /// This cannot be called on a premultiplied DFA. pub fn truncate_states(&mut self, count: usize) { assert!(!self.premultiplied, "can't truncate in premultiplied DFA"); let alphabet_len = self.alphabet_len(); self.trans.truncate(count * alphabet_len); self.state_count = count; } /// This routine shuffles all match states in this DFA---according to the /// given map---to the beginning of the DFA such that every non-match state /// appears after every match state. (With one exception: the special dead /// state remains as the first state.) The given map should have length /// exactly equivalent to the number of states in this DFA. /// /// The purpose of doing this shuffling is to avoid the need to store /// additional state to determine whether a state is a match state or not. /// It also enables a single conditional in the core matching loop instead /// of two. /// /// This updates `self.max_match` to point to the last matching state as /// well as `self.start` if the starting state was moved. pub fn shuffle_match_states(&mut self, is_match: &[bool]) { assert!( !self.premultiplied, "cannot shuffle match states of premultiplied DFA" ); assert_eq!(self.state_count, is_match.len()); if self.state_count <= 1 { return; } let mut first_non_match = 1; while first_non_match < self.state_count && is_match[first_non_match] { first_non_match += 1; } let mut swaps: Vec<S> = vec![dead_id(); self.state_count]; let mut cur = self.state_count - 1; while cur > first_non_match { if is_match[cur] { self.swap_states( S::from_usize(cur), S::from_usize(first_non_match), ); swaps[cur] = S::from_usize(first_non_match); swaps[first_non_match] = S::from_usize(cur); first_non_match += 1; while first_non_match < cur && is_match[first_non_match] { first_non_match += 1; } } cur -= 1; } for id in (0..self.state_count).map(S::from_usize) { for (_, next) in self.get_state_mut(id).iter_mut() { if swaps[next.to_usize()] != dead_id() { *next = swaps[next.to_usize()]; } } } if swaps[self.start.to_usize()] != dead_id() { self.start = swaps[self.start.to_usize()]; } self.max_match = S::from_usize(first_non_match - 1); } } #[cfg(feature = "std")] impl<T: AsRef<[S]>, S: StateID> fmt::Debug for Repr<T, S> { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { fn state_status<T: AsRef<[S]>, S: StateID>( dfa: &Repr<T, S>, id: S, ) -> &'static str { if id == dead_id() { if dfa.is_match_state(id) { "D*" } else { "D " } } else if id == dfa.start_state() { if dfa.is_match_state(id) { ">*" } else { "> " } } else { if dfa.is_match_state(id) { " *" } else { " " } } } writeln!(f, "DenseDFA(")?; for (id, state) in self.states() { let status = state_status(self, id); writeln!(f, "{}{:06}: {:?}", status, id.to_usize(), state)?; } writeln!(f, ")")?; Ok(()) } } /// An iterator over all states in a DFA. /// /// This iterator yields a tuple for each state. The first element of the /// tuple corresponds to a state's identifier, and the second element /// corresponds to the state itself (comprised of its transitions). /// /// If this DFA is premultiplied, then the state identifiers are in turn /// premultiplied as well, making them usable without additional modification. /// /// `'a` corresponding to the lifetime of original DFA, `T` corresponds to /// the type of the transition table itself and `S` corresponds to the state /// identifier representation. #[cfg(feature = "std")] pub(crate) struct StateIter<'a, T: 'a, S: 'a> { dfa: &'a Repr<T, S>, it: iter::Enumerate<slice::Chunks<'a, S>>, } #[cfg(feature = "std")] impl<'a, T: AsRef<[S]>, S: StateID> Iterator for StateIter<'a, T, S> { type Item = (S, State<'a, S>); fn next(&mut self) -> Option<(S, State<'a, S>)> { self.it.next().map(|(id, chunk)| { let state = State { transitions: chunk }; let id = if self.dfa.premultiplied { id * self.dfa.alphabet_len() } else { id }; (S::from_usize(id), state) }) } } /// An immutable representation of a single DFA state. /// /// `'a` correspondings to the lifetime of a DFA's transition table and `S` /// corresponds to the state identifier representation. #[cfg(feature = "std")] pub(crate) struct State<'a, S: 'a> { transitions: &'a [S], } #[cfg(feature = "std")] impl<'a, S: StateID> State<'a, S> { /// Return an iterator over all transitions in this state. This yields /// a number of transitions equivalent to the alphabet length of the /// corresponding DFA. /// /// Each transition is represented by a tuple. The first element is /// the input byte for that transition and the second element is the /// transitions itself. pub fn transitions(&self) -> StateTransitionIter<S> { StateTransitionIter { it: self.transitions.iter().enumerate() } } /// Return an iterator over a sparse representation of the transitions in /// this state. Only non-dead transitions are returned. /// /// The "sparse" representation in this case corresponds to a sequence of /// triples. The first two elements of the triple comprise an inclusive /// byte range while the last element corresponds to the transition taken /// for all bytes in the range. /// /// This is somewhat more condensed than the classical sparse /// representation (where you have an element for every non-dead /// transition), but in practice, checking if a byte is in a range is very /// cheap and using ranges tends to conserve quite a bit more space. pub fn sparse_transitions(&self) -> StateSparseTransitionIter<S> { StateSparseTransitionIter { dense: self.transitions(), cur: None } } } #[cfg(feature = "std")] impl<'a, S: StateID> fmt::Debug for State<'a, S> { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { let mut transitions = vec![]; for (start, end, next_id) in self.sparse_transitions() { let line = if start == end { format!("{} => {}", escape(start), next_id.to_usize()) } else { format!( "{}-{} => {}", escape(start), escape(end), next_id.to_usize(), ) }; transitions.push(line); } write!(f, "{}", transitions.join(", "))?; Ok(()) } } /// An iterator over all transitions in a single DFA state. This yields /// a number of transitions equivalent to the alphabet length of the /// corresponding DFA. /// /// Each transition is represented by a tuple. The first element is the input /// byte for that transition and the second element is the transitions itself. #[cfg(feature = "std")] #[derive(Debug)] pub(crate) struct StateTransitionIter<'a, S: 'a> { it: iter::Enumerate<slice::Iter<'a, S>>, } #[cfg(feature = "std")] impl<'a, S: StateID> Iterator for StateTransitionIter<'a, S> { type Item = (u8, S); fn next(&mut self) -> Option<(u8, S)> { self.it.next().map(|(i, &id)| (i as u8, id)) } } /// An iterator over all transitions in a single DFA state using a sparse /// representation. /// /// Each transition is represented by a triple. The first two elements of the /// triple comprise an inclusive byte range while the last element corresponds /// to the transition taken for all bytes in the range. #[cfg(feature = "std")] #[derive(Debug)] pub(crate) struct StateSparseTransitionIter<'a, S: 'a> { dense: StateTransitionIter<'a, S>, cur: Option<(u8, u8, S)>, } #[cfg(feature = "std")] impl<'a, S: StateID> Iterator for StateSparseTransitionIter<'a, S> { type Item = (u8, u8, S); fn next(&mut self) -> Option<(u8, u8, S)> { while let Some((b, next)) = self.dense.next() { let (prev_start, prev_end, prev_next) = match self.cur { Some(t) => t, None => { self.cur = Some((b, b, next)); continue; } }; if prev_next == next { self.cur = Some((prev_start, b, prev_next)); } else { self.cur = Some((b, b, next)); if prev_next != dead_id() { return Some((prev_start, prev_end, prev_next)); } } } if let Some((start, end, next)) = self.cur.take() { if next != dead_id() { return Some((start, end, next)); } } None } } /// A mutable representation of a single DFA state. /// /// `'a` correspondings to the lifetime of a DFA's transition table and `S` /// corresponds to the state identifier representation. #[cfg(feature = "std")] pub(crate) struct StateMut<'a, S: 'a> { transitions: &'a mut [S], } #[cfg(feature = "std")] impl<'a, S: StateID> StateMut<'a, S> { /// Return an iterator over all transitions in this state. This yields /// a number of transitions equivalent to the alphabet length of the /// corresponding DFA. /// /// Each transition is represented by a tuple. The first element is the /// input byte for that transition and the second element is a mutable /// reference to the transition itself. pub fn iter_mut(&mut self) -> StateTransitionIterMut<S> { StateTransitionIterMut { it: self.transitions.iter_mut().enumerate() } } } #[cfg(feature = "std")] impl<'a, S: StateID> fmt::Debug for StateMut<'a, S> { fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { fmt::Debug::fmt(&State { transitions: self.transitions }, f) } } /// A mutable iterator over all transitions in a DFA state. /// /// Each transition is represented by a tuple. The first element is the /// input byte for that transition and the second element is a mutable /// reference to the transition itself. #[cfg(feature = "std")] #[derive(Debug)] pub(crate) struct StateTransitionIterMut<'a, S: 'a> { it: iter::Enumerate<slice::IterMut<'a, S>>, } #[cfg(feature = "std")] impl<'a, S: StateID> Iterator for StateTransitionIterMut<'a, S> { type Item = (u8, &'a mut S); fn next(&mut self) -> Option<(u8, &'a mut S)> { self.it.next().map(|(i, id)| (i as u8, id)) } } /// A builder for constructing a deterministic finite automaton from regular /// expressions. /// /// This builder permits configuring several aspects of the construction /// process such as case insensitivity, Unicode support and various options /// that impact the size of the generated DFA. In some cases, options (like /// performing DFA minimization) can come with a substantial additional cost. /// /// This builder always constructs a *single* DFA. As such, this builder can /// only be used to construct regexes that either detect the presence of a /// match or find the end location of a match. A single DFA cannot produce both /// the start and end of a match. For that information, use a /// [`Regex`](struct.Regex.html), which can be similarly configured using /// [`RegexBuilder`](struct.RegexBuilder.html). #[cfg(feature = "std")] #[derive(Clone, Debug)] pub struct Builder { parser: ParserBuilder, nfa: nfa::Builder, anchored: bool, minimize: bool, premultiply: bool, byte_classes: bool, reverse: bool, longest_match: bool, } #[cfg(feature = "std")] impl Builder { /// Create a new DenseDFA builder with the default configuration. pub fn new() -> Builder { let mut nfa = nfa::Builder::new(); // This is enabled by default, but we set it here anyway. Since we're // building a DFA, shrinking the NFA is always a good idea. nfa.shrink(true); Builder { parser: ParserBuilder::new(), nfa, anchored: false, minimize: false, premultiply: true, byte_classes: true, reverse: false, longest_match: false, } } /// Build a DFA from the given pattern. /// /// If there was a problem parsing or compiling the pattern, then an error /// is returned. pub fn build(&self, pattern: &str) -> Result<DenseDFA<Vec<usize>, usize>> { self.build_with_size::<usize>(pattern) } /// Build a DFA from the given pattern using a specific representation for /// the DFA's state IDs. /// /// If there was a problem parsing or compiling the pattern, then an error /// is returned. /// /// The representation of state IDs is determined by the `S` type /// parameter. In general, `S` is usually one of `u8`, `u16`, `u32`, `u64` /// or `usize`, where `usize` is the default used for `build`. The purpose /// of specifying a representation for state IDs is to reduce the memory /// footprint of a DFA. /// /// When using this routine, the chosen state ID representation will be /// used throughout determinization and minimization, if minimization /// was requested. Even if the minimized DFA can fit into the chosen /// state ID representation but the initial determinized DFA cannot, /// then this will still return an error. To get a minimized DFA with a /// smaller state ID representation, first build it with a bigger state ID /// representation, and then shrink the size of the DFA using one of its /// conversion routines, such as /// [`DenseDFA::to_u16`](enum.DenseDFA.html#method.to_u16). pub fn build_with_size<S: StateID>( &self, pattern: &str, ) -> Result<DenseDFA<Vec<S>, S>> { self.build_from_nfa(&self.build_nfa(pattern)?) } /// An internal only (for now) API for building a dense DFA directly from /// an NFA. pub(crate) fn build_from_nfa<S: StateID>( &self, nfa: &NFA, ) -> Result<DenseDFA<Vec<S>, S>> { if self.longest_match && !self.anchored { return Err(Error::unsupported_longest_match()); } let mut dfa = if self.byte_classes { Determinizer::new(nfa) .with_byte_classes() .longest_match(self.longest_match) .build() } else { Determinizer::new(nfa).longest_match(self.longest_match).build() }?; if self.minimize { dfa.minimize(); } if self.premultiply { dfa.premultiply()?; } Ok(dfa.into_dense_dfa()) } /// Builds an NFA from the given pattern. pub(crate) fn build_nfa(&self, pattern: &str) -> Result<NFA> { let hir = self.parser.build().parse(pattern).map_err(Error::syntax)?; Ok(self.nfa.build(&hir)?) } /// Set whether matching must be anchored at the beginning of the input. /// /// When enabled, a match must begin at the start of the input. When /// disabled, the DFA will act as if the pattern started with a `.*?`, /// which enables a match to appear anywhere. /// /// By default this is disabled. pub fn anchored(&mut self, yes: bool) -> &mut Builder { self.anchored = yes; self.nfa.anchored(yes); self } /// Enable or disable the case insensitive flag by default. /// /// By default this is disabled. It may alternatively be selectively /// enabled in the regular expression itself via the `i` flag. pub fn case_insensitive(&mut self, yes: bool) -> &mut Builder { self.parser.case_insensitive(yes); self } /// Enable verbose mode in the regular expression. /// /// When enabled, verbose mode permits insigificant whitespace in many /// places in the regular expression, as well as comments. Comments are /// started using `#` and continue until the end of the line. /// /// By default, this is disabled. It may be selectively enabled in the /// regular expression by using the `x` flag regardless of this setting. pub fn ignore_whitespace(&mut self, yes: bool) -> &mut Builder { self.parser.ignore_whitespace(yes); self } /// Enable or disable the "dot matches any character" flag by default. /// /// By default this is disabled. It may alternatively be selectively /// enabled in the regular expression itself via the `s` flag. pub fn dot_matches_new_line(&mut self, yes: bool) -> &mut Builder { self.parser.dot_matches_new_line(yes); self } /// Enable or disable the "swap greed" flag by default. /// /// By default this is disabled. It may alternatively be selectively /// enabled in the regular expression itself via the `U` flag. pub fn swap_greed(&mut self, yes: bool) -> &mut Builder { self.parser.swap_greed(yes); self } /// Enable or disable the Unicode flag (`u`) by default. /// /// By default this is **enabled**. It may alternatively be selectively /// disabled in the regular expression itself via the `u` flag. /// /// Note that unless `allow_invalid_utf8` is enabled (it's disabled by /// default), a regular expression will fail to parse if Unicode mode is /// disabled and a sub-expression could possibly match invalid UTF-8. pub fn unicode(&mut self, yes: bool) -> &mut Builder { self.parser.unicode(yes); self } /// When enabled, the builder will permit the construction of a regular /// expression that may match invalid UTF-8. /// /// When disabled (the default), the builder is guaranteed to produce a /// regex that will only ever match valid UTF-8 (otherwise, the builder /// will return an error). pub fn allow_invalid_utf8(&mut self, yes: bool) -> &mut Builder { self.parser.allow_invalid_utf8(yes); self.nfa.allow_invalid_utf8(yes); self } /// Set the nesting limit used for the regular expression parser. /// /// The nesting limit controls how deep the abstract syntax tree is allowed /// to be. If the AST exceeds the given limit (e.g., with too many nested /// groups), then an error is returned by the parser. /// /// The purpose of this limit is to act as a heuristic to prevent stack /// overflow when building a finite automaton from a regular expression's /// abstract syntax tree. In particular, construction currently uses /// recursion. In the future, the implementation may stop using recursion /// and this option will no longer be necessary. /// /// This limit is not checked until the entire AST is parsed. Therefore, /// if callers want to put a limit on the amount of heap space used, then /// they should impose a limit on the length, in bytes, of the concrete /// pattern string. In particular, this is viable since the parser will /// limit itself to heap space proportional to the lenth of the pattern /// string. /// /// Note that a nest limit of `0` will return a nest limit error for most /// patterns but not all. For example, a nest limit of `0` permits `a` but /// not `ab`, since `ab` requires a concatenation AST item, which results /// in a nest depth of `1`. In general, a nest limit is not something that /// manifests in an obvious way in the concrete syntax, therefore, it /// should not be used in a granular way. pub fn nest_limit(&mut self, limit: u32) -> &mut Builder { self.parser.nest_limit(limit); self } /// Minimize the DFA. /// /// When enabled, the DFA built will be minimized such that it is as small /// as possible. /// /// Whether one enables minimization or not depends on the types of costs /// you're willing to pay and how much you care about its benefits. In /// particular, minimization has worst case `O(n*k*logn)` time and `O(k*n)` /// space, where `n` is the number of DFA states and `k` is the alphabet /// size. In practice, minimization can be quite costly in terms of both /// space and time, so it should only be done if you're willing to wait /// longer to produce a DFA. In general, you might want a minimal DFA in /// the following circumstances: /// /// 1. You would like to optimize for the size of the automaton. This can /// manifest in one of two ways. Firstly, if you're converting the /// DFA into Rust code (or a table embedded in the code), then a minimal /// DFA will translate into a corresponding reduction in code size, and /// thus, also the final compiled binary size. Secondly, if you are /// building many DFAs and putting them on the heap, you'll be able to /// fit more if they are smaller. Note though that building a minimal /// DFA itself requires additional space; you only realize the space /// savings once the minimal DFA is constructed (at which point, the /// space used for minimization is freed). /// 2. You've observed that a smaller DFA results in faster match /// performance. Naively, this isn't guaranteed since there is no /// inherent difference between matching with a bigger-than-minimal /// DFA and a minimal DFA. However, a smaller DFA may make use of your /// CPU's cache more efficiently. /// 3. You are trying to establish an equivalence between regular /// languages. The standard method for this is to build a minimal DFA /// for each language and then compare them. If the DFAs are equivalent /// (up to state renaming), then the languages are equivalent. /// /// This option is disabled by default. pub fn minimize(&mut self, yes: bool) -> &mut Builder { self.minimize = yes; self } /// Premultiply state identifiers in the DFA's transition table. /// /// When enabled, state identifiers are premultiplied to point to their /// corresponding row in the DFA's transition table. That is, given the /// `i`th state, its corresponding premultiplied identifier is `i * k` /// where `k` is the alphabet size of the DFA. (The alphabet size is at /// most 256, but is in practice smaller if byte classes is enabled.) /// /// When state identifiers are not premultiplied, then the identifier of /// the `i`th state is `i`. /// /// The advantage of premultiplying state identifiers is that is saves /// a multiplication instruction per byte when searching with the DFA. /// This has been observed to lead to a 20% performance benefit in /// micro-benchmarks. /// /// The primary disadvantage of premultiplying state identifiers is /// that they require a larger integer size to represent. For example, /// if your DFA has 200 states, then its premultiplied form requires /// 16 bits to represent every possible state identifier, where as its /// non-premultiplied form only requires 8 bits. /// /// This option is enabled by default. pub fn premultiply(&mut self, yes: bool) -> &mut Builder { self.premultiply = yes; self } /// Shrink the size of the DFA's alphabet by mapping bytes to their /// equivalence classes. /// /// When enabled, each DFA will use a map from all possible bytes to their /// corresponding equivalence class. Each equivalence class represents a /// set of bytes that does not discriminate between a match and a non-match /// in the DFA. For example, the pattern `[ab]+` has at least two /// equivalence classes: a set containing `a` and `b` and a set containing /// every byte except for `a` and `b`. `a` and `b` are in the same /// equivalence classes because they never discriminate between a match /// and a non-match. /// /// The advantage of this map is that the size of the transition table can /// be reduced drastically from `#states * 256 * sizeof(id)` to /// `#states * k * sizeof(id)` where `k` is the number of equivalence /// classes. As a result, total space usage can decrease substantially. /// Moreover, since a smaller alphabet is used, compilation becomes faster /// as well. /// /// The disadvantage of this map is that every byte searched must be /// passed through this map before it can be used to determine the next /// transition. This has a small match time performance cost. /// /// This option is enabled by default. pub fn byte_classes(&mut self, yes: bool) -> &mut Builder { self.byte_classes = yes; self } /// Reverse the DFA. /// /// A DFA reversal is performed by reversing all of the concatenated /// sub-expressions in the original pattern, recursively. The resulting /// DFA can be used to match the pattern starting from the end of a string /// instead of the beginning of a string. /// /// Generally speaking, a reversed DFA is most useful for finding the start /// of a match, since a single forward DFA is only capable of finding the /// end of a match. This start of match handling is done for you /// automatically if you build a [`Regex`](struct.Regex.html). pub fn reverse(&mut self, yes: bool) -> &mut Builder { self.reverse = yes; self.nfa.reverse(yes); self } /// Find the longest possible match. /// /// This is distinct from the default leftmost-first match semantics in /// that it treats all NFA states as having equivalent priority. In other /// words, the longest possible match is always found and it is not /// possible to implement non-greedy match semantics when this is set. That /// is, `a+` and `a+?` are equivalent when this is enabled. /// /// In particular, a practical issue with this option at the moment is that /// it prevents unanchored searches from working correctly, since /// unanchored searches are implemented by prepending an non-greedy `.*?` /// to the beginning of the pattern. As stated above, non-greedy match /// semantics aren't supported. Therefore, if this option is enabled and /// an unanchored search is requested, then building a DFA will return an /// error. /// /// This option is principally useful when building a reverse DFA for /// finding the start of a match. If you are building a regex with /// [`RegexBuilder`](struct.RegexBuilder.html), then this is handled for /// you automatically. The reason why this is necessary for start of match /// handling is because we want to find the earliest possible starting /// position of a match to satisfy leftmost-first match semantics. When /// matching in reverse, this means finding the longest possible match, /// hence, this option. /// /// By default this is disabled. pub fn longest_match(&mut self, yes: bool) -> &mut Builder { // There is prior art in RE2 that shows how this can support unanchored // searches. Instead of treating all NFA states as having equivalent // priority, we instead group NFA states into sets, and treat members // of each set as having equivalent priority, but having greater // priority than all following members of different sets. We then // essentially assign a higher priority to everything over the prefix // `.*?`. self.longest_match = yes; self } /// Apply best effort heuristics to shrink the NFA at the expense of more /// time/memory. /// /// This may be exposed in the future, but for now is exported for use in /// the `regex-automata-debug` tool. #[doc(hidden)] pub fn shrink(&mut self, yes: bool) -> &mut Builder { self.nfa.shrink(yes); self } } #[cfg(feature = "std")] impl Default for Builder { fn default() -> Builder { Builder::new() } } /// Return the given byte as its escaped string form. #[cfg(feature = "std")] fn escape(b: u8) -> String { use std::ascii; String::from_utf8(ascii::escape_default(b).collect::<Vec<_>>()).unwrap() } #[cfg(all(test, feature = "std"))] mod tests { use super::*; #[test] fn errors_when_converting_to_smaller_dfa() { let pattern = r"\w{10}"; let dfa = Builder::new() .byte_classes(false) .anchored(true) .premultiply(false) .build_with_size::<u16>(pattern) .unwrap(); assert!(dfa.to_u8().is_err()); } #[test] fn errors_when_determinization_would_overflow() { let pattern = r"\w{10}"; let mut builder = Builder::new(); builder.byte_classes(false).anchored(true).premultiply(false); // using u16 is fine assert!(builder.build_with_size::<u16>(pattern).is_ok()); // // ... but u8 results in overflow (because there are >256 states) assert!(builder.build_with_size::<u8>(pattern).is_err()); } #[test] fn errors_when_premultiply_would_overflow() { let pattern = r"[a-z]"; let mut builder = Builder::new(); builder.byte_classes(false).anchored(true).premultiply(false); // without premultiplication is OK assert!(builder.build_with_size::<u8>(pattern).is_ok()); // ... but with premultiplication overflows u8 builder.premultiply(true); assert!(builder.build_with_size::<u8>(pattern).is_err()); } // let data = ::std::fs::read_to_string("/usr/share/dict/words").unwrap(); // let mut words: Vec<&str> = data.lines().collect(); // println!("{} words", words.len()); // words.sort_by(|w1, w2| w1.len().cmp(&w2.len()).reverse()); // let pattern = words.join("|"); // print_automata_counts(&pattern); // print_automata(&pattern); // print_automata(r"[01]*1[01]{5}"); // print_automata(r"X(.?){0,8}Y"); // print_automata_counts(r"\p{alphabetic}"); // print_automata(r"a*b+|cdefg"); // print_automata(r"(..)*(...)*"); // let pattern = r"\p{any}*?\p{Other_Uppercase}"; // let pattern = r"\p{any}*?\w+"; // print_automata_counts(pattern); // print_automata_counts(r"(?-u:\w)"); // let pattern = r"\p{Greek}"; // let pattern = r"zZzZzZzZzZ"; // let pattern = grapheme_pattern(); // let pattern = r"\p{Ideographic}"; // let pattern = r"\w{10}"; // 51784 --> 41264 // let pattern = r"\w"; // 5182 // let pattern = r"a*"; // print_automata(pattern); // let (_, _, dfa) = build_automata(pattern); }