This repo contains an implementation of a regular expression library using BuildIt.
We currently support the following types of matches:
- full match that checks if the regex exactly matches the text; example code is given in
./samples/sample1.cpp - partial match with binary output with an option to extract the first match (
./samples/sample2.cpp) - all partial matches returned as a list of strings (
./samples/sample3.cpp); the output of all partial matches is the same as the output of reapeatedly applying the PCRE or RE2 FindAndConsume function that gives non-overlapping leftmost longest matches
We support the following operators and expressions.
| Expression | Description |
|---|---|
. |
any character |
[xyz], [^xyz] |
character class |
[a-z], [^a-z] |
character range |
x? |
zero or one x |
x+ |
one or more x |
x* |
zero or more x |
(x|y) |
x or y |
x{n} |
x repeated n times |
x{n,m} |
x repeated between n and m times inclusive |
\d, \w, \s, \D, \W |
escaped character classes |
We have a couple of flag options that affect the way the code is generated:
- specifying the number of interleaving parts for partial matches
- splitting the code generation on
|characters - grouping multiple consecutive states into one
ignore_caseto match both upper and lowercasegreedy- set to true to prefer longer partial matches
These options can be set using the RegexOptions struct as shown in ./samples/sample2.cpp.
To compile the code run make from the root directory. To run the sample1 code for example, run ./build/sample1.
- The main code is in
./srcand./include. - Testing code is in
./test. - Code for measuring performance is in
./benchmarks.
- To build Hyperscan follow the steps 2 and 3 from here.
- Use one of the scripts in
./benchmarks/hyperscan/tools/hsbench/scriptsto create a corpus SQLite database. - Add the regex patterns to a file following this format.
- From the hyperscan build directory run
build/bin/hsbench -e <pattern_file> -c <corpus.db>. More directions are available here.
- To build RE2 run
makein the./benchmarks/re2/directory.
To run the timing experiments on the Twain dataset run ./build/preformance in the ./benchmarks directory.
- Corpus: Project Gutenberg: Complete Works of Mark Twain
- Patterns: available in
./benchmarks/data/twain_patterns.txt; taken from this paper