-
Notifications
You must be signed in to change notification settings - Fork 462
Add support for unbounded look-behind expressions #1266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Multimodcrafter
wants to merge
66
commits into
rust-lang:master
Choose a base branch
from
epfl-systemf:captureless-lookbehinds
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add support for unbounded look-behind expressions #1266
Multimodcrafter
wants to merge
66
commits into
rust-lang:master
from
epfl-systemf:captureless-lookbehinds
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is the first step to supporting captureless lookbehind assertions
The lack of recursing into the inner expression of a lookaround is correct under the current assumption that lookarounds cannot have capture groups. But once the restriction is lifted, this wrong implementation can be very subtle to find. Instead, we can already do the filtering and accept it being a no-op for now.
This makes it consistent with parser's ErrorKind::UnsupportedLookAround.
We require two vm instructions 'CheckLookaround' and 'WriteLookaround' to be able to track the state of lookaround expressions at the current position in the haystack. Both instructions access a new 'lookaround' vector of booleans, which contains one entry per lookaround expression in the regex.
These changes implement the compilation of lookaround assertions from HIR to NFA. Subexpressions of lookaround assertions are patched to a top level reverse union. This is necessary so that the NFA will explore the innermost subexpression first and thereby make sure that all subexpression results are available when they need to be checked. I.e. any `WriteLookaround` state must be visited before any `CheckLookaround` state with the same index.
The machinery necessary to perform the parallel lookbehind checking should only be compiled in when there is actually a lookbehind expression in the regex. This restores compilation to the expected outputs for regexes without lookbehind expressions.
bartekpacia
approved these changes
May 15, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As an example consider the regex
(?<=Title:\s+)\w+
which would match thefollowing strings (matches underlined with
~
):But does not match:
No heading
title: bad case
Title:nospace
What
This PR implements the streaming algorithm from
Linear Matching of JavaScript Regular Expressions (Section 4.4)
for unbounded look-behinds. The same algorithm has been
implemented and merged into V8.
The addition of look-around expressions to this crate was mentioned previously
in #1153.
This PR adds support for positive and negative look-behinds with arbitrary
nesting. With the following limitations
Limitations
Capture groups outside of look-arounds are supported. With the current capture
group semantics, no linear time algorithm which would allow for capture groups
inside of look-arounds is known. However, look-behinds could be implemented in
other engines and with prefilters on. Look-aheads could also be implemented with
additional memory.
How
We implemented the streaming algorithm presented in Section 4.4 of the paper
mentioned above. The algorithm works by running the sub-automata for any
look-behind expressions in parallel to the main automaton. This is achieved by
compiling the look-behind expressions as usual but storing their start states
separately, not reachable from the main automaton.
Instead of a
match
state, the sub-automata for look-behinds have aWriteLookAround
state. This state causes the current position in the haystackto be recorded in a global look-around table.
The main automaton (and the sub-automata in the case of nested look-behinds) can
then read from this table by means of a
CheckLookAround
instruction andcompare the stored index with the current position in the haystack. These states
work as conditional epsilon transitions, similar to the already supported "look"
assertions (e.g.
^
,\b
,$
).PikeVM
's cache has been expanded to preserve good performance of single-matchsearches (stop the look-around threads once the main automaton finishes) and of
all-matches searches (remember the look-around states when resuming a search to
prevent having to rescan the haystack from the beginning).
Testing
We have added unit tests for the new functionality in the individual test
modules to test the new parsing, translation, and compilation features. We have
further added integration tests in the form of a new toml file. All engines
apart from the PikeVM will reject look-behind expressions. Thus tests containing
look-around expressions are filtered out for engines other than the PikeVM and
Meta engine.
Future Work
We would love to get feedback on the implementation.
The next steps are to work on the current limitations. Namely, implement support
in more engines and enable prefilters. Additionally, support for look-aheads
would be implemented if the additional memory cost is acceptable.
We are open to the discussion about any of the above.
Performance
We forked
rebar
and added a new enginedefinition (
rust/regex-lookbehind
) for our fork ofregex
. We added this newengine definition to all benchmarks where
rust/regex
was already present.Furthermore, we added some benchmark definitions to measure the performance
of the look-behind algorithm.
We ran the full suite of benchmarks twice and merged the results. They are available
in our rebar fork (
results_full_combined.csv
)Results without look-behinds
The results from all benchmarks without look-behinds show that our changes do not
introduce a significant slowdown for regexes that were already supported:
Note: We noticed a discrepancy across multiple runs of up to 1.51 when comparing the
current version of
rust/regex
:Due to this result, we conclude that, despite the highest speedup ratio being 1.57 when
comparing both engines across both runs, the results of all individual benchmarks
further strengthen the claim that our changes do not significantly impact performance.
Full benchmark comparison (without look-behinds)
Results with look-behinds
To get an estimate for performance of "real-world regexes" using look-behinds,
we extracted all regexes that contain look-behind expressions from the
snort
ruleset. We chose this as a source of regexes because it has been used as a
benchmark for look-arounds before in Efficient Matching of Regular Expressions with Lookaround Assertions.
Unfortunately, this ruleset is licensed in a way that prohibits us from
distributing it. See the reproduction section below to learn where to get the
ruleset from and how to extract the regexes.
Furthermore, we wrote a couple of very simple benchmarks to demonstrate that
our implementation respects linearity.
We chose to compare our implementation to
python/re
, as this engine is readilyavailable, hence easy to benchmark, and used ubiquitously. Note, however, that
python/re
only supports bounded length look-behinds, while our implementationsupports unbounded ones as well.
Look-behind benchmark comparison
A few things to note:
snort-0
andsnort-4
are the only ones where there is anopportunity for prefiltering based on a prefix literal, which we haven't implemented currently.
This explains the huge difference in speedup compared to all other regexes.
speedup ratio between
pyhton/re
andrust/*
is similar to the values seenhere (e.g.
imported/sherlock/everything-greedy-nl
,curated/08-words/long-russian
).We therefore conclude that the baseline performance for regexes with
look-behinds is reasonable.
linear-haystack
benchmarks shows that ouralgorithm indeed runs in linear time.
How to reproduce
Please follow these instructions to reproduce our results:
snapshot
3200
of the rules in the "Registered" column.rebar
forksnortrules-snapshot-3200
in the root of the cloned repo.benchmark_lookbehind.sh
for the prerequisites. If you areon a debian/ubuntu system, you can install them easily by running
./benchmark_lookbehind.sh --install
(requires root privileges)../benchmark_lookbehind.sh
to run the benchmark.results_full.csv
andresults_lookbehind.csv
, which are placed in the directory containing therebar fork.
Acknowledgements
This was a joint effort by @shilangyu and @Multimodcrafter, supervised by Aurèle Barrière and Clément Pit-Claudel at EPFL's SYSTEMF.