-
Notifications
You must be signed in to change notification settings - Fork 482
Introduce a size limit for regexes before compilation #34171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
28ca561 to
b382343
Compare
| /// `MAX_REGEX_SIZE_AFTER_COMPILATION`) would prevent excessive resource usage, this doesn't seem to | ||
| /// be the case. Since we compile regexes in envd, we need strict limits to prevent envd OOMs. | ||
| /// See <https://github.com/MaterializeInc/database-issues/issues/9907> for an example. | ||
| const MAX_REGEX_SIZE_BEFORE_COMPILATION: usize = 1 * 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we document this constant in our docs?
Is it a potential regression that we don't accept a query that works fine at the moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll add it in the docs. Edit: Done.
Unfortunately this could indeed introduce a regression. I'm hoping that the 1 million limit is big enough that nobody has a regex this big at the moment. If somebody happens to have such a big regex in their catalog, then the upgrade-check would fail, in which case we'd make a new RC with a bigger limit or reverting this change. I'd say this risk is acceptable in cloud. In self-managed, running into this would be a bit more troublesome, because there would have to be some back-and-forth with the user, but considering the very low chance, maybe it's acceptable?
(Guarding this with a feature flag would unfortunately be very hard at this point in the code.)
This is what Postgres' docs says about big regexes:
No particular limit is imposed on the length of REs in this implementation. However, programs intended to be highly portable should not employ REs longer than 256 bytes, as a POSIX-compliant implementation can refuse to accept such REs.
In practice, Postgres seems to fail at lengths of tens of thousands on regexes that look like the ones causing us trouble in https://github.com/MaterializeInc/database-issues/issues/9907:
postgres=# SELECT 'aaaaaaaaaaa' ~ repeat('vxx.0.0-rc.4 (5b079d80c)', '10000');
ERROR: invalid regular expression: regular expression is too complex
postgres=# SELECT 'aaaaaaaaaaa' ~ repeat('vxx.0.0-rc.4 (5b079d80c)', '4000');
ERROR: invalid regular expression: regular expression is too complex
postgres=# SELECT 'aaaaaaaaaaa' ~ repeat('vxx.0.0-rc.4 (5b079d80c)', '2000');
ERROR: invalid regular expression: regular expression is too complex
postgres=# SELECT 'aaaaaaaaaaa' ~ repeat('vxx.0.0-rc.4 (5b079d80c)', '1000');
?column?
----------
f
(1 row)
I also did some quick googling, and couldn't find any case of someone talking about a regex longer than tens of thousands. Claude also says
When developers discuss "large" regexes in production environments, they're typically talking about patterns measured in hundreds or low thousands of characters, not millions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simply check this for existing views using mzexplore---should be quick to export data and run some jq queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately we can do this only for cloud, and I'm not too worried about cloud. In self-managed it would be somewhat more of a hassle if we need a patch release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. Feature flag it (or disable the limit in unsafe mode)? Too much trouble to ask the self-managed users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feature flag it
Unfortunately we can't feature flag it, because we don't have access to feature flag values in scalar function evaluations. Flags would have to be wired in from super far. (And the extra context being passed around scalar evaluation might even have a non-trivial performance impact.)
disable the limit in unsafe mode
Unsafe mode enables too much unsafe stuff, so we'd like to never tell a customer to turn it on. It's more a testing thing than a user-facing escape hatch.
Too much trouble to ask the self-managed users?
Well, I'd say it's acceptable, considering that the risk of a self-managed user already having such a big regex is quite low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, you've convinced me!
b382343 to
f275e37
Compare
Also, fix the regex size limit after compilation to a constant that we control, to prevent surprises if a future version of the Regex crate changes its default.
f275e37 to
1125338
Compare
|
Thanks for the review! This is ready for another review round. I guess the main question is whether we are ok with accepting the risk of customers running into the new limits during an upgrade. I'm leaning towards yes. |
mgree
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! A few notes/thoughts, but nothing that should prevent merging this.
| description: Replicate the string `n` times. | ||
| description: Replicate the string `n` times. The maximum length of the result string is 100 MiB. | ||
|
|
||
| - signature: 'replace(s: str, f: str, r: str) -> str' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The replace function could also be inflationary... do we want limits on all such string functions? (Seems like:concat, concat_ws, decode, and possibly encode [a string under the limit in UTF-8 might not be in UTF-32]). If we're trying to enforce this invariant globally, also string_agg and anything coming from JSON.
| /// | ||
| /// Note: This number is mentioned in our user-facing docs at the "String operators" in the function | ||
| /// reference. | ||
| const MAX_REGEX_SIZE_AFTER_COMPILATION: usize = 10 * 1024 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NB that this is in fact the default NFA size limit already (but good to document and enforce!) https://docs.rs/regex/latest/src/regex/builders.rs.html#52
If our goal is to use less memory/prevent OOMs, we might want a smaller limit.
The main purpose of this PR is to prevent envd OOMs by checking the size of a regex pattern before attempting to compile it. See https://github.com/MaterializeInc/database-issues/issues/9907 and the PR's code comments for more details. (Such big regexes should be extremely rare in practice, but this problem is blocking SQLsmith at the moment.)
Additionally:
Regexcrate changes its default.MAX_STRING_BYTESto better indicate that this is about the output size of certain specific string functions, not a string size limit in general.Motivation
Tips for reviewer
Checklist
$T ⇔ Proto$Tmapping (possibly in a backwards-incompatible way), then it is tagged with aT-protolabel.