Introduce a size limit for regexes before compilation #34171

ggevay · 2025-11-17T14:58:35Z

The main purpose of this PR is to prevent envd OOMs by checking the size of a regex pattern before attempting to compile it. See https://github.com/MaterializeInc/database-issues/issues/9907 and the PR's code comments for more details. (Such big regexes should be extremely rare in practice, but this problem is blocking SQLsmith at the moment.)

Additionally:

I made the compiled regex size limit explicit, to prevent surprises if a future version of the Regex crate changes its default.
I renamed MAX_STRING_BYTES to better indicate that this is about the output size of certain specific string functions, not a string size limit in general.
Edit: Added all these limits to the user-facing docs.

Motivation

This PR fixes a recognized bug: https://github.com/MaterializeInc/database-issues/issues/9907

Tips for reviewer

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

src/expr/src/scalar.rs

antiguru · 2025-11-17T15:43:36Z

src/repr/src/adt/regex.rs

+/// `MAX_REGEX_SIZE_AFTER_COMPILATION`) would prevent excessive resource usage, this doesn't seem to
+/// be the case. Since we compile regexes in envd, we need strict limits to prevent envd OOMs.
+/// See <https://github.com/MaterializeInc/database-issues/issues/9907> for an example.
+const MAX_REGEX_SIZE_BEFORE_COMPILATION: usize = 1 * 1024 * 1024;


Should we document this constant in our docs?

Is it a potential regression that we don't accept a query that works fine at the moment?

Yes, I'll add it in the docs. Edit: Done.

Unfortunately this could indeed introduce a regression. I'm hoping that the 1 million limit is big enough that nobody has a regex this big at the moment. If somebody happens to have such a big regex in their catalog, then the upgrade-check would fail, in which case we'd make a new RC with a bigger limit or reverting this change. I'd say this risk is acceptable in cloud. In self-managed, running into this would be a bit more troublesome, because there would have to be some back-and-forth with the user, but considering the very low chance, maybe it's acceptable?

(Guarding this with a feature flag would unfortunately be very hard at this point in the code.)

This is what Postgres' docs says about big regexes:

No particular limit is imposed on the length of REs in this implementation. However, programs intended to be highly portable should not employ REs longer than 256 bytes, as a POSIX-compliant implementation can refuse to accept such REs.

In practice, Postgres seems to fail at lengths of tens of thousands on regexes that look like the ones causing us trouble in https://github.com/MaterializeInc/database-issues/issues/9907:

postgres=# SELECT 'aaaaaaaaaaa' ~ repeat('vxx.0.0-rc.4 (5b079d80c)', '10000'); ERROR: invalid regular expression: regular expression is too complex postgres=# SELECT 'aaaaaaaaaaa' ~ repeat('vxx.0.0-rc.4 (5b079d80c)', '4000'); ERROR: invalid regular expression: regular expression is too complex postgres=# SELECT 'aaaaaaaaaaa' ~ repeat('vxx.0.0-rc.4 (5b079d80c)', '2000'); ERROR: invalid regular expression: regular expression is too complex postgres=# SELECT 'aaaaaaaaaaa' ~ repeat('vxx.0.0-rc.4 (5b079d80c)', '1000'); ?column? ---------- f (1 row)

I also did some quick googling, and couldn't find any case of someone talking about a regex longer than tens of thousands. Claude also says

When developers discuss "large" regexes in production environments, they're typically talking about patterns measured in hundreds or low thousands of characters, not millions.

We can simply check this for existing views using mzexplore---should be quick to export data and run some jq queries.

Unfortunately we can do this only for cloud, and I'm not too worried about cloud. In self-managed it would be somewhat more of a hassle if we need a patch release.

Ah, right. Feature flag it (or disable the limit in unsafe mode)? Too much trouble to ask the self-managed users?

Feature flag it

Unfortunately we can't feature flag it, because we don't have access to feature flag values in scalar function evaluations. Flags would have to be wired in from super far. (And the extra context being passed around scalar evaluation might even have a non-trivial performance impact.)

disable the limit in unsafe mode

Unsafe mode enables too much unsafe stuff, so we'd like to never tell a customer to turn it on. It's more a testing thing than a user-facing escape hatch.

Too much trouble to ask the self-managed users?

Well, I'd say it's acceptable, considering that the risk of a self-managed user already having such a big regex is quite low.

Okay, you've convinced me!

Also, fix the regex size limit after compilation to a constant that we control, to prevent surprises if a future version of the Regex crate changes its default.

ggevay · 2025-11-18T13:46:09Z

Thanks for the review!

This is ready for another review round. I guess the main question is whether we are ok with accepting the risk of customers running into the new limits during an upgrade. I'm leaning towards yes.

mgree

Looks good to me! A few notes/thoughts, but nothing that should prevent merging this.

mgree · 2025-11-19T16:54:00Z

doc/user/data/sql_funcs.yml

-    description: Replicate the string `n` times.
+    description: Replicate the string `n` times. The maximum length of the result string is 100 MiB.

  - signature: 'replace(s: str, f: str, r: str) -> str'


The replace function could also be inflationary... do we want limits on all such string functions? (Seems like:concat, concat_ws, decode, and possibly encode [a string under the limit in UTF-8 might not be in UTF-32]). If we're trying to enforce this invariant globally, also string_agg and anything coming from JSON.

mgree · 2025-11-19T18:48:01Z

src/repr/src/adt/regex.rs

+///
+/// Note: This number is mentioned in our user-facing docs at the "String operators" in the function
+/// reference.
+const MAX_REGEX_SIZE_AFTER_COMPILATION: usize = 10 * 1024 * 1024;


NB that this is in fact the default NFA size limit already (but good to document and enforce!) https://docs.rs/regex/latest/src/regex/builders.rs.html#52

If our goal is to use less memory/prevent OOMs, we might want a smaller limit.

Better name and comment on MAX_STRING_BYTES

2ddff70

ggevay added the A-ADAPTER Topics related to the ADAPTER layer label Nov 17, 2025

ggevay force-pushed the regex-size-limit branch 2 times, most recently from 28ca561 to b382343 Compare November 17, 2025 15:27

antiguru reviewed Nov 17, 2025

View reviewed changes

ggevay marked this pull request as ready for review November 17, 2025 15:57

ggevay requested a review from a team as a code owner November 17, 2025 15:57

ggevay force-pushed the regex-size-limit branch from b382343 to f275e37 Compare November 18, 2025 13:10

ggevay requested a review from a team as a code owner November 18, 2025 13:10

Introduce a regex size limit before compilation

1125338

Also, fix the regex size limit after compilation to a constant that we control, to prevent surprises if a future version of the Regex crate changes its default.

ggevay force-pushed the regex-size-limit branch from f275e37 to 1125338 Compare November 18, 2025 13:13

mgree approved these changes Nov 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce a size limit for regexes before compilation #34171

Introduce a size limit for regexes before compilation #34171

ggevay commented Nov 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

antiguru Nov 17, 2025

Uh oh!

ggevay Nov 18, 2025 •

edited

Loading

Uh oh!

mgree Nov 19, 2025

Uh oh!

ggevay Nov 19, 2025 •

edited

Loading

Uh oh!

mgree Nov 19, 2025

Uh oh!

ggevay Nov 19, 2025

Uh oh!

mgree Nov 19, 2025

Uh oh!

ggevay commented Nov 18, 2025 •

edited

Loading

Uh oh!

mgree left a comment

Uh oh!

mgree Nov 19, 2025

Uh oh!

mgree Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Introduce a size limit for regexes before compilation #34171

Are you sure you want to change the base?

Introduce a size limit for regexes before compilation #34171

Conversation

ggevay commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Tips for reviewer

Checklist

Uh oh!

Uh oh!

antiguru Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

ggevay Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgree Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

ggevay Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgree Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

ggevay Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

mgree Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

ggevay commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgree left a comment

Choose a reason for hiding this comment

Uh oh!

mgree Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

mgree Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggevay commented Nov 17, 2025 •

edited

Loading

ggevay Nov 18, 2025 •

edited

Loading

ggevay Nov 19, 2025 •

edited

Loading

ggevay commented Nov 18, 2025 •

edited

Loading