Skip to content

Conversation

@ggevay
Copy link
Contributor

@ggevay ggevay commented Nov 17, 2025

The main purpose of this PR is to prevent envd OOMs by checking the size of a regex pattern before attempting to compile it. See https://github.com/MaterializeInc/database-issues/issues/9907 and the PR's code comments for more details. (Such big regexes should be extremely rare in practice, but this problem is blocking SQLsmith at the moment.)

Additionally:

  • I made the compiled regex size limit explicit, to prevent surprises if a future version of the Regex crate changes its default.
  • I renamed MAX_STRING_BYTES to better indicate that this is about the output size of certain specific string functions, not a string size limit in general.
  • Edit: Added all these limits to the user-facing docs.

Motivation

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@ggevay ggevay added the A-ADAPTER Topics related to the ADAPTER layer label Nov 17, 2025
@ggevay ggevay force-pushed the regex-size-limit branch 2 times, most recently from 28ca561 to b382343 Compare November 17, 2025 15:27
/// `MAX_REGEX_SIZE_AFTER_COMPILATION`) would prevent excessive resource usage, this doesn't seem to
/// be the case. Since we compile regexes in envd, we need strict limits to prevent envd OOMs.
/// See <https://github.com/MaterializeInc/database-issues/issues/9907> for an example.
const MAX_REGEX_SIZE_BEFORE_COMPILATION: usize = 1 * 1024 * 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we document this constant in our docs?

Is it a potential regression that we don't accept a query that works fine at the moment?

Copy link
Contributor Author

@ggevay ggevay Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll add it in the docs. Edit: Done.

Unfortunately this could indeed introduce a regression. I'm hoping that the 1 million limit is big enough that nobody has a regex this big at the moment. If somebody happens to have such a big regex in their catalog, then the upgrade-check would fail, in which case we'd make a new RC with a bigger limit or reverting this change. I'd say this risk is acceptable in cloud. In self-managed, running into this would be a bit more troublesome, because there would have to be some back-and-forth with the user, but considering the very low chance, maybe it's acceptable?

(Guarding this with a feature flag would unfortunately be very hard at this point in the code.)

This is what Postgres' docs says about big regexes:

No particular limit is imposed on the length of REs in this implementation. However, programs intended to be highly portable should not employ REs longer than 256 bytes, as a POSIX-compliant implementation can refuse to accept such REs.

In practice, Postgres seems to fail at lengths of tens of thousands on regexes that look like the ones causing us trouble in https://github.com/MaterializeInc/database-issues/issues/9907:

postgres=# SELECT 'aaaaaaaaaaa' ~ repeat('vxx.0.0-rc.4 (5b079d80c)', '10000');
ERROR:  invalid regular expression: regular expression is too complex
postgres=# SELECT 'aaaaaaaaaaa' ~ repeat('vxx.0.0-rc.4 (5b079d80c)', '4000');
ERROR:  invalid regular expression: regular expression is too complex
postgres=# SELECT 'aaaaaaaaaaa' ~ repeat('vxx.0.0-rc.4 (5b079d80c)', '2000');
ERROR:  invalid regular expression: regular expression is too complex
postgres=# SELECT 'aaaaaaaaaaa' ~ repeat('vxx.0.0-rc.4 (5b079d80c)', '1000');
 ?column? 
----------
 f
(1 row)

I also did some quick googling, and couldn't find any case of someone talking about a regex longer than tens of thousands. Claude also says

When developers discuss "large" regexes in production environments, they're typically talking about patterns measured in hundreds or low thousands of characters, not millions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simply check this for existing views using mzexplore---should be quick to export data and run some jq queries.

Copy link
Contributor Author

@ggevay ggevay Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we can do this only for cloud, and I'm not too worried about cloud. In self-managed it would be somewhat more of a hassle if we need a patch release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Feature flag it (or disable the limit in unsafe mode)? Too much trouble to ask the self-managed users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature flag it

Unfortunately we can't feature flag it, because we don't have access to feature flag values in scalar function evaluations. Flags would have to be wired in from super far. (And the extra context being passed around scalar evaluation might even have a non-trivial performance impact.)

disable the limit in unsafe mode

Unsafe mode enables too much unsafe stuff, so we'd like to never tell a customer to turn it on. It's more a testing thing than a user-facing escape hatch.

Too much trouble to ask the self-managed users?

Well, I'd say it's acceptable, considering that the risk of a self-managed user already having such a big regex is quite low.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, you've convinced me!

@ggevay ggevay marked this pull request as ready for review November 17, 2025 15:57
@ggevay ggevay requested a review from a team as a code owner November 17, 2025 15:57
@ggevay ggevay requested a review from a team as a code owner November 18, 2025 13:10
Also, fix the regex size limit after compilation to a constant that we
control, to prevent surprises if a future version of the Regex crate
changes its default.
@ggevay
Copy link
Contributor Author

ggevay commented Nov 18, 2025

Thanks for the review!

This is ready for another review round. I guess the main question is whether we are ok with accepting the risk of customers running into the new limits during an upgrade. I'm leaning towards yes.

Copy link
Contributor

@mgree mgree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! A few notes/thoughts, but nothing that should prevent merging this.

description: Replicate the string `n` times.
description: Replicate the string `n` times. The maximum length of the result string is 100 MiB.

- signature: 'replace(s: str, f: str, r: str) -> str'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The replace function could also be inflationary... do we want limits on all such string functions? (Seems like:concat, concat_ws, decode, and possibly encode [a string under the limit in UTF-8 might not be in UTF-32]). If we're trying to enforce this invariant globally, also string_agg and anything coming from JSON.

///
/// Note: This number is mentioned in our user-facing docs at the "String operators" in the function
/// reference.
const MAX_REGEX_SIZE_AFTER_COMPILATION: usize = 10 * 1024 * 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB that this is in fact the default NFA size limit already (but good to document and enforce!) https://docs.rs/regex/latest/src/regex/builders.rs.html#52

If our goal is to use less memory/prevent OOMs, we might want a smaller limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-ADAPTER Topics related to the ADAPTER layer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants