-
Notifications
You must be signed in to change notification settings - Fork 483
Introduce a size limit for regexes before compilation #34171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,6 +21,23 @@ use serde::de::Error as DeError; | |
| use serde::ser::SerializeStruct; | ||
| use serde::{Deserialize, Deserializer, Serialize, Serializer, de}; | ||
|
|
||
| /// The maximum size of a regex after compilation. | ||
| /// This is the same as the `Regex` crate's default at the time of writing. | ||
| /// | ||
| /// Note: This number is mentioned in our user-facing docs at the "String operators" in the function | ||
| /// reference. | ||
| const MAX_REGEX_SIZE_AFTER_COMPILATION: usize = 10 * 1024 * 1024; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. NB that this is in fact the default NFA size limit already (but good to document and enforce!) https://docs.rs/regex/latest/src/regex/builders.rs.html#52 If our goal is to use less memory/prevent OOMs, we might want a smaller limit.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, I just made the compiled regex size limit explicit to
Well, a 10 MiB compiled regex (if it really doesn't go above that) should be a walk in the park. Our envd mem limit is GBs. Btw. I also checked that making this limit even super small does not prevent the original issue unfortunately. I'm planning to send a bug report to the crate when I have time, because their docs kinda make it sound like they have good DOS prevention, so maybe this is something that they'd want to fix. (As we also discussed on zoom.) |
||
|
|
||
| /// We also need a separate limit for the size of regexes before compilation. Even though the | ||
| /// `Regex` crate promises that using its `size_limit` option (which we set to the other limit, | ||
| /// `MAX_REGEX_SIZE_AFTER_COMPILATION`) would prevent excessive resource usage, this doesn't seem to | ||
| /// be the case. Since we compile regexes in envd, we need strict limits to prevent envd OOMs. | ||
| /// See <https://github.com/MaterializeInc/database-issues/issues/9907> for an example. | ||
| /// | ||
| /// Note: This number is mentioned in our user-facing docs at the "String operators" in the function | ||
| /// reference. | ||
| const MAX_REGEX_SIZE_BEFORE_COMPILATION: usize = 1 * 1024 * 1024; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we document this constant in our docs? Is it a potential regression that we don't accept a query that works fine at the moment?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I'll add it in the docs. Edit: Done. Unfortunately this could indeed introduce a regression. I'm hoping that the 1 million limit is big enough that nobody has a regex this big at the moment. If somebody happens to have such a big regex in their catalog, then the upgrade-check would fail, in which case we'd make a new RC with a bigger limit or reverting this change. I'd say this risk is acceptable in cloud. In self-managed, running into this would be a bit more troublesome, because there would have to be some back-and-forth with the user, but considering the very low chance, maybe it's acceptable? (Guarding this with a feature flag would unfortunately be very hard at this point in the code.) This is what Postgres' docs says about big regexes:
In practice, Postgres seems to fail at lengths of tens of thousands on regexes that look like the ones causing us trouble in https://github.com/MaterializeInc/database-issues/issues/9907: I also did some quick googling, and couldn't find any case of someone talking about a regex longer than tens of thousands. Claude also says
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can simply check this for existing views using mzexplore---should be quick to export data and run some
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unfortunately we can do this only for cloud, and I'm not too worried about cloud. In self-managed it would be somewhat more of a hassle if we need a patch release.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, right. Feature flag it (or disable the limit in unsafe mode)? Too much trouble to ask the self-managed users?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Unfortunately we can't feature flag it, because we don't have access to feature flag values in scalar function evaluations. Flags would have to be wired in from super far. (And the extra context being passed around scalar evaluation might even have a non-trivial performance impact.)
Unsafe mode enables too much unsafe stuff, so we'd like to never tell a customer to turn it on. It's more a testing thing than a user-facing escape hatch.
Well, I'd say it's acceptable, considering that the risk of a self-managed user already having such a big regex is quite low.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, you've convinced me! |
||
|
|
||
| /// A hashable, comparable, and serializable regular expression type. | ||
| /// | ||
| /// The [`regex::Regex`] type, the de facto standard regex type in Rust, does | ||
|
|
@@ -58,7 +75,7 @@ impl Regex { | |
| /// A simple constructor for the default setting of `dot_matches_new_line: true`. | ||
| /// See <https://www.postgresql.org/docs/current/functions-matching.html#POSIX-MATCHING-RULES> | ||
| /// "newline-sensitive matching" | ||
| pub fn new(pattern: &str, case_insensitive: bool) -> Result<Regex, Error> { | ||
| pub fn new(pattern: &str, case_insensitive: bool) -> Result<Regex, RegexCompilationError> { | ||
| Self::new_dot_matches_new_line(pattern, case_insensitive, true) | ||
| } | ||
|
|
||
|
|
@@ -67,10 +84,16 @@ impl Regex { | |
| pattern: &str, | ||
| case_insensitive: bool, | ||
| dot_matches_new_line: bool, | ||
| ) -> Result<Regex, Error> { | ||
| ) -> Result<Regex, RegexCompilationError> { | ||
| if pattern.len() > MAX_REGEX_SIZE_BEFORE_COMPILATION { | ||
| return Err(RegexCompilationError::PatternTooLarge { | ||
| pattern_size: pattern.len(), | ||
| }); | ||
| } | ||
| let mut regex_builder = RegexBuilder::new(pattern); | ||
| regex_builder.case_insensitive(case_insensitive); | ||
| regex_builder.dot_matches_new_line(dot_matches_new_line); | ||
| regex_builder.size_limit(MAX_REGEX_SIZE_AFTER_COMPILATION); | ||
| Ok(Regex { | ||
| case_insensitive, | ||
| dot_matches_new_line, | ||
|
|
@@ -86,6 +109,36 @@ impl Regex { | |
| } | ||
| } | ||
|
|
||
| /// Error type for regex compilation failures. | ||
| #[derive(Debug, Clone)] | ||
| pub enum RegexCompilationError { | ||
| /// Wrapper for regex crate's Error type. | ||
| RegexError(Error), | ||
| /// Regex pattern size exceeds MAX_REGEX_SIZE_BEFORE_COMPILATION. | ||
| PatternTooLarge { pattern_size: usize }, | ||
| } | ||
|
|
||
| impl fmt::Display for RegexCompilationError { | ||
| fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { | ||
| match self { | ||
| RegexCompilationError::RegexError(e) => write!(f, "{}", e), | ||
| RegexCompilationError::PatternTooLarge { | ||
| pattern_size: patter_size, | ||
| } => write!( | ||
| f, | ||
| "regex pattern too large ({} bytes, max {} bytes)", | ||
| patter_size, MAX_REGEX_SIZE_BEFORE_COMPILATION | ||
| ), | ||
| } | ||
| } | ||
| } | ||
|
|
||
| impl From<Error> for RegexCompilationError { | ||
| fn from(e: Error) -> Self { | ||
| RegexCompilationError::RegexError(e) | ||
| } | ||
| } | ||
|
|
||
| impl PartialEq<Regex> for Regex { | ||
| fn eq(&self, other: &Regex) -> bool { | ||
| self.pattern() == other.pattern() | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
replacefunction could also be inflationary... do we want limits on all such string functions? (Seems like:concat,concat_ws,decode, and possiblyencode[a string under the limit in UTF-8 might not be in UTF-32]). If we're trying to enforce this invariant globally, alsostring_aggand anything coming from JSON.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, opened an issue: https://github.com/MaterializeInc/database-issues/issues/9916