-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Editorial: Formally disambiguate the non-Annex-B grammar #1727
base: main
Are you sure you want to change the base?
Editorial: Formally disambiguate the non-Annex-B grammar #1727
Conversation
[+U] `u` TrailSurrogate | ||
[+U] `u` NonSurrogate | ||
[+U] RegExpUnicodeSurrogatePair | ||
[+U] [lookahead ∉ RegExpUnicodeSurrogatePair] `u` Hex4Digits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RegExpUnicodeSurrogatePair
generates a set of terminal sequences, each of length 11. This doesn't fit with the current definition of lookahead-restrictions.
[+U] `u` TrailSurrogate | ||
[+U] `u` NonSurrogate | ||
[+U] RegExpUnicodeSurrogatePair | ||
[+U] [lookahead ∉ RegExpUnicodeSurrogatePair] `u` Hex4Digits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lookahead ∉ RegExpUnicodeSurrogatePair
seems a little suspicious to me, since RegExpUnicodeSurrogatePair
describes a set of sequences of terminals rather than just a set of terminals. As far as I know the only other place where there are sequences of length greater than 1 in a lookahead-restriction-set is the async function
restriction in ExpressionStatement, and there the sequence is of length precisely two (and also it's written out more explicitly).
Could this instead be written as
[+U] `u` LeadSurrogate [lookahead ≠ `\u` TrailSurrogate]
[+U] `u` LeadSurrogate `\u` TrailSurrogate
[+U] `u` TrailSurrogate
[+U] `u` NonSurrogate
? That feels clearer to me, if it's equivalent. (And if it's not, I'm confused.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is invalid (see below), but even with refactoring would still be a lookahead of six code points. And it's not more clear to me, but I would be willing to switch to it if there's consensus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, in the ES grammars, a lookahead-constraint either:
(a) occurs at the end of a right-hand-side, or
(b) occurs before a nonterminal, where that nonterminal derives phrases that begin with the disallowed sequences (i.e. the constraint is 'scoped' to that nonterminal).
So the right-hand-side:
[lookahead ∉ RegExpUnicodeSurrogatePair] `u` Hex4Digits
would be quite unusual, in that the lookahead-constraint has to "look through" the terminal u
and nonterminal HexDigits, and then look past them to a potential u
Hex4Digits following. For this reason, I prefer @bakkot's suggestion:
LeadSurrogate [lookahead != `\u` TrailSurrogate]
as it eliminates the "look through" and winds up with a fairly standard end-of-RHS constraint. (But yes, the nature of its lookahead-sequence would require tweaking 5.1.5 Grammar Notation.)
Also, I think I'd prefer it to come after the Lead+Trail right-hand-side:
[+U] `u` LeadSurrogate `\u` TrailSurrogate
[+U] `u` LeadSurrogate [lookahead != `\u` TrailSurrogate]
(The other thing I like about this solution is that just those two lines make it really obvious why the lookahead-constraint is needed.)
There are currently three flavors of negative lookahead assertions, all defined in Grammar Notation (emphasis mine):
So But addressing the more substantive point, I agree with @jmdyck that the intent is to bound lookahead, in particular limiting it to two terminals (corresponding with an LR(2) grammar). However, that is already not the case for two reasons. One of them has to do with the preservation of It is worth noting, but not necessarily compelling, that this applies to the Regular Expression grammar rather than to the syntactic grammar. Nevertheless, I believe that these particulars should be formally expressed where possible, even if that means admitting that ECMAScript is not as friendly to parse as it might otherwise appear to be (:roll_eyes:). I am willing to update the Grammar Notation section accordingly if that is the consensus. But an alternative does exist to expressing this requirement with a lookahead—introduction of a "longest expansion" rule analogous to the "longest input element" rule for lexical scanning. I'm not sure what exactly that would look like, but would be willing to try something out. |
I have a feeling that would be a bad idea: too much chance of unintended consequences. (Might depend on the exact formulation, though.) Currently, I think using a lookahead-restriction is the best option, along with any necessary changes to the Grammar Notation section. |
My own preference, revisiting this, is to relax the length bound on lookahead restrictions from being bounded at 2 to being bounded at 4 or (if 4 does not suffice) 6, ideally with that relaxation scoped specifically to the RegExp grammar. This might look like
(Additions in bold, removals ~struck through.) This also has the advantage of being explicit that the restriction on length of lookaheads is a property of all lookaheads, not just a particular way of writing them. Thoughts? |
@bakkot: Those wording changes still wouldn't allow |
3d0c24c
to
7a79833
Compare
@waldemarhorwat recently expressed objections to moving the Annex B regular expression syntax into the main grammar because of its order-dependent productions. However, I found an example of the same already in the main grammar, for surrogate pairs in codepoint-based "Unicode" regular expressions. This PR inserts a lookahead assertion to correct that ambiguity, and also applies the same treatment to replace some NumericLiteral prose with a formal assertion.
Closes #969.