-
Notifications
You must be signed in to change notification settings - Fork 70
Add support for character classes [...]
#250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Add support for character classes [...]
#250
Conversation
Also, fix the quantifiers more expressive. I.e., now it supports: {,4}, {4}, {1,3}, {1,} instead of just {1,3} and {1,}
@StefanosChaliasos thanks for this contribution! A bit busy today but I'll try to review this at some point in the evening. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like some dead code was included accidentally? Otherwise the logic itself seems fine.
if is_negated: | ||
expanded_content = expanded_content[1:] # Remove ^ from the content | ||
|
||
return cls(match.group(), expanded_content, is_negated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there are some missing fields that are used in the constructor? Shouldn't this line throw an exception?
EDIT: Based on the coverage report, this line isn't being hit at all.
greek_nfa = NFA.from_regex("[Ͱ-Ͽ]+", input_symbols=input_symbols) | ||
cyrillic_nfa = NFA.from_regex("[Ѐ-ӿ]+", input_symbols=input_symbols) | ||
|
||
latin_samples = ["¡", "£", "Ā", "ŕ", "ƿ"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the \u...
notation? This will make these tests easier to maintain (albiet less elegant in the editor).
self.counter = counter | ||
|
||
@classmethod | ||
def from_match(cls: Type[Self], match: re.Match) -> Self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the logic here heavily overlaps with the process_char_class
function. Could one of these be made to call the other?
) | ||
|
||
lexer.register_token( | ||
character_class_factory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would personally prefer to use the from_match
syntax the way the other token types are registered, but either syntax is fine. But it seems like the from_match
in the new token class isn't being called at all.
@@ -577,3 +691,50 @@ def parse_regex(regexstr: str, input_symbols: AbstractSet[str]) -> NFARegexBuild | |||
postfix = tokens_to_postfix(tokens_with_concats) | |||
|
|||
return parse_postfix_tokens(postfix) | |||
|
|||
|
|||
def process_char_class(class_str: str) -> Tuple[bool, Set[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Might be good to have a couple of small test cases for this function independently to aid in debugging later, but won't make any hard requests for this.
Will go over everything tomorrow. Thanks a lot for the feedback. |
We also added more complex tests
I did some more changes, can you review the new ones. Basically I added support for shorthand (e.g., '\d') and I tokenised whitespace. I need to add more tests and polish the code. I'll change the PR as a draft until done. |
if "\\s" in regex: | ||
additional_symbols.update(WHITESPACE_CHARS) | ||
if "\\S" in regex: | ||
additional_symbols.update(NON_WHITESPACE_CHARS) | ||
if "\\d" in regex: | ||
additional_symbols.update(DIGIT_CHARS) | ||
if "\\D" in regex: | ||
additional_symbols.update(NON_DIGIT_CHARS) | ||
if "\\w" in regex: | ||
additional_symbols.update(WORD_CHARS) | ||
if "\\W" in regex: | ||
additional_symbols.update(NON_WORD_CHARS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StefanosChaliasos Can you please refactor this to use a dict
-based lookup table? That would make this much less repetitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this could make things much cleaner, especially since this can be done in a loop 👍🏽
from automata.regex.parser import ( | ||
DIGIT_CHARS, | ||
NON_DIGIT_CHARS, | ||
NON_WHITESPACE_CHARS, | ||
NON_WORD_CHARS, | ||
WHITESPACE_CHARS, | ||
WORD_CHARS, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StefanosChaliasos Can you please keep all imports at the top of the file? There's no particular need for the tighter scoping here, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think ruff will complain about the imports.
additional_symbols.update(WORD_CHARS) | ||
pos += 2 | ||
continue | ||
elif class_content[pos + 1] in "S": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StefanosChaliasos What is the intention of using in
here as opposed to ==
? If the right-hand side is just a single character, the only difference that seems to make is permitting class_content[pos + 1]
to be empty string (in addition to the character itself). In other words:
"S" in "S" True
"" in "S" # True
continue | ||
|
||
# Handle escape sequence in character class | ||
from automata.regex.parser import _handle_escape_sequences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StefanosChaliasos Can you also please move this import to the top of the file?
@@ -24,10 +25,21 @@ | |||
validate_tokens, | |||
) | |||
|
|||
# Add these at the top of the file to define our shorthand character sets | |||
ASCII_PRINTABLE_CHARS = frozenset(string.printable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eliotwrobson The implication here is that only ASCII characters are deemed as printable characters, but how would that work given that #233 just added support for Unicode characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this only gets used when adding characters from the use of some special character classes? It might be good to have a test case that uses a non-printable character.
Hey, @StefanosChaliasos! I left some additional comments on the PR—apologies if they seem nitpicky, but just wanting to maintain solid code quality and consistency for this project. |
Thanks for the review, I will address the comments once I find some time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments. @StefanosChaliasos also, we just updated the develop branch to use UV, so be sure to do a rebase before adding more changes. It would be awesome if we could close this out soon, since along with the switch to UV, we have a couple of feature request PRs that could be part of a new release.
@@ -24,10 +25,21 @@ | |||
validate_tokens, | |||
) | |||
|
|||
# Add these at the top of the file to define our shorthand character sets | |||
ASCII_PRINTABLE_CHARS = frozenset(string.printable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this only gets used when adding characters from the use of some special character classes? It might be good to have a test case that uses a non-printable character.
if "\\s" in regex: | ||
additional_symbols.update(WHITESPACE_CHARS) | ||
if "\\S" in regex: | ||
additional_symbols.update(NON_WHITESPACE_CHARS) | ||
if "\\d" in regex: | ||
additional_symbols.update(DIGIT_CHARS) | ||
if "\\D" in regex: | ||
additional_symbols.update(NON_DIGIT_CHARS) | ||
if "\\w" in regex: | ||
additional_symbols.update(WORD_CHARS) | ||
if "\\W" in regex: | ||
additional_symbols.update(NON_WORD_CHARS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this could make things much cleaner, especially since this can be done in a loop 👍🏽
from automata.regex.parser import ( | ||
DIGIT_CHARS, | ||
NON_DIGIT_CHARS, | ||
NON_WHITESPACE_CHARS, | ||
NON_WORD_CHARS, | ||
WHITESPACE_CHARS, | ||
WORD_CHARS, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think ruff will complain about the imports.
while pos < len(class_content): | ||
if class_content[pos] == "\\" and pos + 1 < len(class_content): | ||
# Check for shorthand classes in character classes | ||
if class_content[pos + 1] == "s": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might be able to use a lookup table from a dictionary here? Just use the character as a key and a tuple of additional symbols and position increment as the value.
|
||
# Add all characters in the range to input symbols | ||
for i in range(ord(start_char), ord(end_char) + 1): | ||
class_symbols.add(chr(i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be done with a python update
call instead of a loop.
"&": "&", | ||
} | ||
|
||
if char in escape_map: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if char in escape_map: | |
return escape_map.get(char, char) |
Thanks for the additional comments, will fix everything by the end of the week. Got busy with other stuff :) |
Fixes #249
It also makes the quantifiers more expressive:
I.e., now it supports: {,4}, {4}, {1,3}, {1,} instead of just {1,3} and {1,}