Add support for character classes `[...]` #250

StefanosChaliasos · 2025-03-06T09:48:35Z

Fixes #249

It also makes the quantifiers more expressive:

I.e., now it supports: {,4}, {4}, {1,3}, {1,} instead of just {1,3} and {1,}

Also, fix the quantifiers more expressive. I.e., now it supports: {,4}, {4}, {1,3}, {1,} instead of just {1,3} and {1,}

eliotwrobson · 2025-03-06T16:38:08Z

@StefanosChaliasos thanks for this contribution! A bit busy today but I'll try to review this at some point in the evening.

coveralls · 2025-03-06T16:40:12Z

coverage: 98.008% (-1.6%) from 99.613%
when pulling df53edc on StefanosChaliasos:add-support-for-charclass
into 1bdf9b7 on caleb531:develop.

…like ('\')

eliotwrobson

It looks like some dead code was included accidentally? Otherwise the logic itself seems fine.

eliotwrobson · 2025-03-06T19:00:48Z

automata/regex/parser.py

+        if is_negated:
+            expanded_content = expanded_content[1:]  # Remove ^ from the content
+
+        return cls(match.group(), expanded_content, is_negated)


It looks like there are some missing fields that are used in the constructor? Shouldn't this line throw an exception?

EDIT: Based on the coverage report, this line isn't being hit at all.

eliotwrobson · 2025-03-06T19:05:00Z

tests/test_regex.py

+        greek_nfa = NFA.from_regex("[Ͱ-Ͽ]+", input_symbols=input_symbols)
+        cyrillic_nfa = NFA.from_regex("[Ѐ-ӿ]+", input_symbols=input_symbols)
+
+        latin_samples = ["¡", "£", "Ā", "ŕ", "ƿ"]


Can we use the \u... notation? This will make these tests easier to maintain (albiet less elegant in the editor).

eliotwrobson · 2025-03-06T19:06:24Z

automata/regex/parser.py

+        self.counter = counter
+
+    @classmethod
+    def from_match(cls: Type[Self], match: re.Match) -> Self:


It seems like the logic here heavily overlaps with the process_char_class function. Could one of these be made to call the other?

eliotwrobson · 2025-03-06T19:09:12Z

automata/regex/parser.py

+        )
+
+    lexer.register_token(
+        character_class_factory,


Would personally prefer to use the from_match syntax the way the other token types are registered, but either syntax is fine. But it seems like the from_match in the new token class isn't being called at all.

eliotwrobson · 2025-03-06T19:10:07Z

automata/regex/parser.py

@@ -577,3 +691,50 @@ def parse_regex(regexstr: str, input_symbols: AbstractSet[str]) -> NFARegexBuild
    postfix = tokens_to_postfix(tokens_with_concats)

    return parse_postfix_tokens(postfix)
+
+
+def process_char_class(class_str: str) -> Tuple[bool, Set[str]]:


Nit: Might be good to have a couple of small test cases for this function independently to aid in debugging later, but won't make any hard requests for this.

…osChaliasos/automata into pr/250

StefanosChaliasos · 2025-03-06T19:21:33Z

Will go over everything tomorrow. Thanks a lot for the feedback.

We also added more complex tests

StefanosChaliasos · 2025-03-07T08:54:24Z

I did some more changes, can you review the new ones. Basically I added support for shorthand (e.g., '\d') and I tokenised whitespace. I need to add more tests and polish the code. I'll change the PR as a draft until done.

caleb531 · 2025-03-14T19:53:54Z

automata/fa/nfa.py

+        if "\\s" in regex:
+            additional_symbols.update(WHITESPACE_CHARS)
+        if "\\S" in regex:
+            additional_symbols.update(NON_WHITESPACE_CHARS)
+        if "\\d" in regex:
+            additional_symbols.update(DIGIT_CHARS)
+        if "\\D" in regex:
+            additional_symbols.update(NON_DIGIT_CHARS)
+        if "\\w" in regex:
+            additional_symbols.update(WORD_CHARS)
+        if "\\W" in regex:
+            additional_symbols.update(NON_WORD_CHARS)


@StefanosChaliasos Can you please refactor this to use a dict-based lookup table? That would make this much less repetitive.

cc @eliotwrobson

I agree this could make things much cleaner, especially since this can be done in a loop 👍🏽

caleb531 · 2025-03-14T19:54:43Z

automata/fa/nfa.py

+        from automata.regex.parser import (
+            DIGIT_CHARS,
+            NON_DIGIT_CHARS,
+            NON_WHITESPACE_CHARS,
+            NON_WORD_CHARS,
+            WHITESPACE_CHARS,
+            WORD_CHARS,
+        )


@StefanosChaliasos Can you please keep all imports at the top of the file? There's no particular need for the tighter scoping here, IMO.

cc @eliotwrobson

Yes, I think ruff will complain about the imports.

caleb531 · 2025-03-14T19:58:45Z

automata/fa/nfa.py

+                        additional_symbols.update(WORD_CHARS)
+                        pos += 2
+                        continue
+                    elif class_content[pos + 1] in "S":


@StefanosChaliasos What is the intention of using in here as opposed to ==? If the right-hand side is just a single character, the only difference that seems to make is permitting class_content[pos + 1] to be empty string (in addition to the character itself). In other words:

"S" in "S" True "" in "S" # True

caleb531 · 2025-03-14T19:59:30Z

automata/fa/nfa.py

+                        continue
+
+                    # Handle escape sequence in character class
+                    from automata.regex.parser import _handle_escape_sequences


@StefanosChaliasos Can you also please move this import to the top of the file?

caleb531 · 2025-03-14T20:02:09Z

automata/regex/parser.py

@@ -24,10 +25,21 @@
    validate_tokens,
 )

+# Add these at the top of the file to define our shorthand character sets
+ASCII_PRINTABLE_CHARS = frozenset(string.printable)


@eliotwrobson The implication here is that only ASCII characters are deemed as printable characters, but how would that work given that #233 just added support for Unicode characters?

I think this only gets used when adding characters from the use of some special character classes? It might be good to have a test case that uses a non-printable character.

caleb531 · 2025-03-14T20:05:16Z

Hey, @StefanosChaliasos! I left some additional comments on the PR—apologies if they seem nitpicky, but just wanting to maintain solid code quality and consistency for this project.

StefanosChaliasos · 2025-03-14T21:04:29Z

Thanks for the review, I will address the comments once I find some time

eliotwrobson

A few more comments. @StefanosChaliasos also, we just updated the develop branch to use UV, so be sure to do a rebase before adding more changes. It would be awesome if we could close this out soon, since along with the switch to UV, we have a couple of feature request PRs that could be part of a new release.

eliotwrobson · 2025-05-06T21:22:34Z

automata/regex/parser.py

@@ -24,10 +25,21 @@
    validate_tokens,
 )

+# Add these at the top of the file to define our shorthand character sets
+ASCII_PRINTABLE_CHARS = frozenset(string.printable)


I think this only gets used when adding characters from the use of some special character classes? It might be good to have a test case that uses a non-printable character.

eliotwrobson · 2025-05-06T21:22:58Z

automata/fa/nfa.py

+        if "\\s" in regex:
+            additional_symbols.update(WHITESPACE_CHARS)
+        if "\\S" in regex:
+            additional_symbols.update(NON_WHITESPACE_CHARS)
+        if "\\d" in regex:
+            additional_symbols.update(DIGIT_CHARS)
+        if "\\D" in regex:
+            additional_symbols.update(NON_DIGIT_CHARS)
+        if "\\w" in regex:
+            additional_symbols.update(WORD_CHARS)
+        if "\\W" in regex:
+            additional_symbols.update(NON_WORD_CHARS)


I agree this could make things much cleaner, especially since this can be done in a loop 👍🏽

eliotwrobson · 2025-05-06T21:23:15Z

automata/fa/nfa.py

+        from automata.regex.parser import (
+            DIGIT_CHARS,
+            NON_DIGIT_CHARS,
+            NON_WHITESPACE_CHARS,
+            NON_WORD_CHARS,
+            WHITESPACE_CHARS,
+            WORD_CHARS,
+        )


Yes, I think ruff will complain about the imports.

eliotwrobson · 2025-05-06T21:24:13Z

automata/fa/nfa.py

+            while pos < len(class_content):
+                if class_content[pos] == "\\" and pos + 1 < len(class_content):
+                    # Check for shorthand classes in character classes
+                    if class_content[pos + 1] == "s":


I think you might be able to use a lookup table from a dictionary here? Just use the character as a key and a tuple of additional symbols and position increment as the value.

eliotwrobson · 2025-05-06T21:25:16Z

automata/fa/nfa.py

+
+                        # Add all characters in the range to input symbols
+                        for i in range(ord(start_char), ord(end_char) + 1):
+                            class_symbols.add(chr(i))


I think this can be done with a python update call instead of a loop.

eliotwrobson · 2025-05-06T21:26:18Z

automata/regex/parser.py

+        "&": "&",
+    }
+
+    if char in escape_map:


Suggested change

if char in escape_map:

return escape_map.get(char, char)

StefanosChaliasos · 2025-05-07T05:41:16Z

Thanks for the additional comments, will fix everything by the end of the week. Got busy with other stuff :)

StefanosChaliasos added 4 commits March 6, 2025 11:46

Add support for character classes [...]

7d8c4e1

Also, fix the quantifiers more expressive. I.e., now it supports: {,4}, {4}, {1,3}, {1,} instead of just {1,3} and {1,}

Support class characters when no input_symbols are given

7c291b6

Lint fixes and one more test

db5b6eb

Fixx issue with reserved characters inside character class

298f559

eliotwrobson changed the base branch from main to develop March 6, 2025 18:52

StefanosChaliasos and others added 2 commits March 6, 2025 21:08

Add support for escaped characters and properly handle special chars …

ce0a5ac

…like ('\')

Add missing annotation

00ac161

eliotwrobson requested changes Mar 6, 2025

View reviewed changes

Merge branch 'add-support-for-charclass' of https://github.com/Stefan…

227acca

…osChaliasos/automata into pr/250

StefanosChaliasos added 2 commits March 7, 2025 08:54

Add support for shorthands

f400c34

Allow reserved chars in input symbols and tokenize spaces

df53edc

We also added more complex tests

StefanosChaliasos marked this pull request as draft March 7, 2025 08:54

caleb531 reviewed Mar 14, 2025

View reviewed changes

eliotwrobson reviewed May 6, 2025

View reviewed changes

Add support for character classes [...] #250

Are you sure you want to change the base?

Add support for character classes [...] #250

Uh oh!

Conversation

StefanosChaliasos commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eliotwrobson commented Mar 6, 2025

Uh oh!

coveralls commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eliotwrobson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanosChaliasos commented Mar 6, 2025

Uh oh!

StefanosChaliasos commented Mar 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

caleb531 commented Mar 14, 2025

Uh oh!

StefanosChaliasos commented Mar 14, 2025

Uh oh!

eliotwrobson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanosChaliasos commented May 7, 2025

Uh oh!

Uh oh!

Add support for character classes `[...]` #250

Add support for character classes `[...]` #250

StefanosChaliasos commented Mar 6, 2025 •

edited

Loading

coveralls commented Mar 6, 2025 •

edited

Loading