Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support capture groups with the RE/flex regex matcher? #95

Open
larkwiot opened this issue Dec 29, 2020 · 4 comments
Open

Support capture groups with the RE/flex regex matcher? #95

larkwiot opened this issue Dec 29, 2020 · 4 comments
Labels
enhancement New feature or request question A technical question that has or needs clarification

Comments

@larkwiot
Copy link

Hello,

First off, I'm a huge fan of RE-flex for my projects and ugrep is by far my most valuable and most used tool at my workplace. I deal with a lot of text.

Out of all that usage, I consistently miss the capture group feature. I know it is supported with other regex libraries that RE-flex (and therefore ugrep) can use, but that is not the same as RE-flex supporting it directly. I love ugrep and RE-flex's speed and features, but it always hurts me that I have to compile with PCRE2 mode (-P) to get capture groups with ugrep and use Boost.Regex when directly using RE-flex. They're such a common thing to need--why does RE-flex not support them? I'd like to use it and only it so that I do not have to bother with Boost or PCRE2, and also for the performance benefits.

I know POSIX and other compatibility things are at play, but I don't see how supporting capture groups would violate them. The docs say that RE-flex supports the lazy quantifiers, which is not part of POSIX, and it also supports non-capturing groups, but not capture groups. Why is this? It seems like it would be easy to add support for capturing groups considering that non-capturing groups are supported. It makes me think there is some sort of design decision that has already been made which does not allow them, but I don't understand enough of the library to know what it is.

Thanks in advance for your patience if I am making an incorrect assumption or unaware of some functionality that would do this. I'm also sorry if this has already been addressed in another issue or in the docs, but I could not find it either here, ugrep's github project, or the docs.

@genivia-inc
Copy link
Member

genivia-inc commented Dec 29, 2020

Thank you for your feedback!

Note that Flex does not support group captures and backreferences.

I have group capture on the list of things to add to RE/flex.

The RE/flex matcher does identify which pattern among alternations matched, kind of like a "global group capture". For example this|is|an|example will return 3 for a match of an. Same for (this)|(is)|(an)|(example) but we don't need parenthesis with the RE/flex matcher to capture them. This is of course different with PCRE2 and Boost Regex matchers that need grouping with parenthesis for captures.

There are a few ways to support group captures in POSIX matching, which is a more recent research topic. It is not trivial, because POSIX DFAs translated from regex patterns do not encode position information of the original regex, so capturing parenthesis get lost. My student and I researched alternative ways to implement capturing groups and we came up with a new method called staDFA, which we compared to TDFA. Eventually I will use one of these methods and will work on this soon.

@genivia-inc genivia-inc added the enhancement New feature or request label Dec 29, 2020
@larkwiot
Copy link
Author

Would it be possible to more easily add capture groups without posix support for those of us who do not need it? Or even just in ugrep?

@genivia-inc
Copy link
Member

genivia-inc commented Feb 20, 2022

This needs some clarification. With the PCRE matcher in RE/flex you can use capturing groups. Use named captures (?<name>pattern). Backreference with \g{name} and extract the pattern in C++ with:

        const char *name; // group name
        std::pair<const char*,size_t> subpattern; // subpattern in the input and its size
        std::pair<size_t,const char*> id = matcher.group_id();
        while (id.first != 0 && (id.second == NULL || strcmp(id.second, name) != 0))
          id = matcher.group_next_id();
        if (id.first != 0)
          subpattern = matcher[id.first]; // found (name was matched)

IMHO there is not a significant need (or any need whatsoever) to use group captures in lexical analyzers.

Same for ugrep. You can use group captures (both numeric and named) with option -P for Perl matching.

@larkwiot
Copy link
Author

This needs some clarification. With the PCRE matcher in RE/flex you can use capturing groups. Use named captures (?<name>pattern). Backreference with \g{name} and extract the pattern in C++ with:

        const char *name; // group name
        std::pair<const char*,size_t> subpattern; // subpattern in the input and its size
        std::pair<size_t,const char*> id = matcher.group_id();
        while (id.first != 0 && (id.second == NULL || strcmp(id.second, name) != 0))
          id = matcher.group_next_id();
        if (id.first != 0)
          subpattern = matcher[id.first]; // found (name was matched)

IMHO there is not a significant need (or any need whatsoever) to use group captures in lexical analyzers.

Same for ugrep. You can use group captures (both numeric and named) with option -P for Perl matching.

If you're replying to me, I mentioned that I do not want to use PCRE to get captures. I opened this issue in the hopes to encourage native support for captures.

I agree with your view that lexical analyzers do not need captures, but most of my use cases are not in lexical analyzers, so I need captures.

@genivia-inc genivia-inc changed the title Why does RE-flex not support native capture groups Support capture groups with the RE/flex regex matcher? Aug 8, 2024
@genivia-inc genivia-inc added the question A technical question that has or needs clarification label Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question A technical question that has or needs clarification
Projects
None yet
Development

No branches or pull requests

2 participants