Support capture groups with the RE/flex regex matcher? #95

larkwiot · 2020-12-29T15:59:34Z

Hello,

First off, I'm a huge fan of RE-flex for my projects and ugrep is by far my most valuable and most used tool at my workplace. I deal with a lot of text.

Out of all that usage, I consistently miss the capture group feature. I know it is supported with other regex libraries that RE-flex (and therefore ugrep) can use, but that is not the same as RE-flex supporting it directly. I love ugrep and RE-flex's speed and features, but it always hurts me that I have to compile with PCRE2 mode (-P) to get capture groups with ugrep and use Boost.Regex when directly using RE-flex. They're such a common thing to need--why does RE-flex not support them? I'd like to use it and only it so that I do not have to bother with Boost or PCRE2, and also for the performance benefits.

I know POSIX and other compatibility things are at play, but I don't see how supporting capture groups would violate them. The docs say that RE-flex supports the lazy quantifiers, which is not part of POSIX, and it also supports non-capturing groups, but not capture groups. Why is this? It seems like it would be easy to add support for capturing groups considering that non-capturing groups are supported. It makes me think there is some sort of design decision that has already been made which does not allow them, but I don't understand enough of the library to know what it is.

Thanks in advance for your patience if I am making an incorrect assumption or unaware of some functionality that would do this. I'm also sorry if this has already been addressed in another issue or in the docs, but I could not find it either here, ugrep's github project, or the docs.

genivia-inc · 2020-12-29T17:53:15Z

Thank you for your feedback!

Note that Flex does not support group captures and backreferences.

I have group capture on the list of things to add to RE/flex.

The RE/flex matcher does identify which pattern among alternations matched, kind of like a "global group capture". For example this|is|an|example will return 3 for a match of an. Same for (this)|(is)|(an)|(example) but we don't need parenthesis with the RE/flex matcher to capture them. This is of course different with PCRE2 and Boost Regex matchers that need grouping with parenthesis for captures.

There are a few ways to support group captures in POSIX matching, which is a more recent research topic. It is not trivial, because POSIX DFAs translated from regex patterns do not encode position information of the original regex, so capturing parenthesis get lost. My student and I researched alternative ways to implement capturing groups and we came up with a new method called staDFA, which we compared to TDFA. Eventually I will use one of these methods and will work on this soon.

larkwiot · 2020-12-29T18:35:33Z

Would it be possible to more easily add capture groups without posix support for those of us who do not need it? Or even just in ugrep?

genivia-inc · 2022-02-20T17:00:09Z

This needs some clarification. With the PCRE matcher in RE/flex you can use capturing groups. Use named captures (?<name>pattern). Backreference with \g{name} and extract the pattern in C++ with:

        const char *name; // group name
        std::pair<const char*,size_t> subpattern; // subpattern in the input and its size
        std::pair<size_t,const char*> id = matcher.group_id();
        while (id.first != 0 && (id.second == NULL || strcmp(id.second, name) != 0))
          id = matcher.group_next_id();
        if (id.first != 0)
          subpattern = matcher[id.first]; // found (name was matched)

IMHO there is not a significant need (or any need whatsoever) to use group captures in lexical analyzers.

Same for ugrep. You can use group captures (both numeric and named) with option -P for Perl matching.

larkwiot · 2022-02-20T17:57:57Z

This needs some clarification. With the PCRE matcher in RE/flex you can use capturing groups. Use named captures (?<name>pattern). Backreference with \g{name} and extract the pattern in C++ with:
        const char *name; // group name
        std::pair<const char*,size_t> subpattern; // subpattern in the input and its size
        std::pair<size_t,const char*> id = matcher.group_id();
        while (id.first != 0 && (id.second == NULL || strcmp(id.second, name) != 0))
          id = matcher.group_next_id();
        if (id.first != 0)
          subpattern = matcher[id.first]; // found (name was matched)
IMHO there is not a significant need (or any need whatsoever) to use group captures in lexical analyzers.

Same for ugrep. You can use group captures (both numeric and named) with option -P for Perl matching.

If you're replying to me, I mentioned that I do not want to use PCRE to get captures. I opened this issue in the hopes to encourage native support for captures.

I agree with your view that lexical analyzers do not need captures, but most of my use cases are not in lexical analyzers, so I need captures.

genivia-inc added the enhancement New feature or request label Dec 29, 2020

genivia-inc mentioned this issue Jun 26, 2021

Question about performance and syntax Genivia/ugrep#141

Closed

genivia-inc changed the title ~~Why does RE-flex not support native capture groups~~ Support capture groups with the RE/flex regex matcher? Aug 8, 2024

genivia-inc added the question A technical question that has or needs clarification label Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support capture groups with the RE/flex regex matcher? #95

Support capture groups with the RE/flex regex matcher? #95

larkwiot commented Dec 29, 2020

genivia-inc commented Dec 29, 2020 •

edited

Loading

larkwiot commented Dec 29, 2020

genivia-inc commented Feb 20, 2022 •

edited

Loading

larkwiot commented Feb 20, 2022

Support capture groups with the RE/flex regex matcher? #95

Support capture groups with the RE/flex regex matcher? #95

Comments

larkwiot commented Dec 29, 2020

genivia-inc commented Dec 29, 2020 • edited Loading

larkwiot commented Dec 29, 2020

genivia-inc commented Feb 20, 2022 • edited Loading

larkwiot commented Feb 20, 2022

genivia-inc commented Dec 29, 2020 •

edited

Loading

genivia-inc commented Feb 20, 2022 •

edited

Loading