Subroutines breaking capture tokenizing inside of referenced capture group #164

RedCMD · 2022-01-02T12:09:37Z

When trying to call a subroutine on a capture group via \\g<1>.
The call will remove all the previous tokens from capture groups that aren't rechecked in the subroutine.

Create a syntax highlighting extension with this code

{
	"$schema": "https://raw.githubusercontent.com/martinring/tmlanguage/master/tmlanguage.json",
	"name": "Subroutines Syntax",
	"scopeName": "source.redcmd.syntax.subroutines",
	"patterns": [
		{ "include": "#subroutines" }
	],
	"repository": {
		"subroutines": {
			"match": "((a)|(b)|(c)|(d))-\\g<1>",
			"captures": {
				"2": { "name": "strong variable.other.constant" },
				"3": { "name": "strong keyword.control" },
				"4": { "name": "strong support.type" },
				"5": { "name": "strong constant.character.escape" }
			}
		}
	}
}

Expected outcome is that it will highlight all text in the format [abcd]-[abcd]

a-a
a-b
a-c
a-d
b-a
b-b
b-c
b-d
c-a
c-b
c-c
c-d
d-a
d-b
d-c
d-d

Like so:

But instead all tokens connected to capture groups that don't get rematched against (and fail) in the subroutine call get purged.
(capture groups 2 to 5)

The text was updated successfully, but these errors were encountered:

RedCMD · 2022-01-13T03:59:48Z

Another way to see it, is to create a highlighter like this:

"match": "(A)(B)(C)(D)(E)(F)(G)(H)(I)(J)\\g<6>?(K)(L)(M)(N)(O)(P)",
"captures": {
	"1":  { "name": "markup.underline invalid" },
	"2":  { "name": "markup.underline string.regexp" },
	"3":  { "name": "markup.underline string" },
	"4":  { "name": "markup.underline constant.character.escape" },
	"5":  { "name": "markup.underline support.function" },
	"6":  { "name": "markup.underline constant.numeric" },
	"7":  { "name": "markup.underline comment" },
	"8":  { "name": "markup.underline support.type" },
	"9":  { "name": "markup.underline variable" },
	"10": { "name": "markup.underline variable.other.constant" },
	"11": { "name": "markup.underline keyword" },
	"12": { "name": "markup.underline punctuation.definition.list.begin.markdown" },
	"13": { "name": "markup.underline header" },
	"14": { "name": "markup.underline constant.regexp" },
	"15": { "name": "markup.underline keyword.control" },
	"16": { "name": "markup.underline punctuation.definition.tag" }
}

and a test file with: ABCDEFGHIJKLMNOP
It should then colour the letters like so:

This does not trigger the subroutine \\g<6> (which is optional) and thus works fine

But if you insert a F inbetween J and K, the call will be made and will break all tokenization ((F)(G)(H)(I)(J)) between (F) (group 6) and the caller \\g<6>

This is extremely annoying when you have to copy and paste large amounts of the same regex over and over again instead of just being able to make a recall to the code.
and you cant just set the code off at the side and never have it run.
The subroutine call will still be able to manage to break itself.

Workaround for microsoft/vscode-textmate#164 and similar issues

RedCMD · 2023-11-05T07:45:24Z

dup:
#127
#208

) More work towards #16 If we wanted to capture the `{` `}` delimiters with some scopes, I think we might run into microsoft/vscode-textmate#164 & related issues. But for now we aren't highlighting them so it's fine. --------- Co-authored-by: RedCMD <[email protected]>

slevithan · 2025-01-19T00:16:26Z

The call will remove all the previous tokens from capture groups that aren't rechecked in the subroutine.

The above line is a nice concise description, but the meaning is subtle and probably not obvious to all readers, since...

The behavior for both nonparticipating capturing groups and subroutines can be very complex/nuanced, and varies across regex flavors.
You didn't call out that it definitely must do this for captures that participate via the subroutine; the only question is about captures that didn't re-participate.

Also, it might help to describe the behavior outside of the context of TM grammars, and just focus on the subpattern results (and what they should be).

Context: Unlike the subroutines from PCRE/Perl/Regex+ (which are arguably easier to reason about and more useful), Oniguruma subroutines replace the captured values of groups they reference (and captures created within the contents of groups they reference) if the subroutine occurs to the right of the referenced group. In other words, ((.))\g<1> and \g<1>((.)) not only match the same strings, but the returned subpattern matches are also the same and in both cases will be the second character matched. In other other words, if tested against the target string 'ab', capture 0 (the overall match) is ab, and captures 1 and 2 are b (there is no capture 3 or 4 since subroutines don't add additional capture indexes).

Thus you can think of any captures formed directly/indirectly by any number of subroutines as sharing the capture slots of the original capturing groups. And whichever capture (that's part of a set with subroutine/s) that last participated in the match overwrites the captured value in the shared slots. (In fact it gets hairier than this in edge cases with references to indirectly created duplicate named groups, but let's leave that aside.)

@RedCMD, the behavior you're describing would make perfect sense (and not be a bug) if the more-recently-participating subroutine always overwrote the captured values for all captures within its contents. E.g., when ((a)|(b)|(c)|(d))-\g<1> matched a-b, in the end groups 2, 4, and 5 would be nonparticipating, which with JS RegExps would make their subpattern value on match results undefined, and in vscode-oniguruma would make their start and end captureIndices use the value 4294967295.

But, it sounds like you're saying that although subroutines should continue to overwrite values for captures within their contents that participate in the match (you didn't state that part, but it is required to be correct), captures that don't participate via the subroutine match should not overwrite previously captured values from captures that did participate.

Let me show an example of an expected match result in JS RegExp terms since that's easier to show... From my understanding of the problem as you described it, you're saying that when ((a)|(b)|(c)|(d))-\g<1> matches a-b, the match result array should be ['a-b', 'b', 'a', 'b', undefined, undefined], and not what you're getting, which is ['a-b', 'b', undefined, 'b', undefined, undefined]. Is that right?

I just tested via the vscode-oniguruma wrapper, and it gives the results that you expect, which I assume means you're right that this is a bug in vscode-textmate. But are you certain that the bug is in vscode-textmate and not in vscode-oniguruma? As I explained above, either behavior could be considered correct. I'm not easily able to test with Oniguruma in C code directly; but yeah, since the potential bug is very subtle, it should probably first be verified that this is handled differently within Oniguruma (via direct use in C) and/or in alternative TextMate grammar engines.

jtbandes mentioned this issue Nov 5, 2023

Repository works only when it defined at top level of grammar file #140

Open

jtbandes added a commit to jtbandes/swift-tmlanguage that referenced this issue Nov 5, 2023

Fix backreference and subpattern highlighting in VS Code

efe7774

Workaround for microsoft/vscode-textmate#164 and similar issues

jtbandes mentioned this issue Nov 5, 2023

Update Swift grammar and upstream repository microsoft/vscode#197470

Merged

RedCMD referenced this issue in RedCMD/TmLanguage-Syntax-Highlighter Nov 19, 2023

Fix character class range bug and improve \\x{}&\\o{} code points

f48d1bd

This was referenced Oct 5, 2024

Scopes on Recursive Regex Cause Problems #208

Open

multiply applied capture groups seems to ignore some captures #127

Open

jtbandes mentioned this issue Jan 18, 2025

Remove conditionals and use recursion for regex InterpolatedCallout jtbandes/swift-tmlanguage#20

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subroutines breaking capture tokenizing inside of referenced capture group #164

Subroutines breaking capture tokenizing inside of referenced capture group #164

RedCMD commented Jan 2, 2022 •

edited

Loading

RedCMD commented Jan 13, 2022

RedCMD commented Nov 5, 2023 •

edited

Loading

slevithan commented Jan 19, 2025 •

edited

Loading

Subroutines breaking capture tokenizing inside of referenced capture group #164

Subroutines breaking capture tokenizing inside of referenced capture group #164

Comments

RedCMD commented Jan 2, 2022 • edited Loading

RedCMD commented Jan 13, 2022

RedCMD commented Nov 5, 2023 • edited Loading

slevithan commented Jan 19, 2025 • edited Loading

RedCMD commented Jan 2, 2022 •

edited

Loading

RedCMD commented Nov 5, 2023 •

edited

Loading

slevithan commented Jan 19, 2025 •

edited

Loading