Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subroutines breaking capture tokenizing inside of referenced capture group #164

Open
RedCMD opened this issue Jan 2, 2022 · 3 comments
Open

Comments

@RedCMD
Copy link

RedCMD commented Jan 2, 2022

When trying to call a subroutine on a capture group via \\g<1>.
The call will remove all the previous tokens from capture groups that aren't rechecked in the subroutine.

Create a syntax highlighting extension with this code

{
	"$schema": "https://raw.githubusercontent.com/martinring/tmlanguage/master/tmlanguage.json",
	"name": "Subroutines Syntax",
	"scopeName": "source.redcmd.syntax.subroutines",
	"patterns": [
		{ "include": "#subroutines" }
	],
	"repository": {
		"subroutines": {
			"match": "((a)|(b)|(c)|(d))-\\g<1>",
			"captures": {
				"2": { "name": "strong variable.other.constant" },
				"3": { "name": "strong keyword.control" },
				"4": { "name": "strong support.type" },
				"5": { "name": "strong constant.character.escape" }
			}
		}
	}
}

image

Expected outcome is that it will highlight all text in the format [abcd]-[abcd]

a-a
a-b
a-c
a-d
b-a
b-b
b-c
b-d
c-a
c-b
c-c
c-d
d-a
d-b
d-c
d-d

Like so:
image

But instead all tokens connected to capture groups that don't get rematched against (and fail) in the subroutine call get purged.
(capture groups 2 to 5)
image

@RedCMD
Copy link
Author

RedCMD commented Jan 13, 2022

Another way to see it, is to create a highlighter like this:
image

"match": "(A)(B)(C)(D)(E)(F)(G)(H)(I)(J)\\g<6>?(K)(L)(M)(N)(O)(P)",
"captures": {
	"1":  { "name": "markup.underline invalid" },
	"2":  { "name": "markup.underline string.regexp" },
	"3":  { "name": "markup.underline string" },
	"4":  { "name": "markup.underline constant.character.escape" },
	"5":  { "name": "markup.underline support.function" },
	"6":  { "name": "markup.underline constant.numeric" },
	"7":  { "name": "markup.underline comment" },
	"8":  { "name": "markup.underline support.type" },
	"9":  { "name": "markup.underline variable" },
	"10": { "name": "markup.underline variable.other.constant" },
	"11": { "name": "markup.underline keyword" },
	"12": { "name": "markup.underline punctuation.definition.list.begin.markdown" },
	"13": { "name": "markup.underline header" },
	"14": { "name": "markup.underline constant.regexp" },
	"15": { "name": "markup.underline keyword.control" },
	"16": { "name": "markup.underline punctuation.definition.tag" }
}

and a test file with: ABCDEFGHIJKLMNOP
It should then colour the letters like so:
image
This does not trigger the subroutine \\g<6> (which is optional) and thus works fine

But if you insert a F inbetween J and K, the call will be made and will break all tokenization ((F)(G)(H)(I)(J)) between (F) (group 6) and the caller \\g<6>
image

This is extremely annoying when you have to copy and paste large amounts of the same regex over and over again instead of just being able to make a recall to the code.
and you cant just set the code off at the side and never have it run.
The subroutine call will still be able to manage to break itself.

@RedCMD
Copy link
Author

RedCMD commented Nov 5, 2023

dup:
#127
#208

RedCMD referenced this issue in RedCMD/TmLanguage-Syntax-Highlighter Nov 19, 2023
jtbandes added a commit to jtbandes/swift-tmlanguage that referenced this issue Jan 18, 2025
)

More work towards #16

If we wanted to capture the `{` `}` delimiters with some scopes, I think
we might run into
microsoft/vscode-textmate#164 & related
issues. But for now we aren't highlighting them so it's fine.

---------

Co-authored-by: RedCMD <[email protected]>
@slevithan
Copy link

slevithan commented Jan 19, 2025

The call will remove all the previous tokens from capture groups that aren't rechecked in the subroutine.

The above line is a nice concise description, but the meaning is subtle and probably not obvious to all readers, since...

  • The behavior for both nonparticipating capturing groups and subroutines can be very complex/nuanced, and varies across regex flavors.
  • You didn't call out that it definitely must do this for captures that participate via the subroutine; the only question is about captures that didn't re-participate.

Also, it might help to describe the behavior outside of the context of TM grammars, and just focus on the subpattern results (and what they should be).

Context: Unlike the subroutines from PCRE/Perl/Regex+ (which are arguably easier to reason about and more useful), Oniguruma subroutines replace the captured values of groups they reference (and captures created within the contents of groups they reference) if the subroutine occurs to the right of the referenced group. In other words, ((.))\g<1> and \g<1>((.)) not only match the same strings, but the returned subpattern matches are also the same and in both cases will be the second character matched. In other other words, if tested against the target string 'ab', capture 0 (the overall match) is ab, and captures 1 and 2 are b (there is no capture 3 or 4 since subroutines don't add additional capture indexes).

Thus you can think of any captures formed directly/indirectly by any number of subroutines as sharing the capture slots of the original capturing groups. And whichever capture (that's part of a set with subroutine/s) that last participated in the match overwrites the captured value in the shared slots. (In fact it gets hairier than this in edge cases with references to indirectly created duplicate named groups, but let's leave that aside.)

@RedCMD, the behavior you're describing would make perfect sense (and not be a bug) if the more-recently-participating subroutine always overwrote the captured values for all captures within its contents. E.g., when ((a)|(b)|(c)|(d))-\g<1> matched a-b, in the end groups 2, 4, and 5 would be nonparticipating, which with JS RegExps would make their subpattern value on match results undefined, and in vscode-oniguruma would make their start and end captureIndices use the value 4294967295.

But, it sounds like you're saying that although subroutines should continue to overwrite values for captures within their contents that participate in the match (you didn't state that part, but it is required to be correct), captures that don't participate via the subroutine match should not overwrite previously captured values from captures that did participate.

Let me show an example of an expected match result in JS RegExp terms since that's easier to show... From my understanding of the problem as you described it, you're saying that when ((a)|(b)|(c)|(d))-\g<1> matches a-b, the match result array should be ['a-b', 'b', 'a', 'b', undefined, undefined], and not what you're getting, which is ['a-b', 'b', undefined, 'b', undefined, undefined]. Is that right?

I just tested via the vscode-oniguruma wrapper, and it gives the results that you expect, which I assume means you're right that this is a bug in vscode-textmate. But are you certain that the bug is in vscode-textmate and not in vscode-oniguruma? As I explained above, either behavior could be considered correct. I'm not easily able to test with Oniguruma in C code directly; but yeah, since the potential bug is very subtle, it should probably first be verified that this is handled differently within Oniguruma (via direct use in C) and/or in alternative TextMate grammar engines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants