Skip to content

regexp/syntax: [a-y] parses as [a-z] with (?i) #73456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rgmz opened this issue Apr 21, 2025 · 8 comments
Closed

regexp/syntax: [a-y] parses as [a-z] with (?i) #73456

rgmz opened this issue Apr 21, 2025 · 8 comments
Labels
BugReport Issues describing a possible bug in the Go implementation. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WorkingAsIntended Issues describing something that is working as it is supposed to.

Comments

@rgmz
Copy link

rgmz commented Apr 21, 2025

Go version

go version go1.24.2 linux/amd64

Output of go env in your module/workspace:

`go env` output
AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE='on'
GOAMD64='v1'
GOARCH='amd64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/home/me/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/home/me/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3280803028=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/home/me/dev/github.com/gitleaks/gitleaks/go.mod'
GOMODCACHE='/home/me/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/rich/go'
GOPRIVATE=''
GOPROXY='direct'
GOROOT='/usr/lib/golang'
GOSUMDB='off'
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/me/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/lib/golang/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.24.2'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

I was writing test cases to detect "strange" character class ranges (i.e., a user does =-_ and inadvertently creates a range). It appears there is a bug with how syntax.Parse handles character ranges in conjunction with (?i).

https://go.dev/play/p/2p8l3zJi-4T

Note: [a-x] reference in the title refers to any range other than a-z.

What did you see happen?

The pattern (?i)[a-l0-9=-_] is parsed as a-z and not a-l as expected.

Pattern is: [0-9=-_a-zſK]
Pattern is: [0-9=A-L_a-lK]
Pattern is: [0-9=-_a-l]
Pattern is: [0-9=A-L_a-l]

What did you expect to see?

All patterns are printed as [a-l].

Pattern is: [0-9=-_a-lſK]
Pattern is: [0-9=A-L_a-lK]
Pattern is: [0-9=-_a-l]
Pattern is: [0-9=A-L_a-l]
@gabyhelp
Copy link

Related Issues

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

@gabyhelp gabyhelp added the BugReport Issues describing a possible bug in the Go implementation. label Apr 21, 2025
@Hiworle
Copy link

Hiworle commented Apr 21, 2025

This may not be a bug. In the first example, (?i)[a-l0-9=-_] is parsed as [0-9=-_a-zſK] because in ASCII encoding, = (61) and _ (95) happen to include A (65)-Z (90) (in JS, /[0-9=-_a-z]/i.test('z') also returns true). You might replace - with \- to get the correct result:
https://go.dev/play/p/761NDru9plV

@rgmz
Copy link
Author

rgmz commented Apr 21, 2025

This may not be a bug. In the first example, (?i)[a-l0-9=-_] is parsed as [0-9=-_a-zſK] because in ASCII encoding, = (61) and _ (95) happen to include A (65)-Z (90) (in JS, /[0-9=-_a-z]/i.test('z') also returns true).

I considered that. I'd expect [a-l0-9=-_] (no (?i)) to behave similarly, yet it exhibits different behaviour.

You might replace - with \- to get the correct result: https://go.dev/play/p/761NDru9plV

- is important for the context of the issue.

@Hiworle
Copy link

Hiworle commented Apr 21, 2025

The difference is whether it includes l-z.

@rgmz
Copy link
Author

rgmz commented Apr 21, 2025

The difference is whether it includes l-z.

I don't follow.

It seems unexpected that (?i)[a-l0-9=-_] => [0-9=-_a-zſK] and [a-lA-L0-9=-_] => [0-9=-_a-l] produce different outputs — ignoring the ſK symbols added for case folding.

@adonovan
Copy link
Member

Consider the minimal example(?i)[=-_]. The last three chars are not literals--the middle one is a metacharacter--so the whole is parsed as a range of the form p-q. In this case, = U+003D is the start and _ U+005F is the end. That interval includes the uppercase ASCII letters A-Z. So, the case folding of (?i) augments the pattern to include a-zſK as well.

It may be confusing that without (?i), the pattern is printed as just [=-_], making the interval (which includes A-Z) hard to see. But I think it is working as intended.

@JunyangShao JunyangShao added the WorkingAsIntended Issues describing something that is working as it is supposed to. label Apr 21, 2025
@JunyangShao
Copy link
Contributor

JunyangShao commented Apr 21, 2025

@rgmz Does Alan's comment answer your question? If it's WAI feel free to close this issue, thank you 👍
@matloob

@JunyangShao JunyangShao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 21, 2025
@adonovan adonovan closed this as not planned Won't fix, can't repro, duplicate, stale Apr 21, 2025
@rgmz
Copy link
Author

rgmz commented Apr 26, 2025

Thanks for your insights, @adonovan!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BugReport Issues describing a possible bug in the Go implementation. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WorkingAsIntended Issues describing something that is working as it is supposed to.
Projects
None yet
Development

No branches or pull requests

5 participants