Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: nil dereference when storing field of non-nil struct value #71857

Open
chenjie199234 opened this issue Feb 20, 2025 · 24 comments
Open
Assignees
Labels
BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime. Critical A critical problem that affects the availability or correctness of production systems built using Go NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@chenjie199234
Copy link

chenjie199234 commented Feb 20, 2025

Go version

1.24.0

go env
set AR=ar
set CC=gcc
set CGO_CFLAGS=-O2 -g
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-O2 -g
set CGO_ENABLED=1
set CGO_FFLAGS=-O2 -g
set CGO_LDFLAGS=-O2 -g
set CXX=g++
set GCCGO=gccgo
set GO111MODULE=
set GOAMD64=v1
set GOARCH=amd64
set GOAUTH=netrc
go: stripping unprintable or unescapable characters from %"GOBIN"%
set GOBIN=C:\Users\������\Desktop\Go\bin
go: stripping unprintable or unescapable characters from %"GOCACHE"%
set GOCACHE=C:\Users\������\Desktop\Go\cache
set GOCACHEPROG=
set GODEBUG=
go: stripping unprintable or unescapable characters from %"GOENV"%
set GOENV=C:\Users\������\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFIPS140=off
set GOFLAGS=
go: stripping unprintable or unescapable characters from %"GOGCCFLAGS"%
set GOGCCFLAGS=-m64 -mthreads -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=C:\Users\������\AppData\Local\Temp\go-build2224281176=/tmp/go-build -gno-record-gcc-switches
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
go: stripping unprintable or unescapable characters from %"GOMOD"%
set GOMOD=C:\Users\������\Desktop\project\gate\go.mod
go: stripping unprintable or unescapable characters from %"GOMODCACHE"%
set GOMODCACHE=C:\Users\������\Desktop\Go\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
go: stripping unprintable or unescapable characters from %"GOPATH"%
set GOPATH=C:\Users\������\Desktop\Go
set GOPRIVATE=
set GOPROXY=direct
set GOROOT=C:\Program Files\Go
set GOSUMDB=off
set GOTELEMETRY=local
go: stripping unprintable or unescapable characters from %"GOTELEMETRYDIR"%
set GOTELEMETRYDIR=C:\Users\������\AppData\Roaming\go\telemetry
set GOTMPDIR=
set GOTOOLCHAIN=auto
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.24.0
go: stripping unprintable or unescapable characters from %"GOWORK"%
set GOWORK=C:\Users\������\Desktop\project\go.work
set PKG_CONFIG=pkg-config

What did you do?

in my project,there has a very strange panic when update from 1.23.6 to 1.24.0

package main

import (
	"context"
	"fmt"
	"io"
	"strconv"
	"sync/atomic"

	"github.com/chenjie199234/Corelib/cerror"
	"github.com/chenjie199234/Corelib/container/list"
	"github.com/chenjie199234/Corelib/monitor"
)

type Msg struct {
	H     *MsgHeader
	B     *MsgBody
	WithB bool
}

type MsgBody struct {
	Body  []byte
	Error *cerror.Error
}

type MsgHeader struct {
	Callid    uint64
	Path      string
	Type      int
	Deadline  int64
	Metadata  map[string]string
	Tracedata map[string]string
	Traildata map[string]string
}

type rw struct {
	callid     uint64
	path       string
	metadata   map[string]string
	traceddata map[string]string
	deadline   int64
	reader     *list.BlockList[*MsgBody]
	sender     func(context.Context, *Msg) error
	status     atomic.Int32 //use right 4 bit,the bit from left to right:peer_read_status,peer_send_status,self_read_status,self_send_status
	e          error
}

func newrw(callid uint64, path string, deadline int64, md, td map[string]string, sender func(context.Context, *Msg) error) *rw {
	tmp := &rw{
		callid:     callid,
		path:       path,
		metadata:   md,
		traceddata: td,
		deadline:   deadline,
		reader:     list.NewBlockList[*MsgBody](),
		sender:     sender,
	}
	tmp.status.Store(0b1111)
	return tmp

}

func (this *rw) init(body *MsgBody) error {
	return this.sender(context.Background(), &Msg{
		H: &MsgHeader{
			Callid:    this.callid,
			Path:      this.path,
			Type:      1,
			Deadline:  this.deadline,
			Metadata:  this.metadata,
			Tracedata: this.traceddata,
		},
		B:     body,
		WithB: body != nil,
	})
}

func (this *rw) send(body *MsgBody) error {
	if this.status.Load()&0b0001 == 0 {
		if this.e != nil {
			return this.e
		}
		return cerror.ErrCanceled
	}
	if this.status.Load()&0b1000 == 0 {
		return io.EOF
	}
	if e := this.sender(context.Background(), &Msg{
		H: &MsgHeader{
			Callid: this.callid,
			Path:   this.path,
			Type:   2,
		},
		B:     body,
		WithB: body != nil,
	}); e != nil {
		return e
	}
	if body.Error != nil {
		//if we send error to peer,means we stop send
		this.status.And(0b1110)
	}
	return nil
}

func (this *rw) closesend() error {
	if old := this.status.And(0b1110); old&0b0001 == 0 {
		return nil
	}
	return this.sender(context.Background(), &Msg{
		H: &MsgHeader{
			Callid: this.callid,
			Path:   this.path,
			Type:   3,
		},
	})
}

func (this *rw) closerecv() error {
	if old := this.status.And(0b1101); old&0b0010 == 0 {
		return nil
	}
	this.reader.Close()
	return this.sender(context.Background(), &Msg{
		H: &MsgHeader{
			Callid: this.callid,
			Path:   this.path,
			Type:   4,
		},
	})
}

func (this *rw) closerecvsend(trail bool, err error) error {
	if old := this.status.And(0b1100); old&0b0011 == 0 {
		return nil
	}
	this.e = err
	this.reader.Close()
	m := &Msg{
		H: &MsgHeader{
			Callid: this.callid,
			Path:   this.path,
			Type:   5,
		},
	}
	if trail {
		m.H.Traildata = map[string]string{"Cpu-Usage": strconv.FormatFloat(monitor.LastUsageCPU, 'g', 10, 64)}
	}
	return this.sender(context.Background(), m)

}
func (this *rw) recv() ([]byte, error) {
	if this.status.Load()&0b0010 == 0 {
		if this.e != nil {
			return nil, this.e
		}
		return nil, cerror.ErrCanceled
	}
	m, e := this.reader.Pop(context.Background())
	if e != nil {
		if e == list.ErrClosed {
			if this.e != nil {
				return nil, this.e
			}
			if this.status.Load()&0b0100 == 0 {
				return nil, io.EOF
			}
			if this.status.Load()&0b0010 == 0 {
				return nil, cerror.ErrCanceled
			}
			return nil, cerror.ErrClosed
		} else {
			//this is impossible
			return nil, cerror.Convert(e)
		}
	}
	//fix interface nil problem
	if m.Error == nil || m.Error.Code == 0 {
		return m.Body, nil
	}
	//if we read error from peer,means peer stop send
	this.status.And(0b1011)
	this.reader.Close()
	return m.Body, m.Error
}
func (this *rw) cache(m *MsgBody) error {
	_, e := this.reader.Push(m)
	if e == list.ErrClosed {
		e = cerror.ErrClosed
	}
	return e
}
func main() {
	rw := newrw(101, "/test", 0, nil, nil, func(ctx context.Context, m *Msg) error {
		return nil
	})
	rw.closerecvsend(true, nil)
	fmt.Println("finish")
	select {}
}
module demo

go 1.23.0

require github.com/chenjie199234/Corelib v0.0.133-0.20250221101755-b7a3a9d25afa

What did you see happen?

same code
works fine in 1.23.6
panic in 1.24.0

What did you expect to see?

works fine

@Jorropo
Copy link
Member

Jorropo commented Feb 20, 2025

Do you know if this is not a race condition ?

@mknyszek
Copy link
Contributor

If you could post at least the error message as text, that would be helpful. Images are harder for us to work with in general.

Are you able to reproduce this? Are you able to share a small code example that reproduces the issue? Both of these would be very helpful to making progress. As @Jorropo suggests, try running your code under the race detector. If you see more unexpected crashes in your program past the upgrade, posting any additional crashes would also be helpful.

Unfortunately without more information it's hard for us to determine what the problem could be. Many changes go into each release. Since this is in your code, if it is an issue with Go, my guess would be either a bug in the compiler or some kind of GC liveness bug (though we haven't seen anything like that since the release).

@mknyszek mknyszek added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Feb 20, 2025
@mknyszek mknyszek changed the title strange panic in 1.24.0,doesn't exist in 1.23.6 cmd/compile: nil dereference when storing field of non-nil struct value Feb 20, 2025
@mknyszek mknyszek added this to the Backlog milestone Feb 20, 2025
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Feb 20, 2025
@chenjie199234
Copy link
Author

@Jorropo this shouldn't be a race
i tested in both 1.24.0 and 1.23.6 for more then 30 times
100% panic in 1.24.0
0 panic in 1.23.6

@mknyszek
Copy link
Contributor

mknyszek commented Feb 20, 2025

@Jorropo this shouldn't be a race i tested in both 1.24.0 and 1.23.6 for more then 30 times 100% panic in 1.24.0 0 panic in 1.23.6

Unfortunately, data race bugs really can manifest in this way. Unrelated synchronization in the standard library and runtime can mask data race issues, and benign changes can similarly unmask them. Confirming that your code does not fail the race detector just provides everyone a much higher confidence that it isn't the root cause. We're not saying the problem is in your code necessarily, we just want to rule out possible explanations with high confidence.

@seankhliao seankhliao added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Feb 20, 2025
@chenjie199234
Copy link
Author

@mknyszek
i didn't set nil to the rw in any place in my code.
if i can call the function,it should not be gc.
the panic is very strange,i even don't known how to start to write the example code.

@gabyhelp gabyhelp added the BugReport Issues describing a possible bug in the Go implementation. label Feb 20, 2025
@seankhliao seankhliao added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Feb 20, 2025
@mknyszek
Copy link
Contributor

I understand this is a surprising issue, but if we don't approach this systematically we won't be able to identify a root cause. I am not aware of any other recent issues that have similar symptoms, though perhaps someone else is.

We can dig deeper ourselves if you're able to provide some way to reproduce. If you cannot, our courses of action are limited. We can suggest more diagnostics and debugging modes (like the race detector, GODEBUG=cgocheck=1 if you're using cgo, GODEBUG=asyncpreemptoff=1, etc.) to try and narrow it down. We can also ask you to try and narrow down the problem yourself by bisecting between Go 1.23 (5e8a731) and Go 1.24 (3901409) and provide instructions on how to rebuild the Go toolchain.

@chenjie199234
Copy link
Author

@seankhliao @mknyszek
these logs in the image is the business log.
only the logs with red line and cycle is the debug log.
i don't known what i can do to help you to figure out what is happenning.
i'm very confuse the whole day,and it's midnight in my timezone now,i'm really tired.

please give me some guide to output some useful logs to help us find out the problem.
i will follow the guide tomorrow.thx very much,o7

@randall77
Copy link
Contributor

rw, is indeed not nil, it is the address 0x54. That address is clearly bad.

You should try and trace back to find out where that bad value came from. Try adding some clauses like if uintptr(unsafe.Pointer(rw)) < 1000 { panic(...) } to the callers.

@prattmic
Copy link
Member

rw, is indeed not nil, it is the address 0x54. That address is clearly bad.

0x54 should be the address of this + offsetof(e) not this itself. If the offset of e is 0x54, then this could be nil. Though 0x54 is not my count of the offset of e. In fact, shouldn't e be 8-byte aligned? So it definitely looks odd.

@Jorropo

This comment has been minimized.

@prattmic
Copy link
Member

@chenjie199234 It's unrelated to this bug, but it looks like go env may have a bug in printing your C:\User\... directory path. Could you provide more details on #71863.

@mknyszek
Copy link
Contributor

Unlikely, but maybe the line number in the output is wrong, and this offset makes sense for some other structure? (Maybe not this, but this.reader?) 🤷

@chenjie199234
Copy link
Author

chenjie199234 commented Feb 21, 2025

@mknyszek @prattmic @Jorropo @randall77
i update the issue's info
still can't find the reason
:(
the panic line is right
if i comment this.e = err
then panic on this.reader
if i comment this.e = err and this.reader then panic on this.callid

the addr is broken. when the panic happens,the addr is always 0xc and i can't find the reason

@randall77
Copy link
Contributor

the addr is broken. when the panic happens,the addr is always 0xc and i can't find the reason

This I believe is the crux of the problem. You have to figure out where that address came from.
Without a reproducer that we can run, we cannot help you.

@chenjie199234
Copy link
Author

chenjie199234 commented Feb 22, 2025

@randall77
this is very diffcult to reproduce.because even in my case,if i add some useless code and rebuild the project,the panic gone!
this means,the compile result will impact the panic
and i checked all my codes,there has no place to assign something to the val

i'm trying to make a smallest demo which can reproduce this panic

@chenjie199234
Copy link
Author

@mknyszek @prattmic @randall77 @seankhliao
hello guys,i reproduced it
the code i put it in the issue's content
i'm trying to make it more simple

@seankhliao seankhliao removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Feb 22, 2025
@seankhliao
Copy link
Member

thanks for the reproducer
a bisect points to CL 594738: cmd/compile: make sync/atomic AND/OR operations intrinsic on amd64

cc @randall77

@randall77
Copy link
Contributor

Ok, thanks for the reproducer.

This looks like a bug in the register allocator when compiling (*rw).closerecvsend, where it uses the same register for two different things simultaneously. Those two things being the value being compare-and-swapped (0b1100), and the this argument. That's where the 0xc pointer comes from.

I think I have a 1-line fix.

diff --git a/src/cmd/compile/internal/ssa/regalloc.go b/src/cmd/compile/internal/ssa/regalloc.go
index 1b7bcb2b1d..d794098b9d 100644
--- a/src/cmd/compile/internal/ssa/regalloc.go
+++ b/src/cmd/compile/internal/ssa/regalloc.go
@@ -1677,6 +1677,7 @@ func (s *regAllocState) regalloc(f *Func) {
                                }
                                tmpReg = s.allocReg(m, &tmpVal)
                                s.nospill |= regMask(1) << tmpReg
+                               s.tmpused |= regMask(1) << tmpReg
                        }
 
                        // Now that all args are in regs, we're ready to issue the value itself.

I still need to figure out a reasonable test for it.

@gopherbot please open a backport issue for 1.24.

@randall77 randall77 self-assigned this Feb 22, 2025
@randall77 randall77 added the Critical A critical problem that affects the availability or correctness of production systems built using Go label Feb 22, 2025
@randall77 randall77 modified the milestones: Backlog, Go1.24.1, Go1.25 Feb 22, 2025
@gopherbot
Copy link
Contributor

Backport issue(s) opened: #71904 (for 1.24).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@chenjie199234
Copy link
Author

@randall77 @seankhliao
the strange thing is:if u add two println(this) like the below code,the panic will gone.
this means,different code has different case.does this only effect the atomic.And function?

func (this *rw) closerecvsend(trail bool, err error) error {
	println(this)
	if old := this.status.And(0b1100); old&0b0011 == 0 {
		return nil
	}
	println(this)
	this.e = err
	this.reader.Close()
	m := &Msg{
		H: &MsgHeader{
			Callid: this.callid,
			Path:   this.path,
			Type:   5,
		},
	}
	if trail {
		m.H.Traildata = map[string]string{"Cpu-Usage": strconv.FormatFloat(monitor.LastUsageCPU, 'g', 10, 64)}
	}
	return this.sender(context.Background(), m)
}

@randall77
Copy link
Contributor

The printlns cause spills which change how the register allocation proceeds.

It does not affect only sync.atomic.And. It would include, I think, sync.atomic.{And64,And32,Or64,Or32}, conditional moves for floats, and float->bool ops (everything declared as needIntTemp in https://github.com/golang/go/blob/master/src/cmd/compile/internal/ssa/_gen/AMD64Ops.go).

@randall77
Copy link
Contributor

And also https://github.com/golang/go/blob/master/src/cmd/compile/internal/ssa/_gen/ARM64Ops.go, which includes the arm64 sync/atomic.{And64,And32,Or64,Or32,And8,Or8}.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/651221 mentions this issue: cmd/compile: ensure we don't reuse temporary register

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime. Critical A critical problem that affects the availability or correctness of production systems built using Go NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

8 participants