Skip to content

Revert "fix(redisotel): fix the situation of reporting spans multiple times (#3168)" #3433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AllanGuigou
Copy link

@AllanGuigou AllanGuigou commented Jul 15, 2025

This reverts commit 5314a57 (PR) which removed the default tracing hook. We've observed that since v9.8.0 our clients no longer emit traces, I believe this is due to a race condition between nodes getting created and instrumenting tracing. Before v9.8.0 there would always be a default tracing hook, and in the latest version there would only be tracing hooks for new nodes which means any existing nodes would not have tracing.

I've proposed another change that adds tracing hooks for existing nodes, #3432. That requires a breaking change to include context in the params, so I think the revert would make sense in the short term.

cc @ndyakov and @flc1125 since you all were involved in the PR I'm proposing we revert.

@AllanGuigou AllanGuigou marked this pull request as ready for review July 15, 2025 16:30
@flc1125
Copy link
Contributor

flc1125 commented Jul 16, 2025

Very sorry for the impact caused.

I noticed that a new PR has been submitted to address it. If you need any assistance, please feel free to let me know.

@flc1125
Copy link
Contributor

flc1125 commented Jul 16, 2025

Additionally, could you provide some examples to describe your issue? In my application, it did indeed generate duplicate links.


I executed the same code in different versions, and here are the results. Code section:

package otel

import (
	"testing"

	"github.com/redis/go-redis/extra/redisotel/v9"
	"github.com/redis/go-redis/v9"
	"github.com/stretchr/testify/assert"
	"github.com/stretchr/testify/require"
	sdktrace "go.opentelemetry.io/otel/sdk/trace"
	"go.opentelemetry.io/otel/sdk/trace/tracetest"
)

func TestOTel(t *testing.T) {
	sr := tracetest.NewSpanRecorder()
	provider := sdktrace.NewTracerProvider(
		sdktrace.WithSpanProcessor(sr),
	)
	t.Cleanup(func() {
		_ = provider.Shutdown(t.Context())
	})

	client := redis.NewClusterClient(&redis.ClusterOptions{
		Addrs: []string{
			"localhost:7001",
			"localhost:7002",
			"localhost:7003",
		},
	})
	require.NoError(t, redisotel.InstrumentTracing(client, redisotel.WithTracerProvider(provider)))

	assert.Len(t, sr.Ended(), 0)
	client.Ping(t.Context())
	for _, span := range sr.Ended() {
		t.Logf("Span: %s, Name: %s, Status: %s", span.SpanContext().TraceID(), span.Name(), span.Status().Code.String())
	}
}
Results before modification

version:

github.com/redis/go-redis/extra/redisotel/[email protected] github.com/redis/go-redis/extra/rediscmd/[email protected]
github.com/redis/go-redis/extra/redisotel/[email protected] github.com/redis/go-redis/[email protected]

output:

=== RUN   TestOTel
    redis_test.go:35: Span: 7a3ace6aabf474bb8e3eb05a6e412d80, Name: redis.dial, Status: Unset
    redis_test.go:35: Span: 7a3ace6aabf474bb8e3eb05a6e412d80, Name: hello, Status: Unset
    redis_test.go:35: Span: 7a3ace6aabf474bb8e3eb05a6e412d80, Name: redis.pipeline client, Status: Unset
    redis_test.go:35: Span: 7a3ace6aabf474bb8e3eb05a6e412d80, Name: cluster slots, Status: Unset
    redis_test.go:35: Span: 7a3ace6aabf474bb8e3eb05a6e412d80, Name: redis.dial, Status: Unset
    redis_test.go:35: Span: 7a3ace6aabf474bb8e3eb05a6e412d80, Name: hello, Status: Unset
    redis_test.go:35: Span: 7a3ace6aabf474bb8e3eb05a6e412d80, Name: redis.pipeline client, Status: Unset
    redis_test.go:35: Span: 7a3ace6aabf474bb8e3eb05a6e412d80, Name: ping, Status: Unset
    redis_test.go:35: Span: 7a3ace6aabf474bb8e3eb05a6e412d80, Name: ping, Status: Unset
--- PASS: TestOTel (0.01s)

We can see that it generated 2 ping records.

Modified result

version:

github.com/redis/go-redis/extra/redisotel/[email protected] github.com/redis/go-redis/extra/rediscmd/[email protected]
github.com/redis/go-redis/extra/redisotel/[email protected] github.com/redis/go-redis/[email protected]

output:

=== RUN   TestOTel
    redis_test.go:35: Span: 4054ddec746bbbe5b196bbf067e63762, Name: redis.dial, Status: Unset
    redis_test.go:35: Span: 4054ddec746bbbe5b196bbf067e63762, Name: hello, Status: Unset
    redis_test.go:35: Span: 4054ddec746bbbe5b196bbf067e63762, Name: redis.pipeline client, Status: Unset
    redis_test.go:35: Span: 4054ddec746bbbe5b196bbf067e63762, Name: cluster slots, Status: Unset
    redis_test.go:35: Span: f86277d82fce825f7bebad7c0d3254f7, Name: redis.dial, Status: Unset
    redis_test.go:35: Span: f86277d82fce825f7bebad7c0d3254f7, Name: hello, Status: Unset
    redis_test.go:35: Span: f86277d82fce825f7bebad7c0d3254f7, Name: redis.pipeline client, Status: Unset
    redis_test.go:35: Span: f86277d82fce825f7bebad7c0d3254f7, Name: ping, Status: Unset
--- PASS: TestOTel (0.01s)

And here, it's only once.

@ndyakov
Copy link
Member

ndyakov commented Jul 16, 2025

@monkey92t feel free to review this and weight in.

@ndyakov ndyakov requested review from Copilot and monkey92t July 16, 2025 10:09
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reverts a previous fix that removed default tracing hooks from ClusterClient and Ring Redis clients. The revert addresses a race condition where existing nodes lose tracing capabilities when only new nodes receive tracing hooks during instrumentation.

  • Restores default tracing hooks for ClusterClient and Ring types to ensure all nodes have tracing
  • Maintains the OnNewNode callbacks for new node instrumentation
  • Provides a short-term solution while a more comprehensive fix is being developed

@@ -30,6 +30,8 @@ func InstrumentTracing(rdb redis.UniversalClient, opts ...TracingOption) error {
rdb.AddHook(newTracingHook(connString, opts...))
return nil
case *redis.ClusterClient:
rdb.AddHook(newTracingHook("", opts...))
Copy link
Preview

Copilot AI Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a default tracing hook with empty connection string followed by OnNewNode hooks will result in duplicate spans for the same operations. The original PR #3168 that this reverts was specifically fixing this multiple span reporting issue. This revert reintroduces the same problem it was meant to solve.

Copilot uses AI. Check for mistakes.

@@ -38,6 +40,8 @@ func InstrumentTracing(rdb redis.UniversalClient, opts ...TracingOption) error {
})
return nil
case *redis.Ring:
rdb.AddHook(newTracingHook("", opts...))
Copy link
Preview

Copilot AI Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a default tracing hook with empty connection string followed by OnNewNode hooks will result in duplicate spans for the same operations. The original PR #3168 that this reverts was specifically fixing this multiple span reporting issue. This revert reintroduces the same problem it was meant to solve.

Suggested change
rdb.AddHook(newTracingHook("", opts...))

Copilot uses AI. Check for mistakes.

@AllanGuigou
Copy link
Author

AllanGuigou commented Jul 16, 2025

@flc1125 I've modified your example to include a ping before instrumenting tracing which results in one or no spans. I was unclear yesterday how this was occurring in my actual application, we don't call PING or any other commands before instrumenting tracing.

EDIT: I did just find that we call ForEachShard before instrumenting telemetry, and if I'm reading the code correctly that would cause nodes to get initialized. I can't reproduce this in a unit test though, but this may be "us" problem, working on shipping a change to re-order our setup to verify.

	err := client.ForEachShard(t.Context(), func(ctx context.Context, shard *redis.Client) error {
		t.Log("some other setup")
		return nil
	})

func TestOTel(t *testing.T) {
	sr := tracetest.NewSpanRecorder()
	provider := sdktrace.NewTracerProvider(
		sdktrace.WithSpanProcessor(sr),
	)
	t.Cleanup(func() {
		_ = provider.Shutdown(t.Context())
	})

	client := redis.NewClusterClient(&redis.ClusterOptions{
		Addrs: []string{
			"localhost:7001",
			"localhost:7002",
			"localhost:7003",
		},
	})

	client.Ping(t.Context()) // hard-coded but if there is a race between setting up tracing and setting up nodes then there will be no spans

	require.NoError(t, redisotel.InstrumentTracing(client, redisotel.WithTracerProvider(provider)))

	assert.Len(t, sr.Ended(), 0)
	client.Ping(t.Context())

	for _, span := range sr.Ended() {
		t.Logf("Span: %s, Name: %s, Status: %s ", span.SpanContext().TraceID(), span.Name(), span.Status().Code.String())
	}
}
Results from v9.7.3
=== RUN   TestOTel
    /workspaces/tmp/main_test.go:39: Span: 98defb52a187159d54baefe5768c93f5, Name: ping, Status: Unset
--- PASS: TestOTel (0.00s)

Results from v9.8.0
=== RUN   TestOTel
--- PASS: TestOTel (0.00s)

@AllanGuigou
Copy link
Author

AllanGuigou commented Jul 16, 2025

I did just find that we call ForEachShard before instrumenting telemetry, and if I'm reading the code correctly that would cause nodes to get initialized. I can't reproduce this in a unit test though, but this may be an "us" problem, working on shipping a change to re-order our setup to verify.

Verified that tracing is flowing again after making sure instrument tracing is first in our setup, but there might still be a gap. I know that a cluster can have a proxy in-front of the nodes and believe that it can handle some commands like CLUSTER NODES in which case those commands would not be traced. Is that correct? Wondering if the revert is still justified to support those cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants