When resizing AWS OpenSearch cluster Jaeger declares ES VPC Domain endpoint dead for a long time #3816

mcooknu · 2022-07-16T14:20:17Z

mcooknu
Jul 16, 2022

I've noticed each time I resize an ES Domain (for example adding more EBS space or add more Data Nodes) that the cluster basically replicates the desired state and copies over shards etc. There may be some impact (or total outage) during this time but what I'm trying to understand is how Jaeger declares a node dead. It seems Jaeger thinks the ES Domain VPC endpoint is dead for much much longer than it actually is unreachable via AWS GUI.

Is Jaeger sniffing ES Domain endpoint or is this off by default?
Would sniffing help return to service sooner?
How long between retries to see if ES Domain endpoint is not dead?
What does dead mean - is it a connection timeout or does a ES Query have to successfully complete?
Has anyone got any advice on performing this kind of ES Domain upgrade and keeping Jaeger in service?
Does simply adding more Data Nodes prevent this problem and if yes, is there any guidance?
Is there any guidance from Jaeger community or developers on ES Domain sizing for 'typical' span rates?

Notwithstanding ES Domain resize and potential performance hit of that event - would you expect Jaeger to remain connected to the domain and if ES can respond fast enough, attempt GUI actions like "Find Traces"?

These are the kinds of error logs in Jaeger Query pod that seem to appear every few minutes for over an hour while the AWS Domain is resized larger.

{"level":"info","ts":1657979512.318126,"caller":"zapgrpc/zapgrpc.go:129","msg":"elastic: https://vpc-gdo-jaeger-testing-dev-XXXXXXXXX.us-east-1.es.amazonaws.com:443 is dead"}
{"level":"error","ts":1657979512.3182418,"caller":"app/http_handler.go:487","msg":"HTTP handler, Internal Server Error","error":"search services failed: no available connection: no Elasticsearch node available","stacktrace":"github.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).handleError\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:487\ngithub.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).search\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:236\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2046\ngithub.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5\n\tgithub.com/opentracing-contrib/[email protected]/nethttp/server.go:154\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2046\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2046\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\tgithub.com/gorilla/[email protected]/mux.go:210\ngithub.com/jaegertracing/jaeger/cmd/query/app.additionalHeadersHandler.func1\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/additional_headers_handler.go:28\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2046\ngithub.com/gorilla/handlers.CompressHandlerLevel.func1\n\tgithub.com/gorilla/[email protected]/compress.go:141\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2046\ngithub.com/gorilla/handlers.recoveryHandler.ServeHTTP\n\tgithub.com/gorilla/[email protected]/recovery.go:78\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2878\nnet/http.(*conn).serve\n\tnet/http/server.go:1929"}
{"level":"info","ts":1657979512.3181875,"caller":"zapgrpc/zapgrpc.go:129","msg":"elastic: all 1 nodes marked as dead; resurrecting them to prevent deadlock"}

Jaeger v1.28.0
Commit 514a0cc
Build 2021-11-06T05:31:39Z
Jaeger UI v1.18.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jaeger - Distributed Tracing Platform

When resizing AWS OpenSearch cluster Jaeger declares ES VPC Domain endpoint dead for a long time #3816

{{title}}

Replies: 0 comments

Select a reply

Jaeger - Distributed Tracing Platform

When resizing AWS OpenSearch cluster Jaeger declares ES VPC Domain endpoint dead for a long time #3816

mcooknu Jul 16, 2022

Replies: 0 comments

mcooknu
Jul 16, 2022