You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've noticed each time I resize an ES Domain (for example adding more EBS space or add more Data Nodes) that the cluster basically replicates the desired state and copies over shards etc. There may be some impact (or total outage) during this time but what I'm trying to understand is how Jaeger declares a node dead. It seems Jaeger thinks the ES Domain VPC endpoint is dead for much much longer than it actually is unreachable via AWS GUI.
Is Jaeger sniffing ES Domain endpoint or is this off by default?
Would sniffing help return to service sooner?
How long between retries to see if ES Domain endpoint is not dead?
What does dead mean - is it a connection timeout or does a ES Query have to successfully complete?
Has anyone got any advice on performing this kind of ES Domain upgrade and keeping Jaeger in service?
Does simply adding more Data Nodes prevent this problem and if yes, is there any guidance?
Is there any guidance from Jaeger community or developers on ES Domain sizing for 'typical' span rates?
Notwithstanding ES Domain resize and potential performance hit of that event - would you expect Jaeger to remain connected to the domain and if ES can respond fast enough, attempt GUI actions like "Find Traces"?
These are the kinds of error logs in Jaeger Query pod that seem to appear every few minutes for over an hour while the AWS Domain is resized larger.
{"level":"info","ts":1657979512.318126,"caller":"zapgrpc/zapgrpc.go:129","msg":"elastic: https://vpc-gdo-jaeger-testing-dev-XXXXXXXXX.us-east-1.es.amazonaws.com:443 is dead"}
{"level":"error","ts":1657979512.3182418,"caller":"app/http_handler.go:487","msg":"HTTP handler, Internal Server Error","error":"search services failed: no available connection: no Elasticsearch node available","stacktrace":"github.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).handleError\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:487\ngithub.com/jaegertracing/jaeger/cmd/query/app.(*APIHandler).search\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/http_handler.go:236\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2046\ngithub.com/opentracing-contrib/go-stdlib/nethttp.MiddlewareFunc.func5\n\tgithub.com/opentracing-contrib/[email protected]/nethttp/server.go:154\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2046\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2046\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\tgithub.com/gorilla/[email protected]/mux.go:210\ngithub.com/jaegertracing/jaeger/cmd/query/app.additionalHeadersHandler.func1\n\tgithub.com/jaegertracing/jaeger/cmd/query/app/additional_headers_handler.go:28\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2046\ngithub.com/gorilla/handlers.CompressHandlerLevel.func1\n\tgithub.com/gorilla/[email protected]/compress.go:141\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2046\ngithub.com/gorilla/handlers.recoveryHandler.ServeHTTP\n\tgithub.com/gorilla/[email protected]/recovery.go:78\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2878\nnet/http.(*conn).serve\n\tnet/http/server.go:1929"}
{"level":"info","ts":1657979512.3181875,"caller":"zapgrpc/zapgrpc.go:129","msg":"elastic: all 1 nodes marked as dead; resurrecting them to prevent deadlock"}
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I've noticed each time I resize an ES Domain (for example adding more EBS space or add more Data Nodes) that the cluster basically replicates the desired state and copies over shards etc. There may be some impact (or total outage) during this time but what I'm trying to understand is how Jaeger declares a node dead. It seems Jaeger thinks the ES Domain VPC endpoint is dead for much much longer than it actually is unreachable via AWS GUI.
Notwithstanding ES Domain resize and potential performance hit of that event - would you expect Jaeger to remain connected to the domain and if ES can respond fast enough, attempt GUI actions like "Find Traces"?
These are the kinds of error logs in Jaeger Query pod that seem to appear every few minutes for over an hour while the AWS Domain is resized larger.
Beta Was this translation helpful? Give feedback.
All reactions