-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kafka Consumer dies after a reconnect following a network hiccup #1408
Comments
We are experiencing the same thing (I mentioned in #1264 as well) - ever since we switched from Kafka's Java SDK to alpakka-kafka, some of our consumers sporadically die after a network hiccup. Others are fine. Around this time we see "Completing" in the logs - it looks like the Stream completed for those consumers (and never restarted). Our typical setup (Play Framework):
|
@mrubin thanks for adding the observations from your systems. To me it looks like you have a similar problem like we observe. |
@seglo do you see anything interesting / that stands out in the above confirmation for the alpakka-kafka consumer? Thank you |
We implemented a workaround in the meantime. We're watching the stream completion via the draining control and take down the Kubernetes liveness probe so that the pod gets restarted. |
Faced similar issue with:
After printing these logs kafka consumer stopped consuming messages. It just left the consumer group and did not rejoin. |
Same issue here |
FTR: so far we haven't found a better solution than restarting the whole app. Unfortunately, due to the upcoming license changes to Akka, we won't dig deeper into that. |
FYI I have no plans to upgrade Akka to releases under Lightbend's new license, but I understand the concern. Replacing Akka would be a huge effort so I have hopes that a viable Akka fork (such as the Apache Akka fork that's beginning to gain attention) is found. In the mean time, Lightbend has committed to releasing security patches for Akka 2.6 until September of next year, which buys this and other OSS projects built on Akka some time to pivot. |
FYI I lost interest in this issue because we don't use Akka anymore. I would be fine with closing it, but I'll leave it open for now as somebody else might be affected by this problem |
I believe that we are experiencing this issue. Have not found a suitable config setup that might avoid it. We are running in k8s with kafka and the strimzi operator and akka 2.6. Keep the issue open, please. I am looking to try to crash the responsible actor using stream completion detection in the draining control, but haven't managed to find a reliable way. @L7R7 Would you share a snippet of your approach to the work-around, please? |
@tpanagos We migrated away from Akka to a different stack a while ago so I don't recall the details. But I can dig through the git history next week and see if I can provide you with some details |
@tpanagos From what I can see in our git history, we combined this approach: https://doc.akka.io/docs/alpakka-kafka/current/consumer.html#draining-control with a message to the Supervisor that stops the whole application. So basically a |
Versions used
akka version: 2.6.15
akka-stream-kafka version: 2.1.1
scala version: 2.13.6
Expected Behavior
We're using a
committableSource
to consume messages from Kafka. We're reading the messages, parse the JSON payload and persist it into a database.We're also using Draining Control for graceful shutdown.
I was hoping that reconnects after network glitches are handled properly by either akka-stream-kafka or the underlying kafka client lib.
Actual Behavior
The consumer is working as expected most of the time. However, it (very sporadically) dies after a short network hiccup happened (we see error logs from our database connection, so I'm pretty sure that it's a short network issue that's resolved by a reconnect after a short moment).
The first thing I see in the logs is that the consumer fails to do a successful commit against the broker, then loses the assigned partitions and has to rejoin the consumer group. It does so (as I expect), but shortly after that leaves the consumer group. To me, it looks as if the Kafka consumer is shutting down, while the service itself keeps running. I can't see a reason why the consumer should be shutting down.
At this point I'm not sure if it's something we're missing on our side (should we use something other than
committableSource
? Do we miss some configuration? ...), or if it's a bug or just bad luck? Or do we have to take care of such problems on our side by restarting the Kafka source in these cases, like it's mentioned in this issue?What's even more confusing is the fact that we have two consumer groups that read from the same topic and do similar things when consuming the messages from Kafka. We observed this behavior twice over a couple of weeks. One time for each consumer group, both of the times the other group was consuming without any issues.
Relevant logs
I'm putting the (almost) full JSON from our logs here, so it's clear when the logs were made, which logger and what message (I replaced sensitive stuff with
...
). I'll try to group them logically (how it makes sense to me, at least):consumer fails to commit and loses partition assignments after network issues
consumer starts re-joining
consumer re-joins successfully
consumer shuts down and leaves the group
Consumer config
we're consuming rather large messages, so
fetch.max.bytes
is set to something unusual, but I don't think that's relevant here.Reproducible Test Case
I'm currently not able to reproduce the behavior. It happened to us twice over the course of several weeks, and I think it's initiated by a network glitch which is hard to simulate
The text was updated successfully, but these errors were encountered: