-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
docs(self-hosted): provide more insights on troubleshooting kafka #15131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Turns out most of our self-hosted users has never touch Kafka before, so it's a good idea to introduce them regarding how Kafka works. Also added how to increase consumers replica if they're lagging behind.
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
2. Receive consumers list: | ||
```shell | ||
docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list | ||
docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential bug: The change from docker compose run
to docker compose exec
will cause commands to fail, as the troubleshooting guide requires containers to be stopped first.
-
Description: The troubleshooting documentation for Kafka offset resets instructs users to first stop consumer containers. However, subsequent steps were changed from
docker compose run --rm kafka
todocker compose exec kafka
. Theexec
command requires the target container to be running. In a troubleshooting scenario where thekafka
container may be stopped or unhealthy, or if the user has stopped it as part of the procedure, these commands will fail with an error like "container is not running". This breaks the documented recovery workflow, preventing users from resetting Kafka offsets. -
Suggested fix: Revert the commands from
docker compose exec kafka
back todocker compose run --rm kafka
. Therun
command creates a new container for the command, which works regardless of whether the mainkafka
service container is running.
severity: 0.7, confidence: 0.95
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we're not stopping Kafka container
services: | ||
events-consumer: | ||
deploy: | ||
replicas: 3 | ||
``` | ||
This will increase the number of consumers for the `ingest-consumer` consumer group to 3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential bug: The recommended scaling method using deploy.replicas
is silently ignored in standalone Docker Compose, meaning no scaling will actually occur.
-
Description: The documentation suggests scaling Kafka consumers using the
deploy.replicas
key in adocker-compose.override.yml
file. However, thedeploy
key is only effective in Docker Swarm mode. Self-hosted Sentry installations use standalone Docker Compose, which silently ignores this configuration. As a result, users following these instructions will not actually scale their consumers, and the underlying performance issues like consumer lag will persist, despite the user believing they have applied a fix. -
Suggested fix: Remove the instructions for using
deploy.replicas
. Replace them with the correct method for scaling services in standalone Docker Compose, which typically involves defining additional, uniquely named service entries in thedocker-compose.override.yml
file.
severity: 0.8, confidence: 0.98
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the deploy key is only effective in Docker Swarm mode.
Wrong. It works on Docker Compose.
Co-authored-by: Kevin Pfeifer <[email protected]>
Co-authored-by: Kevin Pfeifer <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks waaaay better than the old version but I'm not qualified to give proper feedback. Still unblocking as I think it is miles better that whatever we have currently.
This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, it is a message broker which stores message in a log (or in an easier language: very similar to an array) format. It receives messages from producers that aimed to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages. | ||
|
||
This happens where Kafka and the consumers get out of sync. Possible reasons are: | ||
On the inside, when a message enters a topic, it would be written to a certain partition. You can think partition as physical boxes that stores messages for a specific topic, each topic will have their own separate & dedicated partitions. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph has a bit of overlap with the next one and does not add much to understanding kafka IMO, so I think we could remove it completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind leaving this here. At least I want to emphasize the existence of "partition"
Co-authored-by: Joris Bayer <[email protected]>
Co-authored-by: Joris Bayer <[email protected]>
```log | ||
Exception: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"} | ||
``` | ||
This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, Kafka is a message broker which stores messages in a log (or in an easier language: very similar to an array) format. It receives messages from producers that write to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to be pedantic, Kafka doesn't send messages to consumers. Consumers poll kafka and fetch new messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too pedantic and not so beginner friendly 😅
1. Running out of disk space or memory | ||
2. Having a sustained event spike that causes very long processing times, causing Kafka to drop messages as they go past the retention time | ||
3. Date/time out of sync issues due to a restart or suspend/resume cycle | ||
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use semantic partitioning, we don't generally assign partition numbers to messages. Instead we use 'keyed messages', which define how messages are grouped (by key) but how those keys map to partitions is up to Kafka.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding on this, if messages don't have partition keys, kafka will assign the message to a partition via round-robin (so technically not randomly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Will update these.
3. Date/time out of sync issues due to a restart or suspend/resume cycle | ||
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume. | ||
|
||
Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to mention that offsets are scoped to a partition, and that each partition in a topic will have the same offset numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I should be mentioning that
|
||
This happens where Kafka and the consumers get out of sync. Possible reasons are: | ||
|
||
1. Running out of disk space or memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong, but I think if kafka runs out of disk space the service crashes (which wouldn't cause offset out of range
on consumers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically speaking, you're right. But after a disk out of space incident, there will potentially be a massive offset out of range error.
Perhaps this should be made clearer.
1. Running out of disk space or memory | ||
2. Having a sustained event spike that causes very long processing times, causing Kafka to drop messages as they go past the retention time | ||
3. Date/time out of sync issues due to a restart or suspend/resume cycle | ||
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding on this, if messages don't have partition keys, kafka will assign the message to a partition via round-robin (so technically not randomly)
This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, Kafka is a message broker which stores messages in a log (or in an easier language: very similar to an array) format. It receives messages from producers that write to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages. | ||
|
||
This happens where Kafka and the consumers get out of sync. Possible reasons are: | ||
On the inside, when a message enters a topic, it would be written to a certain partition. You can think partition as physical boxes that stores messages for a specific topic, each topic will have their own separate & dedicated partitions. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the inside, when a message enters a topic, it would be written to a certain partition. You can think partition as physical boxes that stores messages for a specific topic, each topic will have their own separate & dedicated partitions. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine. | |
On the inside, when a message enters a topic, it will be written to a certain partition. You can think of a partition as physical a box that stores messages for a specific topic. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine. |
1. Running out of disk space or memory | ||
2. Having a sustained event spike that causes very long processing times, causing Kafka to drop messages as they go past the retention time | ||
3. Date/time out of sync issues due to a restart or suspend/resume cycle | ||
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume. | |
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. **Important to note: the number of consumers cannot exceed the number of partitions**. If you have more consumers than partitions, the extra consumers will receive no messages. |
3. Date/time out of sync issues due to a restart or suspend/resume cycle | ||
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume. | ||
|
||
Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers. | |
Each message in a topic will have an "offset" (number). You can think of this like an "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the time, we want "lag" to be as low as possible. The easiest solution to lagging is adding more partitions and increasing the number of consumers. |
3. Date/time out of sync issues due to a restart or suspend/resume cycle | ||
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume. | ||
|
||
Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers. | |
Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers. Learn more about [lagging](/self-hosted/troubleshooting/kafka/#consumers-lagging-behind). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some wording that could use cleaning up. I left a few suggestions. Overall, I love the additional information about partitions and consumers.
Co-authored-by: Shannon Anahata <[email protected]>
1. Set offset to latest and execute: | ||
```shell | ||
docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --all-groups --all-topics --reset-offsets --to-latest --execute | ||
docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --all-groups --all-topics --reset-offsets --to-latest --execute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Kafka Commands Fail When Container Stopped
The Kafka consumer group commands now use docker compose exec kafka
instead of docker compose run --rm kafka
. This change assumes the Kafka container is running, which can be problematic during troubleshooting (e.g., for "Offset Out Of Range" errors) when the container might be stopped or unhealthy. The previous run --rm
approach was more robust as it could execute commands even if the main container was down.
DESCRIBE YOUR PR
Turns out most of our self-hosted users has never touch Kafka before, so it's a good idea to introduce them regarding how Kafka works.
Also added how to increase consumers replica if they're lagging behind.
IS YOUR CHANGE URGENT?
Help us prioritize incoming PRs by letting us know when the change needs to go live.
SLA
Thanks in advance for your help!
PRE-MERGE CHECKLIST
Make sure you've checked the following before merging your changes:
LEGAL BOILERPLATE
Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. and is gonna need some rights from me in order to utilize my contributions in this here PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.
EXTRA RESOURCES