Skip to content

Conversation

aldy505
Copy link
Collaborator

@aldy505 aldy505 commented Oct 4, 2025

DESCRIBE YOUR PR

Turns out most of our self-hosted users has never touch Kafka before, so it's a good idea to introduce them regarding how Kafka works.

Also added how to increase consumers replica if they're lagging behind.

IS YOUR CHANGE URGENT?

Help us prioritize incoming PRs by letting us know when the change needs to go live.

  • Urgent deadline (GA date, etc.):
  • Other deadline:
  • None: Not urgent, can wait up to 1 week+

SLA

  • Teamwork makes the dream work, so please add a reviewer to your PRs.
  • Please give the docs team up to 1 week to review your PR unless you've added an urgent due date to it.
    Thanks in advance for your help!

PRE-MERGE CHECKLIST

Make sure you've checked the following before merging your changes:

  • Checked Vercel preview for correctness, including links
  • PR was reviewed and approved by any necessary SMEs (subject matter experts)
  • PR was reviewed and approved by a member of the Sentry docs team

LEGAL BOILERPLATE

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. and is gonna need some rights from me in order to utilize my contributions in this here PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

EXTRA RESOURCES

Turns out most of our self-hosted users has never touch Kafka before, so it's a good idea to introduce them regarding how Kafka works.

Also added how to increase consumers replica if they're lagging behind.
Copy link

vercel bot commented Oct 4, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
develop-docs Ready Ready Preview Comment Oct 8, 2025 0:14am
1 Skipped Deployment
Project Deployment Preview Comments Updated (UTC)
sentry-docs Ignored Ignored Preview Oct 8, 2025 0:14am

@aldy505 aldy505 requested review from hubertdeng123 and BYK October 4, 2025 03:31
2. Receive consumers list:
```shell
docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list
docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential bug: The change from docker compose run to docker compose exec will cause commands to fail, as the troubleshooting guide requires containers to be stopped first.
  • Description: The troubleshooting documentation for Kafka offset resets instructs users to first stop consumer containers. However, subsequent steps were changed from docker compose run --rm kafka to docker compose exec kafka. The exec command requires the target container to be running. In a troubleshooting scenario where the kafka container may be stopped or unhealthy, or if the user has stopped it as part of the procedure, these commands will fail with an error like "container is not running". This breaks the documented recovery workflow, preventing users from resetting Kafka offsets.

  • Suggested fix: Revert the commands from docker compose exec kafka back to docker compose run --rm kafka. The run command creates a new container for the command, which works regardless of whether the main kafka service container is running.
    severity: 0.7, confidence: 0.95

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we're not stopping Kafka container

Comment on lines +167 to +172
services:
events-consumer:
deploy:
replicas: 3
```
This will increase the number of consumers for the `ingest-consumer` consumer group to 3.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential bug: The recommended scaling method using deploy.replicas is silently ignored in standalone Docker Compose, meaning no scaling will actually occur.
  • Description: The documentation suggests scaling Kafka consumers using the deploy.replicas key in a docker-compose.override.yml file. However, the deploy key is only effective in Docker Swarm mode. Self-hosted Sentry installations use standalone Docker Compose, which silently ignores this configuration. As a result, users following these instructions will not actually scale their consumers, and the underlying performance issues like consumer lag will persist, despite the user believing they have applied a fix.

  • Suggested fix: Remove the instructions for using deploy.replicas. Replace them with the correct method for scaling services in standalone Docker Compose, which typically involves defining additional, uniquely named service entries in the docker-compose.override.yml file.
    severity: 0.8, confidence: 0.98

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the deploy key is only effective in Docker Swarm mode.

Wrong. It works on Docker Compose.

aldy505 and others added 2 commits October 4, 2025 20:37
Co-authored-by: Kevin Pfeifer <[email protected]>
Co-authored-by: Kevin Pfeifer <[email protected]>
Copy link
Member

@BYK BYK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks waaaay better than the old version but I'm not qualified to give proper feedback. Still unblocking as I think it is miles better that whatever we have currently.

This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, it is a message broker which stores message in a log (or in an easier language: very similar to an array) format. It receives messages from producers that aimed to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages.

This happens where Kafka and the consumers get out of sync. Possible reasons are:
On the inside, when a message enters a topic, it would be written to a certain partition. You can think partition as physical boxes that stores messages for a specific topic, each topic will have their own separate & dedicated partitions. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph has a bit of overlap with the next one and does not add much to understanding kafka IMO, so I think we could remove it completely.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind leaving this here. At least I want to emphasize the existence of "partition"

aldy505 and others added 2 commits October 7, 2025 17:17
```log
Exception: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}
```
This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, Kafka is a message broker which stores messages in a log (or in an easier language: very similar to an array) format. It receives messages from producers that write to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be pedantic, Kafka doesn't send messages to consumers. Consumers poll kafka and fetch new messages.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too pedantic and not so beginner friendly 😅

1. Running out of disk space or memory
2. Having a sustained event spike that causes very long processing times, causing Kafka to drop messages as they go past the retention time
3. Date/time out of sync issues due to a restart or suspend/resume cycle
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use semantic partitioning, we don't generally assign partition numbers to messages. Instead we use 'keyed messages', which define how messages are grouped (by key) but how those keys map to partitions is up to Kafka.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding on this, if messages don't have partition keys, kafka will assign the message to a partition via round-robin (so technically not randomly)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will update these.

3. Date/time out of sync issues due to a restart or suspend/resume cycle
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume.

Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to mention that offsets are scoped to a partition, and that each partition in a topic will have the same offset numbers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I should be mentioning that


This happens where Kafka and the consumers get out of sync. Possible reasons are:

1. Running out of disk space or memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong, but I think if kafka runs out of disk space the service crashes (which wouldn't cause offset out of range on consumers)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically speaking, you're right. But after a disk out of space incident, there will potentially be a massive offset out of range error.

Perhaps this should be made clearer.

1. Running out of disk space or memory
2. Having a sustained event spike that causes very long processing times, causing Kafka to drop messages as they go past the retention time
3. Date/time out of sync issues due to a restart or suspend/resume cycle
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding on this, if messages don't have partition keys, kafka will assign the message to a partition via round-robin (so technically not randomly)

This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, Kafka is a message broker which stores messages in a log (or in an easier language: very similar to an array) format. It receives messages from producers that write to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages.

This happens where Kafka and the consumers get out of sync. Possible reasons are:
On the inside, when a message enters a topic, it would be written to a certain partition. You can think partition as physical boxes that stores messages for a specific topic, each topic will have their own separate & dedicated partitions. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
On the inside, when a message enters a topic, it would be written to a certain partition. You can think partition as physical boxes that stores messages for a specific topic, each topic will have their own separate & dedicated partitions. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine.
On the inside, when a message enters a topic, it will be written to a certain partition. You can think of a partition as physical a box that stores messages for a specific topic. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine.

1. Running out of disk space or memory
2. Having a sustained event spike that causes very long processing times, causing Kafka to drop messages as they go past the retention time
3. Date/time out of sync issues due to a restart or suspend/resume cycle
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume.
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. **Important to note: the number of consumers cannot exceed the number of partitions**. If you have more consumers than partitions, the extra consumers will receive no messages.

3. Date/time out of sync issues due to a restart or suspend/resume cycle
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume.

Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers.
Each message in a topic will have an "offset" (number). You can think of this like an "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the time, we want "lag" to be as low as possible. The easiest solution to lagging is adding more partitions and increasing the number of consumers.

3. Date/time out of sync issues due to a restart or suspend/resume cycle
When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume.

Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers.
Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers. Learn more about [lagging](/self-hosted/troubleshooting/kafka/#consumers-lagging-behind).

Copy link
Contributor

@sfanahata sfanahata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some wording that could use cleaning up. I left a few suggestions. Overall, I love the additional information about partitions and consumers.

1. Set offset to latest and execute:
```shell
docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --all-groups --all-topics --reset-offsets --to-latest --execute
docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --all-groups --all-topics --reset-offsets --to-latest --execute
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Kafka Commands Fail When Container Stopped

The Kafka consumer group commands now use docker compose exec kafka instead of docker compose run --rm kafka. This change assumes the Kafka container is running, which can be problematic during troubleshooting (e.g., for "Offset Out Of Range" errors) when the container might be stopped or unhealthy. The previous run --rm approach was more robust as it could execute commands even if the main container was down.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants