docs(self-hosted): provide more insights on troubleshooting kafka #15131

aldy505 · 2025-10-04T03:29:52Z

DESCRIBE YOUR PR

Turns out most of our self-hosted users has never touch Kafka before, so it's a good idea to introduce them regarding how Kafka works.

Also added how to increase consumers replica if they're lagging behind.

IS YOUR CHANGE URGENT?

Help us prioritize incoming PRs by letting us know when the change needs to go live.

Urgent deadline (GA date, etc.):
Other deadline:
None: Not urgent, can wait up to 1 week+

SLA

Teamwork makes the dream work, so please add a reviewer to your PRs.
Please give the docs team up to 1 week to review your PR unless you've added an urgent due date to it.
Thanks in advance for your help!

PRE-MERGE CHECKLIST

Make sure you've checked the following before merging your changes:

Checked Vercel preview for correctness, including links
PR was reviewed and approved by any necessary SMEs (subject matter experts)
PR was reviewed and approved by a member of the Sentry docs team

LEGAL BOILERPLATE

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. and is gonna need some rights from me in order to utilize my contributions in this here PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

EXTRA RESOURCES

Sentry Docs contributor guide

Turns out most of our self-hosted users has never touch Kafka before, so it's a good idea to introduce them regarding how Kafka works. Also added how to increase consumers replica if they're lagging behind.

vercel · 2025-10-04T03:29:55Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
develop-docs	Ready	Preview	Comment	Oct 8, 2025 0:14am

1 Skipped Deployment

Project	Deployment	Preview	Comments	Updated (UTC)
sentry-docs	Ignored	Preview		Oct 8, 2025 0:14am

seer-by-sentry · 2025-10-04T03:34:46Z

develop-docs/self-hosted/troubleshooting/kafka.mdx

 2. Receive consumers list:
   ```shell
-   docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list
+   docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list


Potential bug: The change from docker compose run to docker compose exec will cause commands to fail, as the troubleshooting guide requires containers to be stopped first.

Description: The troubleshooting documentation for Kafka offset resets instructs users to first stop consumer containers. However, subsequent steps were changed from docker compose run --rm kafka to docker compose exec kafka. The exec command requires the target container to be running. In a troubleshooting scenario where the kafka container may be stopped or unhealthy, or if the user has stopped it as part of the procedure, these commands will fail with an error like "container is not running". This breaks the documented recovery workflow, preventing users from resetting Kafka offsets.

Suggested fix: Revert the commands from docker compose exec kafka back to docker compose run --rm kafka. The run command creates a new container for the command, which works regardless of whether the main kafka service container is running.
_{severity: 0.7, confidence: 0.95}

_{Did we get this right? 👍 / 👎 to inform future reviews.}

No, we're not stopping Kafka container

seer-by-sentry · 2025-10-04T03:34:46Z

develop-docs/self-hosted/troubleshooting/kafka.mdx

+   services:
+     events-consumer:
+       deploy:
+         replicas: 3
+   ```
+   This will increase the number of consumers for the `ingest-consumer` consumer group to 3.


Potential bug: The recommended scaling method using deploy.replicas is silently ignored in standalone Docker Compose, meaning no scaling will actually occur.

Description: The documentation suggests scaling Kafka consumers using the deploy.replicas key in a docker-compose.override.yml file. However, the deploy key is only effective in Docker Swarm mode. Self-hosted Sentry installations use standalone Docker Compose, which silently ignores this configuration. As a result, users following these instructions will not actually scale their consumers, and the underlying performance issues like consumer lag will persist, despite the user believing they have applied a fix.

Suggested fix: Remove the instructions for using deploy.replicas. Replace them with the correct method for scaling services in standalone Docker Compose, which typically involves defining additional, uniquely named service entries in the docker-compose.override.yml file.
_{severity: 0.8, confidence: 0.98}

_{Did we get this right? 👍 / 👎 to inform future reviews.}

However, the deploy key is only effective in Docker Swarm mode.

Wrong. It works on Docker Compose.

develop-docs/self-hosted/troubleshooting/kafka.mdx

Co-authored-by: Kevin Pfeifer <[email protected]>

BYK

This looks waaaay better than the old version but I'm not qualified to give proper feedback. Still unblocking as I think it is miles better that whatever we have currently.

develop-docs/self-hosted/troubleshooting/kafka.mdx

jjbayer · 2025-10-07T06:41:56Z

develop-docs/self-hosted/troubleshooting/kafka.mdx

+This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, it is a message broker which stores message in a log (or in an easier language: very similar to an array) format. It receives messages from producers that aimed to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages.

-This happens where Kafka and the consumers get out of sync. Possible reasons are:
+On the inside, when a message enters a topic, it would be written to a certain partition. You can think partition as physical boxes that stores messages for a specific topic, each topic will have their own separate & dedicated partitions. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine.


This paragraph has a bit of overlap with the next one and does not add much to understanding kafka IMO, so I think we could remove it completely.

I don't mind leaving this here. At least I want to emphasize the existence of "partition"

develop-docs/self-hosted/troubleshooting/kafka.mdx

Co-authored-by: Joris Bayer <[email protected]>

markstory · 2025-10-07T13:58:25Z

develop-docs/self-hosted/troubleshooting/kafka.mdx

-```log
-Exception: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}
-```
+This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, Kafka is a message broker which stores messages in a log (or in an easier language: very similar to an array) format. It receives messages from producers that write to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages.


If you want to be pedantic, Kafka doesn't send messages to consumers. Consumers poll kafka and fetch new messages.

Too pedantic and not so beginner friendly 😅

markstory · 2025-10-07T14:02:36Z

develop-docs/self-hosted/troubleshooting/kafka.mdx

-1. Running out of disk space or memory
-2. Having a sustained event spike that causes very long processing times, causing Kafka to drop messages as they go past the retention time
-3. Date/time out of sync issues due to a restart or suspend/resume cycle
+When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume.


If we use semantic partitioning, we don't generally assign partition numbers to messages. Instead we use 'keyed messages', which define how messages are grouped (by key) but how those keys map to partitions is up to Kafka.

Adding on this, if messages don't have partition keys, kafka will assign the message to a partition via round-robin (so technically not randomly)

Good point. Will update these.

markstory · 2025-10-07T14:05:46Z

develop-docs/self-hosted/troubleshooting/kafka.mdx

-3. Date/time out of sync issues due to a restart or suspend/resume cycle
+When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume.
+
+Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers.


Do you want to mention that offsets are scoped to a partition, and that each partition in a topic will have the same offset numbers?

Yes I should be mentioning that

bmckerry · 2025-10-07T14:55:02Z

develop-docs/self-hosted/troubleshooting/kafka.mdx

+
+This happens where Kafka and the consumers get out of sync. Possible reasons are:
+
+1. Running out of disk space or memory


I could be wrong, but I think if kafka runs out of disk space the service crashes (which wouldn't cause offset out of range on consumers)

Technically speaking, you're right. But after a disk out of space incident, there will potentially be a massive offset out of range error.

Perhaps this should be made clearer.

bmckerry · 2025-10-07T14:58:31Z

develop-docs/self-hosted/troubleshooting/kafka.mdx

-1. Running out of disk space or memory
-2. Having a sustained event spike that causes very long processing times, causing Kafka to drop messages as they go past the retention time
-3. Date/time out of sync issues due to a restart or suspend/resume cycle
+When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume.


Adding on this, if messages don't have partition keys, kafka will assign the message to a partition via round-robin (so technically not randomly)

develop-docs/self-hosted/troubleshooting/kafka.mdx

sfanahata · 2025-10-07T23:48:09Z

develop-docs/self-hosted/troubleshooting/kafka.mdx

+This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, Kafka is a message broker which stores messages in a log (or in an easier language: very similar to an array) format. It receives messages from producers that write to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages.

-This happens where Kafka and the consumers get out of sync. Possible reasons are:
+On the inside, when a message enters a topic, it would be written to a certain partition. You can think partition as physical boxes that stores messages for a specific topic, each topic will have their own separate & dedicated partitions. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine.


Suggested change

On the inside, when a message enters a topic, it would be written to a certain partition. You can think partition as physical boxes that stores messages for a specific topic, each topic will have their own separate & dedicated partitions. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine.

On the inside, when a message enters a topic, it will be written to a certain partition. You can think of a partition as physical a box that stores messages for a specific topic. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine.

sfanahata · 2025-10-07T23:54:02Z

develop-docs/self-hosted/troubleshooting/kafka.mdx

-1. Running out of disk space or memory
-2. Having a sustained event spike that causes very long processing times, causing Kafka to drop messages as they go past the retention time
-3. Date/time out of sync issues due to a restart or suspend/resume cycle
+When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume.


Suggested change

When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume.

When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. **Important to note: the number of consumers cannot exceed the number of partitions**. If you have more consumers than partitions, the extra consumers will receive no messages.

sfanahata · 2025-10-07T23:56:13Z

develop-docs/self-hosted/troubleshooting/kafka.mdx

-3. Date/time out of sync issues due to a restart or suspend/resume cycle
+When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume.
+
+Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers.


Suggested change

Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers.

Each message in a topic will have an "offset" (number). You can think of this like an "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the time, we want "lag" to be as low as possible. The easiest solution to lagging is adding more partitions and increasing the number of consumers.

sfanahata · 2025-10-08T00:06:15Z

develop-docs/self-hosted/troubleshooting/kafka.mdx

-3. Date/time out of sync issues due to a restart or suspend/resume cycle
+When a producer sends a message to a topic, it will either stick to a certain partition number (example: partition 1, partition 2, etc.) or it will randomly choose a partition. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. One very important aspect to note is that **the number of consumers within a consumer group must not exceed the number of partition for a given topic**. If you have more consumers than number of partitions, then the consumers will be hanging with no messages to consume.
+
+Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers.


Suggested change

Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers.

Each messages in a topic will then have an "offset" (number), this would easily translates to "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the times, we want "lag" to be as low as possible, meaning we don't want to have so many unprocessed messages. The easy solution would be adding more partitions and increasing the number of consumers. Learn more about [lagging](/self-hosted/troubleshooting/kafka/#consumers-lagging-behind).

sfanahata

There is some wording that could use cleaning up. I left a few suggestions. Overall, I love the additional information about partitions and consumers.

Co-authored-by: Shannon Anahata <[email protected]>

cursor · 2025-10-08T00:10:05Z

develop-docs/self-hosted/troubleshooting/kafka.mdx

 1. Set offset to latest and execute:
   ```shell
-   docker compose run --rm kafka kafka-consumer-groups --bootstrap-server kafka:9092 --all-groups --all-topics --reset-offsets --to-latest --execute
+   docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --all-groups --all-topics --reset-offsets --to-latest --execute


Bug: Kafka Commands Fail When Container Stopped

The Kafka consumer group commands now use docker compose exec kafka instead of docker compose run --rm kafka. This change assumes the Kafka container is running, which can be problematic during troubleshooting (e.g., for "Offset Out Of Range" errors) when the container might be stopped or unhealthy. The previous run --rm approach was more robust as it could execute commands even if the main container was down.

docs(self-hosted): provide more insights on troubleshooting kafka

c482c56

Turns out most of our self-hosted users has never touch Kafka before, so it's a good idea to introduce them regarding how Kafka works. Also added how to increase consumers replica if they're lagging behind.

aldy505 requested review from hubertdeng123 and BYK October 4, 2025 03:31

seer-by-sentry bot reviewed Oct 4, 2025

View reviewed changes

vercel bot deployed to Preview – develop-docs October 4, 2025 03:35 View deployment

LordSimal reviewed Oct 4, 2025

View reviewed changes

develop-docs/self-hosted/troubleshooting/kafka.mdx Outdated Show resolved Hide resolved

develop-docs/self-hosted/troubleshooting/kafka.mdx Outdated Show resolved Hide resolved

aldy505 and others added 2 commits October 4, 2025 20:37

Update kafka.mdx

373de2c

Co-authored-by: Kevin Pfeifer <[email protected]>

Update kafka.mdx

b280360

Co-authored-by: Kevin Pfeifer <[email protected]>

vercel bot deployed to Preview – develop-docs October 4, 2025 13:47 View deployment

BYK approved these changes Oct 6, 2025

View reviewed changes

jjbayer reviewed Oct 7, 2025

View reviewed changes

aldy505 and others added 2 commits October 7, 2025 17:17

Apply suggestion from @jjbayer

13365e0

Co-authored-by: Joris Bayer <[email protected]>

Apply suggestion from @jjbayer

484e104

Co-authored-by: Joris Bayer <[email protected]>

vercel bot deployed to Preview – develop-docs October 7, 2025 10:27 View deployment

markstory reviewed Oct 7, 2025

View reviewed changes

bmckerry reviewed Oct 7, 2025

View reviewed changes

sfanahata reviewed Oct 7, 2025

View reviewed changes

develop-docs/self-hosted/troubleshooting/kafka.mdx Outdated Show resolved Hide resolved

sfanahata reviewed Oct 7, 2025

View reviewed changes

sfanahata reviewed Oct 8, 2025

View reviewed changes

Update develop-docs/self-hosted/troubleshooting/kafka.mdx

669bc82

Co-authored-by: Shannon Anahata <[email protected]>

cursor bot reviewed Oct 8, 2025

View reviewed changes

vercel bot deployed to Preview – develop-docs October 8, 2025 00:14 View deployment


		This happens where Kafka and the consumers get out of sync. Possible reasons are:

		1. Running out of disk space or memory

Uh oh!

docs(self-hosted): provide more insights on troubleshooting kafka #15131

Are you sure you want to change the base?

docs(self-hosted): provide more insights on troubleshooting kafka #15131

Conversation

aldy505 commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DESCRIBE YOUR PR

IS YOUR CHANGE URGENT?

SLA

PRE-MERGE CHECKLIST

LEGAL BOILERPLATE

EXTRA RESOURCES

Uh oh!

vercel bot commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seer-by-sentry bot Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seer-by-sentry bot Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

BYK left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sfanahata left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Oct 8, 2025

Choose a reason for hiding this comment

Bug: Kafka Commands Fail When Container Stopped

Uh oh!

Uh oh!

aldy505 commented Oct 4, 2025 •

edited

Loading

vercel bot commented Oct 4, 2025 •

edited

Loading