Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snuba-subscription-consumer-* containers are failing continuously #6137

Open
sree-warrier opened this issue Jul 20, 2024 · 14 comments
Open

snuba-subscription-consumer-* containers are failing continuously #6137

sree-warrier opened this issue Jul 20, 2024 · 14 comments

Comments

@sree-warrier
Copy link

sree-warrier commented Jul 20, 2024

Self-Hosted Version

23.11.2

CPU Architecture

x86_64

Docker Version

NA

Docker Compose Version

NA

Steps to Reproduce

Seeing following containers been crashing continuously. Is this services used for alerting ? Have little confusions now on the services functionality.

snuba-subscription-consumer-events
snuba-subscription-consumer-metrics
snuba-subscription-consumer-transactions

Logs:

2024-07-20 15:57:07,088 Initializing Snuba...
2024-07-20 15:57:10,884 Snuba initialization took 3.7952772620010364s
{"module": "builtins", "event": "Checking Clickhouse connections", "severity": "info", "timestamp": "2024-07-20T15:57:10.897290Z"}
2024-07-20 15:57:10,966 New partitions assigned: {Partition(topic=Topic(name='snuba-commit-log'), index=0): 0, Partition(topic=Topic(name='snuba-commit-log'), index=1): 0, Partition(topic=Topic(name='snuba-commit-log'), index=2): 0, Partition(topic=Topic(name='snuba-commit-log'), index=3): 0, Partition(topic=Topic(name='snuba-commit-log'), index=4): 0}
2024-07-20 15:57:10,979 Caught exception, shutting down...
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 294, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 382, in _run_once
    self.__processing_strategy.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/scheduler_processing_strategy.py", line 240, in submit
    self.__next_step.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/combined_scheduler_executor.py", line 275, in submit
    tasks.extend([task for task in entity_scheduler[tick.partition].find(tick)])
KeyError: 2
2024-07-20 15:57:10,981 Closing <snuba.subscriptions.scheduler_consumer.CommitLogTickConsumer object at 0x7cba41de5f70>...
2024-07-20 15:57:10,983 Partitions to revoke: [Partition(topic=Topic(name='snuba-commit-log'), index=0), Partition(topic=Topic(name='snuba-commit-log'), index=1), Partition(topic=Topic(name='snuba-commit-log'), index=2), Partition(topic=Topic(name='snuba-commit-log'), index=3), Partition(topic=Topic(name='snuba-commit-log'), index=4)]
2024-07-20 15:57:10,983 Partition revocation complete.
2024-07-20 15:57:10,987 Processor terminated
Traceback (most recent call last):
  File "/usr/local/bin/snuba", line 33, in <module>
    sys.exit(load_entry_point('snuba', 'console_scripts', 'snuba')())
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/src/snuba/snuba/cli/subscriptions_scheduler_executor.py", line 153, in subscriptions_scheduler_executor
    processor.run()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 294, in run
    self._run_once()
  File "/usr/local/lib/python3.8/site-packages/arroyo/processing/processor.py", line 382, in _run_once
    self.__processing_strategy.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/scheduler_processing_strategy.py", line 240, in submit
    self.__next_step.submit(message)
  File "/usr/src/snuba/snuba/subscriptions/combined_scheduler_executor.py", line 275, in submit
    tasks.extend([task for task in entity_scheduler[tick.partition].find(tick)])
KeyError: 2

Alerting system were working fine. We made few changes with kafka partitions after that we saw only these 3 containers were down.

  • Initially increased kafka partition for ingest-events and events from 1 to 5 for scale testing
  • We saw only these 3 services were getting down with above error
  • Followed the steps updated in this issue 'Number of Errors' alert rules not triggering self-hosted#2067
  • Tried to clear all lags and offset, still didnt worked out
  • Recreated the topics as per solution mentioned in above issue, it didnt worked out.
  • We deleted all existing alerts and recreated, now alerts are working. But still the containers are in failed state.

Have little confusions now on these services functionality. Which service is now serving the alerting ?

Suspecting some issue with partition mis-match(please do correct us if this is not related to it), so have increased all the topics partition to 5. Currently review all topic configs, seeing these 3 topics snuba-commit-log, events-subscription-results and ingest-monitors having a ReplicationFactor of 3 rest all topic is having ReplicationFactor as 1, remaining all configs remains same now.

Also while listing out consumer-groups seeing following having no active members

Consumer group 'snuba-transactions-subscriptions-consumers' has no active members.
Consumer group 'snuba-events-subscriptions-consumers' has no active members.
Consumer group 'sentry-commit-log-6e1d91f6451a11ef8ad962551908ad8e' has no active members.
Consumer group 'nuba-metrics-subscriptions-consumers' has no active members.
Consumer group 'sentry-commit-log-12e82a30451a11efb933c2a760684d4c' has no active members.

Do let us know if any other information needed.

Expected Result

NA

Actual Result

NA

Event ID

No response

@IanWoodard IanWoodard transferred this issue from getsentry/self-hosted Jul 23, 2024
@mcannizz mcannizz self-assigned this Sep 13, 2024
@untitaker
Copy link
Member

I think this may be a duplicate of #5855 (comment)

@chipzzz
Copy link

chipzzz commented Dec 31, 2024

Any updates on this? I was able to fix snuba-subscription-consumer-transactions by recreating the corresponding topic but for snuba-subscription-consumer-metrics , that did not work

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Dec 31, 2024
@chipzzz
Copy link

chipzzz commented Dec 31, 2024

@mcannizz

@chipzzz
Copy link

chipzzz commented Jan 2, 2025

Is this simply attributed to Sentry currently not polling this data as it's not needed? Hence kafka states no active members? Still the crashloop should be fixed.

UPDATE:
However, once I reset offset it does have active members and the problem persists. (Or when the pod is not crashing)

@untitaker
Copy link
Member

Please refer to the issue comment I linked above and ensure you are not running the commit-log topic with more than one partition.

@seborys40
Copy link

I have/had all commit-log topics set to 1 partition and 1 replica all along as well have them all defined in the topic_partition_counts and all set to 1. This problem became apparent when I was doing some regular operations like editing the topic_partition_counts and updating basic configs.

I can't get passed this events is fine metrics and transactions is not.

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Jan 3, 2025
@chipzzz
Copy link

chipzzz commented Jan 3, 2025

Sorry logged in another account
** events is fine, transactions is fine but metrics is not.

@chipzzz
Copy link

chipzzz commented Jan 3, 2025

@untitaker , is this at all related to removal of beta metrics feature from sentry? We recently did remove the metrics beta feature . I wonder if some components can now be removed from deployment... although I see the metric topics are still having data sent to them.

what are these exactly responsible for ?

  • sentry-snuba-subscription-consumer-metrics
  • sentry-snuba-subscription-consumer-transactions
  • sentry-snuba-subscription-consumer-events

@untitaker
Copy link
Member

untitaker commented Jan 3, 2025

@chipzzz metrics is for release health (crashed sessions etc in releases tab), generic-metrics is for the beta metrics feature you mention, transactions is for Performance product in general, events is for errors

generally, deployments with "subscription" in the name are for alerts. if you don't need alerts on crashed sessions/performance data/errors respectively, you can just remove those deployments

if you have further questions like this I suggest filing a separate issue from this one, which should be focused on the bugs IMO

@chipzzz
Copy link

chipzzz commented Jan 3, 2025

@untitaker , Am still using the aforementioned, except beta metrics. Unclear though what else could be causing this.

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Jan 3, 2025
@chipzzz
Copy link

chipzzz commented Jan 7, 2025

This may be related to this
getsentry/self-hosted#3106 (comment)

@chipzzz
Copy link

chipzzz commented Jan 7, 2025

For reference including #2666

@chipzzz
Copy link

chipzzz commented Jan 7, 2025

@chipzzz
Copy link

chipzzz commented Jan 8, 2025

Resolved the issue.

These consumers

  • sentry-snuba-subscription-consumer-metrics
  • sentry-snuba-subscription-consumer-transactions
  • sentry-snuba-subscription-consumer-events

Also depend on other topics and not just commit-log topics, these are

  • snuba-metrics
  • transactions
  • events

In my UAT environment I had an increased number of partitions for these topics but did not have a matching number of consumers to consume from all partitions, hence the key error.

So you must have a matching number of partitions to corresponding consumers consuming them. I tested with other topics/consumers but was not aware snuba-metrics, transactions, events topic were also associated.

However, I am not sure how this problem became apparent as It was always set up this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Waiting for: Product Owner
Status: No status
Development

No branches or pull requests

5 participants