Replies: 6 comments 4 replies
-
|
Regarding Question 1, I have some thoughts:
|
Beta Was this translation helpful? Give feedback.
-
|
hi, the |
Beta Was this translation helpful? Give feedback.
-
|
@tomsun28 Thank you for your response. Yes, I also lean toward your perspective. I will organize this section of content in the coming days. Regarding Question 2, I'm considering whether we need to incorporate cardinality detection functionality. I've noted that Prometheus already has a related query solution #11945, and I've observed that VictoriaMetrics provides cardinality detection capabilities, which are also implemented via Got any good ideas? |
Beta Was this translation helpful? Give feedback.
-
|
+1, agree with that we should focus on deep integration |
Beta Was this translation helpful? Give feedback.
-
|
After some thought, perhaps we could collect this kind of cardinality statistics for alert labels more appropriately by using hertzbeat's own metrics (#3641) through cardinality sampling before each alert. For example: In this implementation approach
BTW, why are we talking about this issue on its own? Here’s a real example: If I currently have the following metrics: (10,000 REST APIs, 20 jobs, 10 environments), then this rule could theoretically generate 200,000 tag combinations. When the number of tag combinations becomes extremely large, it leads to the so-called high cardinality problem. Alert systems or internal alert modules maintain a state (firing/resolved) for each unique tag combination, which is the source of high-cardinality real-time alerts. Although most of these interfaces follow the pattern: /item/query/{itemId}, such issues typically stem from underlying metric problems. While the root cause is not difficult to identify, providing users with analytical capabilities through predictive measures can effectively prevent these issues in advance. Therefore, I believe this is an easily overlooked yet highly impactful issue in monitoring system performance and maintainability (state explosion, push/aggregation pressure, query and display degradation, aggregation analysis, storage pressure, etc.). With this capability in place, users can even perform additional downsampling, throttling, or discarding when |
Beta Was this translation helpful? Give feedback.
-
|
@Duansg You mean we observe its alert sampling status through HertzBeat’s own metric data? +1 👍 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
As described in issue #3828 , I reviewed the relevant logic and found the following issues:
MetricsRealTimeAlertCalculator, if a metric contains multiple rows of values and no tags are set for the current metric, it may result in identical fingerprint information being generated. This can lead to information being overwritten or abnormal alert statuses. The critical code is as follows:MetricsPeriodicAlertCalculator, if the calculation results return multiple entries and include high-cardinality label values (e.g., trace_id, user_id, create_time), this not only generates a large volume of alert messages but also leads to abnormal alert statuses. The critical code is as follows:In Question 1, if information overwriting and state confusion occur, I consider this unacceptable. I cannot even correctly distinguish valid exception messages, which may lead to misunderstandings for users.
In Question 2, high cardinality may lead to increased resource consumption in TSDB. This is because high cardinality typically implies a large number of active time series. This can result in excessive memory usage or increased insertion operation latency, among other issues.
Welcome to join the discussion.
Beta Was this translation helpful? Give feedback.
All reactions