Discussion on fingerprint generation logic #3845

Duansg · 2025-11-05T15:44:39Z

Duansg
Nov 5, 2025
Collaborator

As described in issue #3828 , I reviewed the relevant logic and found the following issues:

In the fingerprint generation logic of MetricsRealTimeAlertCalculator, if a metric contains multiple rows of values and no tags are set for the current metric, it may result in identical fingerprint information being generated. This can lead to information being overwritten or abnormal alert statuses. The critical code is as follows:

if (field.getLabel()) {
     fingerPrints.put(field.getName(), valueStr);
}

In the fingerprint generation logic of MetricsPeriodicAlertCalculator, if the calculation results return multiple entries and include high-cardinality label values (e.g., trace_id, user_id, create_time), this not only generates a large volume of alert messages but also leads to abnormal alert statuses. The critical code is as follows:

for (Map.Entry<String, Object> entry : result.entrySet()) {
    if (entry.getValue() != null && !VALUE.equals(entry.getKey()) && !TIMESTAMP.equals(entry.getKey())) {
           fingerPrints.put(entry.getKey(), entry.getValue().toString());
    }
}

In Question 1, if information overwriting and state confusion occur, I consider this unacceptable. I cannot even correctly distinguish valid exception messages, which may lead to misunderstandings for users.

In Question 2, high cardinality may lead to increased resource consumption in TSDB. This is because high cardinality typically implies a large number of active time series. This can result in excessive memory usage or increased insertion operation latency, among other issues.

Welcome to join the discussion.

Duansg · 2025-11-05T15:56:48Z

Duansg
Nov 5, 2025
Collaborator Author

Regarding Question 1, I have some thoughts:

Real time alert calculator

I initially considered adding __row_index__ (row number) to the fingerprint, but I cannot guarantee that the row number will always align with the data row. This means there could be cases where identical data rows have different row numbers. The pseudocode is as follows:

for (int i = 0; i < metricsData.getValues().size(); i++) {
    boolean includeLabel = fields.stream().allMatch(CollectRep.Field::getLabel);
    for (int index = 0; index < valueRow.getColumnsList().size(); index++) {
        String valueStr = valueRow.getColumns(index);
        final CollectRep.Field field = fields.get(index);

        if (metricsData.getValues().size() > 1 && !includeLabel) {
            fingerPrints.put("__row_index__", i);
        }
    }
}

Later, I attempted to incorporate fieldName and value into the fingerprint to enhance uniqueness. However, I discovered that if value has a high cardinality, it similarly compromises uniqueness. The pseudocode is as follows:

boolean includeLabel = fields.stream().allMatch(CollectRep.Field::getLabel);

for (int index = 0; index < valueRow.getColumnsList().size(); index++) {
    String valueStr = valueRow.getColumns(index);
    final CollectRep.Field field = fields.get(index);

    if (metricsData.getValues().size() > 1 && !includeLabel) {
        fingerPrints.put(field.getName(), valueStr);
    }
}

I attempted to scan the currently used monitoring templates. Although some metrics lack labels, not all metrics exhibit multi-line issues. Under these circumstances,
a. Inspect all metric information in hzb. If similar issues exist, configure labels (supports composite labels).
b. Incremental checks for adding tags to new templates, accompanied by documentation outlining best practices.

0 replies

tomsun28 · 2025-11-06T13:57:52Z

tomsun28
Nov 6, 2025
Collaborator

hi, the __row_index__ maybe not always mapping the same pre row, and for 2, the value is highly variable, such as time or numerical values.
agree with 3st, I think we still need use the labels to generate the fingerprint, and the labels need be config case by case, through the yml templates.

0 replies

Duansg · 2025-11-06T16:34:56Z

Duansg
Nov 6, 2025
Collaborator Author

@tomsun28 Thank you for your response. Yes, I also lean toward your perspective. I will organize this section of content in the coming days.

Regarding Question 2, I'm considering whether we need to incorporate cardinality detection functionality. I've noted that Prometheus already has a related query solution #11945, and I've observed that VictoriaMetrics provides cardinality detection capabilities, which are also implemented via /api/v1/status/tsdb.Regarding the VictoriaMetrics-related content, you can refer to cardinality-explorer, but we need to investigate whether other TSDB storage solutions support.

Got any good ideas?

4 replies

tomsun28 Nov 7, 2025
Collaborator

This brings up a long-standing question — should we focus on deep integration with a single TSDB, or simply maintain compatibility with multiple TSDBs?
At the moment, since we need the logs feature, it seems that we do need GreptimeDB. @bigcyy @Duansg @zqr10159

bigcyy Nov 7, 2025
Collaborator

This brings up a long-standing question — should we focus on deep integration with a single TSDB, or simply maintain compatibility with multiple TSDBs?
At the moment, since we need the logs feature, it seems that we do need GreptimeDB. @bigcyy @Duansg @zqr10159

+1, I agree that we should focus on deep integration with a single TSDB rather than broadening our compatibility.

My reasoning is based on two key factors:

Reduces User Friction and Learning Curve: The fragmentation of SQL and PromQL dialects across various TSDBs is a significant pain point. Asking users to learn and manage these different syntaxes creates a high learning curve and a disjointed user experience.
Unsustainable Maintenance Overhead: We already support several mainstream TSDBs, and this compatibility comes at a high engineering cost. Every new database we add exponentially increases our testing, debugging, and long-term maintenance workload.

Duansg Nov 7, 2025
Collaborator Author

This brings up a long-standing question — should we focus on deep integration with a single TSDB, or simply maintain compatibility with multiple TSDBs? At the moment, since we need the logs feature, it seems that we do need GreptimeDB. @bigcyy @Duansg @zqr10159

Can we obtain statistical detection of the base we can implement internally? Without relying on support from related underlying APIs.

Duansg Nov 7, 2025
Collaborator Author

This brings up a long-standing question — should we focus on deep integration with a single TSDB, or simply maintain compatibility with multiple TSDBs?
At the moment, since we need the logs feature, it seems that we do need GreptimeDB. @bigcyy @Duansg @zqr10159

+1, I agree that we should focus on deep integration with a single TSDB rather than broadening our compatibility.

I agree with your perspective. From an implementation standpoint, supporting different repositories is indeed a persistent and complex process.

However, I believe the core issue that needs to be discussed first is whether to provide pre-processing or post-processing cardinality detection capabilities. I think this would help users distinguish tags with high cardinality and reduce the resource consumption of the underlying storage.

tomsun28 · 2025-11-07T11:43:27Z

tomsun28
Nov 7, 2025
Collaborator

+1, agree with that we should focus on deep integration

0 replies

Duansg · 2025-11-07T17:32:56Z

Duansg
Nov 7, 2025
Collaborator Author

After some thought, perhaps we could collect this kind of cardinality statistics for alert labels more appropriately by using hertzbeat's own metrics (#3641) through cardinality sampling before each alert. For example:

# HELP hertzbeat_alert_cardinality Estimation of the cardinality of labels for alert rules or rule groups
# TYPE hertzbeat_alert_cardinality gauge
hertzbeat_alert_cardinality{define="alert_define_1"} 15
hertzbeat_alert_cardinality{define="alert_define_2"} 48000
hertzbeat_alert_cardinality{define="alert_define_3"} 2
hertzbeat_alert_cardinality{} 48017

In this implementation approach

Since alerts are fundamentally triggered by a set of labeled time series, adopting dimension-based collection for alert rules avoids the issue of collecting excessive metrics and resulting in an overly high cardinality for the metrics themselves.
Focus more on the HZB monitoring system itself, rather than performing additional statistical analysis and visualization on input metrics.
Implementation can be efficient, low-overhead, and visualizable, and may also configure base alerts for its own metrics.

BTW, why are we talking about this issue on its own? Here’s a real example:

If I currently have the following metrics: (10,000 REST APIs, 20 jobs, 10 environments), then this rule could theoretically generate 200,000 tag combinations.

When the number of tag combinations becomes extremely large, it leads to the so-called high cardinality problem. Alert systems or internal alert modules maintain a state (firing/resolved) for each unique tag combination, which is the source of high-cardinality real-time alerts.

Although most of these interfaces follow the pattern: /item/query/{itemId}, such issues typically stem from underlying metric problems. While the root cause is not difficult to identify, providing users with analytical capabilities through predictive measures can effectively prevent these issues in advance.

Therefore, I believe this is an easily overlooked yet highly impactful issue in monitoring system performance and maintainability (state explosion, push/aggregation pressure, query and display degradation, aggregation analysis, storage pressure, etc.). With this capability in place, users can even perform additional downsampling, throttling, or discarding when cardinality alerts occur later on.

FYI, @tomsun28 @bigcyy @zqr10159

0 replies

tomsun28 · 2025-11-08T11:56:27Z

tomsun28
Nov 8, 2025
Collaborator

@Duansg You mean we observe its alert sampling status through HertzBeat’s own metric data? +1 👍

0 replies

Discussion on fingerprint generation logic #3845

Uh oh!

Duansg Nov 5, 2025 Collaborator

Replies: 6 comments · 4 replies

Uh oh!

Duansg Nov 5, 2025 Collaborator Author

Uh oh!

tomsun28 Nov 6, 2025 Collaborator

Uh oh!

Duansg Nov 6, 2025 Collaborator Author

Uh oh!

tomsun28 Nov 7, 2025 Collaborator

Uh oh!

bigcyy Nov 7, 2025 Collaborator

Uh oh!

Duansg Nov 7, 2025 Collaborator Author

Uh oh!

Duansg Nov 7, 2025 Collaborator Author

Uh oh!

tomsun28 Nov 7, 2025 Collaborator

Uh oh!

Duansg Nov 7, 2025 Collaborator Author

Uh oh!

tomsun28 Nov 8, 2025 Collaborator

Duansg
Nov 5, 2025
Collaborator

Replies: 6 comments 4 replies

Duansg
Nov 5, 2025
Collaborator Author

tomsun28
Nov 6, 2025
Collaborator

Duansg
Nov 6, 2025
Collaborator Author

tomsun28 Nov 7, 2025
Collaborator

bigcyy Nov 7, 2025
Collaborator

Duansg Nov 7, 2025
Collaborator Author

Duansg Nov 7, 2025
Collaborator Author

tomsun28
Nov 7, 2025
Collaborator

Duansg
Nov 7, 2025
Collaborator Author

tomsun28
Nov 8, 2025
Collaborator