Skip to content

Generic labels in metrics, usage in events and q/tx#816

Closed
xmariachi wants to merge 7 commits intodevfrom
diego/engn-3570-monitoring-onchain-add-labels-refactor
Closed

Generic labels in metrics, usage in events and q/tx#816
xmariachi wants to merge 7 commits intodevfrom
diego/engn-3570-monitoring-onchain-add-labels-refactor

Conversation

@xmariachi
Copy link
Contributor

@xmariachi xmariachi commented May 6, 2025

Purpose of Changes and their Description

  • Metrics methods refactor to provide labels generically
  • Modified events, tx and queries metrics to send labels where appropriate (eg topicId, nonce, address, etc).

Are these changes tested and documented?

  • If tested, please describe how. If not, why tests are not needed. -- Unit tests updated, tested with integration tests.
  • If documented, please describe where. If not, describe why docs are not needed. -- Added to the emissions module README (also making it worthwhile - previous one was incorrect based on a copy from simd)
  • Added to Unreleased section of CHANGELOG.md?

@xmariachi xmariachi changed the title Add labels to events with metrics refactor Generic labels in metrics, usage in events and q/tx May 6, 2025
@xmariachi xmariachi marked this pull request as ready for review May 6, 2025 20:23
Copy link
Contributor

@amimart amimart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey I've let some remarks,

My main concern is about metrics, having labels that can carry a lot of different values can be dangerous for the node's memory, I even think it is a bad idea to have topic ids in metrics..

If we really need such labels I think we should consider implementing a metrics exporter that would expose them by watching the chain's state


labels := map[string]string{
"topic_id": strconv.FormatUint(msg.TopicId, 10),
"address": msg.Sender,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid labels with high cardinality, I think setting address as label would create too much metric entries which can impacts node performance

Comment on lines +59 to +65
labels := map[string]string{
"address": msg.ReputerValueBundle.ValueBundle.Reputer,
"topic_id": strconv.FormatUint(msg.ReputerValueBundle.ValueBundle.TopicId, 10),
"nonce": strconv.FormatUint(uint64(msg.ReputerValueBundle.ValueBundle.ReputerRequestNonce.ReputerNonce.BlockHeight), 10),
"blockHeight": strconv.FormatInt(blockHeight, 10),
}
defer metrics.RecordMetrics("InsertReputerPayload", time.Now(), &err, labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have multiple remarks here:

  • The time.Now() being executed here would provide wrong latency measurement;
  • The address, nonce, blockHeight introduce a very high cardinality, that can be dangerous;

I think we should have only one defer statement at the beginning, and if we need to evaluate vars at execution we can use defer func() {} instead


moduleParams, err := ms.k.GetParams(ctx)
if err != nil {
defer metrics.RecordMetrics("InsertReputerPayload", time.Now(), &err, map[string]string{"error": err.Error()})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having error messages as label can creates a lot of metric entries and are hard to use (e.g. error msgs can contains values creating multiple error labels for the same one), it's preferable to use error codes instead

Comment on lines +18 to +22
labels := map[string]string{
"worker_address": req.WorkerAddress,
"topic_id": strconv.FormatUint(req.TopicId, 10),
}
defer metrics.RecordMetrics("GetWorkerLatestInferenceByTopicId", time.Now(), &err, labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed to have labels on queries? I understand the need of labels when it's related to a mutation in the state but I'm not sure it's relevant for queries

Comment on lines +72 to +74
if labels == nil {
labels = make(map[string]string)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if labels == nil {
labels = make(map[string]string)
}
if labels == nil {
labels = make(map[string]string)
}

We can read from a nil map so let it be nil

return
}
metrics.IncrProducerEventCount(metrics.INFERER_SCORE_EVENT)
metrics.IncrProducerEventCountWithLabels(metrics.INFERER_SCORE_EVENT, map[string]string{"topic_id": fmt.Sprintf("%d", topicId)})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to use strconv.FormatUint as it is done in other places

@xmariachi
Copy link
Contributor Author

Hey I've let some remarks,

My main concern is about metrics, having labels that can carry a lot of different values can be dangerous for the node's memory, I even think it is a bad idea to have topic ids in metrics..

If we really need such labels I think we should consider implementing a metrics exporter that would expose them by watching the chain's state

Thanks for the review. Yeah you're totally right about the high cardinality. I'm reviewing this. TopicId would be very useful, but the concern stays and we may just want to get that from offchain data only, in which case this PR may just be discarded.

Copy link

@felipead felipead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Diego,

I agree with Arnaud's argument on the node's memory.

Can we do a simpler PR just with the TopicID and the essentials for monitoring mainnet?

I am afraid the metrics exported, although the ideal solution, will not be ready on time.

@xmariachi
Copy link
Contributor Author

It was decided to move out from adding cardinality to the node and use offchain data for topicId-based alerts. Closing PR.

@xmariachi xmariachi closed this May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants