Skip to content

NETOBSERV-2504: Add DNS name to metrics#2361

Open
jotak wants to merge 1 commit intonetobserv:mainfrom
jotak:dns-metrics
Open

NETOBSERV-2504: Add DNS name to metrics#2361
jotak wants to merge 1 commit intonetobserv:mainfrom
jotak:dns-metrics

Conversation

@jotak
Copy link
Member

@jotak jotak commented Jan 22, 2026

Description

  • Create a new metric (counter): "*_dns_packets_total", with the dns
    name label
  • Move the response code label out of the latency histogram, to that new
    metric => this reduces the metric cardinality, as counters have less
    dimensions than histograms
  • Make the scrape interval configurable (I wanted to check if changing
    the FLP scrape period from 15s to 30s had visible impacts on
    prometheus uses resources, but wasn't conclusive, so I didn't change the
    defauls - however that's still something we can open for configuration)

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labeled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@openshift-ci
Copy link

openshift-ci bot commented Jan 22, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jotak for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 22, 2026
@github-actions
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:c75c6dd
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-c75c6dd
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-c75c6dd

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:c75c6dd make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-c75c6dd

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-c75c6dd
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 22, 2026
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 22, 2026
@jotak jotak changed the title Add DNS name to metrics NETOBSERV-2504: Add DNS name to metrics Jan 22, 2026
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jan 22, 2026

@jotak: This pull request references NETOBSERV-2504 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labeled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@github-actions
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:d9d80be
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-d9d80be
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-d9d80be

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:d9d80be make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-sha-d9d80be

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-sha-d9d80be
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@jotak
Copy link
Member Author

jotak commented Jan 22, 2026

/hold
That's definitely increasing the cardinality
But there's maybe a way to still have that. Currently we use the DNS latency metric for several purposes (error codes and now qname). That's not ideal because it's a histogram metric, with a bunch of buckets, so the cardinality is multiplied by the number of buckets.
We could instead create a new counter for error codes and qnames: I think the cardinality will be lower; but that requires some refactoring on the console side

@jotak
Copy link
Member Author

jotak commented Jan 23, 2026

doing some cardinality tests, splitting in two metrics improves the cardinality even compared to the current status (as I'm also moving error_code out of the latency metric, hence trading an histogram for a counter)

Capture d’écran du 2026-01-23 10-00-44

(that's count({__name__=~"netobserv_namespace_dns_latency.*"}) and count(netobserv_namespace_dns_packets_total))

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 23, 2026
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jan 23, 2026

@jotak: This pull request references NETOBSERV-2504 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Description

  • Create a new metric (counter): "*_dns_packets_total", with the dns
    name label
  • Move the response code label out of the latency histogram, to that new
    metric => this reduces the metric cardinality, as counters have less
    dimensions than histograms
  • Make the scrape interval configurable (I wanted to check if changing
    the FLP scrape period from 15s to 30s had visible impacts on
    prometheus uses resources, but wasn't conclusive, so I didn't change the
    defauls - however that's still something we can open for configuration)

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labeled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

- Create a new metric (counter): "*_dns_packets_total", with the dns
  name label
- Move the response code label out of the latency histogram, to that new
  metric => this reduces the metric cardinality, as counters have less
dimensions than histograms
- Make the scrape interval configurable (I wanted to check if changing
  the FLP scrape period from 15s to 30s had visible impacts on
prometheus uses resources, but wasn't conclusive, so I didn't change the
defauls - however that's still something we can open for configuration)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants