diff --git a/proposals/0071-Entity/01-context.md b/proposals/0071-Entity/01-context.md new file mode 100644 index 0000000..0e3863a --- /dev/null +++ b/proposals/0071-Entity/01-context.md @@ -0,0 +1,455 @@ +# Supporting Entities in Prometheus + +## Abstract + +This proposal introduces native support for **Entities** in Prometheus—a first-class representation of the things that produce telemetry, distinct from the telemetry they produce. + +Today, Prometheus relies on Info-type metrics to represent metadata about monitored objects: gauges with an `_info` suffix, a constant value of `1`, and labels containing the metadata. But this approach is fundamentally flawed: **the thing that produces metrics is not itself a metric**. A Kubernetes pod, a service instance, or a host are entities with their own identity and lifecycle—they should not be stored as time series with sample values. + +This conflation forces users to rely on verbose `group_left` joins to attach metadata to metrics, creates storage inefficiency for constant values, and loses the semantic distinction between what identifies an entity and what describes it. + +By introducing Entities as a native concept, Prometheus can provide cleaner query ergonomics, optimized storage for metadata, explicit lifecycle management, and proper semantics that distinguish between identifying labels (what makes an entity unique) and descriptive labels (additional context about that entity). + +This proposal also aligns with Prometheus's commitment to being the default store for OpenTelemetry metrics. OpenTelemetry's Entity model provides a well-defined structure for representing monitored objects, and native Entity support enables seamless translation between OTel Entities and Prometheus. + +--- + +## Terminology + +Before diving into the problem and proposed solution, let's establish a shared vocabulary: + +#### Info Metric + +A metric that exposes metadata about a monitored entity rather than a measurement. In current Prometheus convention, these are gauges with a constant value of `1` and labels containing the metadata. Examples include `node_uname_info`, `kube_pod_info`, and `target_info`. + +``` +build_info{version="1.2.3", revision="abc123", goversion="go1.21"} 1 +``` + +#### Entity + +An **Entity** represents a distinct object of interest that produces or is associated with telemetry. Unlike Info metrics, Entities are not metrics—they are first-class objects with their own identity, labels, and lifecycle. + +In OpenTelemetry, an entity is an object of interest that produces telemetry data. Entities represent things like services, hosts, containers, or Kubernetes pods. Each entity has a type (e.g., `k8s.pod`, `host`, `service`) and a set of attributes that describe it. + +This proposal adopts the Entity concept as the native Prometheus representation for what was previously expressed through Info metric conventions. + +#### Resource Attributes + +In OpenTelemetry, **resource attributes** are key-value pairs that describe the entity producing telemetry. These attributes are attached to all telemetry (metrics, logs, traces) from that entity. When OTel metrics are exported to Prometheus, resource attributes typically become labels on a `target_info` metric. + +#### Identifying Labels + +**Identifying labels** uniquely distinguish one entity from another of the same type. These labels: +- Must remain constant for the lifetime of the entity +- Together form a unique identifier for the entity +- Are required to identify which entity produced the telemetry + +Examples: +- `k8s.pod.uid` or (`k8s.pod.name`,`k8s.namespace.name`) for a Kubernetes pod +- `host.id` for a host +- `service.instance.id` for a service instance + +#### Descriptive Labels + +**Descriptive labels** provide additional context about an entity but do not serve to uniquely identify it. These labels: +- May change during the entity's lifetime +- Provide useful metadata for querying and visualization +- Are optional and supplementary + +Examples: +- `k8s.pod.label.app_name` (pods labels can change) +- `host.name` (hostnames can change) +- `service.version` (versions change with deployments) + +--- + +## Problem Statement + +### Entities Are Not Metrics + +At the heart of the Info metric pattern lies a conceptual mismatch: **the thing that produces metrics is not itself a metric**. + +Consider a Kubernetes pod. It has an identity (namespace, UID), labels that describe it (name, node, pod labels), a lifecycle (creation time, termination), and it produces telemetry (CPU usage, memory consumption, request counts). The pod is the *source* of metrics—it is not *a* metric. + +Yet today, we represent this pod as a metric: + +```promql +kube_pod_info{namespace="production", pod="api-server-7b9f5", uid="550e8400", node="worker-2"} 1 +``` + +This representation has several conceptual problems: + +1. **The value is meaningless**: The `1` carries no information. It exists only because Prometheus's data model requires a numeric value. +2. **Identity is conflated with data**: All labels are treated equally. There's no distinction between `uid` (which identifies the pod) and `node` (which describes where it's running and could change). +3. **Lifecycle is implicit**: When a pod is deleted and recreated with the same name, Prometheus sees label churn. There's no explicit representation of "this entity ended, a new one began." +4. **Correlation requires workarounds**: To associate the pod's metadata with its metrics, users must write complex `group_left` joins—essentially reconstructing a relationship that should be built into the data model. + +The Prometheus data model was designed for metrics: measurements that change over time, represented as (timestamp, value) pairs with identifying labels. Entities don't fit this model. They have: + +- **Stable identity** (not a stream of values) +- **Mutable descriptions** (labels that change independently of any "sample") +- **Explicit lifecycle** (creation and termination events) +- **Correlation relationships** (many metrics belong to one entity) + +**This proposal introduces Entities as a first-class concept in Prometheus**, separate from metrics, with their own storage, lifecycle management, and query semantics. Info metrics will continue to work for backward compatibility, but new instrumentation and the OTel integration can use proper Entity semantics. + +### The Current Workaround: Info Metrics as Gauges + +Prometheus does not have a native Entity type. Instead, users follow a convention: create a gauge with an `_info` suffix, set its value to `1`, and encode metadata as labels. + +```promql +node_uname_info{nodename="server-1", release="5.15.0", version="#1 SMP", machine="x86_64"} 1 +``` + +While OpenMetrics formally defines an Info type, the Prometheus text exposition format does not support it. This means: +- Info metrics consume storage for a constant value (`1`) that carries no information +- There's no semantic distinction between info metrics and regular gauges +- Query engines cannot optimize for the unique characteristics of metadata + +### Joining Info Metrics Requires `group_left` + +The most common use case for info metrics is attaching their labels to other metrics. For example, adding Kubernetes pod metadata to container CPU metrics: + +```promql +container_cpu_usage_seconds_total + * on(namespace, pod) group_left(node, created_by_kind, created_by_name) + kube_pod_info +``` + +This pattern has several problems: + +1. **Verbose**: Every query that needs pod metadata must include the full `group_left` clause. Dashboards with dozens of panels repeat this join logic everywhere. +2. **Error-Prone**: The `on()` clause must list exactly the right matching labels. Miss one, and the join fails silently or produces incorrect results. List too many, and you get "many-to-many matching not allowed" errors. +3. **Confusing Semantics**: The `group_left` modifier is one of the most confusing aspects of PromQL for new users. "Many-to-one matching" and "group modifiers" require significant mental overhead to understand and use correctly. +4. **Fragile to label changes**: If `kube_pod_info` adds a new label, existing queries may break. If a label is removed, dashboards silently lose data. There's no contract about which labels are stable identifiers vs. which are descriptive metadata. + +### No Distinction Between Identifying and Descriptive Labels + +Current info metrics treat all labels equally. There's no way to express that some labels are stable identifiers while others are mutable metadata: + +```promql +kube_pod_info{ + namespace="production", # Identifying: part of pod identity + pod="api-server-7b9f5", # Identifying: part of pod identity + uid="abc-123-def", # Identifying: globally unique + node="worker-2", # Descriptive: can change if rescheduled + created_by_kind="Deployment", # Descriptive: additional context + created_by_name="api-server" # Descriptive: additional context +} 1 +``` + +This lack of distinction causes problems: +- Queries cannot reliably join on "the identity" of an entity +- OTel Entities cannot be accurately translated (OTel's identifying vs descriptive attributes map to our identifying vs descriptive labels) + +### Storage and Lifecycle Are Not Optimized + +Info metrics are stored like any other time series, despite their unique characteristics: +- The value is always `1`—storing it repeatedly wastes space +- Metadata changes infrequently, but samples are scraped every interval +- Staleness handling treats info metrics like measurements, not metadata + +--- + +## Motivation + +### Prometheus's Commitment to OpenTelemetry + +In March 2024, Prometheus announced its commitment to being the default store for OpenTelemetry metrics. This includes: +- Native OTLP ingestion +- UTF-8 support for metric and label names +- Native support for resource attributes + +OpenTelemetry's data model distinguishes between **metric attributes** (dimensions on individual metrics) and **resource attributes** (properties of the entity producing metrics). Currently, Prometheus flattens resource attributes into `target_info` labels, losing the semantic distinction. + +Native Entity support is a important step toward proper resource attribute handling. + +### The Entity Model + +OpenTelemetry's Entity model provides a structured way to represent monitored objects: + +``` +Entity { + type: "k8s.pod" + identifying_attributes: { + "k8s.namespace.name": "production", + "k8s.pod.uid": "abc-123-def" + } + descriptive_attributes: { + "k8s.pod.name": "api-server-7b9f5", + "k8s.node.name": "worker-2", + "k8s.deployment.name": "api-server" + } +} +``` + +This model enables: +- Clear semantics about what identifies an entity +- Lifecycle management (entities can be created, updated, deleted) +- Correlation across telemetry signals (metrics, logs, traces) + +Prometheus can benefit from similar semantics. In this proposal, OTel's "identifying attributes" map to Prometheus identifying labels, and OTel's "descriptive attributes" map to descriptive labels. + +### Users Already Rely on Info Metrics + +Info metrics are a well-established pattern in the Prometheus ecosystem: + +| Metric | Source | Labels | +|--------|--------|--------| +| `node_uname_info` | Node Exporter | `nodename`, `release`, `version`, `machine`, `sysname` | +| `kube_pod_info` | kube-state-metrics | `namespace`, `pod`, `uid`, `node`, `created_by_*`, etc. | +| `kube_node_info` | kube-state-metrics | `node`, `kernel_version`, `os_image`, `container_runtime_version` | +| `target_info` | OTel SDK | All resource attributes | +| `build_info` | Various | `version`, `revision`, `branch`, `goversion` | + +These metrics are used in thousands of dashboards and alerts. Introducing native Entities improves the ergonomics and semantics while maintaining the utility users depend on. + +--- + +## Use Cases + +### Enriching Metrics with Producer Metadata + +A common need in observability is to enrich metrics with information about what produced them. When analyzing CPU usage, you often want to know which version of the software is running, what node a container is scheduled on, or what deployment owns a pod. This context transforms raw numbers into actionable insights. + +**The Problem:** + +Today, this requires complex `group_left` joins between metrics and info metrics: + +```promql +sum by (namespace, pod, node) ( + rate(container_cpu_usage_seconds_total{namespace="production"}[5m]) + * on(namespace, pod) group_left(node) + kube_pod_info +) +``` + +This pattern appears everywhere: adding `build_info` labels to application metrics, enriching host metrics with `node_uname_info`, correlating service metrics with `target_info` from OTel. Every query must: + +- Know which labels to match on (`namespace`, `pod`, `job`, `instance`, etc.) +- Explicitly list which metadata labels to bring in +- Handle edge cases when labels change (pod rescheduling, version upgrades) + + +Users should be able to say "give me this metric, enriched with information about its producer" without writing complex joins. The query engine should understand the relationship between metrics and the entities that produced them. + +With native Entity support, the query engine knows which labels identify an entity and which describe it. Enrichment becomes automatic or requires minimal syntax—no need to manually specify join keys or enumerate which labels to include. + +### OpenTelemetry Resource Translation + +**Current State:** + +When OTel metrics are exported to Prometheus, resource attributes become labels on `target_info`: + +```promql +target_info{ + job="otel-collector", + instance="collector-1:8888", + service_name="payment-service", + service_version="2.1.0", + service_instance_id="i-abc123", + deployment_environment="production", + host_name="prod-vm-42", + host_id="550e8400-e29b-41d4-a716-446655440000" +} 1 +``` + +To use these attributes with application metrics: + +```promql +http_request_duration_seconds_bucket + * on(job, instance) group_left(service_name, service_version, deployment_environment) + target_info +``` + +**Pain Points:** +- OTel distinguishes identifying vs. descriptive attributes; Prometheus loses this +- Entity lifecycle (creation, updates) is not represented +- Every query must know the OTel schema to write correct joins + +**Desired State:** + +Native translation of OTel Entities to Prometheus Entities, where OTel's identifying attributes (like `k8s_pod_uid`) become identifying labels, and OTel's descriptive attributes (like `k8s_pod_annotation_created_by`, `k8s_pod_status`) become descriptive labels. This would preserve the semantic richness of the OTel data model and enable better query ergonomics. + +### Collection Architectures: Direct Scraping vs. Gateways + +Prometheus deployments follow two main patterns for collecting metrics, and this proposal must support both. + +**Direct Scraping** + +In direct scraping, Prometheus discovers and scrapes each target individually. Service Discovery provides accurate metadata about each target, because the target *is* the entity producing metrics. + +``` +┌─────────────┐ +│ Service A │◀────┐ +│ (pod-xyz) │ │ +└─────────────┘ │ + │ scrape ┌───────────┐ +┌─────────────┐ ├──────────▶│ │ +│ Service B │◀────┤ │Prometheus │ +│ (pod-abc) │ │ │ │ +└─────────────┘ │ └───────────┘ + │ +┌─────────────┐ │ +│ Service C │◀────┘ +│ (pod-def) │ +└─────────────┘ +``` + +Here, Kubernetes SD knows that `pod-xyz` runs Service A with specific labels, resource limits, and node placement. This metadata accurately describes the entity producing metrics—SD-derived entities work well. + +**Gateway and Federation** + +In gateway architectures, metrics flow through an intermediary before reaching Prometheus. The intermediary aggregates metrics from multiple sources. + +``` +┌───────────┐ ┌───────────┐ ┌───────────┐ +│ Service A │────▶│ │ │ │ +│ │push │ OTel │──────▶│Prometheus │ +├───────────┤ │ Collector │scrape │ │ +│ Service B │────▶│ │ │ │ +│ │ │(gateway) │ │ │ +├───────────┤ │ │ │ │ +│ Service C │────▶│ │ │ │ +└───────────┘ └───────────┘ └───────────┘ +``` + +Here, SD only sees the OTel Collector—not Services A, B, or C. Any SD-derived metadata would describe the collector, not the actual metric producers. The same applies to Prometheus federation and pushgateway patterns. + +| What SD Sees | What Actually Produced Telemetry | +|--------------|----------------------------------| +| `otel-collector-pod-xyz` | `payment-service`, `auth-service`, `user-service` | +| `prometheus-federation-1` | Hundreds of scraped targets from regional Prometheus | +| `pushgateway-xyz` | Various batch jobs and short-lived processes | +| `kube-state-metrics-0` | Workloads running in K8s and K8s API itself | + +**Supporting Both Models** + +This proposal must support both architectures: + +1. **Direct scraping**: Entity information can be derived from Service Discovery metadata, since SD accurately describes each target. +2. **Gateway/federation**: Entity information must be embedded in the exposition format to travel with the metrics through intermediaries. + +Users choose the appropriate approach for their architecture. See [Service Discovery](./04-service-discovery.md) for configuration details. + +--- + +## Goals + +This proposal aims to achieve the following: + +### 1. Define Entity as a Native Concept + +Prometheus should recognize Entities as a distinct concept with their own semantics, separate from metrics. Entities represent the things that produce telemetry, not the telemetry itself. + +### 2. Support Identifying and Descriptive Label Semantics + +Entities should allow declaring which labels are identifying (forming the entity's identity) and which are descriptive (providing additional context that may change over time). + +### 3. Improve Query Ergonomics + +Reduce or eliminate the need for `group_left` when attaching entity labels to related metrics. The common case should be simple. + +### 4. Optimize Storage for Metadata + +Entities store string labels and change infrequently. Storage and ingestion should be optimized for this pattern, rather than treating them as time series with constant values. + +### 5. Enable OTel Entity Translation + +Provide a natural mapping between OpenTelemetry Entities and Prometheus Entities, translating OTel's identifying and descriptive attributes to Prometheus's identifying and descriptive labels. + +### 6. Support Both Direct and Gateway Collection Models + +Entity information must work correctly whether Prometheus scrapes targets directly (where SD metadata is accurate) or through intermediaries like OTel Collector or federation. + +--- + +## Non-Goals + +The following are explicitly out of scope for this proposal: + +### Changing behavior for existing `*_info` Gauges + +This proposal defines new semantics for Entities. Existing gauges with `_info` suffix will continue to work as gauges and joins will continue to work. Migration or automatic conversion is not in scope. + +### Complete OTel Data Model Parity + +This proposal focuses on Entities. Full parity with OTel's data model (exemplars, exponential histograms, etc.) is addressed elsewhere. + +--- + +## Related Work + +### OpenMetrics Specification + +OpenMetrics 1.0 (November 2020) formally defines the Info metric type. The specification describes Info as "used to expose textual information which SHOULD NOT change during process lifetime." + +- [OpenMetrics 1.0 Specification](https://prometheus.io/docs/specs/om/open_metrics_spec/) +- [OpenMetrics 2.0 Draft](https://prometheus.io/docs/specs/om/open_metrics_spec_2_0/) + +### The `info()` PromQL Function + +Prometheus 2.x introduced an experimental `info()` function in PromQL to simplify joins between metrics and info metrics. Instead of writing verbose `group_left` queries, users can write: + +```promql +info(rate(http_requests_total[5m])) +``` + +This automatically enriches the result with labels from `target_info`. The function reduces boilerplate and makes queries more readable. + +However, the current implementation hardcodes `job` and `instance` as identifying labels—the labels used to correlate metrics with their info series. This works for `target_info` but fails for other entity types like `kube_pod_info` (which uses `namespace` and `pod`) or `kube_node_info` (which uses `node`). The community is actively discussing improvements to make the function more flexible. + +More fundamentally, `info()` still operates on info metrics—it makes joins easier but doesn't change the underlying model where entity information is encoded as a metric with a constant value. Native Entity support would allow the query engine to understand entity relationships directly, making enrichment automatic without needing explicit function calls or hardcoded identifying labels. + +- [PromQL info() function documentation](https://prometheus.io/docs/prometheus/latest/querying/functions/#info) + +### OpenTelemetry Entity Data Model + +OpenTelemetry defines Entities as "objects of interest associated with produced telemetry." The data model specifies: +- Entity types and their schemas +- Identifying vs. descriptive attributes +- Entity lifecycle events + +- [OTel Entities Data Model](https://opentelemetry.io/docs/specs/otel/entities/data-model/) +- [Resource and Entity Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/how-to-write-conventions/resource-and-entities/) + +### OpenTelemetry Prometheus Compatibility + +OpenTelemetry provides specifications for bidirectional conversion between OTel and Prometheus formats: +- Resource attributes → `target_info` labels +- Metric attributes → metric labels +- Handling of Info and StateSet types + +- [Prometheus and OpenMetrics Compatibility](https://opentelemetry.io/docs/specs/otel/compatibility/prometheus_and_openmetrics/) +- [Prometheus Exporter Specification](https://opentelemetry.io/docs/specs/otel/metrics/sdk_exporters/prometheus/) + +### Prometheus Commitment to OpenTelemetry + +In March 2024, Prometheus announced plans to be the default store for OpenTelemetry metrics: +- OTLP ingestion +- UTF-8 metric and label name support +- Native resource attribute support + +As of late 2024, most of this work has been implemented: OTLP ingestion is generally available in Prometheus 3.0 and UTF-8 support for metric and label names is complete. The notable exception is **native support for resource attributes**—which is precisely what this proposal aims to address through proper Entity semantics. + +- [Prometheus Commitment to OpenTelemetry](https://prometheus.io/blog/2024/03/14/commitment-to-opentelemetry/) + +--- + +## What's Next + +This document establishes the context and motivation for native Entity support in Prometheus. The following documents detail the implementation: + +- **[Exposition Formats](./02-exposition-formats.md)**: How entities are represented in text and protobuf formats +- **[SDK](./03-sdk.md)**: How Prometheus client libraries support entities +- **[Service Discovery](./04-service-discovery.md)**: How entities relate to Prometheus targets and discovered metadata +- **[Storage](./05-storage.md)**: How entities are stored efficiently in the TSDB +- **[Querying](./06-querying.md)**: PromQL extensions for working with entities +- **[Web UI and APIs](./07-web-ui-and-apis.md)**: How entities are displayed and accessed +- **Remote Write (TBD)**: Protocol changes for transmitting entities over remote write +- **Alerting (TBD)**: How entities interact with alerting rules and Alertmanager + +--- + +*This proposal is a work in progress. Feedback from Prometheus maintainers, users, and the broader observability community is welcome.* diff --git a/proposals/0071-Entity/02-exposition-formats.md b/proposals/0071-Entity/02-exposition-formats.md new file mode 100644 index 0000000..20d89f7 --- /dev/null +++ b/proposals/0071-Entity/02-exposition-formats.md @@ -0,0 +1,588 @@ +# Exposition Formats + +## Abstract + +This document specifies how Prometheus exposition formats should be extended to support the Entity concept introduced in [01-context.md](./01-context.md). It covers syntax additions to the text format and new protobuf message definitions. + +The goal is to enable first-class representation of entities—the things that produce telemetry—while maintaining backward compatibility with existing scrapers that don't understand entities. + +--- + +## The Entity Concept + +An **Entity** represents a distinct object of interest that produces or is described by telemetry. Examples include: + +| Component | Description | +|-----------|-------------| +| **Type** | The entity type this instance belongs to (e.g., `k8s.pod`) | +| **Identifying Labels** | Labels that uniquely identify this entity instance. Must remain constant for the entity's lifetime. | +| **Descriptive Labels** | Additional context about the entity. May change over time. | + +Examples of entities: + +- A Kubernetes pod (`k8s.pod`) identified by namespace and UID +- A host or node (`k8s.node`) identified by node UID +- A service instance (`service`) identified by namespace, name, and instance ID + +--- + +## Text Format + +### New Syntax Elements + +| Element | Syntax | Description | +|---------|--------|-------------| +| Entity type declaration | `# ENTITY_TYPE ` | Declares an entity type for subsequent entities | +| Identifying labels | `# ENTITY_IDENTIFYING ...` | Lists which labels form the identity | +| Entity instance | `{}` | An entity instance (no value) | + +### Complete Example + +``` +# ENTITY_TYPE k8s.pod +# ENTITY_IDENTIFYING k8s.namespace.name k8s.pod.uid +k8s.pod{k8s.namespace.name="default",k8s.pod.uid="550e8400-e29b-41d4-a716-446655440000",k8s.pod.name="nginx-7b9f5"} +k8s.pod{k8s.namespace.name="default",k8s.pod.uid="660e8400-e29b-41d4-a716-446655440001",k8s.pod.name="redis-cache-0"} +k8s.pod{k8s.namespace.name="kube-system",k8s.pod.uid="770e8400-e29b-41d4-a716-446655440002",k8s.pod.name="coredns-5dd5756b68-abcde"} + +# ENTITY_TYPE k8s.node +# ENTITY_IDENTIFYING k8s.node.uid +k8s.node{k8s.node.uid="node-uid-001",k8s.node.name="worker-1",k8s.node.os="linux",k8s.node.kernel="5.15.0"} +k8s.node{k8s.node.uid="node-uid-002",k8s.node.name="worker-2",k8s.node.os="linux",k8s.node.kernel="5.15.0"} + +# ENTITY_TYPE service +# ENTITY_IDENTIFYING service.namespace service.name service.instance.id +service{service.namespace="production",service.name="payment-service",service.instance.id="i-abc123",service.version="2.1.0"} + +--- + +# TYPE container_cpu_usage_seconds counter +# HELP container_cpu_usage_seconds Total CPU usage in seconds +# This metric correlates with BOTH k8s.pod and k8s.node entities +# (it contains the identifying labels of both) +container_cpu_usage_seconds_total{k8s.namespace.name="default",k8s.pod.uid="550e8400-e29b-41d4-a716-446655440000",k8s.node.uid="node-uid-001",container="nginx"} 1234.5 +container_cpu_usage_seconds_total{k8s.namespace.name="default",k8s.pod.uid="660e8400-e29b-41d4-a716-446655440001",k8s.node.uid="node-uid-002",container="redis"} 567.8 + +# TYPE http_requests counter +# HELP http_requests Total HTTP requests +http_requests_total{service.namespace="production",service.name="payment-service",service.instance.id="i-abc123",method="GET",status="200"} 9999 + +# EOF +``` + +### Parsing Rules + +1. `# ENTITY_TYPE` starts a new entity family block +2. `# ENTITY_IDENTIFYING` must follow `# ENTITY_TYPE` before any entity instances +3. Entity instances (lines matching `{...}` with no value) are ONLY valid after an `# ENTITY_TYPE` declaration. A line like `foo{bar="baz"}` without a preceding entity type declaration is a parse error. +4. Entity instances MUST contain all identifying labels declared in `# ENTITY_IDENTIFYING` +5. The entity type name in the instance line MUST match the declared `# ENTITY_TYPE` + +### Entity Section Ordering + +**All entities MUST appear at the beginning of the scrape response, before any metrics.** The entity section ends with a `---` delimiter on its own line. + +This ordering requirement exists for practical reasons: when Prometheus parses a metric, it needs to immediately correlate that metric with any relevant entities. If entities could appear anywhere in the response, Prometheus would need to either buffer all metrics until the entire response is parsed, or make a second pass through the data. Both approaches add complexity and memory overhead. + +By requiring entities first, the parser can process the exposition in a single pass. When it encounters a metric, all potentially correlated entities are already in memory and correlation can happen immediately. + +If no entities are present, the `---` delimiter may be omitted. If entities are present but metrics appear before the `---` delimiter (or without one), the scrape fails with a parse error. + +--- + +## Protobuf Format + +### New Message Definitions + +```protobuf +syntax = "proto2"; + +package io.prometheus.client; + +// EntityFamily groups entities of the same type +message EntityFamily { + // Entity type name (e.g., "k8s.pod", "service", "build") + required string type = 1; + + // Names of labels that form the unique identity + repeated string identifying_label_names = 2; + + // Entity instances of this type + repeated Entity entity = 3; +} + +// Entity represents a single entity instance +message Entity { + // All labels (both identifying and descriptive) + repeated LabelPair label = 1; +} +``` + +### Integration with Existing Messages + +The existing `MetricFamily` structure remains unchanged. A new top-level message wraps both: + +```protobuf +// MetricPayload is the top-level message for scrape responses +// that include both entities and metrics +message MetricPayload { + // Entity families + repeated EntityFamily entity_family = 1; + + repeated MetricFamily metric_family = 2; +} +``` + +### Content-Type + +For protobuf with entity support: + +``` +application/vnd.google.protobuf;proto=io.prometheus.client.MetricPayload;encoding=delimited +``` + +For protobuf with entity support, the `proto` parameter changes from `MetricFamily` to `MetricPayload` to indicate the new top-level message type. + +--- + +## Entity-Metric Correlation + +### How Correlation Works + +Entities correlate with metrics through **shared identifying labels**: + +- If a metric has labels that match ALL identifying labels of an entity (same names, same values), that metric is associated with that entity. +- A single metric can correlate with multiple entities (of different types) if it contains the identifying labels of each. + +**Example:** + +``` +# ENTITY_TYPE k8s.pod +# ENTITY_IDENTIFYING k8s.namespace.name k8s.pod.uid +k8s.pod{k8s.namespace.name="default",k8s.pod.uid="550e8400",k8s.pod.name="nginx"} + +--- + +# This metric correlates with the entity above (has both identifying labels) +container_cpu_usage_seconds_total{k8s.namespace.name="default",k8s.pod.uid="550e8400",container="app"} 1234.5 +``` + +Correlation is computed at ingestion time when Prometheus parses the exposition format. See [05-storage.md](./05-storage.md#correlation-index) for how Prometheus builds and maintains these correlations in storage. + +### Conflict Detection + +When a metric correlates with an entity, the query engine enriches the metric's labels with the entity's descriptive labels (see [06-querying.md](./06-querying.md)). This creates the possibility of label conflicts—a metric might have a label with the same name as an entity's descriptive label. + +A conflict occurs when: +- A metric correlates with an entity (has all identifying labels) +- The metric has a label with the same name as one of the entity's descriptive labels +- The values differ + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ Label Conflict Detection │ +└─────────────────────────────────────────────────────────────────────────────┘ + +Entity (k8s.pod) Metric (my_metric) +┌─────────────────────────────────┐ ┌─────────────────────────────────┐ +│ Identifying Labels: │ │ Labels: │ +│ k8s.namespace.name = "default"│◄─────────►│ k8s.namespace.name = "default"│ ✓ Match +│ k8s.pod.uid = "abc-123" │◄─────────►│ k8s.pod.uid = "abc-123" │ ✓ Match +├─────────────────────────────────┤ ├─────────────────────────────────┤ +│ Descriptive Labels: │ │ │ +│ version = "2.0" │◄────╳────►│ version = "1.0" │ ✗ CONFLICT! +│ k8s.pod.name = "nginx" │ │ │ +└─────────────────────────────────┘ │ Value: 42 │ + └─────────────────────────────────┘ + +Correlation established via matching identifying labels, +but "version" exists in both with different values → Scrape fails! +``` + +**Example conflict in exposition format:** +``` +# ENTITY_TYPE k8s.pod +# ENTITY_IDENTIFYING k8s.namespace.name k8s.pod.uid +k8s.pod{k8s.namespace.name="default",k8s.pod.uid="abc-123",version="2.0"} + +--- + +# This metric has k8s.pod identifying labels, so it correlates with the entity. +# But it also has a "version" label that conflicts with the entity's "version" label! +my_metric{k8s.namespace.name="default",k8s.pod.uid="abc-123",version="1.0"} 42 +``` + +When a conflict is detected during scrape, **the scrape fails with an error**. + +Note that **identifying labels cannot conflict** because they must be present on the metric for correlation to occur—if the metric has the same label name with a different value, it simply won't correlate with that entity. + +--- + +## Technical Implementation + +This section provides detailed implementation guidance for parsing entities and integrating with the scrape loop. The implementation should align with the storage layer defined in [05-storage.md](./05-storage.md). + +### Parser Interface Extensions + +The existing `Parser` interface in `model/textparse/interface.go` needs new methods and entry types to handle entities: + +#### New Entry Types + +The `Entry` type is extended with new values for entity handling: + +```go +// Current Entry types (model/textparse/interface.go:206-213) +const ( + EntryInvalid Entry = -1 + EntryType Entry = 0 + EntryHelp Entry = 1 + EntrySeries Entry = 2 + EntryComment Entry = 3 + EntryUnit Entry = 4 + EntryHistogram Entry = 5 + + // New entity entry types + EntryEntityType Entry = 6 // # ENTITY_TYPE + EntryEntityIdentifying Entry = 7 // # ENTITY_IDENTIFYING ... + EntryEntity Entry = 8 // {} (no value) + EntryEntityDelimiter Entry = 9 // --- (marks end of entity section) +) +``` + +When the parser encounters `---`, it returns `EntryEntityDelimiter`. After this point, any entity declarations are a parse error—all entities must appear before the delimiter. + +#### New Parser Methods + +```go +// Parser interface additions +type Parser interface { + // ... existing methods (Series, Histogram, Help, Type, Unit, etc.) ... + + // EntityType returns the entity type name from an ENTITY_TYPE declaration. + // Must only be called after Next() returned EntryEntityType. + // The returned byte slice becomes invalid after the next call to Next. + EntityType() []byte + + // EntityIdentifying returns the list of identifying label names. + // Must only be called after Next() returned EntryEntityIdentifying. + // The returned slice becomes invalid after the next call to Next. + EntityIdentifying() [][]byte + + // EntityLabels writes the entity labels into the passed labels. + // Must only be called after Next() returned EntryEntity. + // All labels (both identifying and descriptive) are included. + EntityLabels(l *labels.Labels) +} +``` + +### Scrape Loop Integration + +The scrape loop in `scrape/scrape.go` needs significant changes to process entities alongside metrics. + +#### Entity Cache + +Extend `scrapeCache` to track entities similar to how it tracks series: + +```go +// Entity cache entry (analogous to cacheEntry for series) +type entityCacheEntry struct { + ref storage.EntityRef + lastIter uint64 + hash uint64 + identifyingLabels labels.Labels + descriptiveLabels labels.Labels +} + +type scrapeCache struct { + // ... existing fields (series, droppedSeries, seriesCur, seriesPrev, metadata) ... + + // Entity parsing state (reset each scrape) + currentEntityType string + currentIdentifyingNames []string + + // Entity tracking (persists across scrapes) + entities map[string]*entityCacheEntry // key: hash of identifying attrs + entityCur map[storage.EntityRef]*entityCacheEntry + entityPrev map[storage.EntityRef]*entityCacheEntry +} + +func newScrapeCache(metrics *scrapeMetrics) *scrapeCache { + return &scrapeCache{ + // ... existing initialization ... + entities: map[string]*entityCacheEntry{}, + entityCur: map[storage.EntityRef]*entityCacheEntry{}, + entityPrev: map[storage.EntityRef]*entityCacheEntry{}, + } +} +``` + +#### Entity Processing in append() + +The main append loop in `scrapeLoop.append()` is extended: + +```go +func (sl *scrapeLoop) append(app storage.Appender, b []byte, contentType string, ts time.Time) (total, added, seriesAdded int, err error) { + defTime := timestamp.FromTime(ts) + + // ... existing parser creation ... + + var ( + // ... existing variables ... + entitiesTotal int + entitiesAdded int + ) + +loop: + for { + et, err := p.Next() + if err != nil { + if errors.Is(err, io.EOF) { + err = nil + } + break + } + + switch et { + case textparse.EntryEntityType: + sl.cache.currentEntityType = string(p.EntityType()) + sl.cache.currentIdentifyingNames = nil + continue + + case textparse.EntryEntityIdentifying: + names := p.EntityIdentifying() + sl.cache.currentIdentifyingNames = make([]string, len(names)) + for i, name := range names { + sl.cache.currentIdentifyingNames[i] = string(name) + } + continue + + case textparse.EntryEntity: + entitiesTotal++ + if err := sl.processEntity(app, p, defTime); err != nil { + sl.l.Debug("Entity processing error", "err", err) + // Depending on error type, may break or continue + if isEntityLimitError(err) { + break loop + } + continue + } + entitiesAdded++ + continue + + case textparse.EntryType: + // ... existing handling ... + case textparse.EntryHelp: + // ... existing handling ... + case textparse.EntrySeries, textparse.EntryHistogram: + // ... existing metric handling ... + // ADD: conflict detection before appending + } + } + + // Update stale markers for both series AND entities + if err == nil { + err = sl.updateStaleMarkers(app, defTime) + sl.updateEntityStaleMarkers(app, defTime) + } + + return total, added, seriesAdded, err +} +``` + +#### Entity Processing Method + +```go +func (sl *scrapeLoop) processEntity(app storage.Appender, p textparse.Parser, ts int64) error { + var allLabels labels.Labels + p.EntityLabels(&allLabels) + + // Validate: all identifying labels must be present + identifying, descriptive := sl.splitEntityLabels(allLabels) + if len(identifying) != len(sl.cache.currentIdentifyingNames) { + return fmt.Errorf("entity missing required identifying labels: expected %v", + sl.cache.currentIdentifyingNames) + } + + // Check entity limit + if sl.entityLimit > 0 && len(sl.cache.entities) >= sl.entityLimit { + return errEntityLimit + } + + hash := identifying.Hash() + hashKey := fmt.Sprintf("%s:%d", sl.cache.currentEntityType, hash) + + // Check cache for existing entity + ce, cached := sl.cache.entities[hashKey] + if cached { + ce.lastIter = sl.cache.iter + + // Check if descriptive labels changed + if !labels.Equal(ce.descriptiveLabels, descriptive) { + ce.descriptiveLabels = descriptive + // Will trigger a WAL write via AppendEntity + } + } + + // Call storage appender + ref, err := app.AppendEntity( + sl.cache.currentEntityType, + identifying, + descriptive, + ts, + ) + if err != nil { + return err + } + + // Update cache + if !cached { + ce = &entityCacheEntry{ + ref: ref, + lastIter: sl.cache.iter, + hash: hash, + identifyingLabels: identifying, + descriptiveLabels: descriptive, + } + sl.cache.entities[hashKey] = ce + } else { + ce.ref = ref + } + + sl.cache.entityCur[ref] = ce + return nil +} + +func (sl *scrapeLoop) splitEntityLabels(allLabels labels.Labels) (labels.Labels, labels.Labels) { + identifyingSet := make(map[string]struct{}) + for _, name := range sl.cache.currentIdentifyingNames { + identifyingSet[name] = struct{}{} + } + + var identifying, descriptive labels.Labels + allLabels.Range(func(l labels.Label) { + if _, ok := identifyingSet[l.Name]; ok { + identifying = append(identifying, l) + } else { + descriptive = append(descriptive, l) + } + }) + + return identifying, descriptive +} +``` + +#### Entity Staleness + +Entity staleness works similarly to series staleness, but marks entities as dead rather than writing StaleNaN: + +```go +func (sl *scrapeLoop) updateEntityStaleMarkers(app storage.Appender, ts int64) error { + for ref, ce := range sl.cache.entityPrev { + if _, ok := sl.cache.entityCur[ref]; ok { + continue // Entity still present + } + + // Entity disappeared - mark it dead + // The storage layer handles this by setting endTime + if err := app.MarkEntityDead(ref, ts); err != nil { + sl.l.Debug("Error marking entity dead", "ref", ref, "err", err) + } + + // Remove from cache + for hashKey, e := range sl.cache.entities { + if e.ref == ref { + delete(sl.cache.entities, hashKey) + break + } + } + } + + return nil +} + +func (c *scrapeCache) entityIterDone(flush bool) { + // Swap current and previous (same pattern as series) + c.entityPrev, c.entityCur = c.entityCur, c.entityPrev + clear(c.entityCur) +} +``` + +### Scrape Configuration + +New configuration options in `config/config.go`: + +```go +type ScrapeConfig struct { + // ... existing fields ... + + // EnableEntityScraping enables parsing of entity declarations. + // Default: false for backward compatibility. + EnableEntityScraping bool `yaml:"enable_entity_scraping,omitempty"` + + // EntityLimit is the maximum number of entities per scrape target. + // 0 means no limit. + EntityLimit int `yaml:"entity_limit,omitempty"` +} +``` + +### Data Flow Summary + +``` +┌───────────────────────────────────────────────────────────────────────────────┐ +│ Scrape Data Flow │ +└───────────────────────────────────────────────────────────────────────────────┘ + + Target /metrics Prometheus Scrape Loop + ┌─────────────────┐ ┌─────────────────────────────────────────┐ + │ # ENTITY_TYPE │ │ │ + │ # ENTITY_IDENT │ ──HTTP GET──► │ 1. Create Parser (textparse.New) │ + │ entity{...} │ │ │ + │ │ │ 2. Loop: p.Next() │ + │ # TYPE metric │ │ ├─ EntryEntityType → cache type │ + │ metric{...} 123 │ │ ├─ EntryEntityIdent → cache names │ + │ # EOF │ │ ├─ EntryEntity → processEntity() │ + └─────────────────┘ │ │ └─ app.AppendEntity() │ + │ ├─ EntrySeries → checkConflicts() │ + │ │ └─ app.Append() │ + │ └─ EntryHistogram → ... │ + │ │ + │ 3. updateStaleMarkers() │ + │ ├─ Series: Write StaleNaN │ + │ └─ Entities: app.MarkEntityDead() │ + │ │ + │ 4. app.Commit() │ + │ ├─ Write WAL records │ + │ ├─ Update Head structures │ + │ └─ Build correlation index │ + └─────────────────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────────┐ + │ Storage (TSDB) │ + │ ┌─────────────┐ ┌─────────────────┐ │ + │ │ WAL Records │ │ Head Block │ │ + │ │ - Series │ │ - memSeries │ │ + │ │ - Samples │ │ - memEntity │ │ + │ │ - Entities │ │ - Correlation │ │ + │ └─────────────┘ │ Index │ │ + │ └─────────────────┘ │ + └─────────────────────────────────────────┘ +``` + +In the [storage](04-storage.md) document, we go over the correlation index, WAL and memEntity struct in greater details. + +--- + +## Related Documents + +- [01-context.md](./01-context.md) - Problem statement and motivation +- [03-sdk.md](./03-sdk.md) - How Prometheus client libraries support entities +- [04-service-discovery.md](./04-service-discovery.md) - How entities relate to Prometheus targets +- [05-storage.md](./05-storage.md) - How entities are stored in the TSDB +- [06-querying.md](./06-querying.md) - PromQL extensions for working with entities +- [07-web-ui-and-apis.md](./07-web-ui-and-apis.md) - How entities are displayed and accessed + +--- + +*This proposal is a work in progress. Feedback is welcome.* + diff --git a/proposals/0071-Entity/03-sdk.md b/proposals/0071-Entity/03-sdk.md new file mode 100644 index 0000000..5eff283 --- /dev/null +++ b/proposals/0071-Entity/03-sdk.md @@ -0,0 +1,417 @@ +# SDK Support for Entities + +## Abstract + +This document specifies how Prometheus client libraries should be extended to support the Entity concept. Using client_golang as the reference implementation, we define new types, interfaces, and patterns that enable applications to declare entities alongside metrics while maintaining backward compatibility with existing instrumentation code. + +The design prioritizes simplicity for the common case—an application instrumenting itself as a single entity—while providing flexibility for advanced scenarios like exporters that expose metrics for multiple entities. + +--- + +## Design Principles + +Before diving into implementation details, it's worth understanding the key design decisions that shaped this proposal. + +**Entities are not collectors.** In client_golang, metrics are managed through the Collector interface, which combines description and collection into a single abstraction. We considered making entities follow this pattern, but entities have fundamentally different characteristics: they represent the "things" that produce telemetry, not the telemetry itself. An entity like "this Kubernetes pod" cuts across multiple collectors (process metrics, Go runtime metrics, application metrics). Tying entities to collectors would create awkward ownership questions and unnecessary coupling. + +**The EntityRegistry is global and separate from the metric Registry.** This separation reflects the conceptual difference between "what is producing telemetry" (entities) and "what telemetry is being produced" (metrics). Making the EntityRegistry global (via `DefaultEntityRegistry`) enables validation at metric registration time—if a metric references a non-existent entity ref, registration fails immediately rather than silently producing invalid output at scrape time. + +**Descriptive labels are mutable, identifying labels are not.** An entity's identity (its type plus identifying labels) is immutable—changing it would make it a different entity. But descriptive labels like version numbers or human-readable names can change during the entity's lifetime. The API reflects this: `SetDescriptiveLabels()` atomically replaces all descriptive labels, while identifying labels are set only at construction. + +--- + +## Entity Types + +### Entity + +The `Entity` type represents a single entity instance: + +```go +type Entity struct { + ref uint64 // Assigned by EntityRegistry + entityType string // e.g., "service", "k8s.pod" + identifyingLabels Labels // Immutable after creation + descriptiveLabels Labels // Mutable via SetDescriptiveLabels + mtx sync.RWMutex // Protects descriptiveLabels +} + +// EntityOpts configures a new Entity +type EntityOpts struct { + Type string // Required: entity type name + Identifying Labels // Required: labels that uniquely identify this instance + Descriptive Labels // Optional: additional context labels +} + +// NewEntity creates an entity. +func NewEntity(opts EntityOpts) *Entity + +// Ref returns the entity's reference (0 if not yet registered) +func (e *Entity) Ref() uint64 + +// Type returns the entity type +func (e *Entity) Type() string + +// IdentifyingLabels returns a copy of the identifying labels +func (e *Entity) IdentifyingLabels() Labels + +// DescriptiveLabels returns a copy of the current descriptive labels +func (e *Entity) DescriptiveLabels() Labels + +// SetDescriptiveLabels atomically replaces all descriptive labels +func (e *Entity) SetDescriptiveLabels(labels Labels) +``` + +### EntityRegistry + +The `EntityRegistry` is a **global singleton**, similar to `prometheus.DefaultRegisterer`. This ensures that metrics can validate entity refs at registration time—if a metric references a non-existent entity, registration fails immediately rather than at scrape time. + +```go +// Global EntityRegistry instance +var DefaultEntityRegistry = NewEntityRegistry() + +type EntityRegistry struct { + mtx sync.RWMutex + byHash map[uint64]*Entity // hash(type+identifying) → Entity + byRef map[uint64]*Entity // ref → Entity + refCounter uint64 // Auto-increments on Register +} + + +// Register adds an entity and assigns its ref. +// Returns error if an entity with the same type+identifying labels exists. +func (er *EntityRegistry) Register(e *Entity) error + +// Unregister removes an entity by ref +func (er *EntityRegistry) Unregister(ref uint64) bool + +// Lookup finds an entity by type and identifying labels, returns its ref +func (er *EntityRegistry) Lookup(entityType string, identifying Labels) (ref uint64, found bool) + +// Get retrieves an entity by ref +func (er *EntityRegistry) Get(ref uint64) *Entity + +// Gather collects entities and metrics together into a MetricPayload. +// Only entities referenced by the gathered metrics are included. +func (er *EntityRegistry) Gather(gatherers ...Gatherer) (*dto.MetricPayload, error) +``` + +--- + +## Metric Integration + +Metrics declare their entity associations through the `EntityRefs` field in their options. This field contains the refs of entities that the metric correlates with. + +### Updated Metric Options + +```go +type CounterOpts struct { + Namespace string + Subsystem string + Name string + Help string + ConstLabels Labels + + // EntityRefs lists the refs of entities this metric correlates with. + // Obtain refs via Entity.Ref() after registering with EntityRegistry. + EntityRefs []uint64 +} + +// Same pattern for GaugeOpts, HistogramOpts, SummaryOpts, etc. +``` + +### Validation at Registration + +When a metric with `EntityRefs` is registered, the metric registry validates that all referenced entity refs exist in the global `DefaultEntityRegistry`. This catches configuration errors immediately: + +```go +// This works: entity is registered first +serviceEntity := prometheus.NewEntity(prometheus.EntityOpts{...}) +prometheus.RegisterEntity(serviceEntity) // Uses DefaultEntityRegistry + +counter := prometheus.NewCounter(prometheus.CounterOpts{ + Name: "requests_total", + EntityRefs: []uint64{serviceEntity.Ref()}, +}) +prometheus.MustRegister(counter) // Validates that serviceEntity.Ref() exists + +// This fails: entity ref doesn't exist +badCounter := prometheus.NewCounter(prometheus.CounterOpts{ + Name: "bad_counter", + EntityRefs: []uint64{999}, // No entity with this ref +}) +prometheus.MustRegister(badCounter) // PANIC: unknown entity ref 999 +``` + +### Usage Example + +```go +// Create and register entity +serviceEntity := prometheus.NewEntity(prometheus.EntityOpts{ + Type: "service", + Identifying: prometheus.Labels{ + "service.namespace": "production", + "service.name": "payment-api", + "service.instance.id": os.Getenv("INSTANCE_ID"), + }, + Descriptive: prometheus.Labels{ + "service.version": "1.0.0", + }, +}) +prometheus.RegisterEntity(serviceEntity) + +// Create metric that correlates with the entity +requestDuration := prometheus.NewHistogram(prometheus.HistogramOpts{ + Name: "http_request_duration_seconds", + Help: "HTTP request latency", + Buckets: prometheus.DefBuckets, + EntityRefs: []uint64{serviceEntity.Ref()}, +}) +prometheus.MustRegister(requestDuration) + +// Later: update descriptive labels during rolling deploy +serviceEntity.SetDescriptiveLabels(prometheus.Labels{ + "service.version": "2.0.0", +}) +``` + +### Multiple Entity Correlations + +A single metric can correlate with multiple entities. This is useful when a metric describes something that spans entity boundaries: + +```go +// Register both pod and node entities +podEntity := prometheus.NewEntity(prometheus.EntityOpts{ + Type: "k8s.pod", + Identifying: prometheus.Labels{ + "k8s.namespace.name": "default", + "k8s.pod.uid": "abc-123", + }, +}) +nodeEntity := prometheus.NewEntity(prometheus.EntityOpts{ + Type: "k8s.node", + Identifying: prometheus.Labels{ + "k8s.node.uid": "node-456", + }, +}) +entityRegistry.Register(podEntity) +entityRegistry.Register(nodeEntity) + +// Container CPU correlates with both pod AND node +containerCPU := prometheus.NewCounter(prometheus.CounterOpts{ + Name: "container_cpu_usage_seconds_total", + Help: "Total CPU usage by container", + EntityRefs: []uint64{podEntity.Ref(), nodeEntity.Ref()}, +}) +``` + +--- + +## Gathering and Exposition + +The `EntityRegistry.Gather()` method is the central coordination point. It accepts metric gatherers as arguments and returns a complete `dto.MetricPayload` containing both entities and metrics. This design enforces that entities are never gathered in isolation—they only make sense alongside their correlated metrics. + +### How Gather Works + +```go +func (er *EntityRegistry) Gather(gatherers ...Gatherer) (*dto.MetricPayload, error) { + // 1. Gather metrics from all provided gatherers + var allMetrics []*dto.MetricFamily + referencedRefs := make(map[uint64]struct{}) + + for _, g := range gatherers { + mfs, err := g.Gather() + if err != nil { + return nil, err + } + allMetrics = append(allMetrics, mfs...) + + // Track which entity refs are actually used by metrics + for _, mf := range mfs { + for _, ref := range mf.GetEntityRefs() { + referencedRefs[ref] = struct{}{} + } + } + } + + // 2. Only include entities that are referenced by at least one metric + // Orphan entities (not referenced by any metric) are excluded + entityFamilies := er.collectReferencedEntities(referencedRefs) + + // 3. Return complete payload + // - All metrics are included (with or without entity refs) + // - Only referenced entities are included + return &dto.MetricPayload{ + EntityFamily: entityFamilies, + MetricFamily: allMetrics, + }, nil +} +``` + +This filtering ensures that: +- **Metrics without entities** are still exposed +- **Entities without metrics** are excluded +- **Only the entities actually needed** are transmitted, reducing payload size + +### HTTP Handler Updates + +The promhttp package needs a handler that works with `EntityRegistry.Gather()`: + +```go +// HandlerFor creates an HTTP handler that exposes entities and metrics together +func HandlerFor(er *EntityRegistry, gatherers []Gatherer, opts HandlerOpts) http.Handler { + return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + payload, err := er.Gather(gatherers...) + if err != nil { + // error handling... + } + + contentType := expfmt.NegotiateIncludingOpenMetrics(r.Header) + w.Header().Set("Content-Type", string(contentType)) + + enc := expfmt.NewPayloadEncoder(w, contentType) + enc.EncodePayload(payload) + }) +} +``` + +### Usage Example + +```go +func main() { + // Register entity (uses global DefaultEntityRegistry) + serviceEntity := prometheus.NewEntity(prometheus.EntityOpts{...}) + prometheus.RegisterEntity(serviceEntity) + + // Register metrics + counter := prometheus.NewCounter(prometheus.CounterOpts{ + Name: "requests_total", + EntityRefs: []uint64{serviceEntity.Ref()}, + }) + prometheus.MustRegister(counter) + + // Expose via HTTP - uses global registries + http.Handle("/metrics", promhttp.Handler()) // Enhanced to use DefaultEntityRegistry + http.ListenAndServe(":8080", nil) +} +``` + +For custom registries, pass them explicitly: + +```go +entityReg := prometheus.NewEntityRegistry() +metricReg := prometheus.NewRegistry() + +http.Handle("/metrics", promhttp.HandlerFor(entityReg, []prometheus.Gatherer{metricReg}, promhttp.HandlerOpts{})) +``` + +--- + +## Changes to Supporting Libraries + +Implementing entity support requires coordinated changes across multiple repositories. + +### client_model + +The protobuf definitions need new message types: + +```protobuf +// EntityFamily groups entities of the same type +message EntityFamily { + required string type = 1; + repeated string identifying_label_names = 2; + repeated Entity entity = 3; +} + +// Entity represents a single entity instance +message Entity { + repeated LabelPair label = 1; // All labels (identifying + descriptive) +} + +// MetricPayload is the top-level message for combined exposition +message MetricPayload { + repeated EntityFamily entity_family = 1; + repeated MetricFamily metric_family = 2; +} +``` + +### common/expfmt + +The exposition format library needs encoder support for `MetricPayload`: + +```go +// PayloadEncoder encodes a complete MetricPayload +type PayloadEncoder interface { + EncodePayload(payload *dto.MetricPayload) error +} + +// NewPayloadEncoder creates an encoder for the combined format +func NewPayloadEncoder(w io.Writer, format Format) PayloadEncoder +``` + +For the text format, the encoder writes the payload in order: entity declarations first, then the `---` delimiter, then metric families. For the protobuf format, the encoder marshals the `MetricPayload` message directly. + +### client_golang + +The changes described in this document: +- New `Entity` and `EntityRegistry` types +- `EntityRegistry.Gather()` that accepts metric gatherers and returns `*dto.MetricPayload` +- Updated metric options with `EntityRefs` field +- Updated promhttp handlers + +--- + +## Backward Compatibility + +The design maintains full backward compatibility: + +**Existing metrics continue to work.** The `EntityRefs` field is optional. Metrics without entity associations work exactly as before—they simply don't correlate with any entity. + +**Existing registries are unaffected.** The metric `Registry` type is unchanged. Entity support is additive through the separate `EntityRegistry`. + +**Existing HTTP handlers work.** The standard `promhttp.Handler()` continues to expose metrics without entities. Applications opt into entity support by using the new `HandlerFor()` that accepts an `EntityRegistry`. + +**Gradual adoption is possible.** Applications can add entity support incrementally—register an entity, update a few metrics to reference it, and the rest continue working unchanged. + +--- + +## Advanced: Dynamic Entity Associations + +The design presented above works well for applications that instrument themselves, where entities are known at startup and metrics have fixed entity associations. However, some use cases require dynamic associations. + +### Exporters with Many Entities + +Exporters like kube-state-metrics expose metrics for thousands of entities (pods, nodes, deployments). Each metric sample correlates with a different entity based on its label values. For these cases, we propose a per-sample entity association: + +```go +// GaugeVec with per-sample entity support +podInfo := prometheus.NewGaugeVec(prometheus.GaugeVecOpts{ + Name: "kube_pod_info", + VariableLabels: []string{"pod_name", "node"}, +}) + +// When recording, specify which entity this sample correlates with +podInfo.WithEntityRef(podEntities[pod.UID].Ref()). + WithLabelValues("nginx", "node-1"). + Set(1) +``` + +This API extension is optional and can be added in a future iteration once the core entity support is stable. + +--- + +## Open Questions + +Several aspects of this design warrant community feedback: + +**promauto integration.** How should the promauto convenience package handle entities? + +**Entity unregistration and metrics.** If an entity is unregistered while metrics still reference it, what should happen? Options: prevent unregistration while referenced, allow it and have Gather skip the missing entity, or error at gather time. + +--- + +## Related Documents + +- [01-context.md](./01-context.md) — Problem statement and entity concept +- [02-exposition-formats.md](./02-exposition-formats.md) — Wire format for entities +- [05-storage.md](./05-storage.md) — How Prometheus stores entities + diff --git a/proposals/0071-Entity/04-service-discovery.md b/proposals/0071-Entity/04-service-discovery.md new file mode 100644 index 0000000..d59a32f --- /dev/null +++ b/proposals/0071-Entity/04-service-discovery.md @@ -0,0 +1,793 @@ +# Service Discovery and Entities + +## Abstract + +This document specifies how Prometheus Service Discovery (SD) integrates with the Entity concept introduced in this proposal. SD already collects rich metadata about scrape targets—metadata that naturally maps to entity labels. This document provides a comprehensive technical specification for deriving entities from SD metadata, including implementation details and resolution of the interaction between relabeling, entity generation, and metric correlation. + +The document also addresses **attribute mapping standards**—how `__meta_*` labels translate to entity type names and attribute names. Rather than prescribing a specific convention, this document presents the available options (OpenTelemetry semantic conventions, Prometheus-native conventions, etc.) and their trade-offs. Standardized, non-customizable mappings are essential for enabling ecosystem-wide interoperability; the specific convention choice is left as an open decision for the Prometheus community. + +Entities can come from two sources: the **exposition format** (embedded in scraped data) or **Service Discovery** (derived from target metadata). Each approach has trade-offs, and users choose based on their architecture. + +--- + +## Background: How Service Discovery Works + +### Discovery Manager Architecture + +The Discovery Manager (`discovery/manager.go`) coordinates all service discovery mechanisms: + +```go +type Manager struct { + // providers keeps track of SD providers + providers []*Provider + + // targets maps (setName, providerName) -> source -> TargetGroup + targets map[poolKey]map[string]*targetgroup.Group + + // syncCh sends updates to the scrape manager + syncCh chan map[string][]*targetgroup.Group +} +``` + +Each `Provider` wraps a `Discoverer` that implements: + +```go +type Discoverer interface { + // Run sends TargetGroups through the channel when changes occur + Run(ctx context.Context, up chan<- []*targetgroup.Group) +} +``` + +### Target Group Structure + +The fundamental unit of discovery is the `targetgroup.Group`: + +```go +// From discovery/targetgroup/targetgroup.go +type Group struct { + // Targets is a list of targets identified by a label set. + // Each target is uniquely identifiable by its address label. + Targets []model.LabelSet + + // Labels is a set of labels common across all targets in the group. + Labels model.LabelSet + + // Source is an identifier that describes this group of targets. + Source string +} +``` + +**Key insight**: SD mechanisms populate `__meta_*` labels into these `LabelSet` objects. These labels contain the raw metadata that will become entity attributes. + +### Label Flow: Discovery to Scrape + +The complete flow from discovery to metric labels: + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ Service Discovery Flow │ +└─────────────────────────────────────────────────────────────────────────────┘ + + 1. DISCOVERY PHASE + ┌─────────────────────────────────────────────────────────────────────────┐ + │ Kubernetes API / AWS API / Consul / etc. │ + │ │ │ + │ ▼ │ + │ ┌─────────────────────────────────────────────────────────────────────┐ │ + │ │ Discoverer.Run() builds targetgroup.Group with: │ │ + │ │ │ │ + │ │ Targets[0] = { │ │ + │ │ __address__: "10.0.0.1:8080" │ │ + │ │ __meta_kubernetes_namespace: "production" │ │ + │ │ __meta_kubernetes_pod_name: "nginx-7b9f5" │ │ + │ │ __meta_kubernetes_pod_uid: "550e8400-e29b-..." │ │ + │ │ __meta_kubernetes_pod_node_name: "worker-1" │ │ + │ │ __meta_kubernetes_pod_phase: "Running" │ │ + │ │ ... │ │ + │ │ } │ │ + │ │ │ │ + │ │ Labels = { │ │ + │ │ __meta_kubernetes_namespace: "production" (group-level) │ │ + │ │ } │ │ + │ └─────────────────────────────────────────────────────────────────────┘ │ + └─────────────────────────────────────────────────────────────────────────┘ + │ + ▼ + 2. SCRAPE MANAGER RECEIVES TARGET GROUPS + ┌─────────────────────────────────────────────────────────────────────────┐ + │ scrapePool.Sync(tgs []*targetgroup.Group) │ + │ │ │ + │ ▼ │ + │ TargetsFromGroup() → PopulateLabels() │ + └─────────────────────────────────────────────────────────────────────────┘ + │ + ▼ + 3. LABEL POPULATION (scrape/target.go:PopulateLabels) + ┌─────────────────────────────────────────────────────────────────────────┐ + │ a) Merge target labels + group labels │ + │ b) Add scrape config defaults (job, __scheme__, __metrics_path__, etc.) │ + │ c) Apply relabel_configs │ + │ d) Delete all __meta_* labels │ + │ e) Default instance to __address__ │ + │ │ + │ Result: Target with final label set │ + │ {job="kubernetes-pods", instance="10.0.0.1:8080", namespace="prod"} │ + └─────────────────────────────────────────────────────────────────────────┘ + │ + ▼ + 4. SCRAPE LOOP + ┌─────────────────────────────────────────────────────────────────────────┐ + │ HTTP GET target → Parse metrics → Apply metric_relabel_configs │ + │ → Append to storage with final labels │ + └─────────────────────────────────────────────────────────────────────────┘ +``` + +**Critical observation**: The `__meta_*` labels are deleted in step 3d. With entity support, we intercept these labels *before* deletion to generate entities. + +--- + +## Entity Sources + +Entities can originate from two sources, each suited to different deployment patterns: + +### Source 1: Service Discovery + +When Prometheus scrapes targets directly, SD metadata accurately describes the entity producing metrics: + +| SD Mechanism | What It Discovers | Entity It Can Generate | +|--------------|-------------------|------------------------| +| Kubernetes pod SD | Pods | `k8s.pod` | +| Kubernetes node SD | Nodes | `k8s.node` | +| Kubernetes service SD | Services | `k8s.service` | +| EC2 SD | EC2 instances | `host`, `cloud.instance` | +| Azure VM SD | Azure VMs | `host`, `cloud.instance` | +| GCE SD | GCE instances | `host`, `cloud.instance` | +| Consul SD | Services | `service` | + +**When to use**: Direct scraping where the target IS the entity. + +### Source 2: Exposition Format + +When metrics flow through intermediaries, SD sees the intermediary, not the actual sources: + +``` +┌───────────┐ ┌───────────┐ ┌───────────┐ +│ Service A │────▶│ OTel │◀─────▶│Prometheus │ +│ (pod-xyz) │push │ Collector │scrape │ │ +└───────────┘ │ │ │ SD sees: │ +┌───────────┐ │ (pod-abc) │ │ pod-abc │ +│ Service B │────▶│ │ │ │ +└───────────┘ └───────────┘ └───────────┘ + │ + Entity info must travel │ + WITH the metrics ─────────┘ +``` + +**When to use**: Gateways, federation, pushgateway, kube-state-metrics. + +See [01-context.md](./01-context.md#collection-architectures-direct-scraping-vs-gateways) for detailed use cases. + +--- + +## Configuration + +### ScrapeConfig Extension + +The `ScrapeConfig` struct in `config/config.go` is extended: + +```go +type ScrapeConfig struct { + // ... existing fields ... + + // EntityFromSD controls SD-derived entity generation. + // When true, Prometheus generates entities from __meta_* labels + // according to the built-in mappings for each SD type. + // Default: false (for backward compatibility) + EntityFromSD bool `yaml:"entity_from_sd,omitempty"` + + // EntityLimit is the maximum number of distinct entities per target. + // A single target may correlate with multiple entities (e.g., pod + node). + // 0 means no limit. + EntityLimit int `yaml:"entity_limit,omitempty"` +} +``` + +### Configuration Examples + +```yaml +scrape_configs: + # Direct scraping with entity generation enabled + - job_name: 'kubernetes-pods' + kubernetes_sd_configs: + - role: pod + entity_from_sd: true + + # Gateway pattern - entities come from exposition format + - job_name: 'otel-collector' + static_configs: + - targets: ['otel-collector:8889'] + entity_from_sd: false # Default + + # Federation - entities flow through metrics + - job_name: 'federate' + honor_labels: true + metrics_path: '/federate' + static_configs: + - targets: ['prometheus-regional:9090'] + entity_from_sd: false +``` + +--- + +## Attribute Mapping Standards + +A critical design decision for SD-derived entities is how `__meta_*` labels translate to entity type names and attribute names. This section outlines the requirements, available options, and trade-offs for establishing a mapping standard. + +### The Problem + +Service Discovery mechanisms produce `__meta_*` labels with provider-specific naming: + +``` +__meta_kubernetes_pod_uid +__meta_kubernetes_namespace +__meta_ec2_instance_id +__meta_azure_machine_id +``` + +These must be transformed into entity attributes. The key questions are: + +1. **Entity type names**: What should we call the entity? (`k8s.pod`? `kubernetes_pod`? `pod`?) +2. **Attribute names**: How should attributes be named? (`k8s.pod.uid`? `pod_uid`? `uid`?) +3. **Which labels become identifying vs. descriptive?** + +The answers to these questions affect: +- **Correlation**: Metrics and entities must share the same identifying label names and values +- **Interoperability**: Other systems querying Prometheus data need predictable attribute names +- **Ecosystem alignment**: Conventions should facilitate integration with dashboards, alerting, and other tools + +### Design Requirements + +Whatever convention is chosen, the mapping must satisfy these requirements: + +1. **Deterministic**: Given the same `__meta_*` labels, the resulting entity attributes must always be identical +2. **Complete**: All meaningful metadata should be captured—useful information should not be silently dropped +3. **Unambiguous**: Each `__meta_*` label maps to exactly one attribute; no conflicts or overlaps +4. **Stable**: Once established, mappings should not change without a clear migration path + +### Available Options + +#### Option 1: OpenTelemetry Semantic Conventions + +Adopt attribute names from [OpenTelemetry Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/), which define standardized names for resource attributes across the industry. + +**Example mappings:** + +| SD Label | OTel-style Entity Attribute | +|----------|----------------------------| +| `__meta_kubernetes_pod_uid` | `k8s.pod.uid` | +| `__meta_kubernetes_namespace` | `k8s.namespace.name` | +| `__meta_ec2_instance_id` | `host.id` | +| `__meta_ec2_instance_type` | `host.type` | +| `__meta_azure_machine_id` | `host.id` | +| `__meta_gce_project` | `cloud.account.id` | + +**Advantages:** +- Industry-wide standardization enables correlation across tools (Grafana, OTel Collector, etc.) +- Reduces cognitive load for teams already using OTel conventions +- Future-proofs Prometheus for deeper OTel integration +- Extensive documentation and community support + +**Disadvantages:** +- Not all conventions are stable; Kubernetes conventions are currently "Experimental" and may change +- Introduces dot-separated names (e.g., `k8s.pod.uid`) which differ from Prometheus's traditional underscore convention +- Requires Prometheus to track and potentially adapt to external convention changes + +**Stability considerations:** + +If OTel conventions are adopted, Prometheus should consider: +- Only adopting conventions that have reached **Stable** status +- For widely-used Experimental conventions (like Kubernetes), accepting the risk with clear user documentation +- Establishing a migration strategy for when conventions change + +#### Option 2: Prometheus-Native Conventions + +Define Prometheus-specific conventions that align with existing Prometheus naming patterns (lowercase, underscore-separated). + +**Example mappings:** + +| SD Label | Prometheus-style Entity Attribute | +|----------|----------------------------------| +| `__meta_kubernetes_pod_uid` | `kubernetes_pod_uid` | +| `__meta_kubernetes_namespace` | `kubernetes_namespace` | +| `__meta_ec2_instance_id` | `ec2_instance_id` | +| `__meta_ec2_instance_type` | `ec2_instance_type` | +| `__meta_azure_machine_id` | `azure_machine_id` | +| `__meta_gce_project` | `gce_project` | + +**Advantages:** +- Consistent with existing Prometheus label naming conventions +- Full control over naming without external dependencies +- No risk of upstream convention changes +- Simpler—direct transformation from `__meta_*` labels + +**Disadvantages:** +- No industry standardization; correlation with OTel-based systems requires translation +- Prometheus would need to define and maintain its own convention documentation +- May diverge from where the broader observability ecosystem is heading +- Less intuitive for teams already using OTel conventions + +#### Option 3: Minimal Transformation + +Strip the `__meta_` prefix and SD-type prefix, keeping attribute names close to the original. + +**Example mappings:** + +| SD Label | Minimal Entity Attribute | +|----------|-------------------------| +| `__meta_kubernetes_pod_uid` | `pod_uid` | +| `__meta_kubernetes_namespace` | `namespace` | +| `__meta_ec2_instance_id` | `instance_id` | +| `__meta_ec2_instance_type` | `instance_type` | +| `__meta_azure_machine_id` | `machine_id` | +| `__meta_gce_project` | `project` | + +**Advantages:** +- Simplest transformation logic +- Shortest attribute names +- Easy to understand and predict + +**Disadvantages:** +- No namespace to distinguish provider-specific attributes +- Poor interoperability with any external standard + +### Identifying vs. Descriptive Label Classification + +Beyond naming, each mapping must classify labels as **identifying** (immutable, define identity) or **descriptive** (mutable, provide context). This classification must be: + +1. **Consistent with the data source**: If the underlying resource uses a UID for identity, so should the entity +2. **Globally unique when combined**: Identifying labels together must uniquely identify one entity +3. **Stable over the entity's lifetime**: Identifying label values must not change + +### SD Mechanisms Without Entity Mappings + +The following SD mechanisms do not generate entities automatically because they lack sufficient metadata to construct meaningful entities: + +| SD Mechanism | Reason | +|--------------|--------| +| `static_configs` | No metadata—just addresses | +| `file_sd_configs` | User-defined, no standard schema | +| `http_sd_configs` | User-defined, no standard schema | +| `dns_sd_configs` | Only provides addresses | + +Users requiring entities from these sources should embed entity information in the exposition format (see [02-exposition-formats.md](./02-exposition-formats.md)). + +### Non-Customizable by Design + +**Attribute mappings are not user-configurable.** This is intentional: + +1. **Standardization requires consistency**: If every deployment uses different attribute names, the benefits of entities (correlation, interoperability, ecosystem tooling) are lost +2. **Ecosystem tooling depends on predictability**: Dashboards, alerting rules, and integrations assume specific attribute names +3. **Reduced cognitive load**: Users don't need to understand or maintain mapping configurations +4. **Simpler implementation**: No configuration parsing, validation, or per-scrape-config mapping logic + +Users who need different attribute names can transform data downstream (e.g., in recording rules or remote write pipelines), but the source of truth in Prometheus uses the standard mappings. + +### Open Decision + +This proposal does not prescribe which naming convention Prometheus should adopt. The choice between OTel alignment, Prometheus-native conventions, or another approach should be made by the Prometheus community based on: + +- Strategic direction for OTel integration +- Compatibility requirements with existing tooling +- Long-term maintenance considerations +- Community feedback + +The implementation will be straightforward once a convention is chosen—the technical complexity is in the entity infrastructure, not the naming. + +--- + +## Implementation Details + +### Entity Generation in the Scrape Pipeline + +Entity generation happens during target creation, before `__meta_*` labels are discarded: + +```go +// In scrape/target.go - modified PopulateLabels +func PopulateLabels(lb *labels.Builder, cfg *config.ScrapeConfig, + tLabels, tgLabels model.LabelSet) (labels.Labels, []*Entity, error) { + PopulateDiscoveredLabels(lb, cfg, tLabels, tgLabels) + + // NEW: Generate entities from __meta_* labels BEFORE relabeling + var entities []*Entity + if cfg.EntityFromSD { + entities = generateEntitiesFromMeta(lb, cfg) + } + + // Apply relabeling (existing behavior) + keep := relabel.ProcessBuilder(lb, cfg.RelabelConfigs...) + if !keep { + return labels.EmptyLabels(), nil, nil + } + + // ... rest of existing validation ... + + // Delete __meta_* labels (existing behavior) + lb.Range(func(l labels.Label) { + if strings.HasPrefix(l.Name, model.MetaLabelPrefix) { + lb.Del(l.Name) + } + }) + + // ... rest of existing code ... + + return res, entities, nil +} + +// generateEntitiesFromMeta extracts entities based on SD-specific mappings +func generateEntitiesFromMeta(lb *labels.Builder, cfg *config.ScrapeConfig) []*Entity { + var entities []*Entity + + // Detect SD type from __meta_* prefix + // Kubernetes: __meta_kubernetes_* + // EC2: __meta_ec2_* + // etc. + + if hasKubernetesLabels(lb) { + if entity := generateK8sPodEntity(lb); entity != nil { + entities = append(entities, entity) + } + if entity := generateK8sNodeEntity(lb); entity != nil { + entities = append(entities, entity) + } + // ... other K8s entity types + } + + if hasEC2Labels(lb) { + if entity := generateHostEntityFromEC2(lb); entity != nil { + entities = append(entities, entity) + } + } + + // ... other SD types + + return entities +} +``` + +### Target Structure Extension + +The `Target` struct is extended to hold generated entities: + +```go +// In scrape/target.go +type Target struct { + labels labels.Labels + scrapeConfig *config.ScrapeConfig + tLabels model.LabelSet + tgLabels model.LabelSet + + // NEW: Entities generated from SD metadata + sdEntities []*Entity + + // ... existing fields ... +} +``` + +### Entity Transmission to Storage + +When a target is scraped, its SD-derived entities are appended alongside metrics: + +```go +// In scrape/scrape.go - within scrapeLoop.append() +func (sl *scrapeLoop) append(app storage.Appender, b []byte, + contentType string, ts time.Time) (...) { + defTime := timestamp.FromTime(ts) + + // NEW: Append SD-derived entities for this target + if sl.sdEntities != nil { + for _, entity := range sl.sdEntities { + if _, err := app.AppendEntity( + entity.Type, + entity.IdentifyingLabels, + entity.DescriptiveLabels, + defTime, + ); err != nil { + sl.l.Debug("Error appending SD entity", "type", entity.Type, "err", err) + } + } + } + + // ... existing metric parsing and appending ... +} +``` + +### Data Flow Diagram + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ Entity Generation Data Flow │ +└─────────────────────────────────────────────────────────────────────────────┘ + +┌───────────────┐ ┌───────────────┐ ┌───────────────┐ +│ Kubernetes │ │ EC2 │ │ Consul │ +│ API │ │ API │ │ API │ +└───────┬───────┘ └───────┬───────┘ └───────┬───────┘ + │ │ │ + ▼ ▼ ▼ +┌───────────────────────────────────────────────────────────────────────────┐ +│ Discovery Manager │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ targetgroup.Group │ │ +│ │ Targets: [ { __meta_kubernetes_pod_uid: "abc", ... } ] │ │ +│ │ Labels: { __meta_kubernetes_namespace: "prod" } │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +└───────────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌───────────────────────────────────────────────────────────────────────────┐ +│ Scrape Manager │ +│ │ +│ scrapePool.Sync(tgs) → TargetsFromGroup() → PopulateLabels() │ +│ │ │ +│ ┌────────────────────┴────────────────────┐ │ +│ │ │ │ +│ ▼ ▼ │ +│ ┌─────────────────────────┐ ┌─────────────────────────┐ │ +│ │ Entity Generation │ │ Label Processing │ │ +│ │ (from __meta_* labels)│ │ (relabel_configs) │ │ +│ │ │ │ │ │ +│ │ IF entity_from_sd: │ │ 1. Apply relabel rules │ │ +│ │ Extract identifying │ │ 2. Delete __meta_* │ │ +│ │ Extract descriptive │ │ 3. Set instance default│ │ +│ │ Create Entity struct │ │ │ │ +│ └───────────┬─────────────┘ └──────────┬──────────────┘ │ +│ │ │ │ +│ │ ┌──────────────────────────────┘ │ +│ │ │ │ +│ ▼ ▼ │ +│ ┌─────────────────────────────────────────────────────────────────────┐ │ +│ │ Target │ │ +│ │ │ │ +│ │ labels: { job="k8s-pods", instance="10.0.0.1:8080", ns="prod" } │ │ +│ │ │ │ +│ │ sdEntities: [ │ │ +│ │ Entity{ │ │ +│ │ type: "k8s.pod", │ │ +│ │ identifyingLabels: {namespace="prod", pod_uid="abc-123"} │ │ +│ │ descriptiveLabels: {pod_name="nginx", node_name="worker-1"} │ │ +│ │ } │ │ +│ │ ] │ │ +│ └─────────────────────────────────────────────────────────────────────┘ │ +└───────────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌───────────────────────────────────────────────────────────────────────────┐ +│ Scrape Loop │ +│ │ +│ For each scrape: │ +│ 1. HTTP GET target │ +│ 2. Parse exposition format │ +│ 3. Extract exposition-format entities (if any) │ +│ 4. Merge SD entities + exposition entities │ +│ 5. app.AppendEntity() for each entity │ +│ 6. app.Append() for each metric (with correlation via shared labels) │ +│ 7. app.Commit() │ +└───────────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌───────────────────────────────────────────────────────────────────────────┐ +│ Storage (TSDB) │ +│ │ +│ ┌─────────────────────┐ ┌─────────────────────┐ │ +│ │ Entity Storage │ │ Series Storage │ │ +│ │ │ │ │ │ +│ │ memEntity │◄──►│ memSeries │ │ +│ │ stripeEntities │ │ stripeSeries │ │ +│ │ EntityMemPostings │ │ postings │ │ +│ │ │ │ │ │ +│ │ Correlation Index │────┤ │ │ +│ └─────────────────────┘ └─────────────────────┘ │ +└───────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Relabeling and Entities + +This section specifies how relabeling interacts with entity generation. + +### Principle: Entities Are Generated Before Relabeling + +Entity generation uses the **raw** `__meta_*` labels before any relabeling is applied. This ensures: + +1. **Predictability**: Entity structure is consistent regardless of user relabeling rules +2. **Correctness**: Identifying labels match the actual resource identity +3. **Simplicity**: Users don't need to coordinate relabeling with entity generation + +### relabel_configs Do Not Affect Entity Labels + +```yaml +scrape_configs: + - job_name: 'kubernetes-pods' + kubernetes_sd_configs: + - role: pod + entity_from_sd: true + relabel_configs: + # This ONLY affects metric labels, NOT entity labels + - source_labels: [__meta_kubernetes_namespace] + target_label: ns # Metric label becomes "ns" + # Entity attribute uses the standard mapping (unchanged) +``` + +**Rationale**: Entity identifying labels are derived from `__meta_*` labels using the standard mapping, independent of `relabel_configs`. This ensures entity structure is predictable regardless of user relabeling rules. + +### metric_relabel_configs and Entity Labels + +`metric_relabel_configs` operates on metrics **after** they're scraped but **before** correlation happens. Entity-enriched labels (descriptive labels added during query) are **not** subject to `metric_relabel_configs`. + +```yaml +scrape_configs: + - job_name: 'kubernetes-pods' + entity_from_sd: true + metric_relabel_configs: + # This drops metrics, but entities remain + - source_labels: [__name__] + regex: 'go_.*' + action: drop +``` + +### honor_labels Interaction + +When `honor_labels: true`, labels from the scraped payload take precedence over target labels. This affects correlation: + +```yaml +scrape_configs: + - job_name: 'federate' + honor_labels: true + entity_from_sd: false # Entities come from federated metrics +``` + +If `entity_from_sd: true` with `honor_labels: true`: +- SD-derived entities are still generated +- Correlation uses the **final** metric labels (which may come from the payload) +- This could cause correlation mismatches if payload labels differ from SD labels + +**Recommendation**: When using `honor_labels: true`, set `entity_from_sd: false` and rely on exposition-format entities. + +--- + +## Conflict Resolution + +> **TODO**: This section needs further design work. When entities come from both SD and the exposition format for the same scrape, we need to define: +> - How to detect that two entities refer to the same resource +> - Whether to merge, prefer one source, or treat them as distinct +> - How to handle conflicting descriptive labels +> - Edge cases around timing and ordering +> +> This interacts with the exposition format design in [02-exposition-formats.md](./02-exposition-formats.md) and needs to be addressed holistically. + +--- + +## Entity Lifecycle with SD + +### Entity Creation + +An SD-derived entity is created when a target with matching `__meta_*` labels first appears in discovery. + +### Entity Updates + +When a target is re-discovered (on each SD refresh) and `entity_from_sd: true`: +1. Entity identifying labels are checked against existing entities +2. If entity exists, descriptive labels are compared +3. If descriptive labels changed, a new snapshot is recorded (see [05-storage.md](./05-storage.md)) + +### Entity Staleness + +When a target disappears from SD: + +1. **Immediate behavior**: The target's scrape loop is stopped +2. **Entity marking**: The SD-derived entity associated with that target receives an `endTime` timestamp +3. **Grace period**: Entities remain queryable for historical analysis + +**Implementation**: + +```go +// In scrape/scrape.go - when target is removed +func (sp *scrapePool) sync(targets []*Target) { + // ... existing target diff logic ... + + // For removed targets, mark their entities as potentially stale + for fingerprint, loop := range sp.loops { + if _, ok := uniqueLoops[fingerprint]; !ok { + // Target removed + if loop.sdEntities != nil { + for _, entity := range loop.sdEntities { + // Don't immediately mark dead - other targets might use same entity + sp.entityRefCounts[entity.Hash()]-- + if sp.entityRefCounts[entity.Hash()] == 0 { + // No more targets reference this entity + app.MarkEntityDead(entity.Ref, timestamp.FromTime(time.Now())) + } + } + } + loop.stop() + } + } +} +``` + +### Entity Deduplication + +Multiple targets may correlate with the same entity (e.g., multiple containers in a pod). The entity is only created once: + +```go +// Entity identity is determined by type + identifying labels +func entityHash(entityType string, identifyingLabels labels.Labels) uint64 { + h := fnv.New64a() + h.Write([]byte(entityType)) + identifyingLabels.Range(func(l labels.Label) { + h.Write([]byte(l.Name)) + h.Write([]byte(l.Value)) + }) + return h.Sum64() +} +``` + +When the same entity is discovered from multiple targets: +- First discovery creates the entity +- Subsequent discoveries update `lastSeen` timestamp +- Descriptive labels are merged (last write wins for conflicts) + +--- + +## Open Questions Resolved + +### Q: Entity deduplication across multiple discovery mechanisms + +**Answer**: Entities are deduplicated by their identifying labels. If Kubernetes pod SD and endpoints SD both discover the same pod, only one entity is stored. The entity's descriptive labels are updated from whichever source provides the most recent data. + +### Q: SD entity lifecycle when target disappears + +**Answer**: When the last target referencing an entity disappears from SD, the entity's `endTime` is set to the current timestamp. The entity remains in storage for historical queries until retention deletes it. + +## Open Questions + +### Q: Which naming convention should Prometheus adopt for entity attributes? + +This proposal presents the available options (OTel semantic conventions, Prometheus-native, minimal transformation) and their trade-offs, but does not prescribe a specific choice. The decision should be made by the Prometheus community considering: + +- Strategic alignment with OpenTelemetry +- Existing ecosystem tooling and dashboards +- Long-term maintenance burden +- Community preferences + +### Q: How should Prometheus handle OTel conventions that are not yet stable? + +If OTel semantic conventions are chosen, Prometheus must decide how to handle conventions that haven't reached "Stable" status (e.g., Kubernetes conventions are currently "Experimental"). Options include: + +1. **Strict stability requirement**: Only adopt stable conventions; define Prometheus-specific names for unstable areas +2. **Pragmatic adoption**: Adopt widely-used experimental conventions with clear documentation about potential future changes +3. **Hybrid approach**: Use stable OTel conventions where available, Prometheus-native names elsewhere + +### Q: Should entity types be namespaced by SD mechanism? + +When multiple SD mechanisms can discover similar resources (e.g., EC2, Azure, GCE all discover "hosts"), should entity types be: + +- **Generic**: `host` (requires merging semantics across providers) +- **Provider-specific**: `ec2.instance`, `azure.vm`, `gce.instance` (clearer provenance, no collision risk) +- **Hierarchical**: `host` with `cloud.provider` as an identifying label + +--- + +## Related Documents + +- [01-context.md](./01-context.md) - Problem statement, motivation, and use cases +- [02-exposition-formats.md](./02-exposition-formats.md) - How entities are represented in wire formats +- [05-storage.md](./05-storage.md) - How entities are stored in the TSDB +- [06-querying.md](./06-querying.md) - PromQL extensions for working with entities +- [07-web-ui-and-apis.md](./07-web-ui-and-apis.md) - How entities are displayed and accessed + +--- + +*This proposal is a work in progress. Feedback is welcome.* + diff --git a/proposals/0071-Entity/05-storage.md b/proposals/0071-Entity/05-storage.md new file mode 100644 index 0000000..b98e2df --- /dev/null +++ b/proposals/0071-Entity/05-storage.md @@ -0,0 +1,998 @@ +# Entity Storage + +> **Recommended Approach**: This document describes the correlation-based storage design, which we recommend for initial implementation due to its incremental nature and backward compatibility. An alternative design that fundamentally changes how series identity works is described in [05b-storage-entity-native.md](05b-storage-entity-native.md). + +## Abstract + +This document specifies how Prometheus stores entities reliably and efficiently. Entities represent the things that produce telemetry (pods, nodes, services) and need different storage semantics than traditional time series: they have immutable identifying labels, mutable descriptive labels that change over time, and lifecycle boundaries (creation and deletion). This document covers the in-memory structures, Write-Ahead Log integration, block persistence, and the correlation index that links entities to their associated metrics. + +## Background + +### Current Prometheus Storage Architecture + +Prometheus uses a time series database (TSDB) optimized for append-heavy workloads with the following key components: + +**Head Block**: The in-memory component that stores the most recent data. New samples are appended here first. The Head contains: +- `memSeries`: In-memory representation of each time series, holding recent samples in chunks +- `stripeSeries`: A sharded map for concurrent access to series by ID or label hash +- `MemPostings`: An inverted index mapping label name/value pairs to series references + +**Write-Ahead Log (WAL)**: Ensures durability by writing all incoming data to disk before acknowledging. On crash recovery, the WAL is replayed to reconstruct the Head. WAL records include: +- Series records (new series with their labels) +- Sample records (timestamp + value for a series) +- Metadata records (type, unit, help for metrics) +- Exemplar and histogram records + +**Persistent Blocks**: Periodically, the Head is compacted into immutable blocks stored on disk. Each block contains: +- Chunk files (compressed time series data) +- Index file (label index, postings lists, series metadata) +- Meta file (time range, stats) + +**Appender Interface**: The primary interface for writing data to storage: + +```go +type Appender interface { + Append(ref SeriesRef, l labels.Labels, t int64, v float64) (SeriesRef, error) + Commit() error + Rollback() error + // ... other methods for histograms, exemplars, metadata +} +``` + +The scrape loop uses Appender to write scraped metrics. Each scrape creates an Appender, appends all samples, then calls Commit() to atomically persist everything to the WAL. + +### Why Entities Need Different Storage + +Entities differ from time series in fundamental ways: + +| Aspect | Time Series | Entities | +|--------|-------------|----------| +| Identity | Labels (all mutable in theory) | Identifying labels (immutable) | +| Values | Numeric samples over time | String labels (descriptive) | +| Cardinality | High (many series per entity) | Lower (one entity, many series) | +| Lifecycle | Implicit (staleness) | Explicit (start/end timestamps) | +| Correlation | Self-contained | Links to multiple series | + +These differences motivate a dedicated storage approach rather than trying to fit entities into the existing series model. + +## Entity Data Model + +### The memEntity Structure + +Each entity in memory is represented by the following structure: + +```go +type memEntity struct { + // Immutable after creation - no lock needed for these fields + ref EntityRef // Unique identifier (uint64, auto-incrementing) + entityType string // e.g., "k8s.pod", "service", "k8s.node" + identifyingLabels labels.Labels // Immutable labels that define identity + + // Lifecycle timestamps + startTime int64 // When this entity incarnation was created + endTime int64 // When deleted (0 if still alive) + + // Mutable - requires lock + sync.Mutex + descriptiveSnapshots []labelSnapshot // Historical descriptive labels + lastSeen int64 // Last scrape timestamp (for staleness checking) +} + +type labelSnapshot struct { + timestamp int64 + labels labels.Labels +} +``` + +### Identifying vs Descriptive Labels + +**Identifying Labels** define what an entity *is*. They are immutable for the lifetime of an entity incarnation: + +``` +Entity Type: k8s.pod +Identifying Labels: + - k8s.namespace.name = "production" + - k8s.pod.uid = "550e8400-e29b-41d4-a716-446655440000" +``` + +Two entities with the same identifying labels are considered the same entity (within their lifecycle bounds). + +**Descriptive Labels** provide additional context that may change over time: + +``` +Descriptive Labels (at t1): + - k8s.pod.name = "nginx-7b9f5" + - k8s.node.name = "worker-1" + - k8s.pod.status = "Running" + +Descriptive Labels (at t2, pod migrated): + - k8s.pod.name = "nginx-7b9f5" + - k8s.node.name = "worker-2" ← changed + - k8s.pod.status = "Running" +``` + +### Snapshot Storage for Descriptive Labels + +Descriptive labels are stored as complete snapshots at each change point. When new descriptive labels arrive: + +1. Compare with the most recent snapshot +2. If different, append a new snapshot with current timestamp +3. If identical, update `lastSeen` but don't create new snapshot + +``` +descriptiveSnapshots: [ + { t1, {name="nginx-7b9f5", node="worker-1", status="Running"} }, + { t5, {name="nginx-7b9f5", node="worker-2", status="Running"} }, // node changed + { t9, {name="nginx-7b9f5", node="worker-2", status="Terminating"} }, // status changed +] +``` + +**Why snapshots instead of an event log?** + +An event log (storing only deltas) would save storage space but impose query-time costs. To answer "what were the descriptive labels at time T?", a query would need to: +1. Find all change events before T +2. Replay them to reconstruct the state + +With snapshots, the query simply finds the latest snapshot where `snapshot.timestamp <= T`. + +### Entity Lifecycle + +Each entity has explicit lifecycle boundaries: + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ Entity Lifecycle │ +├─────────────────────────────────────────────────────────────────────┤ +│ │ +│ startTime endTime │ +│ │ │ │ +│ ▼ ▼ │ +│ ┌──────────────────────────────────────────────────────┐ │ +│ │ Entity is "alive" │ │ +│ │ - Correlates with metrics in this time range │ │ +│ │ - Descriptive labels tracked │ │ +│ └──────────────────────────────────────────────────────┘ │ +│ │ +│ Before startTime: Entity doesn't exist │ +│ After endTime: Entity is "dead" (historical only) │ +│ endTime == 0: Entity is currently alive │ +│ │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +**Entity Staleness** + +An entity's `endTime` is determined by staleness, similar to series staleness: +- Each scrape updates `lastSeen` timestamp +- If `now - lastSeen > staleness_threshold`, entity is marked dead +- `endTime` is set to `lastSeen + staleness_threshold` + +**Entity Reincarnation** + +The same identifying labels can appear again after an entity ends: + +``` +Timeline: + t1: Entity A created (ref=1, identifying={pod.uid="abc"}, startTime=t1) + t5: Entity A deleted (ref=1, endTime=t5) + t10: Entity B created (ref=2, identifying={pod.uid="abc"}, startTime=t10) +``` + +Entity A and Entity B have the same identifying labels but different EntityRefs and non-overlapping lifecycles. At any point in time, at most one entity with a given set of identifying labels should be alive. + +## Storage Components + +### In-Memory Structures + +#### Entity Storage in Head + +The Head block is extended with entity storage: + +```go +type Head struct { + // ... existing fields ... + + // Entity storage + entities *stripeEntities // All entities by ref or identifying attrs hash + entityPostings *EntityMemPostings // Inverted index for entity labels + + // Correlation index + seriesToEntities map[HeadSeriesRef][]EntityRef + entitiesToSeries map[EntityRef][]HeadSeriesRef + correlationMtx sync.RWMutex + + lastEntityID atomic.Uint64 // For generating EntityRefs +} +``` + +#### stripeEntities + +Similar to `stripeSeries`, provides sharded concurrent access to entities: + +```go +type stripeEntities struct { + size int + series []map[EntityRef]*memEntity + hashes []map[uint64][]*memEntity // hash(identifyingAttrs) -> entities + locks []sync.RWMutex +} + +// Get entity by ref +func (s *stripeEntities) getByRef(ref EntityRef) *memEntity + +// Get entity by identifying labels (may return multiple for historical) +func (s *stripeEntities) getByIdentifyingLabels(hash uint64, lbls labels.Labels) []*memEntity + +func (s *stripeEntities) getAliveByIdentifyingLabels(hash uint64, lbls labels.Labels) *memEntity +``` + +#### EntityMemPostings + +An inverted index mapping label name/value pairs to entity references: + +```go +type EntityMemPostings struct { + mtx sync.RWMutex + m map[string]map[string][]EntityRef // label name -> label value -> entity refs +} + +// Example contents: +// "k8s.namespace.name" -> "production" -> [EntityRef(1), EntityRef(5), EntityRef(12)] +// "k8s.node.name" -> "worker-1" -> [EntityRef(1), EntityRef(3)] +``` + +This enables efficient lookups like "find all entities in namespace X" or "find all entities on node Y". + +#### Correlation Index + +The correlation index maintains the many-to-many relationship between series and entities: + +```go +// Series -> Entities: "which entities does this series correlate with?" +seriesToEntities map[HeadSeriesRef][]EntityRef + +// Entities -> Series: "which series are associated with this entity?" +entitiesToSeries map[EntityRef][]HeadSeriesRef +``` + +**Building the correlation at ingestion time:** + +When a new series is created: +``` +series.labels = {__name__="container_cpu", k8s.namespace.name="prod", k8s.pod.uid="abc", k8s.node.uid="xyz"} + +For each registered entity type: + k8s.pod: requires {k8s.namespace.name, k8s.pod.uid} + → series has both → find entity with these identifying attrs + → if found and alive: add to correlation index + + k8s.node: requires {k8s.node.uid} + → series has this → find entity with this identifying attr + → if found and alive: add to correlation index + +Result: seriesToEntities[series.ref] = [podEntityRef, nodeEntityRef] +``` + +When a new entity is created: +``` +entity.identifyingAttrs = {k8s.namespace.name="prod", k8s.pod.uid="abc"} + +Find all series whose labels contain ALL of entity's identifying attrs: + → Use postings index: intersect(postings["k8s.namespace.name"]["prod"], + postings["k8s.pod.uid"]["abc"]) + → For each matching series: add to correlation index +``` + +**Correlation and Entity Lifecycle** + +When an entity becomes stale (endTime set), it remains in the correlation index. This preserves historical correlations for queries over past time ranges. The query layer filters based on timestamp overlap between the query range and entity lifecycle. + +### Write-Ahead Log + +#### New WAL Record Type + +A single new record type captures all entity state: + +```go +const ( + // ... existing types ... + Entity Type = 11 // Entity record +) + +type RefEntity struct { + Ref EntityRef + EntityType string + IdentifyingLabels []labels.Label + DescriptiveLabels []labels.Label + StartTime int64 + EndTime int64 // 0 if alive + Timestamp int64 // When this record was written +} +``` + +#### Record Encoding + +Entity records follow the same encoding pattern as other WAL records: + +``` +┌───────────┬──────────┬────────────┬──────────────┐ +│ type <1b> │ len <2b> │ CRC32 <4b> │ data │ +└───────────┴──────────┴────────────┴──────────────┘ +``` + +The data section for an Entity record: + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ Entity Record Data │ +├─────────────────────────────────────────────────────────────────────┤ +│ ref <8b, big-endian> │ +│ entityType │ +│ numIdentifyingLabels │ +│ ┌─ name │ +│ └─ value │ +│ ... repeated for each identifying label │ +│ numDescriptiveLabels │ +│ ┌─ name │ +│ └─ value │ +│ ... repeated for each descriptive label │ +│ startTime │ +│ endTime │ +│ timestamp │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +#### When Entity Records Are Written + +Entity records are written to WAL in these situations: + +1. **New entity created**: Full record with startTime set, endTime=0 +2. **Descriptive labels changed**: Full record with updated labels and new timestamp +3. **Entity marked dead**: Full record with endTime set + +Writing full records (not deltas) simplifies replay and allows any single record to fully describe entity state at that point. + +#### WAL Replay Behavior + +On startup, entity records are replayed to reconstruct the Head's entity state: + +```go +func (h *Head) replayEntityRecord(rec RefEntity) error { + existing := h.entities.getByRef(rec.Ref) + + if existing == nil { + // New entity - create it + entity := &memEntity{ + ref: rec.Ref, + entityType: rec.EntityType, + identifyingLabels: rec.IdentifyingLabels, + startTime: rec.StartTime, + endTime: rec.EndTime, + } + if len(rec.DescriptiveLabels) > 0 { + entity.descriptiveSnapshots = []labelSnapshot{ + {timestamp: rec.Timestamp, labels: rec.DescriptiveLabels}, + } + } + h.entities.set(entity) + } else { + // Update existing entity + existing.Lock() + existing.endTime = rec.EndTime + if len(rec.DescriptiveLabels) > 0 { + // Check if labels changed from last snapshot + if shouldAddSnapshot(existing, rec.DescriptiveLabels) { + existing.descriptiveSnapshots = append( + existing.descriptiveSnapshots, + labelSnapshot{timestamp: rec.Timestamp, labels: rec.DescriptiveLabels}, + ) + } + } + existing.Unlock() + } + + // Update lastEntityID if needed + if uint64(rec.Ref) > h.lastEntityID.Load() { + h.lastEntityID.Store(uint64(rec.Ref)) + } + + return nil +} +``` + +The correlation index is rebuilt after all WAL records are replayed, by iterating all entities and series and computing correlations. + +### Block Persistence + +When the Head is compacted into a persistent block, entities must also be persisted. + +#### Entity Index in Blocks + +Each block includes an entity index alongside the existing series index: + +``` +Block Directory Structure: + block-ulid/ + ├── chunks/ # Chunk files (existing) + ├── index # Series index (existing) + ├── entities # Entity index (new) + ├── meta.json # Block metadata (extended) + └── tombstones # Deletion markers (existing) +``` + +The entity index file structure: + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ Entity Index File │ +├─────────────────────────────────────────────────────────────────────┤ +│ Magic Number (4 bytes) │ +│ Version (1 byte) │ +├─────────────────────────────────────────────────────────────────────┤ +│ Symbol Table │ +│ - All unique strings (entity types, attr names, attr values) │ +├─────────────────────────────────────────────────────────────────────┤ +│ Entity Table │ +│ For each entity: │ +│ - EntityRef │ +│ - EntityType (symbol ref) │ +│ - IdentifyingLabels (symbol ref pairs) │ +│ - StartTime, EndTime │ +│ - DescriptiveSnapshots offset (pointer to snapshots section) │ +├─────────────────────────────────────────────────────────────────────┤ +│ Descriptive Snapshots Section │ +│ For each entity's snapshots: │ +│ - Number of snapshots │ +│ - For each snapshot: timestamp, labels (symbol ref pairs) │ +├─────────────────────────────────────────────────────────────────────┤ +│ Entity Postings │ +│ - Inverted index: (label_name, label_value) -> [EntityRefs] │ +├─────────────────────────────────────────────────────────────────────┤ +│ Table of Contents │ +│ CRC32 │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +#### Compaction Behavior + +During compaction: + +1. **Entity Selection**: Include entities whose lifecycle overlaps with the block's time range + ``` + Include entity if: entity.startTime < block.maxTime AND + (entity.endTime == 0 OR entity.endTime > block.minTime) + ``` + +2. **Snapshot Filtering**: Only include descriptive snapshots within the block's time range + +3. **Deduplication**: If compacting multiple blocks, entities with the same EntityRef are merged, keeping all unique snapshots + +#### Entity Retention + +Entities follow the same retention policy as series data. Prometheus deletes blocks based on `RetentionDuration` (time-based) or `MaxBytes` (size-based). When blocks are deleted, entities are handled as follows: + +**Retention Rule**: An entity persists as long as **any block overlapping its lifecycle** exists. + +``` +Block Timeline: + Block 1 Block 2 Block 3 Block 4 + [t0, t1] [t1, t2] [t2, t3] [t3, t4] + +Entity A: ████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ + startTime=t0 endTime=t1.5 + (lifecycle spans Block 1 and Block 2) + +Entity B: ░░░░░░░░░░░░████████████████████████████████ + startTime=t1.2 endTime=0 (still alive) + (lifecycle spans Block 2, Block 3, Block 4, Head) + +When Block 1 and Block 2 are deleted due to retention: +- Entity A is deleted (no remaining blocks contain its lifecycle) +- Entity B persists (Block 3, Block 4, Head still overlap its lifecycle) +``` + +This ensures historical queries can always resolve entity correlations for the data that remains. + +#### Head Entity Garbage Collection + +The Head block periodically runs garbage collection to remove entities that are no longer needed in memory. This mirrors how series GC works in `Head.gc()`. + +**GC Eligibility**: An entity in the Head is eligible for garbage collection when: +1. The entity is dead (`endTime != 0`), AND +2. The entity's entire lifecycle is before `Head.MinTime()` (fully compacted to blocks) + +```go +func (h *Head) gcEntities() map[EntityRef]struct{} { + mint := h.MinTime() + deleted := make(map[EntityRef]struct{}) + + h.entities.iter(func(entity *memEntity) { + // Only consider dead entities + if entity.endTime == 0 { + return // Still alive, keep in Head + } + + // If the entity's entire lifecycle is before Head's minTime, + // it has been fully compacted to blocks and can be removed + if entity.endTime < mint { + deleted[entity.ref] = struct{}{} + } + }) + + // Remove from entity storage + for ref := range deleted { + entity := h.entities.getByRef(ref) + h.entities.delete(ref) + h.entityPostings.Delete(ref, entity.identifyingLabels) + } + + // Clean up correlation index + h.correlationMtx.Lock() + for ref := range deleted { + // Remove entity from all series correlations + for _, seriesRef := range h.entitiesToSeries[ref] { + h.seriesToEntities[seriesRef] = removeEntityRef( + h.seriesToEntities[seriesRef], ref) + } + delete(h.entitiesToSeries, ref) + } + h.correlationMtx.Unlock() + + return deleted +} +``` + +**Integration with Head.gc()**: Entity GC runs alongside series GC during `truncateMemory()`: + +```go +func (h *Head) truncateSeriesAndChunkDiskMapper(caller string) error { + // ... existing series GC ... + actualInOrderMint, minOOOTime, minMmapFile := h.gc() + + // Entity GC + deletedEntities := h.gcEntities() + h.metrics.entitiesRemoved.Add(float64(len(deletedEntities))) + + // ... rest of truncation ... +} +``` + +## Ingestion Flow + +### Extended Appender Interface + +The Appender interface is extended to support entity ingestion: + +```go +type Appender interface { + // ... existing methods ... + + // AppendEntity adds or updates an entity. + // Returns the EntityRef (existing or newly assigned). + AppendEntity( + entityType string, + identifyingAttrs labels.Labels, + descriptiveAttrs labels.Labels, + timestamp int64, + ) (EntityRef, error) +} +``` + +### headAppender Implementation + +```go +func (a *headAppender) AppendEntity( + entityType string, + identifyingAttrs labels.Labels, + descriptiveAttrs labels.Labels, + timestamp int64, +) (EntityRef, error) { + + // Validate inputs + if entityType == "" { + return 0, fmt.Errorf("entity type cannot be empty") + } + if len(identifyingLabels) == 0 { + return 0, fmt.Errorf("identifying labels cannot be empty") + } + + // Sort labels for consistent hashing + sort.Sort(identifyingLabels) + sort.Sort(descriptiveLabels) + + hash := identifyingLabels.Hash() + + // Check for existing alive entity + entity := a.head.entities.getAliveByIdentifyingLabels(hash, identifyingLabels) + + if entity == nil { + // Create new entity + ref := EntityRef(a.head.lastEntityID.Inc()) + entity = &memEntity{ + ref: ref, + entityType: entityType, + identifyingLabels: identifyingLabels, + startTime: timestamp, + endTime: 0, + lastSeen: timestamp, + } + + if len(descriptiveLabels) > 0 { + entity.descriptiveSnapshots = []labelSnapshot{ + {timestamp: timestamp, labels: descriptiveLabels}, + } + } + + // Stage for commit + a.pendingEntities = append(a.pendingEntities, entity) + a.pendingEntityRecords = append(a.pendingEntityRecords, RefEntity{ + Ref: ref, + EntityType: entityType, + IdentifyingLabels: identifyingLabels, + DescriptiveLabels: descriptiveLabels, + StartTime: timestamp, + EndTime: 0, + Timestamp: timestamp, + }) + + return ref, nil + } + + // Update existing entity + entity.Lock() + entity.lastSeen = timestamp + + // Check if descriptive labels changed + needsSnapshot := false + if len(entity.descriptiveSnapshots) == 0 { + needsSnapshot = len(descriptiveLabels) > 0 + } else { + lastSnapshot := entity.descriptiveSnapshots[len(entity.descriptiveSnapshots)-1] + needsSnapshot = !labels.Equal(lastSnapshot.labels, descriptiveLabels) + } + + if needsSnapshot { + entity.descriptiveSnapshots = append(entity.descriptiveSnapshots, labelSnapshot{ + timestamp: timestamp, + labels: descriptiveLabels, + }) + + // Stage WAL record for changed labels + a.pendingEntityRecords = append(a.pendingEntityRecords, RefEntity{ + Ref: entity.ref, + EntityType: entity.entityType, + IdentifyingLabels: entity.identifyingLabels, + DescriptiveLabels: descriptiveLabels, + StartTime: entity.startTime, + EndTime: 0, + Timestamp: timestamp, + }) + } + + entity.Unlock() + return entity.ref, nil +} +``` + +### Commit and Rollback + +**Commit** persists all pending entities to WAL and updates indexes: + +```go +func (a *headAppender) Commit() error { + // ... existing commit logic for samples ... + + // Write entity records to WAL + if len(a.pendingEntityRecords) > 0 { + if err := a.logEntities(); err != nil { + return err + } + } + + // Add new entities to Head + for _, entity := range a.pendingEntities { + a.head.entities.set(entity) + a.head.entityPostings.Add(entity.ref, entity.identifyingLabels) + + // Build correlations with existing series + a.head.buildEntityCorrelations(entity) + } + + // Clear pending state + a.pendingEntities = a.pendingEntities[:0] + a.pendingEntityRecords = a.pendingEntityRecords[:0] + + return nil +} +``` + +**Rollback** discards all pending changes: + +```go +func (a *headAppender) Rollback() error { + // ... existing rollback logic ... + + // Simply discard pending entities - they were never added to Head + a.pendingEntities = a.pendingEntities[:0] + a.pendingEntityRecords = a.pendingEntityRecords[:0] + + return nil +} +``` + +### Correlation Index Updates + +When building correlations for a new entity: + +```go +func (h *Head) buildEntityCorrelations(entity *memEntity) { + // Find all series that have ALL of the entity's identifying labels + var postingsLists []Postings + + entity.identifyingLabels.Range(func(l labels.Label) { + postingsLists = append(postingsLists, h.postings.Get(l.Name, l.Value)) + }) + + // Intersect all postings lists + intersection := Intersect(postingsLists...) + + h.correlationMtx.Lock() + defer h.correlationMtx.Unlock() + + for intersection.Next() { + seriesRef := intersection.At() + + // Add bidirectional correlation + h.seriesToEntities[seriesRef] = append(h.seriesToEntities[seriesRef], entity.ref) + h.entitiesToSeries[entity.ref] = append(h.entitiesToSeries[entity.ref], seriesRef) + } +} +``` + +When a new series is created, correlations are built similarly by finding all alive entities whose identifying labels are a subset of the series labels. + +## Query Support + +This section provides an overview of how storage exposes entities for queries. Detailed query semantics are covered in the Querying document. + +### Storage Query Interface + +```go +type EntityQuerier interface { + // Get entity by ref + Entity(ref EntityRef) (*Entity, error) + + // Find entities by type and/or labels + Entities(ctx context.Context, entityType string, matchers ...*labels.Matcher) (EntitySet, error) + + // Get entities correlated with a series at a specific time + EntitiesForSeries(seriesRef SeriesRef, timestamp int64) ([]EntityRef, error) + + // Get series correlated with an entity + SeriesForEntity(entityRef EntityRef) ([]SeriesRef, error) + + // Get descriptive labels at a point in time + DescriptiveLabelsAt(entityRef EntityRef, timestamp int64) (labels.Labels, error) +} +``` + +### Time-Range Filtering + +Queries specify a time range `[mint, maxt]`. Entity results are filtered by lifecycle: + +```go +func (e *memEntity) isAliveAt(t int64) bool { + return e.startTime <= t && (e.endTime == 0 || e.endTime > t) +} + +func (e *memEntity) overlapsRange(mint, maxt int64) bool { + return e.startTime < maxt && (e.endTime == 0 || e.endTime > mint) +} +``` + +### Descriptive Label Lookup + +To get descriptive labels at a specific timestamp: + +```go +func (e *memEntity) descriptiveLabelsAt(t int64) labels.Labels { + if !e.isAliveAt(t) { + return labels.EmptyLabels() + } + + snapshots := e.descriptiveSnapshots + if len(snapshots) == 0 { + return labels.EmptyLabels() + } + + // Binary search: find the first snapshot where timestamp > t + // Then the snapshot we want is at index i-1 + i := sort.Search(len(snapshots), func(i int) bool { + return snapshots[i].timestamp > t + }) + + if i == 0 { + // All snapshots are after time t + return labels.EmptyLabels() + } + + return snapshots[i-1].labels +} +``` + +## Remote Write Considerations + +Entities need to be transmitted over Prometheus remote write protocol. This requires extending the protobuf definitions: + +```protobuf +message EntityWriteRequest { + repeated Entity entities = 1; +} + +message Entity { + string entity_type = 1; + repeated Label identifying_labels = 2; + repeated Label descriptive_labels = 3; + int64 start_time_ms = 4; + int64 end_time_ms = 5; // 0 if alive + int64 timestamp_ms = 6; // When this state was observed +} +``` + +Key considerations for remote write: + +1. **Incremental Updates**: Only send entity records when state changes (new entity, attrs changed, entity died) +2. **Receiver Reconciliation**: Receivers must handle out-of-order entity records and merge appropriately +3. **Correlation Rebuild**: Receivers rebuild correlation indexes locally based on their series data + +Detailed remote write protocol changes are specified in a separate document. + +## Trade-offs and Design Decisions + +### Separate Entity Storage vs Embedding in Series + +**Decision**: Separate storage structure for entities + +**Rationale**: +- Entities have different access patterns (lookup by identifying labels vs. time-range queries) +- Many-to-many relationship with series doesn't fit the one-to-one series model +- Entity lifecycle (explicit start/end) differs from series staleness +- Descriptive labels are string-valued, not numeric samples + +**Trade-off**: Additional complexity in storage layer, but cleaner semantics and better query performance. + +### Snapshots vs Event Log for Descriptive Labels + +**Decision**: Store complete snapshots at each change point + +**Rationale**: +- Point-in-time queries are common ("what was this pod's node at time T?") +- Snapshots enable O(log n) lookup via binary search +- Event log would require O(n) replay to reconstruct state +- Descriptive labels change infrequently, limiting snapshot count + +**Trade-off**: Higher storage per change, but faster queries and simpler implementation. + +### Correlation at Ingestion Time vs Query Time + +**Decision**: Build correlation index at ingestion time + +**Rationale**: +- Queries should be fast; correlation lookup is O(1) with pre-built index +- Ingestion can afford extra work; it's already doing label processing +- Correlation relationships are stable (based on immutable identifying labels) + +**Trade-off**: Ingestion overhead for maintaining correlation index, but significantly faster queries. + +### Single WAL Record Type vs Multiple + +**Decision**: Single comprehensive entity record type + +**Rationale**: +- Simplifies WAL encoding/decoding logic +- Any single record fully describes entity state (no partial records) +- Replay is straightforward—each record is self-contained +- Matches pattern used for series (full labels in each Series record) + +**Trade-off**: Slightly larger WAL records, but simpler and more robust. + +## Open Questions / Future Work + +### Retention Alignment + +How exactly should entity retention align with block retention? +- Current proposal: entities persist while any block containing their lifecycle exists +- May need refinement based on operational experience + +### Memory Management + +Long-running Prometheus instances may accumulate many historical entities: +- Consider memory-mapped entity storage for historical entities +- Investigate entity compaction/summarization for very old data + +### Federation and Multi-Prometheus + +When multiple Prometheus instances scrape the same entities: +- Entity deduplication across instances +- Consistent EntityRef assignment (or ref translation) +- Correlation index consistency + +### Entity Type Registry + +Should Prometheus maintain a registry of known entity types with their identifying label schemas? +- Would enable validation at ingestion time +- Could optimize correlation index building +- Trade-off: flexibility vs. consistency + +--- + +## TODO: Memory and WAL Replay Performance + +This section requires further investigation and benchmarking: + +### Memory Concerns + +- **Entity memory footprint estimation**: We need to quantify the memory cost per entity, including the `memEntity` struct, descriptive snapshots, and correlation index entries. This will help users estimate memory requirements based on expected entity counts. + +- **Impact on existing memory settings**: How do entity storage requirements interact with `--storage.tsdb.head-chunks-*` and other memory-related flags? Should there be dedicated entity memory limits? + +- **Memory-mapped entity storage**: For Prometheus instances with very long uptimes and high entity churn, historical entities may accumulate. Investigate whether memory-mapping historical entities (similar to mmapped chunks) could reduce memory pressure. + +- **Correlation index memory scaling**: The bidirectional correlation maps (`seriesToEntities` and `entitiesToSeries`) could become large with high series and entity counts. Consider more memory-efficient data structures (e.g., roaring bitmaps) if benchmarks show this is a bottleneck. + +### WAL Replay Performance + +- **Correlation index rebuild time**: The current proposal rebuilds the correlation index after WAL replay by iterating all entities and series. For large Prometheus instances (millions of series, thousands of entities), this could significantly increase startup time. + +- **Incremental correlation during replay**: Instead of rebuilding correlations after replay, could we store correlation state in the WAL or maintain it incrementally during replay? This would trade WAL size for faster startup. + +- **Checkpointing correlation state**: Consider extending WAL checkpointing to include entity and correlation state, reducing the amount of replay needed on restart. + +- **Benchmark targets**: We should establish performance targets (e.g., "WAL replay should not increase by more than 10% with 10,000 entities") and validate them through benchmarks. + +These topics need benchmarking with realistic workloads before finalizing the implementation approach. + +--- + +## TODO: Columnar Storage Strategies + +This section outlines potential optimizations for entity label storage that warrant further exploration: + +### Background + +Descriptive labels are fundamentally different from time series samples: +- They are **string-valued**, not numeric +- They change **infrequently** (entity metadata doesn't update every scrape) +- They are often **queried together** (users typically want all labels of an entity, not just one) +- They benefit from **compression** due to repetitive patterns (many pods have similar labels) + +These characteristics suggest that columnar storage techniques, commonly used in analytical databases, might offer significant benefits. + +### Areas to Explore + +- **Column-oriented label storage**: Instead of storing all labels for a snapshot together (row-oriented), store each label name as a column with its values across entities. This could improve compression and enable efficient filtering by specific labels. + +- **Dictionary encoding**: Entity labels often have low cardinality (e.g., `k8s.pod.status.phase` has only a few possible values). Dictionary encoding could dramatically reduce storage for descriptive labels. + +- **Run-length encoding for temporal data**: When descriptive labels don't change across many snapshots, run-length encoding could eliminate redundant storage. + +- **Label projection pushdown**: When queries only need specific entity labels (e.g., `sum by (k8s.node.name)`), the storage layer could avoid reading unnecessary labels. + +- **Separate label storage files**: Similar to how chunks are stored separately from the index, entity labels could have dedicated storage with format optimized for their access patterns. + +### Trade-offs to Consider + +- Implementation complexity vs. storage/query benefits +- Read vs. write optimization (columnar is typically better for reads) +- Memory overhead of maintaining multiple storage formats +- Compatibility with existing TSDB compaction and retention logic + +This is a potential future optimization and not required for the initial implementation. + +--- + +## What's Next + +- [Querying](05-querying.md): How PromQL is extended to query entities and correlations +- [Web UI and APIs](06-web-ui-and-apis.md): HTTP API endpoints and UI for entity exploration + diff --git a/proposals/0071-Entity/05b-storage-entity-native.md b/proposals/0071-Entity/05b-storage-entity-native.md new file mode 100644 index 0000000..c740410 --- /dev/null +++ b/proposals/0071-Entity/05b-storage-entity-native.md @@ -0,0 +1,934 @@ +# Storage Design: Entity-Native Model + +> **Alternative Approach**: This document describes an alternative storage design where series identity is based only on metric labels, with samples grouped into "streams" by entity. While this approach offers stronger alignment with OpenTelemetry's data model and addresses cardinality at a fundamental level, it requires significant changes to Prometheus's core architecture. We recommend the correlation-based approach described in [05-storage.md](05-storage.md) for initial implementation, as it can be built incrementally on the existing TSDB without breaking backward compatibility. This entity-native design remains valuable as a potential future evolution once entities prove their value in production. + +## Executive Summary + +This document proposes a fundamental redesign of Prometheus's storage model to natively support Entities as first-class concepts, separate from metric identity. The key insight is that **metric identity** (what is being measured) should be separate from **entity identity** (what is being measured about). + +### Core Idea + +``` +Series: + labels: {__name__="http_requests_total", method="GET", status="200"} # metric labels only + data: [ + { + entityRefs: [podRef1, nodeRef1, serviceRef1], + samples: [{t: 1000, v: 100}, {t: 1015, v: 120}] + }, + { + entityRefs: [podRef2, nodeRef3], + samples: [{t: 1000, v: 1020}, {t: 1015, v: 1203}] + } + ] +``` + +This model separates: +- **What** is being measured → Series labels (metric name + metric-specific labels) +- **About what** it's being measured → Entity references (linking to entity storage) + +--- + +## Part 1: Current TSDB Model (Reference) + +Before diving into the new model, let's understand the current Prometheus TSDB architecture. + +### Current Series Identity + +In the current model, a series is uniquely identified by its **complete label set**: + +```go +type memSeries struct { + ref chunks.HeadSeriesRef // Unique identifier (auto-incrementing) + lset labels.Labels // Complete label set (includes ALL labels) + headChunks *memChunk // In-memory samples + mmappedChunks []*mmappedChunk // Memory-mapped chunks on disk + // ... +} +``` + +**Example:** These are THREE different series in current Prometheus: +``` +http_requests_total{method="GET", status="200", pod="nginx-abc"} # Series 1 +http_requests_total{method="GET", status="200", pod="nginx-def"} # Series 2 +http_requests_total{method="GET", status="200", pod="nginx-xyz"} # Series 3 +``` + +### Current Flow + +``` +Scrape → Labels → Hash(Labels) → getOrCreate(hash, labels) → memSeries → Append Sample +``` + +The hash of ALL labels determines series identity: + +```go +func (a *appender) getOrCreate(l labels.Labels) (series *memSeries, created bool) { + hash := l.Hash() // Hash of ALL labels + + series = a.series.GetByHash(hash, l) + if series != nil { + return series, false + } + + ref := chunks.HeadSeriesRef(a.nextRef.Inc()) + series = &memSeries{ref: ref, lset: l} + a.series.Set(hash, series) + return series, true +} +``` + +### Current Index Structure + +The postings index maps `label_name=label_value` → list of series refs: + +``` +Postings Index: + method="GET" → [1, 2, 3, 5, 8, ...] + status="200" → [1, 3, 5, 7, 9, ...] + pod="nginx-abc" → [1, 4, 7, ...] + pod="nginx-def" → [2, 5, 8, ...] +``` + +Query `http_requests_total{method="GET", status="200"}` intersects posting lists. + +--- + +## Part 2: Entity-Native Storage Model + +### Core Concepts + +#### 1. Metric Labels vs Entity Labels + +**Metric Labels** describe the measurement itself: +- `method="GET"` - HTTP method being measured +- `status="200"` - Response status being counted +- `le="0.5"` - Histogram bucket boundary +- `quantile="0.99"` - Summary quantile + +**Entity Labels** describe what the measurement is about: +- `k8s.pod.uid="abc-123"` - Which pod +- `k8s.node.name="worker-1"` - Which node +- `service.name="api-gateway"` - Which service + +#### 2. New Series Definition + +```go +// New: Series identity = metric name + metric labels only +type memSeries struct { + ref SeriesRef // Unique identifier + metricName string // e.g., "http_requests_total" + labels labels.Labels // Metric-specific labels only (method, status, etc.) + + // Multiple data streams, one per entity combination + streams []*dataStream // Samples grouped by entity +} + +// A stream of samples from a specific entity combination +type dataStream struct { + entityRefs []EntityRef // Which entities this stream is from + headChunk *memChunk // Current in-memory chunk + mmappedChunks []*mmappedChunk // Historical chunks + + // Staleness tracking per stream + lastSeen int64 // Last sample timestamp +} +``` + +#### 3. Entity Storage (Separate) + +```go +type memEntity struct { + ref EntityRef // Unique identifier (auto-incrementing) + entityType string // e.g., "k8s.pod", "k8s.node", "service" + identifyingLabels labels.Labels // Immutable: what makes this entity unique + + // Mutable descriptive labels with history + sync.Mutex + descriptiveSnapshots []labelSnapshot + + // Lifecycle + startTime int64 // When this entity incarnation started + endTime int64 // 0 if still alive +} + +type labelSnapshot struct { + timestamp int64 + labels labels.Labels +} +``` + +### Visual Representation + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ SERIES STORAGE │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ Series 1: http_requests_total{method="GET", status="200"} │ +│ ┌─────────────────────────────────────────────────────────────────────────┐│ +│ │ Stream A: entityRefs=[pod:abc, node:worker-1, svc:api] ││ +│ │ Chunks: [(t=1000,v=100), (t=1015,v=120), ...] ││ +│ ├─────────────────────────────────────────────────────────────────────────┤│ +│ │ Stream B: entityRefs=[pod:def, node:worker-2] ││ +│ │ Chunks: [(t=1000,v=1020), (t=1015,v=1203), ...] ││ +│ ├─────────────────────────────────────────────────────────────────────────┤│ +│ │ Stream C: entityRefs=[pod:xyz, node:worker-1, svc:api] ││ +│ │ Chunks: [(t=1020,v=5), (t=1035,v=15), ...] ← Pod rescheduled here ││ +│ └─────────────────────────────────────────────────────────────────────────┘│ +├─────────────────────────────────────────────────────────────────────────────┤ +│ Series 2: http_requests_total{method="POST", status="201"} │ +│ ┌─────────────────────────────────────────────────────────────────────────┐│ +│ │ Stream A: entityRefs=[pod:abc, node:worker-1, svc:api] ││ +│ │ Chunks: [(t=1000,v=50), (t=1015,v=55), ...] ││ +│ └─────────────────────────────────────────────────────────────────────────┘│ +└─────────────────────────────────────────────────────────────────────────────┘ + +┌─────────────────────────────────────────────────────────────────────────────┐ +│ ENTITY STORAGE │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ Entity: k8s.pod (ref=pod:abc) │ +│ Identifying: {k8s.pod.uid="abc-123"} │ +│ Descriptive @ t=1000: {k8s.pod.name="nginx-abc", version="1.0"} │ +│ Descriptive @ t=2000: {k8s.pod.name="nginx-abc", version="1.1"} │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ Entity: k8s.node (ref=node:worker-1) │ +│ Identifying: {k8s.node.uid="node-uid-001"} │ +│ Descriptive @ t=0: {k8s.node.name="worker-1", region="us-east-1"} │ +├─────────────────────────────────────────────────────────────────────────────┤ +│ Entity: service (ref=svc:api) │ +│ Identifying: {service.name="api-gateway", service.namespace="prod"} │ +│ Descriptive @ t=1000: {service.version="2.0", deployment="blue"} │ +│ Descriptive @ t=3000: {service.version="2.1", deployment="green"} │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Part 3: In-Memory Structures + +### 3.1 Series Storage + +```go +type Head struct { + // Series storage - sharded for concurrent access + series *stripeSeries + + // Entity storage - separate sharded structure + entities *stripeEntities + + // Index structures + metricPostings *MetricPostings // metric labels → series refs + entityPostings *EntityPostings // entity refs → (series ref, stream index) + + // ... existing fields (WAL, chunks, etc.) +} + +// stripeSeries holds series by SeriesRef and by metric label hash +type stripeSeries struct { + size int + series []map[SeriesRef]*memSeries // Sharded by ref + hashes []seriesHashmap // Sharded by metric label hash + locks []stripeLock +} + +// seriesHashmap - only uses metric labels for lookup +type seriesHashmap struct { + unique map[uint64]*memSeries + conflicts map[uint64][]*memSeries +} +``` + +### 3.2 Series Lookup + +```go +// Key change: Series lookup only uses metric labels +func (a *appender) getOrCreateSeries(metricLabels labels.Labels) (*memSeries, bool) { + hash := metricLabels.Hash() // Hash of METRIC labels only + + series := a.series.GetByHash(hash, metricLabels) + if series != nil { + return series, false + } + + ref := SeriesRef(a.nextSeriesRef.Inc()) + series = &memSeries{ + ref: ref, + metricName: metricLabels.Get(labels.MetricName), + labels: metricLabels.WithoutEmpty(), + streams: make([]*dataStream, 0), + } + a.series.Set(hash, series) + return series, true +} +``` + +### 3.3 Stream Management + +```go +// Find or create a data stream for the given entity combination +func (s *memSeries) getOrCreateStream(entityRefs []EntityRef) (*dataStream, bool) { + s.Lock() + defer s.Unlock() + + // Look for existing stream with same entity combination + for _, stream := range s.streams { + if entityRefsEqual(stream.entityRefs, entityRefs) { + return stream, false + } + } + + // Create new stream + stream := &dataStream{ + entityRefs: entityRefs, + headChunk: nil, + lastSeen: 0, + } + s.streams = append(s.streams, stream) + return stream, true +} + +func entityRefsEqual(a, b []EntityRef) bool { + if len(a) != len(b) { + return false + } + // Sort and compare - entity refs are unordered + sortedA := sortEntityRefs(a) + sortedB := sortEntityRefs(b) + for i := range sortedA { + if sortedA[i] != sortedB[i] { + return false + } + } + return true +} +``` + +### 3.4 Entity Storage + +```go +// stripeEntities holds entities by EntityRef and by identifying label hash +type stripeEntities struct { + size int + entities []map[EntityRef]*memEntity // Sharded by ref + hashes []entityHashmap // Sharded by (type + identifying labels) hash + locks []stripeLock +} + +func (a *appender) getOrCreateEntity( + entityType string, + identifyingLabels labels.Labels, + descriptiveLabels labels.Labels, + timestamp int64, +) (*memEntity, bool) { + // Hash of type + identifying labels + hash := hashEntityIdentity(entityType, identifyingLabels) + + entity := a.entities.GetByHash(hash, entityType, identifyingLabels) + if entity != nil { + // Update descriptive labels if changed + entity.updateDescriptive(descriptiveLabels, timestamp) + return entity, false + } + + ref := EntityRef(a.nextEntityRef.Inc()) + entity = &memEntity{ + ref: ref, + entityType: entityType, + identifyingLabels: identifyingLabels, + startTime: timestamp, + descriptiveSnapshots: []labelSnapshot{{ + timestamp: timestamp, + labels: descriptiveLabels, + }}, + } + a.entities.Set(hash, entity) + return entity, true +} +``` + +--- + +## Part 4: Index Structures + +### 4.1 Metric Postings Index + +Maps metric labels to series refs (similar to current postings, but only for metric labels): + +```go +type MetricPostings struct { + mtx sync.RWMutex + // label name → label value → series refs + m map[string]map[string][]SeriesRef +} + +// Add a series to the postings index +func (p *MetricPostings) Add(ref SeriesRef, lset labels.Labels) { + p.mtx.Lock() + defer p.mtx.Unlock() + + lset.Range(func(l labels.Label) { + if p.m[l.Name] == nil { + p.m[l.Name] = make(map[string][]SeriesRef) + } + p.m[l.Name][l.Value] = append(p.m[l.Name][l.Value], ref) + }) +} + +// Get series refs for a label pair +func (p *MetricPostings) Get(name, value string) []SeriesRef { + p.mtx.RLock() + defer p.mtx.RUnlock() + + if p.m[name] == nil { + return nil + } + return p.m[name][value] +} +``` + +### 4.2 Entity Postings Index + +Maps entity labels to (series, stream) pairs: + +```go +type EntityPostings struct { + mtx sync.RWMutex + + // entityRef → list of (seriesRef, streamIndex) + byEntity map[EntityRef][]streamLocation + + // For reverse lookup: entity label → entity refs + byLabel map[string]map[string][]EntityRef +} + +type streamLocation struct { + seriesRef SeriesRef + streamIndex int +} + +// Register that a stream uses an entity +func (p *EntityPostings) AddStreamEntity( + seriesRef SeriesRef, + streamIndex int, + entityRef EntityRef, +) { + p.mtx.Lock() + defer p.mtx.Unlock() + + loc := streamLocation{seriesRef: seriesRef, streamIndex: streamIndex} + p.byEntity[entityRef] = append(p.byEntity[entityRef], loc) +} + +// Find all streams that use a specific entity +func (p *EntityPostings) GetStreamsByEntity(entityRef EntityRef) []streamLocation { + p.mtx.RLock() + defer p.mtx.RUnlock() + + return p.byEntity[entityRef] +} +``` + +### 4.3 Combined Query Flow + +```go +// Query: http_requests_total{method="GET", k8s.pod.name="nginx-abc"} +func (q *querier) Select(matchers ...*labels.Matcher) SeriesSet { + var metricMatchers, entityMatchers []*labels.Matcher + + for _, m := range matchers { + if isEntityLabel(m.Name) { + entityMatchers = append(entityMatchers, m) + } else { + metricMatchers = append(metricMatchers, m) + } + } + + // Step 1: Find series by metric labels + seriesRefs := q.metricPostings.PostingsForMatchers(metricMatchers...) + + // Step 2: If entity matchers, filter streams + if len(entityMatchers) > 0 { + // Find entities that match + entityRefs := q.findMatchingEntities(entityMatchers) + + // Find streams that use these entities + return q.filterStreamsByEntities(seriesRefs, entityRefs) + } + + // Return all streams from matching series + return q.allStreamsFromSeries(seriesRefs) +} +``` + +--- + +## Part 5: Ingestion Flow + +### 5.1 Scrape Processing + +```go +func (a *appender) Append( + metricLabels labels.Labels, // Only metric-specific labels + entityRefs []EntityRef, // Pre-resolved entity references + timestamp int64, + value float64, +) error { + // Step 1: Get or create series (by metric labels only) + series, seriesCreated := a.getOrCreateSeries(metricLabels) + + // Step 2: Get or create stream (by entity combination) + stream, streamCreated := series.getOrCreateStream(entityRefs) + + // Step 3: Append sample to stream + if err := stream.append(timestamp, value); err != nil { + return err + } + + // Step 4: Update entity postings if new stream + if streamCreated { + streamIdx := len(series.streams) - 1 + for _, entityRef := range entityRefs { + a.entityPostings.AddStreamEntity(series.ref, streamIdx, entityRef) + } + } + + // Record for WAL + a.pendingSamples = append(a.pendingSamples, pendingSample{ + seriesRef: series.ref, + streamIndex: len(series.streams) - 1, + timestamp: timestamp, + value: value, + }) + + return nil +} +``` + +### 5.2 Entity Resolution During Scrape + +```go +// During scrape, labels are split into metric vs entity +func (sl *scrapeLoop) processMetrics( + metrics []parsedMetric, + entities []parsedEntity, +) error { + app := sl.appender() + + // First, resolve all entities from this scrape + entityRefMap := make(map[string]EntityRef) + for _, e := range entities { + entity, _ := app.getOrCreateEntity( + e.Type, + e.IdentifyingLabels, + e.DescriptiveLabels, + sl.timestamp, + ) + entityRefMap[entityKey(e.Type, e.IdentifyingLabels)] = entity.ref + } + + // Then, process metrics with entity references + for _, m := range metrics { + metricLabels, entityTypes := splitLabels(m.Labels) + + // Resolve entity refs for this metric + var entityRefs []EntityRef + for _, et := range entityTypes { + key := entityKeyFromMetric(et, m.Labels) + if ref, ok := entityRefMap[key]; ok { + entityRefs = append(entityRefs, ref) + } + } + + if err := app.Append(metricLabels, entityRefs, m.Timestamp, m.Value); err != nil { + return err + } + } + + return app.Commit() +} +``` + +--- + +## Part 6: WAL Format + +### 6.1 New Record Types + +```go +const ( + // Existing types + RecordSeries Type = 1 + RecordSamples Type = 2 + RecordTombstones Type = 3 + RecordExemplars Type = 4 + RecordMetadata Type = 6 + + // New types for entity-native model + RecordEntity Type = 20 // Entity definition + RecordEntityUpdate Type = 21 // Descriptive label update + RecordStream Type = 22 // New stream in a series + RecordStreamSamples Type = 23 // Samples for a specific stream +) +``` + +### 6.2 Entity Record + +``` +┌────────────────────────────────────────────────────────────────┐ +│ type = 20 <1b> │ +├────────────────────────────────────────────────────────────────┤ +│ ┌────────────────────────────────────────────────────────────┐ │ +│ │ entityRef <8b> │ │ +│ ├────────────────────────────────────────────────────────────┤ │ +│ │ len(entityType) │ │ +│ │ entityType │ │ +│ ├────────────────────────────────────────────────────────────┤ │ +│ │ n = len(identifyingLabels) │ │ +│ │ identifyingLabels │ │ +│ ├────────────────────────────────────────────────────────────┤ │ +│ │ m = len(descriptiveLabels) │ │ +│ │ descriptiveLabels │ │ +│ ├────────────────────────────────────────────────────────────┤ │ +│ │ startTime <8b> │ │ +│ └────────────────────────────────────────────────────────────┘ │ +│ . . . │ +└────────────────────────────────────────────────────────────────┘ +``` + +### 6.3 Series Record + +``` +┌────────────────────────────────────────────────────────────────┐ +│ type = 1 <1b> │ +├────────────────────────────────────────────────────────────────┤ +│ ┌────────────────────────────────────────────────────────────┐ │ +│ │ seriesRef <8b> │ │ +│ ├────────────────────────────────────────────────────────────┤ │ +│ │ n = len(metricLabels) │ │ +│ │ metricLabels │ │ +│ └────────────────────────────────────────────────────────────┘ │ +│ . . . │ +└────────────────────────────────────────────────────────────────┘ +``` + +### 6.4 Stream Record + +``` +┌────────────────────────────────────────────────────────────────┐ +│ type = 22 <1b> │ +├────────────────────────────────────────────────────────────────┤ +│ ┌────────────────────────────────────────────────────────────┐ │ +│ │ seriesRef <8b> │ │ +│ │ streamIndex │ │ +│ ├────────────────────────────────────────────────────────────┤ │ +│ │ n = len(entityRefs) │ │ +│ │ entityRef_0 <8b> │ │ +│ │ ... │ │ +│ │ entityRef_n <8b> │ │ +│ └────────────────────────────────────────────────────────────┘ │ +│ . . . │ +└────────────────────────────────────────────────────────────────┘ +``` + +### 6.5 Stream Samples Record + +``` +┌────────────────────────────────────────────────────────────────┐ +│ type = 23 <1b> │ +├────────────────────────────────────────────────────────────────┤ +│ ┌────────────────────────────────────────────────────────────┐ │ +│ │ seriesRef <8b> │ │ +│ │ streamIndex │ │ +│ │ baseTimestamp <8b> │ │ +│ ├────────────────────────────────────────────────────────────┤ │ +│ │ timestamp_delta │ │ +│ │ value <8b> │ │ +│ │ ... │ │ +│ └────────────────────────────────────────────────────────────┘ │ +└────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Part 7: Block Format + +### 7.1 Block Directory Structure + +``` +/ +├── meta.json +├── index +├── chunks/ +│ ├── 000001 +│ ├── 000002 +│ └── ... +├── entities/ # NEW: Entity storage +│ ├── index # Entity index +│ └── snapshots/ # Descriptive label snapshots +│ ├── 000001 +│ └── ... +└── tombstones +``` + +### 7.2 Modified Series Index Format + +``` +Series Entry: +┌──────────────────────────────────────────────────────────────────────────┐ +│ len │ +├──────────────────────────────────────────────────────────────────────────┤ +│ labels count │ +│ ┌──────────────────────────────────────────────────────────────────────┐ │ +│ │ ref(metric_label_name) │ │ +│ │ ref(metric_label_value) │ │ +│ │ ... │ │ +│ └──────────────────────────────────────────────────────────────────────┘ │ +├──────────────────────────────────────────────────────────────────────────┤ +│ streams count │ +│ ┌──────────────────────────────────────────────────────────────────────┐ │ +│ │ Stream 0: │ │ +│ │ entity_refs count │ │ +│ │ entityRef_0 <8b> │ │ +│ │ ... │ │ +│ │ chunks count │ │ +│ │ chunk entries... │ │ +│ ├──────────────────────────────────────────────────────────────────────┤ │ +│ │ Stream 1: │ │ +│ │ ... │ │ +│ └──────────────────────────────────────────────────────────────────────┘ │ +├──────────────────────────────────────────────────────────────────────────┤ +│ CRC32 <4b> │ +└──────────────────────────────────────────────────────────────────────────┘ +``` + +### 7.3 Entity Index Format + +``` +┌────────────────────────────┬─────────────────────┐ +│ magic(0xENT1D700) <4b> │ version(1) <1 byte> │ +├────────────────────────────┴─────────────────────┤ +│ ┌──────────────────────────────────────────────┐ │ +│ │ Symbol Table │ │ +│ ├──────────────────────────────────────────────┤ │ +│ │ Entity Types │ │ +│ ├──────────────────────────────────────────────┤ │ +│ │ Entities │ │ +│ ├──────────────────────────────────────────────┤ │ +│ │ Entity Label Postings │ │ +│ ├──────────────────────────────────────────────┤ │ +│ │ Postings Offset Table │ │ +│ ├──────────────────────────────────────────────┤ │ +│ │ TOC │ │ +│ └──────────────────────────────────────────────┘ │ +└──────────────────────────────────────────────────┘ + +Entity Entry: +┌──────────────────────────────────────────────────────────────────────────┐ +│ entityRef <8b> │ +├──────────────────────────────────────────────────────────────────────────┤ +│ ref(entityType) │ +├──────────────────────────────────────────────────────────────────────────┤ +│ identifyingLabels count │ +│ ┌──────────────────────────────────────────────────────────────────────┐ │ +│ │ ref(label_name) │ │ +│ │ ref(label_value) │ │ +│ │ ... │ │ +│ └──────────────────────────────────────────────────────────────────────┘ │ +├──────────────────────────────────────────────────────────────────────────┤ +│ startTime │ +│ endTime (0 if still alive at block max time) │ +├──────────────────────────────────────────────────────────────────────────┤ +│ snapshot_file_ref (reference to descriptive snapshots) │ +├──────────────────────────────────────────────────────────────────────────┤ +│ CRC32 <4b> │ +└──────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Part 8: Query Execution + +### 8.1 Query Result Model + +```go +// A query result is now a series with potentially multiple streams +type SeriesResult struct { + Labels labels.Labels // Metric labels + Streams []StreamResult // One per entity combination +} + +type StreamResult struct { + EntityRefs []EntityRef // Which entities this stream is from + Samples []Sample // The actual samples + + // Resolved entity labels (computed at query time) + entityLabels labels.Labels +} +``` + +### 8.2 Entity Label Resolution + +```go +func (q *querier) resolveEntityLabels( + entityRefs []EntityRef, + timestamp int64, +) labels.Labels { + builder := labels.NewBuilder(nil) + + for _, ref := range entityRefs { + entity := q.entities.GetByRef(ref) + if entity == nil { + continue + } + + // Add identifying labels + entity.identifyingLabels.Range(func(l labels.Label) { + builder.Set(l.Name, l.Value) + }) + + // Add descriptive labels at the given timestamp + descriptive := entity.DescriptiveLabelsAt(timestamp) + descriptive.Range(func(l labels.Label) { + builder.Set(l.Name, l.Value) + }) + } + + return builder.Labels() +} +``` + +### 8.3 PromQL Integration + +```go +// When PromQL asks for a vector at time T: +func (q *querier) Select(ctx context.Context, matchers ...*labels.Matcher) storage.SeriesSet { + metricMatchers, entityMatchers := splitMatchers(matchers) + + // Find matching series by metric labels + seriesRefs := q.metricPostings.PostingsForMatchers(ctx, metricMatchers...) + + // Build result set + var results []storage.Series + + for seriesRefs.Next() { + series := q.series.GetByRef(seriesRefs.At()) + + for streamIdx, stream := range series.streams { + // Check if stream's entities match entity matchers + if len(entityMatchers) > 0 { + entityLabels := q.resolveEntityLabels(stream.entityRefs, q.maxTime) + if !matchAll(entityLabels, entityMatchers) { + continue + } + } + + // Create a "virtual series" for this stream + results = append(results, &virtualSeries{ + metricLabels: series.labels, + entityRefs: stream.entityRefs, + chunks: stream.chunks, + querier: q, + }) + } + } + + return newSeriesSet(results) +} + +// virtualSeries represents a single stream as a series +type virtualSeries struct { + metricLabels labels.Labels + entityRefs []EntityRef + chunks []chunks.Meta + querier *querier +} + +func (s *virtualSeries) Labels() labels.Labels { + // Merge metric labels with entity labels + builder := labels.NewBuilder(s.metricLabels) + + entityLabels := s.querier.resolveEntityLabels(s.entityRefs, s.querier.maxTime) + entityLabels.Range(func(l labels.Label) { + builder.Set(l.Name, l.Value) + }) + + return builder.Labels() +} +``` + +--- + +## Part 9: Migration and Compatibility + +### 9.1 Feature Flag + +```yaml +# prometheus.yml +storage: + tsdb: + entity_native_storage: true # Enable new storage model +``` + +### 9.2 Backward Compatibility Mode + +When `entity_native_storage: false` (default): +- Behave exactly like current Prometheus +- All labels treated as metric labels +- Single stream per series + +When `entity_native_storage: true`: +- Entity labels are separated based on configuration/conventions +- Multiple streams per series possible +- Entity storage enabled + +### 9.3 Migration Strategy + +1. **Phase 1: Dual Write** + - New data written in new format + - Old blocks remain readable + - Query merges old and new formats + +2. **Phase 2: Background Conversion** + - Old blocks gradually converted during compaction + - No service interruption + +3. **Phase 3: Full Migration** + - All data in new format + - Old format support can be deprecated + +--- + +## Part 10: Trade-offs and Considerations + +### Benefits + +| Aspect | Improvement | +|--------|-------------| +| **Cardinality** | Series count = metric × metric_label_values (not × entities) | +| **Entity Changes** | Pod restart = new stream, not new series | +| **Storage Efficiency** | Entity labels stored once, not per-series | +| **Query Flexibility** | Natural entity-aware queries | +| **OTel Alignment** | Matches OTLP's resource/metric model | + +### Challenges + +| Aspect | Challenge | Mitigation | +|--------|-----------|------------| +| **Complexity** | Significant codebase changes | Phased rollout, feature flags | +| **Query Performance** | Entity label resolution overhead | Caching, pre-computation | +| **Index Size** | Additional entity postings | Efficient encoding, memory mapping | +| **Compatibility** | Breaking change for remote write | Version negotiation, adapters | + +### Open Questions + +1. **Stream Identity**: Should stream identity be based on sorted entity refs or preserve order? + +2. **Staleness**: Per-stream staleness vs per-series staleness? + +3. **Remote Write**: How to encode streams in the remote write protocol? + +4. **Recording Rules**: How do recording rule results handle entity association? + +5. **Exemplars**: Should exemplars be per-stream or per-series? + +--- diff --git a/proposals/0071-Entity/06-querying.md b/proposals/0071-Entity/06-querying.md new file mode 100644 index 0000000..bb6cf14 --- /dev/null +++ b/proposals/0071-Entity/06-querying.md @@ -0,0 +1,555 @@ +# Querying: Entity-Aware PromQL + +## Abstract + +This document specifies how Prometheus's query engine extends to support native entity awareness. The core principle is **automatic enrichment**: when querying metrics, correlated entity labels (both identifying and descriptive) are automatically included in results without requiring explicit join operations. A new **pipe operator** (`|`) enables filtering metrics by entity correlation using familiar syntax consistent with the exposition format. + +## Background + +### Current PromQL Value Types + +PromQL expressions evaluate to one of four value types: + +| Type | Description | Example | +|------|-------------|---------| +| **Scalar** | Single floating-point number | `42`, `3.14` | +| **String** | Simple string literal | `"hello"` | +| **Instant Vector** | Set of time series, each with one sample at a single timestamp | `http_requests_total{job="api"}` | +| **Range Vector (Matrix)** | Set of time series, each with multiple samples over a time range | `http_requests_total{job="api"}[5m]` | + +Functions have specific type signatures: + +``` +rate(Matrix) → Vector +sum(Vector) → Vector +scalar(Vector) → Scalar (single-element vector only) +``` + +### Current Query Execution Model + +When Prometheus executes a PromQL query: + +1. **Parsing**: Query string → Abstract Syntax Tree (AST) +2. **Preparation**: For each VectorSelector, call `querier.Select()` with label matchers +3. **Evaluation**: Traverse AST, evaluate functions and operators +4. **Result**: Return typed value (Scalar, Vector, or Matrix) + +The query engine interacts with storage through the `Querier` interface: + +```go +type Querier interface { + Select(ctx context.Context, sortSeries bool, hints *SelectHints, + matchers ...*labels.Matcher) SeriesSet + LabelValues(ctx context.Context, name string, ...) ([]string, error) + LabelNames(ctx context.Context, ...) ([]string, error) + Close() error +} +``` + +--- + +## Automatic Enrichment + +### How It Works + +When the query engine evaluates a VectorSelector or MatrixSelector, it automatically enriches each series with labels from correlated entities. + +**Query:** +```promql +container_cpu_usage_seconds_total{k8s.namespace.name="production"} +``` + +**Before enrichment (raw series from storage):** +``` +container_cpu_usage_seconds_total{ + container="nginx", + k8s.namespace.name="production", + k8s.pod.uid="abc-123", + k8s.node.uid="node-001" +} 1234.5 +``` + +**After enrichment (returned to user):** +``` +container_cpu_usage_seconds_total{ + # Original metric labels + container="nginx", + + # Identifying labels (correlation keys, already on series) + k8s.namespace.name="production", + k8s.pod.uid="abc-123", + k8s.node.uid="node-001", + + # Descriptive labels from k8s.pod entity + k8s.pod.name="nginx-7b9f5", + k8s.pod.status.phase="Running", + k8s.pod.start_time="2024-01-15T10:30:00Z", + + # Descriptive labels from k8s.node entity + k8s.node.name="worker-1", + k8s.node.os="linux", + k8s.node.kernel.version="5.15.0" +} 1234.5 +``` + +### Enrichment Algorithm + +```go +func (ev *evaluator) enrichSeries( + ctx context.Context, + series storage.Series, + timestamp int64, +) labels.Labels { + originalLabels := series.Labels() + + // 1. Find correlated entities via storage index + entityRefs := ev.entityQuerier.EntitiesForSeries(series.Ref()) + + if len(entityRefs) == 0 { + return originalLabels // No entities, return as-is + } + + // 2. Build enriched label set + builder := labels.NewBuilder(originalLabels) + + for _, entityRef := range entityRefs { + entity := ev.entityQuerier.GetEntity(entityRef) + + // Get descriptive labels at the sample timestamp + descriptiveLabels := entity.DescriptiveLabelsAt(timestamp) + + // Merge into result + descriptiveLabels.Range(func(l labels.Label) { + builder.Set(l.Name, l.Value) + }) + } + + return builder.Labels() +} +``` + +--- + +## Filtering by Entity Labels + +Since entity labels appear as labels in query results, standard PromQL label matchers work: + +### By Identifying Labels + +```promql +# Filter by pod UID (identifying) +container_cpu_usage_seconds_total{k8s.pod.uid="abc-123"} +``` + +This is efficient because identifying labels are stored on the series and indexed. + +### By Descriptive Labels + +```promql +# Filter by pod name (descriptive) +container_cpu_usage_seconds_total{k8s.pod.name="nginx-7b9f5"} + +# Filter by node OS (descriptive) +container_memory_usage_bytes{k8s.node.os="linux"} + +# Regex matching on descriptive labels +http_requests_total{service.version=~"2\\..*"} +``` + +**Query Execution for Descriptive Label Filters:** + +1. Select all series that might match (based on metric name and any indexed labels) +2. For each series, look up correlated entities +3. Get descriptive labels at evaluation timestamp +4. Apply the filter: keep series where enriched labels match + +## Aggregation by Entity Labels + +Standard PromQL aggregation works with entity labels: + +```promql +# Sum CPU by node name (descriptive label) +sum by (k8s.node.name) (container_cpu_usage_seconds_total) + +# Average memory by service version +avg by (service.version) (process_resident_memory_bytes) + +# Count requests by pod status +count by (k8s.pod.status.phase) (rate(http_requests_total[5m])) +``` + +### Aggregation Semantics + +Aggregation happens **after** enrichment: + +``` +1. Select series matching the selector +2. Enrich each series with entity labels +3. Group by the specified labels (which may include entity labels) +4. Apply aggregation function +``` + +**Example:** + +```promql +sum by (k8s.node.name) (container_cpu_usage_seconds_total) +``` + +``` +Step 1 - Select series: + container_cpu{pod_uid="a", node_uid="n1"} 10 + container_cpu{pod_uid="b", node_uid="n1"} 20 + container_cpu{pod_uid="c", node_uid="n2"} 30 + +Step 2 - Enrich with entity labels: + container_cpu{..., k8s.node.name="worker-1"} 10 + container_cpu{..., k8s.node.name="worker-1"} 20 + container_cpu{..., k8s.node.name="worker-2"} 30 + +Step 3 - Group by k8s.node.name: + Group "worker-1": [10, 20] + Group "worker-2": [30] + +Step 4 - Sum: + {k8s.node.name="worker-1"} 30 + {k8s.node.name="worker-2"} 30 +``` + +--- + +## Range Queries and Temporal Semantics + +### The Challenge + +Descriptive labels can change over time. When querying a range, which label values should be used? + +**Example scenario:** +- Pod `abc-123` runs on `worker-1` from T0 to T5 +- Pod migrates to `worker-2` at T5 +- Query: `container_cpu_usage_seconds_total{k8s.pod.uid="abc-123"}[10m]` + +### Solution: Point-in-Time Label Resolution + +Each sample is enriched with the descriptive labels **that were valid at that sample's timestamp**. + +```promql +container_cpu_usage_seconds_total{k8s.pod.uid="abc-123"}[10m] +``` + +**Returns:** +``` +# Samples before migration (T0-T4) have worker-1 +container_cpu{k8s.pod.uid="abc-123", k8s.node.name="worker-1"} 100 @T0 +container_cpu{k8s.pod.uid="abc-123", k8s.node.name="worker-1"} 110 @T1 +container_cpu{k8s.pod.uid="abc-123", k8s.node.name="worker-1"} 120 @T2 +container_cpu{k8s.pod.uid="abc-123", k8s.node.name="worker-1"} 130 @T3 +container_cpu{k8s.pod.uid="abc-123", k8s.node.name="worker-1"} 140 @T4 + +# Samples after migration (T5+) have worker-2 +container_cpu{k8s.pod.uid="abc-123", k8s.node.name="worker-2"} 150 @T5 +container_cpu{k8s.pod.uid="abc-123", k8s.node.name="worker-2"} 160 @T6 +... +``` + +### Implications for Range Functions + +Functions like `rate()` operate on the raw sample values, but the returned instant vector has enriched labels: + +```promql +rate(container_cpu_usage_seconds_total{k8s.pod.uid="abc-123"}[5m]) +``` + +For rate calculation: +- Uses sample values regardless of label changes +- The result is enriched with labels **at the evaluation timestamp** + +### Series Identity Across Label Changes + +**Important:** Descriptive label changes do NOT create new series. The series identity is defined by: +- Metric name +- Original metric labels +- Entity identifying labels (correlation keys) + +Descriptive labels are metadata that "rides along" with samples, not part of series identity. + +--- + +## The Entity Type Filter Operator + +Automatic enrichment means entity labels appear as labels in query results, so standard label matchers handle most filtering needs: + +```promql +# Filter by entity label - just use label matchers +container_cpu_usage_seconds_total{k8s.pod.name="nginx"} +container_cpu_usage_seconds_total{k8s.pod.status.phase="Running"} +``` + +However, there's one thing label matchers **cannot** do: filter by entity type existence. The pipe operator (`|`) fills this gap. + +### Syntax + +```promql +vector_expr | entity_type_expr +``` + +Where `entity_type_expr` can be: +- A single entity type: `k8s.pod` +- Negated: `!k8s.pod` +- Combined with `and`: `k8s.pod and k8s.node` +- Combined with `or`: `k8s.pod or service` +- Grouped: `(k8s.pod and k8s.node) or service` + +### When to Use + +The pipe operator answers the question: **"Is this metric correlated with an entity of this type?"** + +```promql +# Metrics that ARE correlated with any pod entity +container_cpu_usage_seconds_total | k8s.pod + +# Metrics that ARE correlated with any node entity +container_memory_usage_bytes | k8s.node + +# Metrics that ARE correlated with any service entity +http_requests_total | service +``` + +### Negation with `!` + +Use `!` before an entity type to negate it: + +```promql +# Metrics NOT correlated with any pod +container_cpu_usage_seconds_total | !k8s.pod + +# Metrics NOT correlated with any service +http_requests_total | !service +``` + +### Combining Entity Type Filters + +Use `and`/`or` keywords to combine entity type filters: + +```promql +# Metrics correlated with BOTH a pod AND a node +container_cpu_usage_seconds_total | k8s.pod and k8s.node + +# Metrics correlated with a pod OR a service +container_cpu_usage_seconds_total | k8s.pod or service + +# Metrics correlated with a pod but NOT a node +container_cpu_usage_seconds_total | k8s.pod and !k8s.node +``` + +Operator precedence follows standard rules: `!` (not) binds tightest, then `and`, then `or`. Use parentheses for clarity: + +```promql +# Explicit grouping +container_cpu | (k8s.pod and k8s.node) or service +``` + +### All Metrics for an Entity Type + +To get all metrics correlated with a specific entity type, omit the metric selector: + +```promql +# All metrics correlated with any pod + | k8s.pod + +# Equivalent to: +{__name__=~".+"} | k8s.pod +``` + +This is useful for exploring what metrics are available for a given entity type. + +### Combining with Label Matchers + +For label filtering, use label matchers (simpler and familiar). Use the pipe operator only when you need entity type filtering: + +```promql +# Filter by label: use label matcher +container_cpu_usage_seconds_total{k8s.pod.name="nginx"} + +# Filter by entity type existence: use pipe +container_cpu_usage_seconds_total | k8s.pod + +# Both: label matcher for label, pipe for type +container_cpu_usage_seconds_total{k8s.namespace.name="production"} | k8s.pod | k8s.node +``` + +--- + +## Query Engine Implementation + +### Extended Querier Interface + +```go +// EntityQuerier provides entity lookup capabilities +type EntityQuerier interface { + // Get entities correlated with a series + EntitiesForSeries(ref storage.SeriesRef) []EntityRef + + // Get entity by reference + GetEntity(ref EntityRef) Entity + + Close() error +} + +// Entity represents a single entity +type Entity interface { + Ref() EntityRef + Type() string + IdentifyingLabels() labels.Labels + DescriptiveLabelsAt(timestamp int64) labels.Labels + StartTime() int64 + EndTime() int64 +} +``` + +### Parser Changes + +New AST nodes for entity type filtering: + +```go +// EntityTypeFilter represents a pipe expression: vector | entity_type_expr +type EntityTypeFilter struct { + Expr Expr // Left side (vector expression) + EntityTypeExpr EntityTypeExpr // Right side (entity type boolean expression) + PosRange posrange.PositionRange +} + +func (*EntityTypeFilter) Type() ValueType { return ValueTypeVector } + +// EntityTypeExpr is an interface for entity type expressions +type EntityTypeExpr interface { + // Matches returns true if the given set of entity types satisfies this expression + Matches(entityTypes map[string]bool) bool +} + +// EntityTypeName represents a single entity type: k8s.pod +type EntityTypeName struct { + Name string // e.g., "k8s.pod", "service" + Negated bool // true for !k8s.pod +} + +// EntityTypeAnd represents: k8s.pod and k8s.node +type EntityTypeAnd struct { + Left, Right EntityTypeExpr +} + +// EntityTypeOr represents: k8s.pod or service +type EntityTypeOr struct { + Left, Right EntityTypeExpr +} +``` + +### Pipe Operator Evaluation + +```go +func (ev *evaluator) evalEntityTypeFilter( + ctx context.Context, + vector Vector, + typeExpr EntityTypeExpr, +) Vector { + var result Vector + + for _, sample := range vector { + // Get all entity types correlated with this series + seriesEntityRefs := ev.entityQuerier.EntitiesForSeries(sample.SeriesRef) + entityTypes := make(map[string]bool) + for _, ref := range seriesEntityRefs { + entity := ev.entityQuerier.GetEntity(ref) + entityTypes[entity.Type()] = true + } + + // Evaluate the entity type expression + if typeExpr.Matches(entityTypes) { + result = append(result, sample) + } + } + + return result +} + +// Example Matches implementations: + +func (e *EntityTypeName) Matches(types map[string]bool) bool { + has := types[e.Name] + if e.Negated { + return !has + } + return has +} + +func (e *EntityTypeAnd) Matches(types map[string]bool) bool { + return e.Left.Matches(types) && e.Right.Matches(types) +} + +func (e *EntityTypeOr) Matches(types map[string]bool) bool { + return e.Left.Matches(types) || e.Right.Matches(types) +} +``` + +### Query Execution Flow + +``` +┌─────────────────────────────────────────────────────────────────────────┐ +│ Query Execution Flow │ +└─────────────────────────────────────────────────────────────────────────┘ + + ┌─────────────────────────────┐ + │ PromQL String │ + │ │ + │ cpu | k8s.pod and k8s.node │ + └─────────────┬───────────────┘ + │ + ▼ + ┌─────────────────────────────┐ + │ Parser │ + │ │ + │ - VectorSelector │ + │ - EntityTypeFilter │◄── NEW + │ - EntityTypeExpr (and/or/!) │ + └─────────────┬───────────────┘ + │ + ▼ + ┌─────────────────────────────┐ + │ AST │ + │ │ + │ EntityTypeFilter { │ + │ Expr: cpu │ + │ TypeExpr: And { │ + │ Left: "k8s.pod" │ + │ Right: "k8s.node" │ + │ } │ + │ } │ + └─────────────┬───────────────┘ + │ + ▼ +┌────────────────────────────────────────────────────────────────────────┐ +│ Evaluator │ +│ │ +│ 1. Evaluate left side (VectorSelector) │ +│ - querier.Select() → SeriesSet │ +│ - Enrich with entity labels │ +│ - Result: enriched Vector │ +│ │ +│ 2. Evaluate EntityTypeFilter │ +│ - For each series, get correlated entity types │ +│ - Evaluate boolean expression against those types │ +│ - Keep series where expression evaluates to true │ +│ - Result: filtered Vector │ +│ │ +└────────────────────────────────────────────────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────┐ + │ Result │ + │ │ + │ Vector/Matrix │ + └─────────────────────────────┘ +``` + +--- + +The next document will cover [Web UI and APIs](./07-web-ui-and-apis.md), detailing how these capabilities are exposed in Prometheus's user interface and HTTP APIs. diff --git a/proposals/0071-Entity/07-web-ui-and-apis.md b/proposals/0071-Entity/07-web-ui-and-apis.md new file mode 100644 index 0000000..ca05e96 --- /dev/null +++ b/proposals/0071-Entity/07-web-ui-and-apis.md @@ -0,0 +1,566 @@ +# Web UI and APIs + +## Abstract + +This document specifies how Prometheus's HTTP API and Web UI should be extended to support entity-aware querying. The key principle is **progressive disclosure**: query results display entity context prominently while keeping the interface familiar for users who don't need entity details. + +The wireframe below illustrates the concept—entity labels are displayed separately from metric labels, making it easy to understand the context of each time series. + +![Wireframe showing query results with entity labels separated from metric labels](./wireframes/Wireframe%20-%20Simple%20idea%20-%20Complete%20flow.png) + +--- + +## Background + +### Current API Response Structure + +Today, the `/api/v1/query` endpoint returns results like: + +```json +{ + "status": "success", + "data": { + "resultType": "vector", + "result": [ + { + "metric": { + "__name__": "container_cpu_usage_seconds_total", + "container": "nginx", + "namespace": "production", + "pod": "nginx-7b9f5" + }, + "value": [1234567890, "1234.5"] + } + ] + } +} +``` + +All labels are in a flat `metric` object. There's no distinction between: +- Labels that identify the metric itself (e.g., `container`, `method`) +- Labels that identify the entity producing the metric (e.g., `k8s.pod.uid`, `k8s.node.uid`) +- Labels that describe the entity (e.g., `k8s.pod.name`, `k8s.node.os`) + +### Current UI Display + +The Prometheus UI displays all labels together: + +``` +container_cpu_usage_seconds_total{container="nginx", namespace="production", pod="nginx-7b9f5", ...} +``` + +This becomes unwieldy when entity labels are added through enrichment—users see a long list of labels without understanding which provide entity context. + +--- + +## API Changes + +### Query Response Enhancement + +The query endpoints (`/api/v1/query`, `/api/v1/query_range`) should return entity context alongside the metric: + +```json +{ + "status": "success", + "data": { + "resultType": "vector", + "result": [ + { + "metric": { + "__name__": "container_cpu_usage_seconds_total", + "container": "nginx" + }, + "entities": [ + { + "type": "k8s.pod", + "identifyingLabels": { + "k8s.namespace.name": "production", + "k8s.pod.uid": "abc-123" + }, + "descriptiveLabels": { + "k8s.pod.name": "nginx-7b9f5", + "k8s.pod.status.phase": "Running" + } + }, + { + "type": "k8s.node", + "identifyingLabels": { + "k8s.node.uid": "node-001" + }, + "descriptiveLabels": { + "k8s.node.name": "worker-1", + "k8s.node.os": "linux" + } + } + ], + "value": [1234567890, "1234.5"] + } + ] + } +} +``` + +**Key changes:** + +| Field | Description | +|-------|-------------| +| `metric` | Only the original metric labels (not entity labels) | +| `entities` | Array of correlated entities with their labels | +| `entities[].type` | Entity type (e.g., "k8s.pod", "service") | +| `entities[].identifyingLabels` | Immutable labels that identify the entity | +| `entities[].descriptiveLabels` | Mutable labels describing the entity | + +### Backward Compatibility + +For backward compatibility, a query parameter controls the response format: + +``` +GET /api/v1/query?query=...&entity_info=true +``` + +| Parameter | Behavior | +|-----------|----------| +| `entity_info=true` | Returns structured entity information | +| `entity_info=false` (default) | Returns flat labels (current behavior, entity labels merged in) | + +When `entity_info=false` (default), all entity labels appear in the `metric` object as they do today with automatic enrichment. This ensures existing tooling continues to work. + +### Response Type Definitions + +```typescript +// Enhanced query result with entity context +interface EnhancedInstantSample { + metric: Record; // Original metric labels only + entities?: EntityContext[]; // Correlated entities (if entity_info=true) + value?: [number, string]; + histogram?: [number, Histogram]; +} + +interface EntityContext { + type: string; // e.g., "k8s.pod" + identifyingLabels: Record; + descriptiveLabels: Record; +} + +// When entity_info=false (default), use existing format +interface LegacyInstantSample { + metric: Record; // All labels merged (metric + entity labels) + value?: [number, string]; + histogram?: [number, Histogram]; +} +``` + +--- + +## New Entity Endpoints + +### List Entity Types + +``` +GET /api/v1/entities/types +``` + +Returns all known entity types in the system: + +```json +{ + "status": "success", + "data": [ + { + "type": "k8s.pod", + "identifyingLabels": ["k8s.namespace.name", "k8s.pod.uid"], + "count": 1523 + }, + { + "type": "k8s.node", + "identifyingLabels": ["k8s.node.uid"], + "count": 12 + }, + { + "type": "service", + "identifyingLabels": ["service.namespace", "service.name", "service.instance.id"], + "count": 89 + } + ] +} +``` + +### Get Entity Type Schema + +``` +GET /api/v1/entities/types/{type} +``` + +Returns detailed schema for an entity type: + +```json +{ + "status": "success", + "data": { + "type": "k8s.pod", + "identifyingLabels": ["k8s.namespace.name", "k8s.pod.uid"], + "knownDescriptiveLabels": [ + "k8s.pod.name", + "k8s.pod.status.phase", + "k8s.pod.start_time", + "k8s.pod.ip", + "k8s.pod.owner.kind", + "k8s.pod.owner.name" + ], + "activeEntityCount": 1523, + "correlatedSeriesCount": 45230 + } +} +``` + +### List Entities + +``` +GET /api/v1/entities?type=k8s.pod&match[]={k8s.namespace.name="production"} +``` + +Returns entities matching the criteria: + +```json +{ + "status": "success", + "data": [ + { + "type": "k8s.pod", + "identifyingLabels": { + "k8s.namespace.name": "production", + "k8s.pod.uid": "abc-123" + }, + "descriptiveLabels": { + "k8s.pod.name": "nginx-7b9f5", + "k8s.pod.status.phase": "Running" + }, + "startTime": 1700000000, + "endTime": 0, + "correlatedSeriesCount": 42 + } + ] +} +``` + +**Query parameters:** + +| Parameter | Description | +|-----------|-------------| +| `type` | Entity type to query (required) | +| `match[]` | Label matchers for filtering entity labels (can specify multiple) | +| `start` | Start of time range (for historical queries) | +| `end` | End of time range | +| `limit` | Maximum entities to return | + +### Get Entity Details + +``` +GET /api/v1/entities/{type}/{encoded_identifying_attrs} +``` + +The identifying labels are URL-encoded as a label set: + +``` +GET /api/v1/entities/k8s.pod/k8s.namespace.name%3D%22production%22%2Ck8s.pod.uid%3D%22abc-123%22 +``` + +Returns detailed information about a specific entity: + +```json +{ + "status": "success", + "data": { + "type": "k8s.pod", + "identifyingLabels": { + "k8s.namespace.name": "production", + "k8s.pod.uid": "abc-123" + }, + "descriptiveLabels": { + "k8s.pod.name": "nginx-7b9f5", + "k8s.pod.status.phase": "Running" + }, + "startTime": 1700000000, + "endTime": 0, + "descriptiveHistory": [ + { + "timestamp": 1700000000, + "labels": { + "k8s.pod.name": "nginx-7b9f5", + "k8s.pod.status.phase": "Pending" + } + }, + { + "timestamp": 1700000030, + "labels": { + "k8s.pod.name": "nginx-7b9f5", + "k8s.pod.status.phase": "Running" + } + } + ], + "correlatedSeries": [ + "container_cpu_usage_seconds_total", + "container_memory_usage_bytes", + "container_network_receive_bytes_total" + ] + } +} +``` + +### Get Correlated Metrics for Entity + +``` +GET /api/v1/entities/{type}/{encoded_identifying_attrs}/metrics +``` + +Returns all metric names correlated with a specific entity: + +```json +{ + "status": "success", + "data": [ + { + "name": "container_cpu_usage_seconds_total", + "seriesCount": 3, + "labels": ["container"] + }, + { + "name": "container_memory_usage_bytes", + "seriesCount": 3, + "labels": ["container"] + } + ] +} +``` + +--- + +## Web UI Changes + +### Query Results Display + +Based on the wireframe concept, query results should display entity context prominently but separately from metric labels. + +**Current display:** +``` +container_cpu_usage_seconds_total{container="nginx", k8s.namespace.name="production", k8s.pod.uid="abc-123", k8s.pod.name="nginx-7b9f5", k8s.node.uid="node-001", k8s.node.name="worker-1", ...} 1234.5 +``` + +**Enhanced display:** + +``` +┌─────────────────────────────────────────────────────────────────────────────┐ +│ container_cpu_usage_seconds_total{container="nginx"} 1234.5 │ +│ │ +│ Entities: │ +│ k8s.pod │ +│ k8s.namespace.name="production", k8s.pod.uid="abc-123" │ +│ k8s.pod.name="nginx-7b9f5", k8s.pod.status.phase="Running" │ +│ │ +│ k8s.node │ +│ k8s.node.uid="node-001" │ +│ k8s.node.name="worker-1", k8s.node.os="linux" │ +└─────────────────────────────────────────────────────────────────────────────┘ +``` + +### UI Components + +**1. SeriesName Enhancement** + +The `SeriesName` component should accept entity context: + +```typescript +interface SeriesNameProps { + labels: Record; + entities?: EntityContext[]; + format: boolean; + showEntities?: boolean; // Toggle entity display +} +``` + +**2. EntityBadge Component** + +A new component for displaying entity information: + +```typescript +interface EntityBadgeProps { + entity: EntityContext; + expanded?: boolean; + onToggle?: () => void; +} +``` + +Displays entity type with expandable labels: + +``` +┌─────────────────────────────────────────────┐ +│ 📦 k8s.pod [▼] │ +│ k8s.namespace.name="production" │ +│ k8s.pod.uid="abc-123" │ +│ ───────────────────────────── │ +│ k8s.pod.name="nginx-7b9f5" │ +│ k8s.pod.status.phase="Running" │ +└─────────────────────────────────────────────┘ +``` + +**3. Collapsible Entity Section** + +For tables with many results, entities can be collapsed by default: + +```typescript +interface DataTableProps { + data: InstantQueryResult; + showEntities: boolean; + entityDisplayMode: 'collapsed' | 'expanded' | 'inline'; +} +``` + +### New Pages + +**1. Entity Explorer Page** + +A dedicated page for browsing entities: + +``` +/entities + ├── List all entity types + ├── Filter by type + ├── Search by labels + └── Click to see entity details + +/entities/{type} + ├── List all entities of type + ├── Filter by identifying/descriptive labels + └── Click to see entity details + +/entities/{type}/{id} + ├── Entity details + ├── Label history timeline + ├── Correlated metrics list + └── Quick query links +``` + +**2. Entity Type Schema Page** + +Shows the schema for an entity type: + +``` +/entities/types/{type} + ├── Identifying labels list + ├── Known descriptive labels + ├── Entity count statistics + └── Related entity types +``` + +### Graph View Integration + +When viewing graphs, entity context can be shown on hover: + +``` +┌───────────────────────────────────────────────────────────────┐ +│ Graph │ +│ ╱╲ ╱╲ │ +│ ╱ ╲ ╱ ╲ ╱╲ │ +│ ╱ ╲╱ ╲ ╱ ╲ │ +│ ╱ ╲╱ ╲ │ +│ ╱ ╲ │ +├───────────────────────────────────────────────────────────────┤ +│ Hovering: container_cpu_usage_seconds_total{container="nginx"}│ +│ │ +│ 📦 k8s.pod: nginx-7b9f5 (production) │ +│ 🖥️ k8s.node: worker-1 │ +│ │ +│ Value: 1234.5 @ 2024-01-15 10:30:00 │ +└───────────────────────────────────────────────────────────────┘ +``` + +### Settings + +New user preferences for entity display: + +```typescript +interface EntityDisplaySettings { + // Show entity information in query results + showEntitiesInResults: boolean; + + // Default display mode + entityDisplayMode: 'collapsed' | 'expanded' | 'inline'; + + // Show identifying vs descriptive separation + separateIdentifyingLabels: boolean; + + // Entity types to always show + pinnedEntityTypes: string[]; +} +``` + +--- + +## Implementation Considerations + +### API Response Size + +Adding entity context increases response size. Mitigations: + +1. **Optional via query parameter**: `entity_info=true` to opt-in +2. **Compression**: gzip reduces impact significantly +3. **Pagination**: Limit results and paginate large responses +4. **Streaming**: Consider streaming for very large result sets + +### Frontend Performance + +With potentially many entities per series: + +- Lazy load entity details on expand +- Virtualize long lists +- Use `entity_info=false` for performance-critical views +- Progressive loading for entity explorer + +--- + +## API Summary + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/api/v1/query` | GET/POST | Query with optional `entity_info=true` | +| `/api/v1/query_range` | GET/POST | Range query with optional `entity_info=true` | +| `/api/v1/entities/types` | GET | List all entity types | +| `/api/v1/entities/types/{type}` | GET | Get entity type schema | +| `/api/v1/entities` | GET | List entities with filters | +| `/api/v1/entities/{type}/{id}` | GET | Get specific entity details | +| `/api/v1/entities/{type}/{id}/metrics` | GET | Get metrics for entity | + +--- + +## UI Summary + +| Feature | Description | +|---------|-------------| +| Enhanced SeriesName | Shows entities separately from labels | +| EntityBadge | Compact entity display with expand | +| Entity Explorer | Browse and search entities | +| Graph hover | Shows entity context on hover | +| Settings | Control entity display preferences | + +--- + +## Migration Path + +**Phase 1: API additions** +- Add `entity_info` parameter (default false) +- Add new `/api/v1/entities/*` endpoints +- Existing behavior unchanged + +**Phase 2: UI enhancements** +- Add EntityBadge component +- Enhance SeriesName with entity support +- Add Entity Explorer page + +**Phase 3: Default behavior** +- Consider making `entity_info=true` the default +- Deprecation warnings for flat-label-only usage + +--- + +*This proposal is a work in progress. Feedback on API design and UI mockups is welcome.* + diff --git a/proposals/0071-Entity/wireframes/Wireframe - Simple idea - Complete flow.png b/proposals/0071-Entity/wireframes/Wireframe - Simple idea - Complete flow.png new file mode 100644 index 0000000..2496463 Binary files /dev/null and b/proposals/0071-Entity/wireframes/Wireframe - Simple idea - Complete flow.png differ