Skip to content

update approx_topk page and create /sql-functions folder #91

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 14, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/.pages
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,4 @@ nav:
- Telemetry: telemetry.md
- zPlane: zplane.md
- Work Group: work_group.md
- SQL Functions: sql-functions
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/approx-topk-distinct.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/approx-topk-with-filter.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/approx-topk.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 7 additions & 0 deletions docs/sql-functions/.pages
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
nav:

- SQL Functions Overview: index.md
- Full-Text Search Functions: full-text-search.md
- Array Functions: array.md
- Aggregate Functions: aggregate.md
- Approximate Aggregate Functions: approximate-aggregate
37 changes: 37 additions & 0 deletions docs/sql-functions/aggregate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
Aggregate functions compute a single result from a set of input values. For usage of standard SQL aggregate functions such as `COUNT`, `SUM`, `AVG`, `MIN`, and `MAX`, refer to [PostgreSQL documentation](https://www.postgresql.org/docs/).

---

### `histogram`
**Syntax**: histogram(field) or histogram(field, 'interval')
**Description:** <br>
Use the `histogram()` function to divide your time-based log data into fixed intervals and apply aggregate functions such as `COUNT()` or `SUM()` to analyze time-series patterns. This helps visualize trends over time and supports meaningful comparisons.<br><br>
**Syntax:** <br>
```sql
histogram(timestamp_field, 'interval')
```

- `timestamp_field`: A valid timestamp field, such as _timestamp.
- `interval`: A fixed time interval in readable units such as '30 seconds', '1 minute', '15 minutes', or '1 hour'.

**Histogram with aggregate function** <br>
```sql
SELECT histogram(_timestamp, '30 seconds') AS key, COUNT(*) AS num
FROM "default"
GROUP BY key
ORDER BY key
```
**Expected Output**: <br>

This query divides the log data into 30-second intervals.
Each row in the result shows:

- **`key`**: The start time of the 30-second bucket.
- **`num`**: The count of log records that fall within that time bucket.
<br>
![histogram](./images/sql-reference/histogram.png)

!!! note
- If you do not specify an interval, the backend automatically determines a suitable value.
- To ensure consistent bucket sizes and avoid unexpected behavior, it is recommended to always define the interval explicitly.

4 changes: 4 additions & 0 deletions docs/sql-functions/approximate-aggregate/.pages
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
nav:

- Overview: index.md
- approx_topk : approx-topk.md
59 changes: 59 additions & 0 deletions docs/sql-functions/approximate-aggregate/approx-topk-distinct.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
This page provides instructions on using the `approx_topk_distinct()` function.
If you only need to find the top K most frequently occurring values in a field, refer to the [approx_topk()](../approx-topk/) function.

## What is approx_topk_distinct()
The approx_topk_distinct() function returns an approximate list of the top K values from one field (field1) that have the most number of distinct values in another field (field2). It is designed to handle large-scale, high-cardinality datasets efficiently by combining two algorithms:

- **HyperLogLog**: Used to estimate the number of distinct values in field2 per field1.
- **Space-Saving**: Used to select the top K field1 values with the highest estimated distinct counts.

Because both algorithms are probabilistic and the computation is distributed across multiple query nodes, the results are approximate.

---

## Query Syntax

```sql

SELECT approx_topk_distinct(field1, field2, K) FROM "stream_name"
```
Here:

- `field1`: The field to group by and return top results for.
- `field2`: The field whose distinct values are counted per field1.
- `K`: Number of top results to return.
- `stream_name`: The stream containing the data

**Example**
```sql
SELECT approx_topk_distinct(clientip, user_agent, 5) FROM "demo1"
```
This query returns an approximate list of the top 5 `clientip` values that have the most number of distinct user_agent values in the `demo1` stream.

**Note:** The result is returned as an array of objects, where each object includes the value of `field1` and its corresponding distinct count based on `field2`.

```json
{
"item": [
{ "value": "192.168.1.100", "count": 1450 },
{ "value": "203.0.113.50", "count": 1170 },
{ "value": "10.0.0.5", "count": 1160 },
{ "value": "198.51.100.75", "count": 1040 },
{ "value": "172.16.0.10", "count": 1010 }
]
}
```

### Use `approx_topk_distinct` With `unnest`
To convert the nested array into individual rows for easier readability or further processing, use the `unnest()` function.

```sql
SELECT item.value as clientip, item.count as distinct_user_agent_count
FROM (
SELECT unnest(approx_topk_distinct(clientip, user_agent, 5)) as item
FROM "demo1"
)
ORDER BY distinct_user_agent_count DESC
```
**Result**
![approx_topk_distinct](../../images/approx-topk-distinct.png)
172 changes: 172 additions & 0 deletions docs/sql-functions/approximate-aggregate/approx-topk.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
This page provides instructions on using the `approx_topk()` function and explains its performance benefits compared to the traditional `GROUP BY` method.

## What is `approx_topk`?
The `approx_topk()` function returns an approximate list of the top K most frequently occurring values in a specified field. It uses the Space-Saving algorithm, a memory-efficient approach designed for high-cardinality data and distributed processing, providing significant [performance benefits](#performance-comparison).

> To find the top K values based on the number of distinct values in another field, use the [approx_topk_distinct() function](../approx-topk-distinct/).

---

## Query Syntax
```sql

SELECT approx_topk(field_name, K) FROM "stream_name"
```
Here:

- `field_name`: The field for which top values should be retrieved.
- `K`: The number of top values to return.
- `stream_name`: The stream containing the data.

**Example**
```sql
SELECT approx_topk(clientip, 10) FROM "demo1"
```
This query returns an approximate list of the `top k` most frequently occurring values in the `clientip` field from the `demo1` stream.

**Result of `approx_topk`** <br>
The result is returned as an array of objects, where each object includes the value and its corresponding count. For example:

```json
{
"item": [ { "value": "192.168.1.100", "count": 2650 }, { "value": "10.0.0.5", "count": 2230 }, { "value": "203.0.113.50", "count": 2210 }, { "value": "198.51.100.75", "count": 1979 }, { "value": "172.16.0.10", "count": 1939 } ]
}
```

### Use `approx_topk` With `unnest`
To convert these nested results into individual rows, use the `unnest()` function.

```sql
SELECT item.value as clientip, item.count as request_count
FROM (
SELECT unnest(approx_topk(clientip, 20)) as item
FROM "demo1"
)
ORDER BY request_count
DESC
```
**Result of `approx_topk()` with `unnest()`**
This provides a flat output as shown below:

```json
{ "value": "192.168.1.100", "count": 2650 }
{ "value": "10.0.0.5", "count": 2230 }
{ "value": "203.0.113.50", "count": 2210 }
...
```

---

## `GROUP BY` Versus `approx_topk`

### How `GROUP BY` Works
The traditional way to find the top values in a field is by using a `GROUP BY` query combined with `ORDER BY` and `LIMIT`. <br>
For example:

```sql

SELECT clientip AS x_axis_1, COUNT(*) AS y_axis_1
FROM cdn_production
GROUP BY x_axis_1
ORDER BY y_axis_1 DESC
LIMIT 10
```
This query counts how many times each unique `clientip` appears and returns the **top 10** based on that count.

??? info "Why Traditional `GROUP BY` Breaks in Large Datasets"
In large datasets with high-cardinality fields, the query is executed across multiple querier nodes. Each node uses multiple CPU cores to process the data. The data is split into partitions, and each core handles a subset of partitions.

Consider the following scenario:

- Dataset contains `3 million` unique client IPs.
- Query runs using `60` querier nodes.
- Each core processes `60` CPU cores, with each core processing one partition.

This results in:

`3 million` values × `60` nodes × `60` cores or partitions = `10.8 billion` data entries being processed in memory.

This level of memory usage can overwhelm the system and cause failures.

**Typical Failure Message** <br>
```
Resources exhausted: Failed to allocate additional 63232256 bytes for GroupedHashAggregateStream[20] with 0 bytes already allocated for this reservation - 51510301 bytes remain available for the total pool
```
![Typical Failure Message](../../images/approx-top-k-error-in-traditional-method.png)
This is a common limitation of using traditional `GROUP BY` with high-cardinality fields in large environments.

### How `approx_topk` Works
When you run a query using `approx_topk()`, each query node processes a subset of the dataset and computes its local approximate top K values.
Each node sends up to `max(K * 10, 1000)` values to the leader node rather than just **K** values. This provides buffer capacity to prevent missing globally frequent values that may not appear in the **local top K** lists of individual nodes.

Despite this optimization, `approx_topk()` still returns approximate results because the function uses a probabilistic algorithm and the query execution is distributed across nodes.

!!! Note

This method improves performance and reduces memory usage, especially in production-scale environments. It is a trade-off between precision and efficiency. View the **performance comparison** shown in the following section.

---

### Performance Comparison

When querying high-cardinality fields like clientip in large datasets, performance becomes critical. This section compares the execution performance of a traditional `GROUP BY` query with a query that uses the `approx_topk()` function.

**Use Case**<br>
You want to identify the top 20 most frequent client IP addresses in the `demo1` stream based on request volume.

**Query 1: Using `GROUP BY` and `LIMIT`**<br>
```sql
SELECT clientip as "x_axis_1", count(_timestamp) as "y_axis_1"
FROM "demo1"
GROUP BY x_axis_1
ORDER BY y_axis_1 DESC
LIMIT 20
```

**Query 2: Using `approx_topk()`**
```sql
SELECT item.value as clientip, item.count as request_count
FROM (
SELECT unnest(approx_topk(clientip, 20)) as item
FROM "demo1"
)
ORDER BY request_count DESC
```

**Results**
<br>
![Performance Difference Between `GROUP BY` and `approx_topk()](../../images/approx-topk.png)
<br>
Both queries were run against the same dataset using OpenObserve dashboards. Here are the observed query durations from the browser developer tools:

- The `GROUP BY` query without `approx_topk` took **1.46 seconds** to complete.
- The query using `approx_topk` completed in **692 milliseconds**.

This demonstrates that **approx_topk** executed more than twice as fast in this scenario, delivering a performance improvement of **over 50 percent**.

---

## Limitations

The following are the known limitations of `approx_topk()` function:

- Results are approximate, not guaranteed to be exact. Not recommended when exact accuracy is critical for analysis or reporting.
- Accuracy depends on data distribution across partitions.

---

## Frequently Asked Questions
**Q.** Can I use a `WHERE` clause with `approx_topk()`? <br>
**A.** Yes. You can apply a `WHERE` clause before calling the `approx_topk()` function to filter the dataset. This limits the scope of the top K calculation to only the matching records.

```sql
SELECT item.value as clientip, item.count as request_count
FROM (
SELECT unnest(approx_topk(clientip, 5)) as item
FROM "demo1"
WHERE status = 401
)
ORDER BY request_count DESC
```
<br>
![WHERE clause with approx_topk](../../images/approx-topk-with-filter.png)
5 changes: 5 additions & 0 deletions docs/sql-functions/approximate-aggregate/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
OpenObserve provides the following approximate aggregate functions designed for high-cardinality data analysis at scale.

Learn more:

- [approx_topk](../approximate-aggregate/approx-topk/)
Loading