openobserve · DebashisBorgohainO2 · Jul 14, 2025 · Jul 11, 2025 · Jul 14, 2025
diff --git a/docs/.pages b/docs/.pages
@@ -22,3 +22,4 @@ nav:
 - Telemetry: telemetry.md
 - zPlane: zplane.md
 - Work Group: work_group.md 
+- SQL Functions: sql-functions
diff --git a/docs/images/approx-top-k-error-in-traditional-method.png b/docs/images/approx-top-k-error-in-traditional-method.png
diff --git a/docs/images/approx-topk-distinct.png b/docs/images/approx-topk-distinct.png
diff --git a/docs/images/approx-topk-with-filter.png b/docs/images/approx-topk-with-filter.png
diff --git a/docs/images/approx-topk.png b/docs/images/approx-topk.png
diff --git a/docs/sql-functions/.pages b/docs/sql-functions/.pages
@@ -0,0 +1,7 @@
+nav:
+
+- SQL Functions Overview: index.md
+- Full-Text Search Functions: full-text-search.md 
+- Array Functions: array.md
+- Aggregate Functions: aggregate.md
+- Approximate Aggregate Functions: approximate-aggregate  
diff --git a/docs/sql-functions/aggregate.md b/docs/sql-functions/aggregate.md
@@ -0,0 +1,37 @@
+Aggregate functions compute a single result from a set of input values. For usage of standard SQL aggregate functions such as `COUNT`, `SUM`, `AVG`, `MIN`, and `MAX`, refer to [PostgreSQL documentation](https://www.postgresql.org/docs/).
+
+---
+
+### `histogram`
+**Syntax**: histogram(field) or histogram(field, 'interval')
+**Description:** <br>
+Use the `histogram()` function to divide your time-based log data into fixed intervals and apply aggregate functions such as `COUNT()` or `SUM()` to analyze time-series patterns. This helps visualize trends over time and supports meaningful comparisons.<br><br>
+**Syntax:** <br>
+```sql
+histogram(timestamp_field, 'interval')
+```
+
+- `timestamp_field`: A valid timestamp field, such as _timestamp.
+- `interval`: A fixed time interval in readable units such as '30 seconds', '1 minute', '15 minutes', or '1 hour'.
+
+**Histogram with aggregate function** <br>
+```sql
+SELECT histogram(_timestamp, '30 seconds') AS key, COUNT(*) AS num
+FROM "default"
+GROUP BY key
+ORDER BY key
+```
+**Expected Output**: <br>
+
+This query divides the log data into 30-second intervals. 
+Each row in the result shows:
+
+- **`key`**: The start time of the 30-second bucket.
+- **`num`**: The count of log records that fall within that time bucket.
+<br>
+![histogram](./images/sql-reference/histogram.png)
+
+!!! note
+    - If you do not specify an interval, the backend automatically determines a suitable value.
+    - To ensure consistent bucket sizes and avoid unexpected behavior, it is recommended to always define the interval explicitly.
+
diff --git a/docs/sql-functions/approximate-aggregate/.pages b/docs/sql-functions/approximate-aggregate/.pages
@@ -0,0 +1,4 @@
+nav: 
+
+- Overview: index.md
+- approx_topk : approx-topk.md
diff --git a/docs/sql-functions/approximate-aggregate/approx-topk-distinct.md b/docs/sql-functions/approximate-aggregate/approx-topk-distinct.md
@@ -0,0 +1,59 @@
+This page provides instructions on using the `approx_topk_distinct()` function. 
+If you only need to find the top K most frequently occurring values in a field, refer to the [approx_topk()](../approx-topk/) function.
+
+## What is approx_topk_distinct()
+The approx_topk_distinct() function returns an approximate list of the top K values from one field (field1) that have the most number of distinct values in another field (field2). It is designed to handle large-scale, high-cardinality datasets efficiently by combining two algorithms:
+
+- **HyperLogLog**: Used to estimate the number of distinct values in field2 per field1.
+- **Space-Saving**: Used to select the top K field1 values with the highest estimated distinct counts.
+
+Because both algorithms are probabilistic and the computation is distributed across multiple query nodes, the results are approximate.
+
+---
+
+## Query Syntax
+
+```sql
+
+SELECT approx_topk_distinct(field1, field2, K) FROM "stream_name"
+```
+Here:
+
+- `field1`: The field to group by and return top results for.
+- `field2`: The field whose distinct values are counted per field1.
+- `K`: Number of top results to return.
+- `stream_name`: The stream containing the data
+
+**Example**
+```sql
+SELECT approx_topk_distinct(clientip, user_agent, 5) FROM "demo1"
+```
+This query returns an approximate list of the top 5 `clientip` values that have the most number of distinct user_agent values in the `demo1` stream.
+
+**Note:** The result is returned as an array of objects, where each object includes the value of `field1` and its corresponding distinct count based on `field2`.
+
+```json
+{
+  "item": [
+    { "value": "192.168.1.100", "count": 1450 },
+    { "value": "203.0.113.50", "count": 1170 },
+    { "value": "10.0.0.5", "count": 1160 },
+    { "value": "198.51.100.75", "count": 1040 },
+    { "value": "172.16.0.10", "count": 1010 }
+  ]
+}
+```
+
+### Use `approx_topk_distinct` With `unnest`
+To convert the nested array into individual rows for easier readability or further processing, use the `unnest()` function.
+
+```sql
+SELECT item.value as clientip, item.count as distinct_user_agent_count 
+FROM (
+  SELECT unnest(approx_topk_distinct(clientip, user_agent, 5)) as item 
+  FROM "demo1"
+)
+ORDER BY distinct_user_agent_count DESC
+```
+**Result**
+![approx_topk_distinct](../../images/approx-topk-distinct.png)
diff --git a/docs/sql-functions/approximate-aggregate/approx-topk.md b/docs/sql-functions/approximate-aggregate/approx-topk.md
@@ -0,0 +1,172 @@
+This page provides instructions on using the `approx_topk()` function and explains its performance benefits compared to the traditional `GROUP BY` method.
+
+## What is `approx_topk`?
+The `approx_topk()` function returns an approximate list of the top K most frequently occurring values in a specified field. It uses the Space-Saving algorithm, a memory-efficient approach designed for high-cardinality data and distributed processing, providing significant [performance benefits](#performance-comparison). 
+
+> To find the top K values based on the number of distinct values in another field, use the [approx_topk_distinct() function](../approx-topk-distinct/).
+
+---
+
+## Query Syntax
+```sql
+
+SELECT approx_topk(field_name, K) FROM "stream_name"
+```
+Here:
+
+- `field_name`: The field for which top values should be retrieved.
+- `K`: The number of top values to return.
+- `stream_name`: The stream containing the data.
+
+**Example**
+```sql
+SELECT approx_topk(clientip, 10) FROM "demo1"
+```
+This query returns an approximate list of the `top k` most frequently occurring values in the `clientip` field from the `demo1` stream.
+
+**Result of `approx_topk`** <br>
+The result is returned as an array of objects, where each object includes the value and its corresponding count. For example:
+
+```json
+{
+  "item": [ { "value": "192.168.1.100", "count": 2650 }, { "value": "10.0.0.5", "count": 2230 }, { "value": "203.0.113.50", "count": 2210 }, { "value": "198.51.100.75", "count": 1979 }, { "value": "172.16.0.10", "count": 1939 } ]
+}
+```
+
+### Use `approx_topk` With `unnest`
+To convert these nested results into individual rows, use the `unnest()` function.
+
+```sql
+SELECT item.value as clientip, item.count as request_count 
+FROM ( 
+    SELECT unnest(approx_topk(clientip, 20)) as item 
+    FROM "demo1" 
+    ) 
+ORDER BY request_count 
+DESC
+```
+**Result of `approx_topk()` with `unnest()`**
+This provides a flat output as shown below: 
+
+```json
+{ "value": "192.168.1.100", "count": 2650 }
+{ "value": "10.0.0.5", "count": 2230 }
+{ "value": "203.0.113.50", "count": 2210 }
+...
+```
+
+---
+
+## `GROUP BY` Versus `approx_topk`
+
+### How `GROUP BY` Works
+The traditional way to find the top values in a field is by using a `GROUP BY` query combined with `ORDER BY` and `LIMIT`. <br>
+    For example:
+
+    ```sql
+
+    SELECT clientip AS x_axis_1, COUNT(*) AS y_axis_1 
+    FROM cdn_production 
+    GROUP BY x_axis_1 
+    ORDER BY y_axis_1 DESC 
+    LIMIT 10
+    ```
+    This query counts how many times each unique `clientip` appears and returns the **top 10** based on that count.
+
+??? info "Why Traditional `GROUP BY` Breaks in Large Datasets"
+    In large datasets with high-cardinality fields, the query is executed across multiple querier nodes. Each node uses multiple CPU cores to process the data. The data is split into partitions, and each core handles a subset of partitions.
+
+    Consider the following scenario:
+
+    - Dataset contains `3 million` unique client IPs.
+    - Query runs using `60` querier nodes.
+    - Each core processes `60` CPU cores, with each core processing one partition.
+
+    This results in:
+
+    `3 million` values × `60` nodes × `60` cores or partitions = `10.8 billion` data entries being processed in memory.
+
+    This level of memory usage can overwhelm the system and cause failures.
+
+    **Typical Failure Message** <br>
+    ```
+    Resources exhausted: Failed to allocate additional 63232256 bytes for GroupedHashAggregateStream[20] with 0 bytes already allocated for this reservation - 51510301 bytes remain available for the total pool
+    ```
+    ![Typical Failure Message](../../images/approx-top-k-error-in-traditional-method.png)
+    This is a common limitation of using traditional `GROUP BY` with high-cardinality fields in large environments.
+
+### How `approx_topk` Works
+When you run a query using `approx_topk()`, each query node processes a subset of the dataset and computes its local approximate top K values. 
+Each node sends up to `max(K * 10, 1000)` values to the leader node rather than just **K** values. This provides buffer capacity to prevent missing globally frequent values that may not appear in the **local top K** lists of individual nodes.
+
+Despite this optimization, `approx_topk()` still returns approximate results because the function uses a probabilistic algorithm and the query execution is distributed across nodes.
+
+!!! Note 
+
+    This method improves performance and reduces memory usage, especially in production-scale environments. It is a trade-off between precision and efficiency. View the **performance comparison** shown in the following section.  
+
+---
+
+### Performance Comparison
+
+When querying high-cardinality fields like clientip in large datasets, performance becomes critical. This section compares the execution performance of a traditional `GROUP BY` query with a query that uses the `approx_topk()` function.
+
+**Use Case**<br>
+You want to identify the top 20 most frequent client IP addresses in the `demo1` stream based on request volume.
+
+**Query 1: Using `GROUP BY` and `LIMIT`**<br>
+```sql
+SELECT clientip as "x_axis_1", count(_timestamp) as "y_axis_1"
+FROM "demo1"
+GROUP BY x_axis_1
+ORDER BY y_axis_1 DESC
+LIMIT 20
+```
+
+**Query 2: Using `approx_topk()`**
+```sql
+SELECT item.value as clientip, item.count as request_count
+FROM (
+  SELECT unnest(approx_topk(clientip, 20)) as item
+  FROM "demo1"
+)
+ORDER BY request_count DESC
+```
+
+**Results**
+<br>
+![Performance Difference Between `GROUP BY` and `approx_topk()](../../images/approx-topk.png)
+<br>
+Both queries were run against the same dataset using OpenObserve dashboards. Here are the observed query durations from the browser developer tools:
+
+- The `GROUP BY` query without `approx_topk` took **1.46 seconds** to complete.
+- The query using `approx_topk` completed in **692 milliseconds**.
+
+This demonstrates that **approx_topk** executed more than twice as fast in this scenario, delivering a performance improvement of **over 50 percent**.
+
+---
+
+## Limitations
+
+The following are the known limitations of `approx_topk()` function:
+
+- Results are approximate, not guaranteed to be exact. Not recommended when exact accuracy is critical for analysis or reporting. 
+- Accuracy depends on data distribution across partitions.
+
+---
+
+## Frequently Asked Questions
+**Q.** Can I use a `WHERE` clause with `approx_topk()`? <br>
+**A.** Yes. You can apply a `WHERE` clause before calling the `approx_topk()` function to filter the dataset. This limits the scope of the top K calculation to only the matching records.
+
+```sql
+SELECT item.value as clientip, item.count as request_count 
+FROM ( 
+    SELECT unnest(approx_topk(clientip, 5)) as item 
+    FROM "demo1" 
+    WHERE status = 401
+) 
+ORDER BY request_count DESC
+```
+<br>
+![WHERE clause with approx_topk](../../images/approx-topk-with-filter.png)
diff --git a/docs/sql-functions/approximate-aggregate/index.md b/docs/sql-functions/approximate-aggregate/index.md
@@ -0,0 +1,5 @@
+OpenObserve provides the following approximate aggregate functions designed for high-cardinality data analysis at scale.
+
+Learn more: 
+
+- [approx_topk](../approximate-aggregate/approx-topk/)