-
Notifications
You must be signed in to change notification settings - Fork 28
update approx_topk page and create /sql-functions folder #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…dividual function pages under /sql-functions folder
### `histogram` | ||
**Syntax**: `histogram(field, 'duration')` | ||
**Description:** <br> | ||
Use the `histogram` function to divide your time-based log data into time buckets of a fixed duration and then apply aggregate functions such as `COUNT()` or `SUM()` to those intervals. | ||
This helps in visualizing time-series trends and performing meaningful comparisons over time. <br><br> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Histogram also support histgoram(field)
without interval, the backend will auto-calculate the interval.
we commonly call it interval
not duration
.
Consider the following scenario: | ||
|
||
- Dataset contains `3 million` unique client IPs. | ||
- Query runs using `60` CPU cores. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line should be runs using 60
querier nodes.
and each node have 60
cpu cores. each core will have one partition. so at the end is:
3m * 60(nodes) * 60(cores/partitions)
When you run a query using `approx_topk()`, each query node processes a subset of the dataset and computes its local approximate top K values. These local top K values are sent to the leader node. The leader node merges them to generate the final approximate result. | ||
|
||
Because each node sends only its local top K values, the final result may miss values that are frequent across the entire dataset but do not appear in the top K list of any single node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each node not only send topK to leader, actually it will send
max(k*10, 1000)
it means, when you want to top 10, each node will send 1000 to the leader.
why not only top10?
because maybe some item is not top 10 but in other node is top 10, we need some capacity to fix the final value.
### `str_match` | ||
|
||
**Syntax**: `str_match(field, 'value')` <br> | ||
**Alias**: `match_field(field, 'value')` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str_match
have an alias name match_field
, and str_match_ignore_case
have an alias match_field_ignore_case
No description provided.