Skip to content

update approx_topk page and create /sql-functions folder #91

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DebashisBorgohainO2
Copy link
Contributor

No description provided.

…dividual function pages under /sql-functions folder
Comment on lines +5 to +9
### `histogram`
**Syntax**: `histogram(field, 'duration')`
**Description:** <br>
Use the `histogram` function to divide your time-based log data into time buckets of a fixed duration and then apply aggregate functions such as `COUNT()` or `SUM()` to those intervals.
This helps in visualizing time-series trends and performing meaningful comparisons over time. <br><br>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Histogram also support histgoram(field) without interval, the backend will auto-calculate the interval.

we commonly call it interval not duration.

Consider the following scenario:

- Dataset contains `3 million` unique client IPs.
- Query runs using `60` CPU cores.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line should be runs using 60 querier nodes.

and each node have 60 cpu cores. each core will have one partition. so at the end is:

3m * 60(nodes) * 60(cores/partitions)

Comment on lines +99 to +101
When you run a query using `approx_topk()`, each query node processes a subset of the dataset and computes its local approximate top K values. These local top K values are sent to the leader node. The leader node merges them to generate the final approximate result.

Because each node sends only its local top K values, the final result may miss values that are frequent across the entire dataset but do not appear in the top K list of any single node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each node not only send topK to leader, actually it will send
max(k*10, 1000)

it means, when you want to top 10, each node will send 1000 to the leader.

why not only top10?

because maybe some item is not top 10 but in other node is top 10, we need some capacity to fix the final value.

Comment on lines +5 to +8
### `str_match`

**Syntax**: `str_match(field, 'value')` <br>
**Alias**: `match_field(field, 'value')`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str_match have an alias name match_field, and str_match_ignore_case have an alias match_field_ignore_case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants