Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add new agg/window function 'approx_top_k' #29643

Merged
merged 7 commits into from
Sep 4, 2023

Conversation

liuyehcf
Copy link
Contributor

@liuyehcf liuyehcf commented Aug 22, 2023

Fix #25684

Space Save Algorithm

The Space Saving Algorithm is commonly used for estimating the top-K frequent items in a stream of data with limited memory. To implement this as a two-stage aggregate function for a distributed database management system (DBMS), you'll need to handle the aggregation in two main phases:

  1. Local Aggregation (First-stage aggregate on each node)
  2. Global Aggregation (Second-stage aggregate on a single node)

Here's how you can design and execute the two-stage aggregation:

Local Aggregation (First-stage): Each node will maintain a list of counters based on the Space Saving Algorithm:

  1. For each incoming item in the stream:
    1. If the item is already in the list of counters, increment its count.
    2. If the item is not in the list and there is space available, add it to the list with a count of 1.
    3. If the item is not in the list and there is no space available, find the item with the smallest count, replace it with the new item and increment the count of the new item.
  2. At the end of this phase, each node will have its local top-K counters.

Global Aggregation (Second-stage): After the local aggregation phase, the intermediate counters from all nodes will be sent to a particular aggregation node. On this node:

  1. For each counter from the nodes:
    1. If the item is already in the global list of counters, add the local count to the global count.
    2. If the item is not in the global list and there is space available, add it to the global list with its local count.
    3. If the item is not in the global list and there is no space available, determine if its local count is greater than the smallest global counter. If it is, replace the global counter with the new item and its count. Otherwise, discard the counter.
  2. Once all the local counters have been processed, the global list will contain the estimated top-K frequent items across all the nodes.

Description

Please refer to approx_top_k.md in the change list for more information.

Examples

MySQL > SELECT APPROX_TOP_K(L_LINESTATUS) FROM lineitem;
+-------------------------------------------------------------+
| approx_top_k(L_LINESTATUS)                                  |
+-------------------------------------------------------------+
| [{"item":"O","count":3004998},{"item":"F","count":2996217}] |
+-------------------------------------------------------------+
MySQL > SELECT APPROX_TOP_K(L_LINENUMBER) FROM lineitem GROUP BY L_RETURNFLAG
+-------------------------------------------------------------------------------------------------------------------------------------+
| approx_top_k(L_LINENUMBER)                                                                                                          |
+-------------------------------------------------------------------------------------------------------------------------------------+
| [{"item":1,"count":761151},{"item":2,"count":652280},{"item":3,"count":543265},{"item":4,"count":434834},{"item":5,"count":326135}] |
| [{"item":1,"count":368853},{"item":2,"count":316830},{"item":3,"count":263950},{"item":4,"count":211270},{"item":5,"count":158495}] |
| [{"item":1,"count":369996},{"item":2,"count":316718},{"item":3,"count":264179},{"item":4,"count":210911},{"item":5,"count":158657}] |
+-------------------------------------------------------------------------------------------------------------------------------------+

Limitations

  1. For now, we only support BOOLEAN, STRING, numerical types, floating point types and date related types.
  2. NULL value will be excluded from this calculations.

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.1
    • 3.0
    • 2.5
    • 2.4

@liuyehcf liuyehcf force-pushed the approx_top_k branch 2 times, most recently from 11f6d4a to affe7e0 Compare August 22, 2023 06:25
@wanpengfei-git wanpengfei-git added the documentation Improvements or additions to documentation label Aug 22, 2023
@liuyehcf liuyehcf force-pushed the approx_top_k branch 4 times, most recently from a086a12 to 21b567c Compare August 22, 2023 08:11
@liuyehcf liuyehcf changed the title [Feature] Add new window function 'approx_top_k' [Feature] Add new agg/window function 'approx_top_k' Aug 23, 2023
@liuyehcf liuyehcf force-pushed the approx_top_k branch 4 times, most recently from 0042d56 to baa4c5e Compare August 23, 2023 09:27
@liuyehcf liuyehcf force-pushed the approx_top_k branch 4 times, most recently from f8cad83 to 29ad60a Compare August 28, 2023 06:45
@liuyehcf liuyehcf force-pushed the approx_top_k branch 2 times, most recently from 4ce7e38 to c6a14fe Compare September 2, 2023 00:24
Signed-off-by: liuyehcf <[email protected]>
be/src/exec/analytor.cpp Outdated Show resolved Hide resolved
Comment on lines +33 to +35
struct ApproxTopKState {
using CppType = RunTimeCppType<LT>;
using ColumnType = RunTimeColumnType<LT>;
Copy link
Contributor

@fzhedu fzhedu Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we imp a general version for complex types? like array, map, struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is beyond the scope of this pr. May be we will support semi-structured type.

Signed-off-by: liuyehcf <[email protected]>
Signed-off-by: liuyehcf <[email protected]>
@sonarcloud
Copy link

sonarcloud bot commented Sep 4, 2023

SonarCloud Quality Gate failed.    Quality Gate failed

Bug C 2 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell B 126 Code Smells

0.0% 0.0% Coverage
0.8% 0.8% Duplication

idea Catch issues before they fail your Quality Gate with our IDE extension sonarlint SonarLint

@wanpengfei-git
Copy link
Collaborator

[FE Incremental Coverage Report]

😍 pass : 95 / 98 (96.94%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/analysis/AnalyticExpr.java 3 4 75.00% [273]
🔵 com/starrocks/common/util/ExprUtil.java 16 18 88.89% [20, 40]
🔵 com/starrocks/catalog/FunctionSet.java 37 37 100.00% []
🔵 com/starrocks/sql/analyzer/DecimalV3FunctionAnalyzer.java 4 4 100.00% []
🔵 com/starrocks/sql/optimizer/transformer/WindowTransformer.java 4 4 100.00% []
🔵 com/starrocks/analysis/AnalyticWindow.java 1 1 100.00% []
🔵 com/starrocks/sql/analyzer/AnalyticAnalyzer.java 2 2 100.00% []
🔵 com/starrocks/sql/analyzer/FunctionAnalyzer.java 28 28 100.00% []

Copy link
Contributor

@silverbullet233 silverbullet233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wanpengfei-git
Copy link
Collaborator

[BE Incremental Coverage Report]

😞 fail : 5 / 273 (01.83%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 src/exprs/agg/nullable_aggregate.h 0 5 00.00% [164, 170, 171, 173, 211]
🔵 src/exprs/agg/approx_top_k.h 0 244 00.00% [43, 46, 61, 62, 63, 64, 65, 66, 67, 69, 70, 73, 74, 75, 81, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 98, 99, 100, 101, 102, 104, 105, 106, 107, 108, 109, 110, 111, 112, 117, 118, 119, 120, 123, 124, 126, 127, 129, 130, 131, 132, 134, 135, 136, 137, 138, 139, 140, 141, 143, 144, 145, 146, 151, 153, 155, 156, 157, 158, 162, 171, 172, 173, 176, 178, 179, 180, 185, 186, 187, 188, 189, 190, 191, 192, 194, 195, 196, 197, 198, 201, 202, 203, 205, 206, 209, 210, 212, 213, 214, 219, 220, 221, 224, 226, 230, 232, 233, 234, 236, 251, 252, 253, 254, 256, 257, 258, 260, 262, 263, 264, 265, 266, 267, 270, 271, 272, 274, 275, 276, 279, 280, 281, 284, 285, 286, 288, 289, 290, 292, 293, 295, 296, 298, 299, 300, 301, 304, 307, 308, 309, 310, 311, 312, 313, 315, 316, 319, 320, 323, 324, 325, 328, 330, 331, 332, 333, 336, 337, 338, 341, 343, 344, 345, 347, 348, 349, 351, 352, 353, 354, 355, 357, 359, 362, 364, 365, 366, 367, 368, 373, 374, 376, 377, 378, 380, 382, 383, 384, 385, 387, 390, 391, 393, 396, 398, 399, 401, 403, 404, 406, 407, 409, 410, 411, 415, 416, 417, 418, 419, 421, 422, 425, 426, 428, 431, 432, 435, 436, 437, 440, 442, 448, 451, 452, 453, 454, 458, 460, 461, 464]
🔵 src/exec/analytor.cpp 0 19 00.00% [205, 206, 207, 306, 468, 469, 470, 471, 472, 525, 526, 528, 530, 531, 532, 533, 534, 536, 537]
🔵 src/exprs/agg/factory/aggregate_factory.hpp 2 2 100.00% []
🔵 src/exprs/agg/factory/aggregate_resolver_approx.cpp 3 3 100.00% []

@liuyehcf liuyehcf enabled auto-merge (squash) September 4, 2023 05:42
@liuyehcf liuyehcf merged commit 43968b7 into StarRocks:main Sep 4, 2023
28 of 30 checks passed
@liuyehcf
Copy link
Contributor Author

liuyehcf commented Sep 4, 2023

@Mergifyio backport branch-3.0

@liuyehcf
Copy link
Contributor Author

liuyehcf commented Sep 4, 2023

@Mergifyio backport branch-3.1

@mergify
Copy link
Contributor

mergify bot commented Sep 4, 2023

backport branch-3.0

✅ Backports have been created

@mergify
Copy link
Contributor

mergify bot commented Sep 4, 2023

backport branch-3.1

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Sep 4, 2023
* [Feature] Add new window function 'approx_top_k'

Signed-off-by: liuyehcf <[email protected]>

* update 1

Signed-off-by: liuyehcf <[email protected]>

* update 2

Signed-off-by: liuyehcf <[email protected]>

* update 3

Signed-off-by: liuyehcf <[email protected]>

* update 4

Signed-off-by: liuyehcf <[email protected]>

* update 5

Signed-off-by: liuyehcf <[email protected]>

* update 6

Signed-off-by: liuyehcf <[email protected]>

---------

Signed-off-by: liuyehcf <[email protected]>
(cherry picked from commit 43968b7)

# Conflicts:
#	be/src/exprs/agg/factory/aggregate_factory.hpp
#	fe/fe-core/src/main/java/com/starrocks/analysis/AnalyticExpr.java
#	fe/fe-core/src/main/java/com/starrocks/catalog/FunctionSet.java
#	test/common/sql/ssb/create.sql
#	test/common/sql/tpcds/create.sql
#	test/common/sql/tpch/create.sql
mergify bot pushed a commit that referenced this pull request Sep 4, 2023
* [Feature] Add new window function 'approx_top_k'

Signed-off-by: liuyehcf <[email protected]>

* update 1

Signed-off-by: liuyehcf <[email protected]>

* update 2

Signed-off-by: liuyehcf <[email protected]>

* update 3

Signed-off-by: liuyehcf <[email protected]>

* update 4

Signed-off-by: liuyehcf <[email protected]>

* update 5

Signed-off-by: liuyehcf <[email protected]>

* update 6

Signed-off-by: liuyehcf <[email protected]>

---------

Signed-off-by: liuyehcf <[email protected]>
(cherry picked from commit 43968b7)

# Conflicts:
#	be/src/exprs/agg/factory/aggregate_factory.hpp
#	fe/fe-core/src/main/java/com/starrocks/analysis/AnalyticExpr.java
#	fe/fe-core/src/main/java/com/starrocks/catalog/FunctionSet.java
#	fe/fe-core/src/test/java/com/starrocks/sql/plan/AggregateTest.java
#	test/common/sql/ssb/create.sql
#	test/common/sql/tpcds/create.sql
#	test/common/sql/tpch/create.sql
liuyehcf added a commit that referenced this pull request Sep 4, 2023
#30357)

* [Feature] Add new agg/window function 'approx_top_k' (#29643)

* [Feature] Add new window function 'approx_top_k'

Signed-off-by: liuyehcf <[email protected]>

* update 1

Signed-off-by: liuyehcf <[email protected]>

* update 2

Signed-off-by: liuyehcf <[email protected]>

* update 3

Signed-off-by: liuyehcf <[email protected]>

* update 4

Signed-off-by: liuyehcf <[email protected]>

* update 5

Signed-off-by: liuyehcf <[email protected]>

* update 6

Signed-off-by: liuyehcf <[email protected]>

---------

Signed-off-by: liuyehcf <[email protected]>
(cherry picked from commit 43968b7)

# Conflicts:
#	be/src/exprs/agg/factory/aggregate_factory.hpp
#	fe/fe-core/src/main/java/com/starrocks/analysis/AnalyticExpr.java
#	fe/fe-core/src/main/java/com/starrocks/catalog/FunctionSet.java
#	fe/fe-core/src/test/java/com/starrocks/sql/plan/AggregateTest.java
#	test/common/sql/ssb/create.sql
#	test/common/sql/tpcds/create.sql
#	test/common/sql/tpch/create.sql

* solve conflict

Signed-off-by: liuyehcf <[email protected]>

* fix fe ut

---------

Signed-off-by: liuyehcf <[email protected]>
Co-authored-by: liuyehcf <[email protected]>
liuyehcf added a commit that referenced this pull request Sep 4, 2023
#30356)

* [Feature] Add new agg/window function 'approx_top_k' (#29643)

* [Feature] Add new window function 'approx_top_k'

Signed-off-by: liuyehcf <[email protected]>

* update 1

Signed-off-by: liuyehcf <[email protected]>

* update 2

Signed-off-by: liuyehcf <[email protected]>

* update 3

Signed-off-by: liuyehcf <[email protected]>

* update 4

Signed-off-by: liuyehcf <[email protected]>

* update 5

Signed-off-by: liuyehcf <[email protected]>

* update 6

Signed-off-by: liuyehcf <[email protected]>

---------

Signed-off-by: liuyehcf <[email protected]>
(cherry picked from commit 43968b7)

# Conflicts:
#	be/src/exprs/agg/factory/aggregate_factory.hpp
#	fe/fe-core/src/main/java/com/starrocks/analysis/AnalyticExpr.java
#	fe/fe-core/src/main/java/com/starrocks/catalog/FunctionSet.java
#	test/common/sql/ssb/create.sql
#	test/common/sql/tpcds/create.sql
#	test/common/sql/tpch/create.sql

* solve conflict

Signed-off-by: liuyehcf <[email protected]>

* fix fe ut

* fix be compile

---------

Signed-off-by: liuyehcf <[email protected]>
Co-authored-by: liuyehcf <[email protected]>
Jay-ju pushed a commit to Jay-ju/starrocks that referenced this pull request Sep 7, 2023
* [Feature] Add new window function 'approx_top_k'

Signed-off-by: liuyehcf <[email protected]>

* update 1

Signed-off-by: liuyehcf <[email protected]>

* update 2

Signed-off-by: liuyehcf <[email protected]>

* update 3

Signed-off-by: liuyehcf <[email protected]>

* update 4

Signed-off-by: liuyehcf <[email protected]>

* update 5

Signed-off-by: liuyehcf <[email protected]>

* update 6

Signed-off-by: liuyehcf <[email protected]>

---------

Signed-off-by: liuyehcf <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[function]approx_top_k
6 participants