Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shard query splitting: Split queries by stream shards #715

Merged
merged 53 commits into from
Sep 11, 2024

Conversation

matyax
Copy link
Contributor

@matyax matyax commented Aug 22, 2024

Query splitting by stream shards.

Query splitting was introduced in Loki to optimize querying for long intervals and high volume of data, dividing a big request into smaller sub-requests, combining and displaying the results as they arrive.

This approach, inspired by the time-based query splitting, takes advantage of the stream_shard internal label, representing how data is spread into different sources that can be queried individually.

The main entry point of this module is runShardSplitQuery(), which prepares the query for execution and passes it to splitQueriesByStreamShard() to begin the querying loop.

splitQueriesByStreamShard() has the following structure:

  • Creates and returns an Observable to which the UI will subscribe
  • Requests the stream_shard values of the selected service:
    • If there are no shard values, it falls back to the standard querying approach of the data source in runNonSplitRequest()
    • If there are shards:
      • It groups the shard requests in an array of arrays of shard numbers in groupShardRequests()
      • It begins the querying loop with runNextRequest()
  • runNextRequest() will send a query using the nth (cycle) shard group, and has the following internal structure:
    • adjustTargetsFromResponseState() will filter log queries targets that already received the requested maxLines
    • interpolateShardingSelector() will update the stream selector with the current shard numbers
    • After query execution:
      • If the response is successful:
        • It will add new data to the response with combineResponses()
        • nextRequest() will use the current cycle and the total groups to determine the next request or complete execution with done()
      • If the response is unsuccessful:
        • If there are retry attempts, it will retry the current cycle, or else continue with the next cycle
        • If the returned error is Maximum series reached, it will not retry
  • Once all request groups have been executed, it will be done()

Key features

  • Shards retrieved by a /values call including the selected service.
  • Starts with low volume shards, intertwined with higher volume shards.
  • Groups shards based on the current time interval and available shards.
  • Sends queries grouping shards in a selector __stream_shard__=~"shardn1|shardn2|...|shardn".
  • For short intervals (< 24 hours) groups shards in queries. For longer intervals, queries individual shards.
  • Recombines responses without asuming any particular order, including overlapping values.
  • Retries failed requests up to 3 times, with exponential backoff (1500 ms * 2^failed attempts of waiting time) Does not retry if the response is "maximum series reached".

Important

Requires the exploreLogsShardSplitting feature flag enabled

Closes #690

@matyax matyax marked this pull request as ready for review August 22, 2024 16:39
@matyax matyax requested a review from a team as a code owner August 22, 2024 16:39
@gtk-grafana
Copy link
Contributor

should this be in draft for now?

@matyax
Copy link
Contributor Author

matyax commented Aug 23, 2024

100%. Just wanted the build for publishing.

@matyax matyax force-pushed the matyax/shard-splitting branch 5 times, most recently from 4706de5 to 3a5df48 Compare August 29, 2024 11:09
@matyax
Copy link
Contributor Author

matyax commented Sep 3, 2024

@grafana/observability-logs If you want to start the review while I work on the coverage, that'd be great. I don't plan to add any significant change to the code for now.

@matyax
Copy link
Contributor Author

matyax commented Sep 3, 2024

Coverage added 😤

project-words.txt Outdated Show resolved Hide resolved
@matyax matyax changed the title Shard splitting: WIP Shard query splitting: Split queries by stream shards Sep 4, 2024
@matyax matyax requested review from a team and gtk-grafana September 4, 2024 15:24
@@ -0,0 +1,1403 @@
import { DataFrame, DataFrameType, DataQueryResponse, Field, FieldType, QueryResultMetaStat } from '@grafana/data';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,281 @@
import {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,194 @@
import { of } from 'rxjs';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,264 @@
import { Observable, Subscriber, Subscription } from 'rxjs';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matyax matyax force-pushed the matyax/shard-splitting branch 2 times, most recently from e14bb19 to 6c34669 Compare September 10, 2024 19:23
@gtk-grafana
Copy link
Contributor

gtk-grafana commented Sep 10, 2024

Can we prevent "no data" from showing up (and keep the panel in the loading state) when it's streaming but the shards we've queried so far don't have data?
image

@matyax
Copy link
Contributor Author

matyax commented Sep 10, 2024

Definitely. What I wanted to try was to create the initial frame with fields with empty values, and see if that gets rid of that behavior. I did a quick test, but it will require an update to the merging code and some extra complexity, so I'd like to follow that up independently. And/or check the viz to see what is it that triggers the display of No Data.

@gtk-grafana
Copy link
Contributor

From what I remember the streaming (LoadingState) state expects dataframes, I think setting the initial state as LoadingState.Loading fixes 90% of this without any other refactor

@matyax
Copy link
Contributor Author

matyax commented Sep 10, 2024

Fantastic, I'll give it a try. The reason why we used streaming with splitting was so that the panel updates the viz as new data appears, but we can start with loading and change to streaming easily.

@gtk-grafana
Copy link
Contributor

gtk-grafana commented Sep 10, 2024

Yeah I think it should be "loading" state, until we get a non-empty response. We shouldn't update the "series" prop until we get a non-empty response, or the request is done.

But easier said then done :P if it's too much work we can punt on it as this is under a ff, but this is the only real issue I'm seeing right now in my testing.

A good way to see how big of a UX change this is, is by enabling auto-refresh on this branch vs main

@matyax matyax enabled auto-merge (squash) September 11, 2024 16:25
@matyax matyax merged commit c9eacbd into main Sep 11, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Query splitting: split by shard in Explore Logs
2 participants