Skip to content

KAFKA-19019: Add support for remote storage fetch for share groups #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: trunk
Choose a base branch
from

Conversation

arvi18
Copy link

@arvi18 arvi18 commented Apr 21, 2025

What

This PR adds the support for remote storage fetch for share groups.

Limitation

There is a limitation in remote storage fetch for consumer groups that
we can only perform remote fetch for a single topic partition in a fetch
request. Since, the logic of share fetch requests is largely based on
how consumer
groups work, we are following similar logic in implementing remote
storage fetch. However, this problem
should be addressed as part of KAFKA-19133 which should help us perform
fetch for multiple remote fetch topic partition in a single share fetch
request.

Testing

I have followed the AK
documentation

to test my code locally (by adopting LocalTieredStorage.java) and with
the help of unit tests.

Summary by CodeRabbit

  • New Features

    • Added support for remote storage fetches in share fetch operations, enabling asynchronous remote data retrieval alongside local log reads.
  • Bug Fixes

    • Improved error handling and completion logic for share fetches involving remote storage, ensuring robust operation in failure scenarios.
  • Tests

    • Introduced comprehensive tests covering remote storage fetch integration, including success, failure, and edge cases for delayed share fetch logic.

Copy link

coderabbitai bot commented Apr 21, 2025

Walkthrough

The changes introduce remote storage fetch support to the DelayedShareFetch logic in Kafka. The main class is refactored to handle both local log fetches and asynchronous remote fetches, including new fields, methods, and helper records to manage remote fetch state and exceptions. The control flow is updated to branch based on whether a remote fetch is in progress, has failed, or is unnecessary. The test suite is significantly expanded with new tests that cover various scenarios involving remote fetch initiation, completion, error handling, and lock management, using extensive mocking to simulate different outcomes.

Changes

File(s) Change Summary
core/src/main/java/kafka/server/share/DelayedShareFetch.java Refactored to add support for remote storage fetches: new fields, constructor, and helper records; updated onComplete() and tryComplete() logic; added remote fetch lifecycle management; improved error handling and lock management.
core/src/test/java/kafka/server/share/DelayedShareFetchTest.java Added multiple new tests covering remote fetch scenarios, including successful and failed fetches, broker state changes, and concurrency; updated builder and helper methods; improved verification of partition acquisition.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant DelayedShareFetch
    participant ReplicaManager
    participant RemoteLogManager

    Client->>DelayedShareFetch: Initiate share fetch
    alt Remote fetch required
        DelayedShareFetch->>ReplicaManager: Schedule remote fetch
        ReplicaManager->>RemoteLogManager: Start remote fetch task
        RemoteLogManager-->>ReplicaManager: Remote fetch result (async)
        ReplicaManager-->>DelayedShareFetch: Remote fetch completion
        DelayedShareFetch->>Client: Complete fetch (with remote data)
    else Only local fetch required
        DelayedShareFetch->>ReplicaManager: Read from local log
        ReplicaManager-->>DelayedShareFetch: Local log data
        DelayedShareFetch->>Client: Complete fetch (with local data)
    end
Loading

Poem

In the warren of Kafka, a new path we fetch,
With hops to remote logs, our reach does stretch.
Local or distant, the bytes now appear,
Async and clever, the code’s crystal clear.
Tests bound like bunnies, ensuring no glitch—
Remote or local, the fetches are rich!
🐇✨

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ast-grep (0.31.1)
core/src/test/java/kafka/server/share/DelayedShareFetchTest.java

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)
  • We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@visz11
Copy link
Collaborator

visz11 commented Apr 21, 2025

@coderabbitai full review

Copy link

coderabbitai bot commented Apr 21, 2025

✅ Actions performed

Full review triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
core/src/main/java/kafka/server/share/DelayedShareFetch.java (2)

724-733: Pass an immutable copy when queueing follow‑up actions

topicIdPartitions is passed directly to the lambda queued via replicaManager.addToActionQueue.
Because the original Set is typically a mutable LinkedHashSet owned by the caller, later mutations
(e.g. clear()) will change the contents seen by the queued runnable, producing stale or empty work.

-replicaManager.addToActionQueue(() -> topicIdPartitions.forEach(topicIdPartition ->
+Set<TopicIdPartition> tpCopy = new LinkedHashSet<>(topicIdPartitions);
+replicaManager.addToActionQueue(() -> tpCopy.forEach(topicIdPartition ->
     replicaManager.completeDelayedShareFetchRequest(
         new DelayedShareFetchGroupKey(shareFetch.groupId(),
                                       topicIdPartition.topicId(),
                                       topicIdPartition.partition()))));

594-597: Expose remoteFetch as Optional instead of nullable

remoteFetch() collapses the Optional and returns null when absent, re‑introducing a nullable
contract that callers must special‑case.

-RemoteFetch remoteFetch() {
-    return remoteFetchOpt.orElse(null);
-}
+Optional<RemoteFetch> remoteFetch() {
+    return remoteFetchOpt;
+}

This keeps the null‑free invariant and simplifies call sites (ifPresent, orElse …).

core/src/test/java/kafka/server/share/DelayedShareFetchTest.java (1)

1760-1764: remoteFetch field in test‑builder is never set

The builder keeps a final Optional<DelayedShareFetch.RemoteFetch> remoteFetch = Optional.empty() and
always passes that to DelayedShareFetch. All tests therefore exercise only the production code path
that creates the RemoteFetch internally.

If you intended to inject a pre‑fabricated RemoteFetch for deterministic testing, expose a
withRemoteFetch(...) setter; otherwise drop the field to avoid confusion.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed79212 and 25788b3.

📒 Files selected for processing (2)
  • core/src/main/java/kafka/server/share/DelayedShareFetch.java (11 hunks)
  • core/src/test/java/kafka/server/share/DelayedShareFetchTest.java (11 hunks)
🔇 Additional comments (1)
core/src/main/java/kafka/server/share/DelayedShareFetch.java (1)

682-708:

❌ Incorrect review comment

Potential data race on remoteFetchOpt / remoteStorageFetchException

remoteFetchOpt and remoteStorageFetchException are mutated in tryComplete (thread‑pool threads) and
read in maybeCompletePendingRemoteFetch as well as onComplete without any synchronisation or volatile
declaration. DelayedOperation.tryComplete(...) may run concurrently on multiple threads, so these two
fields can be observed in a half‑initialised state causing:

  • NullPointerException (remoteFetchOpt.get() when another thread cleared it),
  • lost wake‑up (one thread completes the future while another still sees !isDone()).

Guard the two fields with the inherited lock or mark them volatile/AtomicReference.


🏁 Script executed:

#!/bin/bash
grep -n "class DelayedShareFetch" core/src/main/java/kafka/server/share/DelayedShareFetch.java
grep -n "remoteFetchOpt" core/src/main/java/kafka/server/share/DelayedShareFetch.java
grep -n "remoteStorageFetchException" core/src/main/java/kafka/server/share/DelayedShareFetch.java
grep -n "tryComplete" core/src/main/java/kafka/server/share/DelayedShareFetch.java
grep -n "maybeCompletePendingRemoteFetch" core/src/main/java/kafka/server/share/DelayedShareFetch.java
grep -n "onComplete" core/src/main/java/kafka/server/share/DelayedShareFetch.java

Length of output: 3971


#!/bin/bash
# Locate DelayedOperation class
find core/src -type f -name "DelayedOperation.java"

# Display the beginning to inspect locking and structure
DELAYED_OP_PATH=$(find core/src -type f -name "DelayedOperation.java" | head -n1)
echo "=== $DELAYED_OP_PATH ==="
sed -n '1,200p' "$DELAYED_OP_PATH"

# Show where tryComplete is implemented in DelayedOperation
grep -R "tryComplete" -n "$DELAYED_OP_PATH"
sed -n '200,350p' "$DELAYED_OP_PATH"

# Inspect usage of 'lock' in DelayedOperation
grep -R "lock" -n "$DELAYED_OP_PATH"

#!/bin/bash
set -e

FILE="core/src/main/java/kafka/server/share/DelayedShareFetch.java"

echo "=== maybeCompletePendingRemoteFetch (lines 660-740) ==="
sed -n '660,740p' "$FILE"

echo
echo "=== tryComplete (lines 270-360) ==="
sed -n '270,360p' "$FILE"

echo
echo "=== onComplete (lines 180-260) ==="
sed -n '180,260p' "$FILE"

echo
echo "Searching for lock usage in DelayedShareFetch:"
grep -R "lock" -n "$FILE"

#!/bin/bash
set -e

# 1. Show mutations of remoteFetchOpt
FILE="core/src/main/java/kafka/server/share/DelayedShareFetch.java"
echo "=== remoteFetchOpt assignments and updates (lines 140-240) ==="
sed -n '140,240p' "$FILE"

echo
# 2. Show where remoteStorageFetchException is set
echo "=== remoteStorageFetchException mutations (lines 600-700) ==="
sed -n '600,700p' "$FILE"

echo
# 3. Inspect DelayedOperation for lock usage
OP_PATH=$(find core/src -type f -name "DelayedOperation.java" | head -n1)
echo "=== DelayedOperation file: $OP_PATH (first 200 lines) ==="
sed -n '1,200p' "$OP_PATH"
echo
echo "=== 'lock' declarations and usage in DelayedOperation ==="
grep -n "lock" "$OP_PATH"

No data race – fields are already guarded by DelayedOperation’s lock

All accesses and mutations of remoteFetchOpt and remoteStorageFetchException occur inside overridden tryCompletemaybeCompletePendingRemoteFetch, and onComplete methods, which DelayedOperation invokes under its internal ReentrantLock. Since these methods (and their helper calls) run only while holding that lock, there is no risk of observing a half‑initialized state or lost wake‑up. Please disregard the synchronization suggestion.

Likely an incorrect or invalid review comment.

Comment on lines +657 to +663
remoteFetchTask = replicaManager.remoteLogManager().get().asyncRead(
remoteStorageFetchInfo,
result -> {
remoteFetchResult.complete(result);
replicaManager.completeDelayedShareFetchRequest(new DelayedShareFetchGroupKey(shareFetch.groupId(), remoteFetchTopicIdPartition.topicId(), remoteFetchTopicIdPartition.partition()));
}
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle absence of RemoteLogManager to avoid NoSuchElementException

replicaManager.remoteLogManager().get() is called without checking whether the Option is empty.
If tiered storage is disabled – or it is disabled temporarily because the broker transitions to offline – this will throw NoSuchElementException, short‑circuit the fetch and bubble up as an unhandled runtime exception.

-remoteFetchTask = replicaManager.remoteLogManager().get().asyncRead(
+if (replicaManager.remoteLogManager().isEmpty()) {
+    remoteStorageFetchException = Optional.of(
+        new IllegalStateException("RemoteLogManager not available for remote fetch"));
+    throw remoteStorageFetchException.get();
+}
+remoteFetchTask = replicaManager.remoteLogManager().get().asyncRead(

Fail fast with a meaningful error or return an error‐code for the partition instead of crashing.

Committable suggestion skipped: line range outside the PR's diff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants