-
Notifications
You must be signed in to change notification settings - Fork 0
KAFKA-19019: Add support for remote storage fetch for share groups #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
…line build failure
This comment was marked as resolved.
This comment was marked as resolved.
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
core/src/main/java/kafka/server/share/DelayedShareFetch.java (2)
724-733
: Pass an immutable copy when queueing follow‑up actions
topicIdPartitions
is passed directly to the lambda queued viareplicaManager.addToActionQueue
.
Because the originalSet
is typically a mutableLinkedHashSet
owned by the caller, later mutations
(e.g.clear()
) will change the contents seen by the queued runnable, producing stale or empty work.-replicaManager.addToActionQueue(() -> topicIdPartitions.forEach(topicIdPartition -> +Set<TopicIdPartition> tpCopy = new LinkedHashSet<>(topicIdPartitions); +replicaManager.addToActionQueue(() -> tpCopy.forEach(topicIdPartition -> replicaManager.completeDelayedShareFetchRequest( new DelayedShareFetchGroupKey(shareFetch.groupId(), topicIdPartition.topicId(), topicIdPartition.partition()))));
594-597
: ExposeremoteFetch
asOptional
instead of nullable
remoteFetch()
collapses theOptional
and returnsnull
when absent, re‑introducing a nullable
contract that callers must special‑case.-RemoteFetch remoteFetch() { - return remoteFetchOpt.orElse(null); -} +Optional<RemoteFetch> remoteFetch() { + return remoteFetchOpt; +}This keeps the null‑free invariant and simplifies call sites (
ifPresent
,orElse
…).core/src/test/java/kafka/server/share/DelayedShareFetchTest.java (1)
1760-1764
:remoteFetch
field in test‑builder is never setThe builder keeps a
final Optional<DelayedShareFetch.RemoteFetch> remoteFetch = Optional.empty()
and
always passes that toDelayedShareFetch
. All tests therefore exercise only the production code path
that creates theRemoteFetch
internally.If you intended to inject a pre‑fabricated
RemoteFetch
for deterministic testing, expose a
withRemoteFetch(...)
setter; otherwise drop the field to avoid confusion.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
core/src/main/java/kafka/server/share/DelayedShareFetch.java
(11 hunks)core/src/test/java/kafka/server/share/DelayedShareFetchTest.java
(11 hunks)
🔇 Additional comments (1)
core/src/main/java/kafka/server/share/DelayedShareFetch.java (1)
682-708
:❌ Incorrect review comment
Potential data race on
remoteFetchOpt
/remoteStorageFetchException
remoteFetchOpt
andremoteStorageFetchException
are mutated intryComplete
(thread‑pool threads) and
read inmaybeCompletePendingRemoteFetch
as well asonComplete
without any synchronisation orvolatile
declaration.DelayedOperation.tryComplete(...)
may run concurrently on multiple threads, so these two
fields can be observed in a half‑initialised state causing:
NullPointerException
(remoteFetchOpt.get()
when another thread cleared it),- lost wake‑up (one thread completes the future while another still sees
!isDone()
).Guard the two fields with the inherited
lock
or mark themvolatile
/AtomicReference
.
🏁 Script executed:
#!/bin/bash grep -n "class DelayedShareFetch" core/src/main/java/kafka/server/share/DelayedShareFetch.java grep -n "remoteFetchOpt" core/src/main/java/kafka/server/share/DelayedShareFetch.java grep -n "remoteStorageFetchException" core/src/main/java/kafka/server/share/DelayedShareFetch.java grep -n "tryComplete" core/src/main/java/kafka/server/share/DelayedShareFetch.java grep -n "maybeCompletePendingRemoteFetch" core/src/main/java/kafka/server/share/DelayedShareFetch.java grep -n "onComplete" core/src/main/java/kafka/server/share/DelayedShareFetch.javaLength of output: 3971
#!/bin/bash # Locate DelayedOperation class find core/src -type f -name "DelayedOperation.java" # Display the beginning to inspect locking and structure DELAYED_OP_PATH=$(find core/src -type f -name "DelayedOperation.java" | head -n1) echo "=== $DELAYED_OP_PATH ===" sed -n '1,200p' "$DELAYED_OP_PATH" # Show where tryComplete is implemented in DelayedOperation grep -R "tryComplete" -n "$DELAYED_OP_PATH" sed -n '200,350p' "$DELAYED_OP_PATH" # Inspect usage of 'lock' in DelayedOperation grep -R "lock" -n "$DELAYED_OP_PATH"
#!/bin/bash set -e FILE="core/src/main/java/kafka/server/share/DelayedShareFetch.java" echo "=== maybeCompletePendingRemoteFetch (lines 660-740) ===" sed -n '660,740p' "$FILE" echo echo "=== tryComplete (lines 270-360) ===" sed -n '270,360p' "$FILE" echo echo "=== onComplete (lines 180-260) ===" sed -n '180,260p' "$FILE" echo echo "Searching for lock usage in DelayedShareFetch:" grep -R "lock" -n "$FILE"
#!/bin/bash set -e # 1. Show mutations of remoteFetchOpt FILE="core/src/main/java/kafka/server/share/DelayedShareFetch.java" echo "=== remoteFetchOpt assignments and updates (lines 140-240) ===" sed -n '140,240p' "$FILE" echo # 2. Show where remoteStorageFetchException is set echo "=== remoteStorageFetchException mutations (lines 600-700) ===" sed -n '600,700p' "$FILE" echo # 3. Inspect DelayedOperation for lock usage OP_PATH=$(find core/src -type f -name "DelayedOperation.java" | head -n1) echo "=== DelayedOperation file: $OP_PATH (first 200 lines) ===" sed -n '1,200p' "$OP_PATH" echo echo "=== 'lock' declarations and usage in DelayedOperation ===" grep -n "lock" "$OP_PATH"
No data race – fields are already guarded by DelayedOperation’s lock
All accesses and mutations of
remoteFetchOpt
andremoteStorageFetchException
occur inside overriddentryComplete
,maybeCompletePendingRemoteFetch
, andonComplete
methods, which DelayedOperation invokes under its internal ReentrantLock. Since these methods (and their helper calls) run only while holding that lock, there is no risk of observing a half‑initialized state or lost wake‑up. Please disregard the synchronization suggestion.Likely an incorrect or invalid review comment.
/refacto-test |
Refacto is reviewing this PR. Please wait for the review comments to be posted. |
private void cancelRemoteFetchTask() { | ||
boolean cancelled = remoteFetchOpt.get().remoteFetchTask().cancel(false); | ||
if (!cancelled) { | ||
log.debug("Remote fetch task for RemoteStorageFetchInfo: {} could not be cancelled and its isDone value is {}", | ||
remoteFetchOpt.get().remoteFetchInfo(), remoteFetchOpt.get().remoteFetchTask().isDone()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential Resource Leakage in Remote Fetch Task
The cancelRemoteFetchTask method attempts to cancel a remote fetch task but only logs a debug message if cancellation fails. There's no fallback mechanism to ensure resources are released if the task cannot be cancelled. This could lead to resource leakage if multiple cancellation attempts fail, potentially causing memory leaks or thread exhaustion under high load.
private void cancelRemoteFetchTask() { | |
boolean cancelled = remoteFetchOpt.get().remoteFetchTask().cancel(false); | |
if (!cancelled) { | |
log.debug("Remote fetch task for RemoteStorageFetchInfo: {} could not be cancelled and its isDone value is {}", | |
remoteFetchOpt.get().remoteFetchInfo(), remoteFetchOpt.get().remoteFetchTask().isDone()); | |
} | |
} | |
private void cancelRemoteFetchTask() { | |
boolean cancelled = remoteFetchOpt.get().remoteFetchTask().cancel(false); | |
if (!cancelled) { | |
log.debug("Remote fetch task for RemoteStorageFetchInfo: {} could not be cancelled and its isDone value is {}", | |
remoteFetchOpt.get().remoteFetchInfo(), remoteFetchOpt.get().remoteFetchTask().isDone()); | |
// If cancellation failed and task is done, ensure we clean up any resources | |
if (remoteFetchOpt.get().remoteFetchTask().isDone()) { | |
try { | |
// Force get() to ensure any exceptions are handled and resources released | |
remoteFetchOpt.get().remoteFetchTask().get(0, TimeUnit.MILLISECONDS); | |
} catch (Exception e) { | |
// Expected if task completed exceptionally or timed out | |
log.debug("Forced completion of remote fetch task: {}", e.getMessage()); | |
} | |
} | |
} | |
} |
RemoteLogReadResult remoteLogReadResult = remoteFetch.remoteFetchResult().get(); | ||
if (remoteLogReadResult.error.isPresent()) { | ||
Throwable error = remoteLogReadResult.error.get(); | ||
// If there is any error for the remote fetch topic partition, we populate the error accordingly. | ||
shareFetchPartitionData.add( | ||
new ShareFetchPartitionData( | ||
remoteFetch.topicIdPartition(), | ||
partitionsAcquired.get(remoteFetch.topicIdPartition()), | ||
ReplicaManager.createLogReadResult(error).toFetchPartitionData(false) | ||
) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insufficient Error Handling in Remote Fetch Processing
The code handles remote fetch errors by simply wrapping them in a LogReadResult, but doesn't properly classify or sanitize the error information. This could potentially leak sensitive information about the system's internal structure or state through error messages. Additionally, there's no differentiation between different types of errors (e.g., transient network issues vs. permission problems), which could lead to incorrect handling of security-related failures.
RemoteLogReadResult remoteLogReadResult = remoteFetch.remoteFetchResult().get(); | |
if (remoteLogReadResult.error.isPresent()) { | |
Throwable error = remoteLogReadResult.error.get(); | |
// If there is any error for the remote fetch topic partition, we populate the error accordingly. | |
shareFetchPartitionData.add( | |
new ShareFetchPartitionData( | |
remoteFetch.topicIdPartition(), | |
partitionsAcquired.get(remoteFetch.topicIdPartition()), | |
ReplicaManager.createLogReadResult(error).toFetchPartitionData(false) | |
) | |
); | |
RemoteLogReadResult remoteLogReadResult = remoteFetch.remoteFetchResult().get(); | |
if (remoteLogReadResult.error.isPresent()) { | |
Throwable error = remoteLogReadResult.error.get(); | |
// Classify and sanitize the error before propagating | |
Errors kafkaError = Errors.UNKNOWN_SERVER_ERROR; | |
if (error instanceof TimeoutException) { | |
kafkaError = Errors.REQUEST_TIMED_OUT; | |
} else if (error instanceof KafkaStorageException) { | |
kafkaError = Errors.KAFKA_STORAGE_ERROR; | |
} else if (error instanceof NotLeaderOrFollowerException) { | |
kafkaError = Errors.NOT_LEADER_OR_FOLLOWER; | |
} else if (error instanceof UnknownTopicOrPartitionException) { | |
kafkaError = Errors.UNKNOWN_TOPIC_OR_PARTITION; | |
} | |
// Log the detailed error but return a sanitized error to the client | |
log.debug("Remote fetch error for partition {}: {}", remoteFetch.topicIdPartition(), error.getMessage()); | |
shareFetchPartitionData.add( | |
new ShareFetchPartitionData( | |
remoteFetch.topicIdPartition(), | |
partitionsAcquired.get(remoteFetch.topicIdPartition()), | |
ReplicaManager.createLogReadResult(kafkaError.exception()).toFetchPartitionData(false) | |
) | |
); |
private boolean maybeCompletePendingRemoteFetch() { | ||
boolean canComplete = false; | ||
|
||
TopicIdPartition topicIdPartition = remoteFetchOpt.get().topicIdPartition(); | ||
try { | ||
replicaManager.getPartitionOrException(topicIdPartition.topicPartition()); | ||
} catch (KafkaStorageException e) { // Case a | ||
log.debug("TopicPartition {} is in an offline log directory, satisfy {} immediately", topicIdPartition, shareFetch.fetchParams()); | ||
canComplete = true; | ||
} catch (UnknownTopicOrPartitionException e) { // Case b | ||
log.debug("Broker no longer knows of topicPartition {}, satisfy {} immediately", topicIdPartition, shareFetch.fetchParams()); | ||
canComplete = true; | ||
} catch (NotLeaderOrFollowerException e) { // Case c | ||
log.debug("Broker is no longer the leader or follower of topicPartition {}, satisfy {} immediately", topicIdPartition, shareFetch.fetchParams()); | ||
canComplete = true; | ||
} | ||
|
||
if (canComplete || remoteFetchOpt.get().remoteFetchResult().isDone()) { // Case d | ||
boolean completedByMe = forceComplete(); | ||
// If invocation of forceComplete is not successful, then that means the request is already completed | ||
// hence release the acquired locks. | ||
if (!completedByMe) { | ||
releasePartitionLocks(partitionsAcquired.keySet()); | ||
} | ||
return completedByMe; | ||
} else | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential Race Condition in Remote Fetch Completion
There's a potential race condition in the remote fetch completion logic. The code first checks if the remote fetch can complete due to various conditions, then separately checks if the remoteFetchResult is done. Between these checks, the state could change, leading to inconsistent behavior. Additionally, there's no synchronization mechanism to ensure that only one thread completes the remote fetch operation, which could lead to multiple threads attempting to process the same result or release the same locks.
private boolean maybeCompletePendingRemoteFetch() { | |
boolean canComplete = false; | |
TopicIdPartition topicIdPartition = remoteFetchOpt.get().topicIdPartition(); | |
try { | |
replicaManager.getPartitionOrException(topicIdPartition.topicPartition()); | |
} catch (KafkaStorageException e) { // Case a | |
log.debug("TopicPartition {} is in an offline log directory, satisfy {} immediately", topicIdPartition, shareFetch.fetchParams()); | |
canComplete = true; | |
} catch (UnknownTopicOrPartitionException e) { // Case b | |
log.debug("Broker no longer knows of topicPartition {}, satisfy {} immediately", topicIdPartition, shareFetch.fetchParams()); | |
canComplete = true; | |
} catch (NotLeaderOrFollowerException e) { // Case c | |
log.debug("Broker is no longer the leader or follower of topicPartition {}, satisfy {} immediately", topicIdPartition, shareFetch.fetchParams()); | |
canComplete = true; | |
} | |
if (canComplete || remoteFetchOpt.get().remoteFetchResult().isDone()) { // Case d | |
boolean completedByMe = forceComplete(); | |
// If invocation of forceComplete is not successful, then that means the request is already completed | |
// hence release the acquired locks. | |
if (!completedByMe) { | |
releasePartitionLocks(partitionsAcquired.keySet()); | |
} | |
return completedByMe; | |
} else | |
return false; | |
private boolean maybeCompletePendingRemoteFetch() { | |
boolean canComplete = false; | |
TopicIdPartition topicIdPartition = remoteFetchOpt.get().topicIdPartition(); | |
try { | |
replicaManager.getPartitionOrException(topicIdPartition.topicPartition()); | |
} catch (KafkaStorageException e) { // Case a | |
log.debug("TopicPartition {} is in an offline log directory, satisfy {} immediately", topicIdPartition, shareFetch.fetchParams()); | |
canComplete = true; | |
} catch (UnknownTopicOrPartitionException e) { // Case b | |
log.debug("Broker no longer knows of topicPartition {}, satisfy {} immediately", topicIdPartition, shareFetch.fetchParams()); | |
canComplete = true; | |
} catch (NotLeaderOrFollowerException e) { // Case c | |
log.debug("Broker is no longer the leader or follower of topicPartition {}, satisfy {} immediately", topicIdPartition, shareFetch.fetchParams()); | |
canComplete = true; | |
} | |
// Atomically check if we can complete and if so, attempt completion | |
synchronized(this) { | |
// Re-check conditions inside synchronized block to prevent race conditions | |
if ((canComplete || remoteFetchOpt.get().remoteFetchResult().isDone()) && !isCompleted()) { // Case d | |
boolean completedByMe = forceComplete(); | |
// If invocation of forceComplete is not successful, then that means the request is already completed | |
// hence release the acquired locks. | |
if (!completedByMe) { | |
releasePartitionLocks(partitionsAcquired.keySet()); | |
} | |
return completedByMe; | |
} else | |
return false; | |
} |
} catch (InterruptedException | ExecutionException e) { | ||
log.error("Exception occurred in completing remote fetch {} for delayed share fetch request {}", remoteFetchOpt.get(), e); | ||
handleExceptionInCompletingRemoteStorageShareFetchRequest(acquiredNonRemoteFetchTopicPartitionData.keySet(), e); | ||
} catch (Exception e) { | ||
log.error("Unexpected error in processing delayed share fetch request", e); | ||
handleExceptionInCompletingRemoteStorageShareFetchRequest(acquiredNonRemoteFetchTopicPartitionData.keySet(), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unchecked Exception in Remote Storage Fetch Implementation
The code catches and handles InterruptedException and ExecutionException specifically, but then also has a generic catch for all other exceptions. This is problematic because it could mask serious security issues like SecurityExceptions or AccessControlExceptions that should be handled differently. Additionally, the thread's interrupted status is not restored when catching InterruptedException, which is a common cause of thread management issues that can lead to resource leaks or deadlocks.
} catch (InterruptedException | ExecutionException e) { | |
log.error("Exception occurred in completing remote fetch {} for delayed share fetch request {}", remoteFetchOpt.get(), e); | |
handleExceptionInCompletingRemoteStorageShareFetchRequest(acquiredNonRemoteFetchTopicPartitionData.keySet(), e); | |
} catch (Exception e) { | |
log.error("Unexpected error in processing delayed share fetch request", e); | |
handleExceptionInCompletingRemoteStorageShareFetchRequest(acquiredNonRemoteFetchTopicPartitionData.keySet(), e); | |
} catch (InterruptedException e) { | |
// Restore the interrupted status | |
Thread.currentThread().interrupt(); | |
log.error("Thread interrupted while completing remote fetch {} for delayed share fetch request", remoteFetchOpt.get()); | |
handleExceptionInCompletingRemoteStorageShareFetchRequest(acquiredNonRemoteFetchTopicPartitionData.keySet(), e); | |
} catch (ExecutionException e) { | |
log.error("Execution exception in completing remote fetch {} for delayed share fetch request: {}", remoteFetchOpt.get(), e.getCause()); | |
handleExceptionInCompletingRemoteStorageShareFetchRequest(acquiredNonRemoteFetchTopicPartitionData.keySet(), e); | |
} catch (SecurityException | AccessControlException e) { | |
// Handle security-related exceptions differently | |
log.error("Security exception in remote fetch operation: {}", e.getMessage()); | |
// Consider additional security-specific handling here | |
handleExceptionInCompletingRemoteStorageShareFetchRequest(acquiredNonRemoteFetchTopicPartitionData.keySet(), e); | |
} catch (Exception e) { | |
log.error("Unexpected error in processing delayed share fetch request", e); | |
handleExceptionInCompletingRemoteStorageShareFetchRequest(acquiredNonRemoteFetchTopicPartitionData.keySet(), e); |
Remote Storage Fetch Implementation Lacks Proper Error Handling👍 Well Done
📌 Files Processed
📝 Additional Comments
|
/refacto-test-arvi |
Refacto is reviewing this PR. Please wait for the review comments to be posted. |
/refacto-test-arvi |
Refacto is reviewing this PR. Please wait for the review comments to be posted. |
Code Review: Security, Performance and Maintainability Improvements👍 Well Done
📌 Files Processed
📝 Additional Comments
|
/refacto-test-arvi |
Refacto is reviewing this PR. Please wait for the review comments to be posted. |
Code Review: Remote Storage Fetch Operations👍 Well Done
📌 Files Processed
📝 Additional Comments
|
private void cancelRemoteFetchTask() { | ||
boolean cancelled = remoteFetchOpt.get().remoteFetchTask().cancel(false); | ||
if (!cancelled) { | ||
log.debug("Remote fetch task for RemoteStorageFetchInfo: {} could not be cancelled and its isDone value is {}", | ||
remoteFetchOpt.get().remoteFetchInfo(), remoteFetchOpt.get().remoteFetchTask().isDone()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remote Fetch Cancellation
Missing null check before accessing remoteFetchOpt.get(). If remoteFetchOpt is empty, NullPointerException will be thrown causing request processing failure.
Standards
- ISO-IEC-25010-Reliability-Fault-Tolerance
- ISO-IEC-25010-Functional-Correctness-Appropriateness
- DbC-Precondition
- SRE-Error-Handling
*/ | ||
private void processRemoteFetchOrException( | ||
TopicPartitionRemoteFetchInfo topicPartitionRemoteFetchInfo | ||
) { | ||
TopicIdPartition remoteFetchTopicIdPartition = topicPartitionRemoteFetchInfo.topicIdPartition(); | ||
RemoteStorageFetchInfo remoteStorageFetchInfo = topicPartitionRemoteFetchInfo.logReadResult().info().delayedRemoteStorageFetch.get(); | ||
|
||
Future<Void> remoteFetchTask; | ||
CompletableFuture<RemoteLogReadResult> remoteFetchResult = new CompletableFuture<>(); | ||
try { | ||
remoteFetchTask = replicaManager.remoteLogManager().get().asyncRead( | ||
remoteStorageFetchInfo, | ||
result -> { | ||
remoteFetchResult.complete(result); | ||
replicaManager.completeDelayedShareFetchRequest(new DelayedShareFetchGroupKey(shareFetch.groupId(), remoteFetchTopicIdPartition.topicId(), remoteFetchTopicIdPartition.partition())); | ||
} | ||
); | ||
} catch (Exception e) { | ||
// Throw the error if any in scheduling the remote fetch task. | ||
remoteStorageFetchException = Optional.of(e); | ||
throw e; | ||
} | ||
remoteFetchOpt = Optional.of(new RemoteFetch(remoteFetchTopicIdPartition, topicPartitionRemoteFetchInfo.logReadResult(), remoteFetchTask, remoteFetchResult, remoteStorageFetchInfo)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resource Cleanup Risk
Missing null check before accessing remoteFetchOpt.get(). NullPointerException risk if remoteFetchOpt is empty, causing resource leaks due to unreleased partition locks.
Standards
- ISO-IEC-25010-Reliability-Fault-Tolerance
- ISO-IEC-25010-Functional-Correctness-Appropriateness
- DbC-Precondition
- SRE-Error-Handling
/refacto-test-arvi |
Refacto is reviewing this PR. Please wait for the review comments to be posted. |
Code Review: Remote Storage Fetch Implementation👍 Well Done
📌 Files Processed
📝 Additional Comments
|
boolean cancelled = remoteFetchOpt.get().remoteFetchTask().cancel(false); | ||
if (!cancelled) { | ||
log.debug("Remote fetch task for RemoteStorageFetchInfo: {} could not be cancelled and its isDone value is {}", | ||
remoteFetchOpt.get().remoteFetchInfo(), remoteFetchOpt.get().remoteFetchTask().isDone()); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remote Fetch Cancellation
The cancelRemoteFetchTask method logs but doesn't handle the case when cancellation fails. This could lead to resource leaks or inconsistent state if the task continues executing after the request is completed, potentially causing system reliability issues.
Standards
- ISO-IEC-25010-Reliability-Fault-Tolerance
- ISO-IEC-25010-Functional-Correctness-Appropriateness
- SRE-Error-Handling
} catch (InterruptedException | ExecutionException e) { | ||
log.error("Exception occurred in completing remote fetch {} for delayed share fetch request {}", remoteFetchOpt.get(), e); | ||
handleExceptionInCompletingRemoteStorageShareFetchRequest(acquiredNonRemoteFetchTopicPartitionData.keySet(), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unhandled InterruptedException Risk
InterruptedException is caught but thread interruption status is not restored. This can lead to lost interruption signals, causing threads to continue executing when they should terminate, potentially leading to resource leaks or system instability.
Standards
- ISO-IEC-25010-Reliability-Fault-Tolerance
- ISO-IEC-25010-Functional-Correctness-Appropriateness
- SRE-Error-Handling
RemoteFetch remoteFetch() { | ||
return remoteFetchOpt.orElse(null); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Null Check Missing
The remoteFetch() method returns null when remoteFetchOpt is empty, but callers might not expect null returns. This could lead to NullPointerExceptions in test code. Consider returning Optional instead to enforce null-safety at compile time.
Standards
- ISO-IEC-25010-Reliability-Fault-Tolerance
- ISO-IEC-25010-Functional-Correctness-Appropriateness
- DbC-Resource-Mgmt
/refacto-test-arvi |
Refacto is reviewing this PR. Please wait for the review comments to be posted. |
Code Review: Remote Storage Fetch Implementation👍 Well Done
📌 Files Processed
📝 Additional Comments
|
private void cancelRemoteFetchTask() { | ||
boolean cancelled = remoteFetchOpt.get().remoteFetchTask().cancel(false); | ||
if (!cancelled) { | ||
log.debug("Remote fetch task for RemoteStorageFetchInfo: {} could not be cancelled and its isDone value is {}", | ||
remoteFetchOpt.get().remoteFetchInfo(), remoteFetchOpt.get().remoteFetchTask().isDone()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete Remote Task Cancellation
The cancelRemoteFetchTask method only logs when cancellation fails but doesn't handle the failure case. This could lead to orphaned tasks continuing execution after the request completes, potentially causing resource leaks or inconsistent state.
Standards
- ISO-IEC-25010-Reliability-Fault-Tolerance
- SRE-Resource-Management
- ISO-IEC-25010-Functional-Correctness-Appropriateness
boolean canComplete = false; | ||
|
||
TopicIdPartition topicIdPartition = remoteFetchOpt.get().topicIdPartition(); | ||
try { | ||
replicaManager.getPartitionOrException(topicIdPartition.topicPartition()); | ||
} catch (KafkaStorageException e) { // Case a | ||
log.debug("TopicPartition {} is in an offline log directory, satisfy {} immediately", topicIdPartition, shareFetch.fetchParams()); | ||
canComplete = true; | ||
} catch (UnknownTopicOrPartitionException e) { // Case b | ||
log.debug("Broker no longer knows of topicPartition {}, satisfy {} immediately", topicIdPartition, shareFetch.fetchParams()); | ||
canComplete = true; | ||
} catch (NotLeaderOrFollowerException e) { // Case c | ||
log.debug("Broker is no longer the leader or follower of topicPartition {}, satisfy {} immediately", topicIdPartition, shareFetch.fetchParams()); | ||
canComplete = true; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing Lock Release
If an unexpected exception occurs in getPartitionOrException, partition locks may not be released properly. The method should include a catch-all exception handler to ensure locks are always released, preventing potential deadlocks in the system.
Standards
- ISO-IEC-25010-Reliability-Fault-Tolerance
- SRE-Resource-Management
- DbC-Resource-Mgmt
} catch (InterruptedException | ExecutionException e) { | ||
log.error("Exception occurred in completing remote fetch {} for delayed share fetch request {}", remoteFetchOpt.get(), e); | ||
handleExceptionInCompletingRemoteStorageShareFetchRequest(acquiredNonRemoteFetchTopicPartitionData.keySet(), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unhandled Interruption Exception
InterruptedException is caught but the thread's interrupted status is not restored. This violates the interruption contract and can lead to missed interruption signals, potentially causing threads to continue running when they should terminate.
Standards
- ISO-IEC-25010-Reliability-Maturity
- SRE-Error-Handling
- DbC-Exception-Handling
* Note - This function should only be called when we know that there is remote fetch. | ||
*/ | ||
private void cancelRemoteFetchTask() { | ||
boolean cancelled = remoteFetchOpt.get().remoteFetchTask().cancel(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remote Fetch NPE Risk
Direct call to remoteFetchOpt.get() without checking isPresent() first could cause NullPointerException if remoteFetchOpt is empty. This method should validate the optional contains a value before accessing it to prevent runtime failures.
Standards
- ISO-IEC-25010-Reliability-Fault-Tolerance
- ISO-IEC-25010-Functional-Correctness-Appropriateness
- DbC-Precondition
/refacto-test-arvi |
Refacto is reviewing this PR. Please wait for the review comments to be posted. |
Code Review: Remote Storage Fetch Implementation👍 Well Done
📌 Files Processed
📝 Additional Comments
|
} catch (InterruptedException | ExecutionException e) { | ||
log.error("Exception occurred in completing remote fetch {} for delayed share fetch request {}", remoteFetchOpt.get(), e); | ||
handleExceptionInCompletingRemoteStorageShareFetchRequest(acquiredNonRemoteFetchTopicPartitionData.keySet(), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exception Handling Gap
The InterruptedException is caught but the thread's interrupted status is not restored, violating interrupt handling protocol. This can lead to missed interruption signals and prevent proper thread shutdown.
Standards
- ISO-IEC-25010-Reliability-Recoverability
- ISO-IEC-25010-Functional-Correctness-Appropriateness
- SRE-Error-Handling
if (!cancelled) { | ||
log.debug("Remote fetch task for RemoteStorageFetchInfo: {} could not be cancelled and its isDone value is {}", | ||
remoteFetchOpt.get().remoteFetchInfo(), remoteFetchOpt.get().remoteFetchTask().isDone()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remote Fetch Cancellation
The remote fetch task cancellation doesn't handle the case where cancellation fails but the task is already done. This could lead to resource leaks or inconsistent state if the task completed between the cancel check and the isDone check.
Standards
- ISO-IEC-25010-Reliability-Fault-Tolerance
- ISO-IEC-25010-Functional-Correctness-Appropriateness
- SRE-Error-Handling
try { | ||
List<ShareFetchPartitionData> shareFetchPartitionData = new ArrayList<>(); | ||
int readableBytes = 0; | ||
if (remoteFetchOpt.get().remoteFetchResult().isDone()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Null Check Missing
The code assumes remoteFetchOpt.get() is always present without checking, which could lead to NullPointerException if remoteFetchOpt is empty. This would cause unexpected failures during remote fetch operations.
Standards
- ISO-IEC-25010-Reliability-Fault-Tolerance
- ISO-IEC-25010-Functional-Correctness-Appropriateness
- DbC-Precondition
if (canComplete || remoteFetchOpt.get().remoteFetchResult().isDone()) { // Case d | ||
boolean completedByMe = forceComplete(); | ||
// If invocation of forceComplete is not successful, then that means the request is already completed | ||
// hence release the acquired locks. | ||
if (!completedByMe) { | ||
releasePartitionLocks(partitionsAcquired.keySet()); | ||
} | ||
return completedByMe; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential Race Condition
Potential race condition exists where remoteFetchResult().isDone() could change between check and action. If another thread completes the operation between the isDone check and forceComplete call, inconsistent state may occur.
Standards
- ISO-IEC-25010-Reliability-Maturity
- ISO-IEC-25010-Functional-Correctness-Appropriateness
- SRE-Concurrency-Control
What
This PR adds the support for remote storage fetch for share groups.
Limitation
There is a limitation in remote storage fetch for consumer groups that
we can only perform remote fetch for a single topic partition in a fetch
request. Since, the logic of share fetch requests is largely based on
how consumer
groups work, we are following similar logic in implementing remote
storage fetch. However, this problem
should be addressed as part of KAFKA-19133 which should help us perform
fetch for multiple remote fetch topic partition in a single share fetch
request.
Testing
I have followed the AK
documentation
to test my code locally (by adopting
LocalTieredStorage.java
) and withthe help of unit tests.
Summary by CodeRabbit
New Features
Bug Fixes
Tests