-
Notifications
You must be signed in to change notification settings - Fork 14.8k
KAFKA-19760: RecordTooLargeExceptions in group coordinator when offsets.topic.compression.codec is used #20653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KAFKA-19760: RecordTooLargeExceptions in group coordinator when offsets.topic.compression.codec is used #20653
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch! I left a few comments.
...tor-common/src/main/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntime.java
Outdated
Show resolved
Hide resolved
...tor-common/src/main/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntime.java
Outdated
Show resolved
Hide resolved
...tor-common/src/main/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntime.java
Outdated
Show resolved
Hide resolved
...tor-common/src/main/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntime.java
Outdated
Show resolved
Hide resolved
...tor-common/src/main/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntime.java
Outdated
Show resolved
Hide resolved
...tor-common/src/main/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntime.java
Outdated
Show resolved
Hide resolved
coordinator-common/src/test/java/org/apache/kafka/coordinator/common/runtime/TestUtil.java
Show resolved
Hide resolved
...common/src/test/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntimeTest.java
Outdated
Show resolved
Hide resolved
...common/src/test/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntimeTest.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the PR!
...tor-common/src/main/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntime.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| @Test | ||
| public void testLargeCompressibleRecordTriggersFlushAndSucceeds() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a variant of this test that would try to pack the large record into the same batch as write1 under the old implementation? 3x of the max batch size is too large for that and we'd need something like 0.75 of the max batch size. We can use @ParameterizedTest, @ValueSource and add a double parameter for the fraction of the max batch size to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a variant but separated it out into a different test because they behave slightly differently. When the record is smaller than batch size but won't fit in the current batch, it flushes the current batch but then doesn't trigger a second flush on the new batch. The 3x batch size triggers a second flush as well so having two tests to check both behaviors is good.
...tor-common/src/main/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntime.java
Outdated
Show resolved
Hide resolved
| int estimatedSizeUpperBound = AbstractRecords.estimateSizeInBytes( | ||
| currentBatch.builder.magic(), | ||
| CompressionType.NONE, | ||
| recordsToAppend | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is kind of annoying that we compute the size twice, especially that estimatedSize is estimatedSizeUpperBound with a fixed factor. Could we combine?
Otherwise, I wonder whether we should just remove the check on estimatedSize below and rely on the check from the log layer. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can combine them. But we can remove the estimatedSize check and rely on the log layer. Then when we have an overly large atomic write, we will 1) flush the current batch, 2) write a new batch, 3) flush it immediately, which fails. The downside is that we will do a bunch of extra work for oversized writes.
@izzyharker what do you think about removing the estimatedSize check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably fine either way? If the issue doing the size check twice we could use the estimated ratio on the uncompressed size as an estimate rather than making two method calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The estimatedSize is likely wrong anyway. I am for removing it. @squah-confluent Is it ok for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm fine with removing the check against estimatedSize entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
| // If flushing fails, we don't catch the exception in order to let | ||
| // the caller fail the current operation. | ||
| maybeFlushCurrentBatch(currentTimeMs); | ||
| if (isAtomic && !currentBatch.builder.hasRoomFor(0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure to understand the isAtomic here. Let's say that we have two records that must be written automatically, we still have space in the batch for others. Why do we force a flush?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't force a flush. We only flush if we took the atomic path and uncompressed batch size >= max.message.size, which means the next atomic write will flush the current batch before writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Would it be possible to put !currentBatch.builder.hasRoomFor(0) within maybeFlushCurrentBatch or do we really need to rely on isAtomic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the harm in moving it to maybeFlushCurrentBatch.
…server_errors_iharker
9bfdfdf to
d17a9ff
Compare
...common/src/test/java/org/apache/kafka/coordinator/common/runtime/CoordinatorRuntimeTest.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks
…ts.topic.compression.codec is used (#20653) The group coordinator has been having issues with unknown errors. The theory is that this is caused by optimistic compression estimates which cause unchecked batch overflows when trying to write. This PR adds a check for uncompressed record size to flush batches more eagerly and avoid overfilling partially-full batches. This should make the group coordinator errors less frequent. Also added tests to ensure this change does not impact desired behavior for large compressible records. Reviewers: Sean Quah <[email protected]>, David Jacot <[email protected]>
|
Merged to trunk and 4.1. I was not able to cherry-pick it to 4.0 due to some conflicts. @izzyharker Could you please open a PR for 4.0 branch? |
…ts.topic.compression.codec is used (4.0) (#20715) This PR backports the change from #20653 to 4.0 The group coordinator has been having issues with unknown errors. The theory is that this is caused by optimistic compression estimates which cause unchecked batch overflows when trying to write. This PR adds a check for uncompressed record size to flush batches more eagerly and avoid overfilling partially-full batches. This should make the group coordinator errors less frequent. Also added tests to ensure this change does not impact desired behavior for large compressible records. Reviewers: Sean Quah <[email protected]>, David Jacot <[email protected]>
| private void maybeFlushCurrentBatch(long currentTimeMs) { | ||
| if (currentBatch != null) { | ||
| if (currentBatch.builder.isTransactional() || (currentBatch.appendTimeMs - currentTimeMs) >= appendLingerMs) { | ||
| if (currentBatch.builder.isTransactional() || (currentBatch.appendTimeMs - currentTimeMs) >= appendLingerMs || !currentBatch.builder.hasRoomFor(0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@majialoong and I were discussing the condition (currentBatch.appendTimeMs - currentTimeMs) >= appendLingerMs. The correct version seems to be (currentTimeMs - currentBatch.appendTimeMs) >= appendLingerMs. Or we can remove it, since a lingerTimeoutTask already exists
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The group coordinator has been having issues with unknown errors. The
theory is that this is caused by optimistic compression estimates which
cause unchecked batch overflows when trying to write.
This PR adds a check for uncompressed record size to flush batches more
eagerly and avoid overfilling partially-full batches. This should make
the group coordinator errors less frequent.
Also added tests to ensure this change does not impact desired behavior
for large compressible records.
Reviewers: Sean Quah [email protected], David Jacot
[email protected]