[FEAT] 2PC implementation#132
Open
gaurav7261 wants to merge 1 commit into
Open
Conversation
Author
|
@mzitnik @kurnoolsaketh please review |
Contributor
|
@gaurav7261 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an exactly-once ClickHouse sink for Flink, closing the gap called out in the README ("Currently the sink does not support exactly-once semantics"). The sink is built on the Flink 2.0
Sink+SupportsCommitter+CommittingSinkWriterAPIs and relies on ClickHouse'sinsert_deduplication_tokenfor idempotent commits. Also ports the same design to theflink-connector-clickhouse-1.17module using the Flink 1.17TwoPhaseCommittingSinkAPI.eBay block-aggregator / ClickLoad style "deterministic block reconstruction + CH dedup" pattern, adapted to Flink 2PC.
Problem
The existing
ClickHouseAsyncSinkis built on Flink'sAsyncSinkBase. Under failure:numOfDroppedRecordsincrements,resultHandler.completeExceptionallyis called). That is data loss, not at-least-once.ASYNC_OPERATIONS=trueon the insert means CH acks before data is durable — client ack cannot be taken as "landed".Fixing these in-place would require reshaping the async framework. A cleaner option is a second sink that uses Flink's native 2PC path.
Design
The sink follows the same deterministic-block-reconstruction principle used by eBay's Block Aggregator and ClickHouse's ClickLoad script, but plugged into Flink's 2PC protocol rather than into Kafka consumer state or a staging table.
Flow
Token formula
Position binding (subtask / ckpt / seq) + content binding (payload hash). Deterministic across restarts: same bytes always produce the same token. If Flink re-parallelizes on recovery and a different subtask sees the same records, the token still matches because the payload is identical. ClickHouse drops the retry at the
ReplicatedMergeTreelevel.Prior art
__consumer_offsets. We use the same principle but store committables in Flink checkpoint state instead.ALTER TABLE ... MOVE PARTITIONfor atomic EO bulk loads. Alternative path; stronger guarantees than dedup tokens but heavier. Noted as a possible future mode in Javadoc.clickhouse-kafka-connect— same deterministic-reconstruction pattern with state stored in aKeeperMaptable; Java-side state machine inProcessing.java.The CH Supercharge Part 3 blog explicitly warns that pure block-hash dedup is fragile under thread-level non-determinism and window overflow. The content-bound token addresses both.
Changes
New package
sink/tpc/in both 2.0 and 1.17 modulesClickHouseSink.javaSink+SupportsCommitter(2.0) /TwoPhaseCommittingSink(1.17).ClickHouseCommittable.java{payload, tableName, format, deduplicationToken, recordCount}.ClickHouseCommittableSerializer.javaClickHouseCommittingWriter.javaprepareCommitchunks buffer and stamps each chunk with a content-bound token.ClickHouseCommitter.javainsert_deduplication_token; retry classification via CH numeric error codes; client recycling after sustained connection failures; commit latency / dedup-anomaly metrics.ClickHouseClientConfig(2.0 module)Adds
public synchronized void resetClient()so the committer can recycle the cached HTTP client after N consecutive connection failures, mirroring the fresh-client-per-attempt practice inBlockSupportedBufferFlushTask.Operational features
76,209,210,241,246,252,319,999retryable;60,62,81,497permanent. Falls back to message heuristics for unknown codes.future.get(commitTimeoutMs, MILLISECONDS)— default 120 s — prevents stuck-commit deadlock.ClickHouseClientConfig.resetClient()triggered after 2 consecutive connection-level failures.commitsSucceeded,commitsRetried,commitsFailedPermanent,recordsCommitted,bytesCommitted,writtenRowMismatches,commitLatencyMs.writtenRowMismatches > 0is the ARV-analog alarm.ASYNC_OPERATIONSasync_operations=falseat insert time so client ack means durable write.Usage
Required ClickHouse config
The default window of 100 blocks is too small for extended retry scenarios.
Required Flink topology
Source → sink must be a
forwardconnection (same parallelism, norebalance/keyBy) to preserve source-replay determinism. The Javadoc spells this out; sources like Kafka with fixed parallelism and file sources satisfy it.Tests
Unit tests (no Docker, run on CI)
ClickHouseCommittableSerializerTest— 5 tests: V1 round-trip with normal / empty / 10 MB payloads, version rejection, version stability.ClickHouseTokenDeterminismTest— 6 tests: same inputs → same token; different (subtask | checkpoint | sequence | payload) → different token; 1-bit payload flip produces different token.All 11 pass locally:
Integration tests (requires Docker)
ClickHouseSinkTpcTests:exactlyOnceHappyPath_allRowsCommitted— bounded Flink job withEXACTLY_ONCEcheckpointing; verifies row count matches input.clickHouseTokenDedup_sameTokenCollapsesDuplicate— two inserts with same token against CH testcontainer; verifies the second is dropped.committableRoundTrip_isLossless— serializer sanity in integration context.Deliberate omissions vs. eBay Block Aggregator
Documented in
ClickHouseSinkJavadoc. Summary:system.zookeeperprobe wheninsert_quorumis set) — revisit if quorum errors dominate in practice.TableSchemaUpdateTracker(schema evolution) — current scaffold requires a job restart on schema change. Explicitly out of scope.Limitations / known gaps
StatefulSinkfor writer state; buffer is reconstructed from source replay on restart. Trade-off: simpler state at the cost of a re-run of uncommitted records (which is safe because committables were never ack'd).Compile verification
Both modules compile against their respective Flink versions:
Notes for reviewers
ClickHouseClientConfig#resetClient()is a small public API addition to the 2.0 module. Minimal blast radius (no existing callers); would also make sense in the 1.17 module if you want parity — happy to mirror it.ClickHouseTokenDeterminismTeston purpose so any drift in the writer produces a test failure.writtenRowMismatchescounter is the production alert signal — non-zero means something upstream is double-producing or the dedup window has overflowed. Worth a dashboard.ClickHouseSinkexplicitly lists the prior art (eBay Block Aggregator, ClickLoad, clickhouse-kafka-connect) and source-determinism requirement so downstream users know what they are signing up for.