[server] Add threadsafe mode to venice-server which adjusts message processing order #910

ZacAttack · 2024-03-21T03:24:05Z

[server][WIP] Add threadsafe mode to venice-server which adjusts message processing order

This is an initial phase PR. It is seen as the minimal set of changes needed in order to add a mode where writes on leader are committed to rocksdb prior to producing. This change in order has the following impacts:

Drainer is skipped on leaders:
In a later refactor it might be prudent to remove the drainer entirely. However, in order to best accomodate that, it would likely make sense to execute a kind of batching logic when flushing to rocksdb. We do not attempt to make this change in this PR.

DCR logic must change:
Since writes are persisted to rocksdb prior to producing to Kafka, we now must accomodate for the possibility of left over state on a leader. To address this, we add a new mode to the merge conflict resolution logic where upon a perfect tie (on value and timestamp), we resolve to produce the repeated record. The intention here is to be able to be certain that a write which was persisted to rocksdb on leader but not produced doesn't end up getting lost due to failing DCR.

Transient Record is disabled
transient record cache is disabled for those ingestion tasks which enable this mode. This is itself was one of the goals, but we should go here with some validation. Most clusters in production end up seeing pretty low cache hit rate on transient record cache in production, however, there is at least one use case that gets as high as a 20% hit rate. Theoretically, we may be able to avoid taking too much hit here as we are able to give the memory savings to rocksdb cache, but this needs vetting. If this doesn't work, then we will need to replace the transient record cache with a simple size/time based cache.

There are also some cleanups here and there. Getting rid of some code paths that we no longer need and cleaning up others.

NOTE: Integration tests haven't been completely added to this PR yet. Part of that is because while switching some of the existing integration tests to this mode, some tests are failing. This needs some more diagnosis. Hence the WIP tag.

Resolves #XXX

How was this PR tested?

Does this PR introduce any user-facing changes?

No. You can skip the rest of this section.
Yes. Make sure to explain your proposed changes and call out the behavior change.

…age processing order This is an initial phase PR. It is seen as the minimal set of changes needed in order to add a mode where writes on leader are committed to rocksdb prior to producing. This change in order has the following impacts: -Drainer is skipped on leaders: In a later refactor it might be prudent to remove the drainer entirely. However, in order to best accomodate that, it would likely make sense to execute a kind of batching logic when flushing to rocksdb. We do not attempt to make this change in this PR. -DCR logic must change Since writes are persisted to rocksdb prior to producing to Kafka, we now must accomodate for the possibility of left over state on a leader. To address this, we add a new mode to the merge conflict resolution logic where upon a perfect tie (on value and timestamp), we resolve to produce the repeated record. The intention here is to be able to be certain that a write which was persisted to rocksdb on leader but not produced doesn't end up getting lost due to failing DCR. -Transient Record is disabled transient record cache is disabled for those ingestion tasks which enable this mode. This is itself was one of the goals, but we should go here with some validation. Most clusters in production end up seeing pretty low cache hit rate on transient record cache in production, however, there is at least one use case that gets as high as a 20% hit rate. Theoretically, we may be able to avoid taking too much hit here as we are able to give the memory savings to rocksdb cache, but this needs vetting. If this doesn't work, then we will need to replace the transient record cache with a simple size/time based cache. There are also some cleanups here and there. Getting rid of some code paths that we no longer need and cleaning up others. NOTE: Integration tests haven't been completely added to this PR yet. Part of that is because while switching some of the existing integration tests to this mode, some tests are failing. This needs some more diagnosis. Hence the WIP tag.

sixpluszero

Thanks for the change! Overall looks great, I leave some comment, especially about TR.

clients/da-vinci-client/src/main/java/com/linkedin/davinci/config/VeniceServerConfig.java

...inci-client/src/main/java/com/linkedin/davinci/kafka/consumer/PartitionConsumptionState.java

...ts/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTask.java

...vinci-client/src/main/java/com/linkedin/davinci/replication/merge/MergeConflictResolver.java

sixpluszero

Thanks for the update, now I am not sure if I totally follow the logic as I post my concern in one of the reply......we probably should chat a bit offline to chew the code change.
Also, we probably need some tests coverage to prove our old mode can fail in the race condition case and new mode is correct, even if my concern is eventually proved invalid...

sixpluszero · 2024-04-16T07:22:30Z

...-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestActiveActiveIngestion.java

@@ -127,6 +127,7 @@ public void setUp() {
    Properties serverProperties = new Properties();
    serverProperties.setProperty(ConfigKeys.SERVER_PROMOTION_TO_LEADER_REPLICA_DELAY_SECONDS, Long.toString(1));
    serverProperties.put(ROCKSDB_PLAIN_TABLE_FORMAT_ENABLED, false);
+    serverProperties.put(SERVER_INGESTION_TASK_THREAD_SAFE_MODE, false);


Since this mode is default false, can we make it to run both true and false for some sophisticated tests for AAWC?

...ts/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTask.java

...ient/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTask.java

sixpluszero · 2024-08-14T20:31:28Z

...ts/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTask.java

@@ -2834,6 +2885,14 @@ private int internalProcessConsumerRecord(
          LOGGER.error("Failed to record Record heartbeat with message: ", e);
        }
      } else {
+        // TODO: This is a hack. Today the code kind of does a backdoor change to a key in the leaderProducer callback.


Actually I don't really follow this hack, can you explain a bit? I remember the issue being the leader does not chunk the record before it is produced to VT, and it seems like here it does not solve the issue? So the Leader and the Follower will still have different view?

FelixGV

Thanks for working on this! I wish I would have reviewed it sooner, but better late than never I guess... my comments are pretty minor, I think... but I have a few questions as well.

I don't see which integration tests are disabled. Is this a stale part of the commit message / PR body? Let's delete that part if it is stale...

Thanks again!

FelixGV · 2024-08-15T15:26:25Z

clients/da-vinci-client/src/main/java/com/linkedin/davinci/config/VeniceServerConfig.java

@@ -453,6 +454,8 @@ public class VeniceServerConfig extends VeniceClusterConfig {

  private final int ingestionTaskMaxIdleCount;

+  private final boolean threadSafeMode;


We should probably find another name for this functionality, since technically the old code is also supposed to be threadsafe... it's just that the new code is intended to make it easier to maintain thread-safety (less likely to introduce concurrency bugs)...

I guess the most significant functional change with this mode is that the leader persists changes locally prior to writing to Kafka, and as a result the TransientRecordCache becomes unnecessary. Perhaps a name along those lines might be more clear?

How about: leaderPersistsLocallyBeforeProducingToVT / leader.persists.locally.before.producing.to.vt

It's a bit of a mouthful, but seems more clear... a more concise version might be "Leader Persists Before Producing", and in day-to-day operations we might end up calling it "LPBP", for short. IDK, I'm just riffing at this point 😂

Open to other suggestions too, of course.

FelixGV · 2024-08-15T15:27:08Z

...client/src/main/java/com/linkedin/davinci/kafka/consumer/ActiveActiveStoreIngestionTask.java

-       * The below flow must be executed in a critical session for the same key:
+       * The below flow must be executed in a critical section for the same key:


Thanks 😄 ...

...ient/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTask.java

FelixGV · 2024-08-15T16:01:47Z

...a-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderProducerCallback.java

+    if (this.syncOffsetsOnlyAfterProducing && !chunkDeprecation) {
+      // sync offsets
+      ingestionTask
+          .maybeSyncOffsets(consumedRecord, leaderProducedRecordContext, partitionConsumptionState, subPartition);


Does the PCS instance passed here has a chance of having been modified by another thread (the processing thread)? I imagine we don't clone the PCS in order to make them immutable, so I wonder if what we would be checkpointing here is guaranteed to represent the correct state, up until what was just produced, rather than up until what's been consumed by another thread?

That's a good catch, yeah it could happen.

...inci-client/src/main/java/com/linkedin/davinci/kafka/consumer/PartitionConsumptionState.java

FelixGV · 2024-08-15T16:09:34Z

...ts/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTask.java

    /**
-     * The consumer record has been put into drainer queue; the following cases will result in putting to drainer directly:
+     * The consumer record needs to be put into drainer queue; the following cases will result in putting to drainer directly:


We should delete the O/B/O reference below, right?

FelixGV · 2024-08-15T16:15:08Z

...vinci-client/src/main/java/com/linkedin/davinci/replication/merge/MergeConflictResolver.java

-import static com.linkedin.venice.schema.rmd.v1.CollectionRmdTimestamp.PUT_ONLY_PART_LENGTH_FIELD_POS;
-import static com.linkedin.venice.schema.rmd.v1.CollectionRmdTimestamp.TOP_LEVEL_COLO_ID_FIELD_POS;
-import static com.linkedin.venice.schema.rmd.v1.CollectionRmdTimestamp.TOP_LEVEL_TS_FIELD_POS;


Nitpick: why are we removing those? It makes the code a bit more verbose, and it adds trivially modified lines into the git blame, which inflates the size of an already complex PR... is there any benefit? Perhaps there is ambiguity between the constants coming from different classes? (In which case I'd be fine with keeping the change...)

FelixGV · 2024-08-15T16:18:53Z

...ient/src/main/java/com/linkedin/davinci/schema/merge/PerFieldTimestampMergeRecordHelper.java

-      final boolean newFieldCompletelyReplaceOldField = newFieldValue != oldFieldValue;
-      if (newFieldCompletelyReplaceOldField) {
+
+      if (newFieldValue != oldFieldValue) {
        oldRecord.put(oldRecordField.pos(), newFieldValue);
      }
-      return newFieldCompletelyReplaceOldField
-          ? UpdateResultStatus.COMPLETELY_UPDATED
-          : UpdateResultStatus.NOT_UPDATED_AT_ALL;
-
+      return UpdateResultStatus.COMPLETELY_UPDATED;


So I guess this is the change referenced in the commit message about:

DCR logic must change:
Since writes are persisted to rocksdb prior to producing to Kafka, we now must accomodate for the possibility of left over state on a leader. To address this, we add a new mode to the merge conflict resolution logic where upon a perfect tie (on value and timestamp), we resolve to produce the repeated record. The intention here is to be able to be certain that a write which was persisted to rocksdb on leader but not produced doesn't end up getting lost due to failing DCR.

I don't understand this, however... why would there be changes persisted in the leader but not in the VT, in cases of a perfect tie? If it is a perfect tie, then I assume the leader need not persist any changes either, right...?

FelixGV · 2024-08-15T16:20:07Z

...lient/src/main/java/com/linkedin/davinci/schema/merge/SortBasedCollectionFieldOpHandler.java

-    setNewMapActiveElementAndTs(
+    setNewMapActiveElementAndTimestamp(


Nitpick... I would prefer trivial renames to be in a separate PR, to minimize the size of complex PRs... but up to you.

ZacAttack added 2 commits March 20, 2024 20:13

Merge branch 'main' into threadsafe

7c8178c

sixpluszero reviewed Apr 2, 2024

View reviewed changes

ZacAttack added 2 commits April 10, 2024 10:53

Merge branch 'main' into threadsafe

0d4e6a0

address comments

88adca9

sixpluszero reviewed Apr 16, 2024

View reviewed changes

ZacAttack and others added 3 commits April 24, 2024 18:10

add chunking fix and set AAIngestion test to use thread safe mode

593cd0f

reduce timeouts

8c0e60c

Merge branch 'main' into threadsafe

3af747c

FelixGV mentioned this pull request May 17, 2024

[specs][doc] FizzBee spec for the Venice's LeaderFollower protocol #958

Merged

2 tasks

ZacAttack added 5 commits June 20, 2024 14:33

Merge branch 'main' into threadsafe

75633d7

Merge branch 'main' into threadsafe

d2e517e

Merge branch 'main' into threadsafe

fe74f0a

last tweaks for chunk deprecation

99451e1

Merge branch 'main' into threadsafe

0ce894d

ZacAttack changed the title ~~[server][WIP] Add threadsafe mode to venice-server which adjusts message processing order~~ [server] Add threadsafe mode to venice-server which adjusts message processing order Aug 1, 2024

sixpluszero reviewed Aug 14, 2024

View reviewed changes

FelixGV reviewed Aug 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[server] Add threadsafe mode to venice-server which adjusts message processing order #910

[server] Add threadsafe mode to venice-server which adjusts message processing order #910

ZacAttack commented Mar 21, 2024

sixpluszero left a comment

sixpluszero left a comment

sixpluszero Apr 16, 2024

sixpluszero Aug 14, 2024

FelixGV left a comment

FelixGV Aug 15, 2024

FelixGV Aug 15, 2024

FelixGV Aug 15, 2024

ZacAttack Aug 30, 2024

FelixGV Aug 15, 2024

FelixGV Aug 15, 2024

FelixGV Aug 15, 2024

FelixGV Aug 15, 2024

		@@ -453,6 +454,8 @@ public class VeniceServerConfig extends VeniceClusterConfig {

		private final int ingestionTaskMaxIdleCount;

		private final boolean threadSafeMode;

		* The below flow must be executed in a critical session for the same key:
		* The below flow must be executed in a critical section for the same key:

		setNewMapActiveElementAndTs(
		setNewMapActiveElementAndTimestamp(

[server] Add threadsafe mode to venice-server which adjusts message processing order #910

Are you sure you want to change the base?

[server] Add threadsafe mode to venice-server which adjusts message processing order #910

Conversation

ZacAttack commented Mar 21, 2024

[server][WIP] Add threadsafe mode to venice-server which adjusts message processing order

How was this PR tested?

Does this PR introduce any user-facing changes?

sixpluszero left a comment

Choose a reason for hiding this comment

sixpluszero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FelixGV left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment