Test fixes 3 #3976

ifesdjeen · 2025-03-12T19:43:15Z

Thanks for sending a pull request! Here are some tips if you're new here:

Ensure you have added or run the appropriate tests for your PR.
Be sure to keep the PR description updated to reflect all changes.
Write your PR title to summarize what this PR proposes.
If possible, provide a concise example to reproduce the issue for a faster review.
Read our contributor guidelines
If you're making a documentation change, see our guide to documentation contribution

Commit messages should follow the following format:

<One sentence description, usually Jira title or CHANGES.txt summary>

<Optional lengthier description (context on patch)>

patch by <Authors>; reviewed by <Reviewers> for CASSANDRA-#####

Co-authored-by: Name1 <email1>
Co-authored-by: Name2 <email2>

The Cassandra Jira

ifesdjeen · 2025-03-12T19:46:43Z

src/java/org/apache/cassandra/config/Config.java

@@ -185,7 +185,7 @@ public static Set<String> splitCommaDelimited(String src)
    public volatile DurationSpec.IntMillisecondsBound cms_default_retry_backoff = null;
    @Deprecated(since="5.1")
    public volatile DurationSpec.IntMillisecondsBound cms_default_max_retry_backoff = null;
-    public String cms_retry_delay = "0 <= 50ms*1*attempts <= 10s,retries=10";
+    public String cms_retry_delay = "0 <= 50ms*1*attempts <= 1s,retries=10";


InProgressSequenceCoordinationTest started flaking, so I reverted to behavior that was closer to previous

ifesdjeen · 2025-03-12T19:47:48Z

src/java/org/apache/cassandra/service/accord/AccordConfigurationService.java

-            }
+
+            // Fetching only one epoch here since later epochs might have already been requested concurrently
+            FetchTopologies.fetch(SharedContext.Global.instance, peers, epoch, epoch)


This is just making fetch asynchronous; I think it should have no positive or negative impact on tests, but blocking get was unnerving, and we would get stuck here often on shutdown.

ifesdjeen · 2025-03-12T19:48:05Z

src/java/org/apache/cassandra/service/accord/AccordVerbHandler.java

-                    request.process(node, fromNodeId, message.header);
-                });
-            }
+            node.withEpoch(waitForEpoch, (ignored, withEpochFailure) -> {


CMS catchup was unnecessary here due to comment in the TODO above.

ifesdjeen · 2025-03-12T19:48:34Z

src/java/org/apache/cassandra/service/accord/CommandsForRanges.java

@@ -102,22 +104,34 @@ private Loader newLoader(Unseekables<?> searchKeysOrRanges, RedundantBefore redu
            return new Loader(this, searchKeysOrRanges, redundantBefore, testKind, minTxnId, maxTxnId, findAsDep);
        }

+        private void updateTransitive(UnaryOperator<NavigableMap<TxnId, Ranges>> update)


This one fixes transitive map concurrent access

ifesdjeen · 2025-03-12T19:49:14Z

test/distributed/org/apache/cassandra/distributed/impl/AbstractCluster.java

@@ -1149,7 +1149,7 @@ public void close()
                           .collect(Collectors.toList());
        try
        {
-            FBUtilities.waitOnFutures(futures, 1L, TimeUnit.MINUTES);
+            FBUtilities.waitOnFutures(futures, instances.size(), TimeUnit.MINUTES);


looks like on CI, we sometimes do not manage to shutdown within 1 minute after tests that have a lot of background work scheduled. We can probably become better with interrupts, but meanwhile.

what examples are you referring to? a cluster with 12 nodes having 12 minutes to shutdown feels very excessive. Even 3 node cluster having 3m takes away time from other tests to run in CI

belliottsmith · 2025-03-12T19:58:20Z

src/java/org/apache/cassandra/service/accord/CommandsForRanges.java

-                e.setValue(newRanges);
-            }
+            updateTransitive(transitive -> {
+                NavigableMap<TxnId, Ranges> next = new TreeMap<>();


I think it might be nice to return the original map if there's no changes. That is, keep this null until we need to edit, then insert everything before-hand and update from afterwards.

pushed update/fix

- SetShardDurable should correctly set DurableBefore Majority/Universal based on the Durability parameter - Partial compaction should update records in place to ensure truncation of discontiguous compactions do not lead to an incorrect field version being used - Journal compaction should not rewrite fields shadowed by a newer record - avoid String.format in Compactor hot path - avoid string concatenation on hot path; improve segment compactor partition build efficiency - Make sure to actually call compaction iterator - Fix %s placeholder

Failed on seed 0x6bea128ae851724b-org.apache.cassandra.simulator.SimulationException: Failed on seed 0x6bea128ae851724b Caused by: java.lang.AssertionError: Saw errors in node3: Unexpected exception: ERROR [AccordExecutor[0,8]:1] node3 2025-03-10 13:11:53,851 Uncaught accord exception java.util.ConcurrentModificationException: null at java.base/java.util.TreeMap$NavigableSubMap$SubMapIterator.nextEntry(TreeMap.java:1700) at java.base/java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1748) at java.base/java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1742) at org.apache.cassandra.service.accord.CommandsForRanges$Loader.intersects(CommandsForRanges.java:149) at org.apache.cassandra.service.accord.AccordTask$RangeTxnScanner.runInternal(AccordTask.java:1065) at org.apache.cassandra.service.accord.AccordTask$RangeTxnAndKeyScanner.runInternal(AccordTask.java:961) at org.apache.cassandra.service.accord.AccordTask$RangeTxnScanner.run(AccordTask.java:1053) at org.apache.cassandra.service.accord.AccordExecutor$PlainRunnable.run(AccordExecutor.java:1074) at org.apache.cassandra.service.accord.AccordExecutorAbstractLockLoop.runWithoutLock(AccordExecutorAbstractLockLoop.java:249) at org.apache.cassandra.concurrent.InfiniteLoopExecutor.loop(InfiniteLoopExecutor.java:125) at org.apache.cassandra.simulator.systems.InterceptedExecution$InterceptedThreadStart.run(InterceptedExecution.java:216) at

dcapwell · 2025-03-14T16:42:37Z

src/java/org/apache/cassandra/journal/Params.java

@@ -58,5 +60,5 @@ enum FailurePolicy { STOP, STOP_JOURNAL, IGNORE, DIE }
    /**
     * @return user provided version to use for key and value serialization
     */
-    int userVersion();
+    Version userVersion();


this is the journal package, but you switch to the accord version? How does this impact CEP-45 (mutation tracking)? They won't be using the accord version

this seems to have got pulled into my branch somehow, not sure if this one is my bad. Either way, good catch, I'll revert.

dcapwell · 2025-03-14T16:50:58Z

src/java/org/apache/cassandra/service/accord/AccordJournalValueSerializers.java

 import static accord.local.CommandStores.RangesForEpoch;

 // TODO (required): test with large collection values, and perhaps split out some fields if they have a tendency to grow larger
 // TODO (required): alert on metadata size
 // TODO (required): versioning
 public class AccordJournalValueSerializers
 {
-    public interface FlyweightSerializer<ENTRY, IMAGE>
+    private static final int messagingVersion = MessagingService.VERSION_40;


why add this back? This isn't the version accord journal uses

its private and don't see anything in this class use it, so its dead code?

This looks to be my mistake

dcapwell · 2025-03-14T16:59:44Z

test/unit/org/apache/cassandra/io/sstable/CQLSSTableWriterConcurrencyTest.java

+        int count = errorCount.get();
+        assertThat(count).isEqualTo(0).describedAs(new Description()
+        {
+            public String value()


Suggested change

public String value()

@Override

public String value()

dcapwell · 2025-03-14T17:01:20Z

test/unit/org/apache/cassandra/io/sstable/CQLSSTableWriterConcurrencyTest.java

@@ -70,6 +72,7 @@ public void testConcurrentSchemaModification() throws InterruptedException, IOEx
        File[] dataDirs = new File[nThreads];
        String baseDataDir = tempFolder.newFolder().getAbsolutePath();

+        AtomicReference<String> errors = new AtomicReference<>("");


what i don't get, this is passing on trunk but failing constantly on accord... what did accord change to make this unstable?

dcapwell · 2025-03-14T17:34:30Z

src/java/org/apache/cassandra/service/accord/AccordJournal.java

    Node node;

    enum Status { INITIALIZED, STARTING, REPLAY, STARTED, TERMINATING, TERMINATED }
    private volatile Status status = Status.INITIALIZED;

-    public AccordJournal(Params params, AccordAgent agent)


took awhile but it did turn out the agent is dead code; tons of code to cleanup to remove that link

ifesdjeen commented Mar 12, 2025

View reviewed changes

belliottsmith reviewed Mar 12, 2025

View reviewed changes

belliottsmith and others added 12 commits March 14, 2025 14:01

fix CommandChangeTest

81f2bff

Fix accord iteration test

44ad26c

Fix FailedAckTest

085057a

Avoid TCM epoch catchup in AccordVerbHandler

4991cc3

Increase wait time during closing to avoid Unterminated threads

d4aa006

Increase timeouts, improve test stability

a32d7c6

More descriptive output from CQL test

59a4237

Shorten max CMS delay

bb0d224

Improve future handling in config service

5083d63

Address Benedict's comment

b46fcc3

ifesdjeen force-pushed the test-fixes-3 branch from 4a77aad to b46fcc3 Compare March 14, 2025 15:25

dcapwell reviewed Mar 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test fixes 3 #3976

Test fixes 3 #3976

ifesdjeen commented Mar 12, 2025

ifesdjeen Mar 12, 2025

ifesdjeen Mar 12, 2025

ifesdjeen Mar 12, 2025

ifesdjeen Mar 12, 2025

ifesdjeen Mar 12, 2025

dcapwell Mar 14, 2025

belliottsmith Mar 12, 2025

ifesdjeen Mar 14, 2025

dcapwell Mar 14, 2025

belliottsmith Mar 14, 2025

dcapwell Mar 14, 2025

dcapwell Mar 14, 2025

belliottsmith Mar 14, 2025

dcapwell Mar 14, 2025

dcapwell Mar 14, 2025

dcapwell Mar 14, 2025

Test fixes 3 #3976

Are you sure you want to change the base?

Test fixes 3 #3976

Conversation

ifesdjeen commented Mar 12, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment