Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test fixes 3 #3976

Open
wants to merge 12 commits into
base: cep-15-accord
Choose a base branch
from
Open

Test fixes 3 #3976

wants to merge 12 commits into from

Conversation

ifesdjeen
Copy link
Contributor

Thanks for sending a pull request! Here are some tips if you're new here:

  • Ensure you have added or run the appropriate tests for your PR.
  • Be sure to keep the PR description updated to reflect all changes.
  • Write your PR title to summarize what this PR proposes.
  • If possible, provide a concise example to reproduce the issue for a faster review.
  • Read our contributor guidelines
  • If you're making a documentation change, see our guide to documentation contribution

Commit messages should follow the following format:

<One sentence description, usually Jira title or CHANGES.txt summary>

<Optional lengthier description (context on patch)>

patch by <Authors>; reviewed by <Reviewers> for CASSANDRA-#####

Co-authored-by: Name1 <email1>
Co-authored-by: Name2 <email2>

The Cassandra Jira

@@ -185,7 +185,7 @@ public static Set<String> splitCommaDelimited(String src)
public volatile DurationSpec.IntMillisecondsBound cms_default_retry_backoff = null;
@Deprecated(since="5.1")
public volatile DurationSpec.IntMillisecondsBound cms_default_max_retry_backoff = null;
public String cms_retry_delay = "0 <= 50ms*1*attempts <= 10s,retries=10";
public String cms_retry_delay = "0 <= 50ms*1*attempts <= 1s,retries=10";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InProgressSequenceCoordinationTest started flaking, so I reverted to behavior that was closer to previous

}

// Fetching only one epoch here since later epochs might have already been requested concurrently
FetchTopologies.fetch(SharedContext.Global.instance, peers, epoch, epoch)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just making fetch asynchronous; I think it should have no positive or negative impact on tests, but blocking get was unnerving, and we would get stuck here often on shutdown.

request.process(node, fromNodeId, message.header);
});
}
node.withEpoch(waitForEpoch, (ignored, withEpochFailure) -> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMS catchup was unnecessary here due to comment in the TODO above.

@@ -102,22 +104,34 @@ private Loader newLoader(Unseekables<?> searchKeysOrRanges, RedundantBefore redu
return new Loader(this, searchKeysOrRanges, redundantBefore, testKind, minTxnId, maxTxnId, findAsDep);
}

private void updateTransitive(UnaryOperator<NavigableMap<TxnId, Ranges>> update)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one fixes transitive map concurrent access

@@ -1149,7 +1149,7 @@ public void close()
.collect(Collectors.toList());
try
{
FBUtilities.waitOnFutures(futures, 1L, TimeUnit.MINUTES);
FBUtilities.waitOnFutures(futures, instances.size(), TimeUnit.MINUTES);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like on CI, we sometimes do not manage to shutdown within 1 minute after tests that have a lot of background work scheduled. We can probably become better with interrupts, but meanwhile.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what examples are you referring to? a cluster with 12 nodes having 12 minutes to shutdown feels very excessive. Even 3 node cluster having 3m takes away time from other tests to run in CI

e.setValue(newRanges);
}
updateTransitive(transitive -> {
NavigableMap<TxnId, Ranges> next = new TreeMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be nice to return the original map if there's no changes. That is, keep this null until we need to edit, then insert everything before-hand and update from afterwards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed update/fix

belliottsmith and others added 12 commits March 14, 2025 14:01
 - SetShardDurable should correctly set DurableBefore Majority/Universal based on the Durability parameter
 - Partial compaction should update records in place to ensure truncation of discontiguous compactions do not lead to an incorrect field version being used
 - Journal compaction should not rewrite fields shadowed by a newer record
 - avoid String.format in Compactor hot path
 - avoid string concatenation on hot path; improve segment compactor partition build efficiency
 - Make sure to actually call compaction iterator
 - Fix %s placeholder
Failed on seed 0x6bea128ae851724b-org.apache.cassandra.simulator.SimulationException: Failed on seed 0x6bea128ae851724b Caused by: java.lang.AssertionError: Saw errors in node3: Unexpected exception: ERROR [AccordExecutor[0,8]:1] node3 2025-03-10 13:11:53,851 Uncaught accord exception java.util.ConcurrentModificationException: null at java.base/java.util.TreeMap$NavigableSubMap$SubMapIterator.nextEntry(TreeMap.java:1700) at java.base/java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1748) at java.base/java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1742) at org.apache.cassandra.service.accord.CommandsForRanges$Loader.intersects(CommandsForRanges.java:149) at org.apache.cassandra.service.accord.AccordTask$RangeTxnScanner.runInternal(AccordTask.java:1065) at org.apache.cassandra.service.accord.AccordTask$RangeTxnAndKeyScanner.runInternal(AccordTask.java:961) at org.apache.cassandra.service.accord.AccordTask$RangeTxnScanner.run(AccordTask.java:1053) at org.apache.cassandra.service.accord.AccordExecutor$PlainRunnable.run(AccordExecutor.java:1074) at org.apache.cassandra.service.accord.AccordExecutorAbstractLockLoop.runWithoutLock(AccordExecutorAbstractLockLoop.java:249) at org.apache.cassandra.concurrent.InfiniteLoopExecutor.loop(InfiniteLoopExecutor.java:125) at org.apache.cassandra.simulator.systems.InterceptedExecution$InterceptedThreadStart.run(InterceptedExecution.java:216) at
@@ -58,5 +60,5 @@ enum FailurePolicy { STOP, STOP_JOURNAL, IGNORE, DIE }
/**
* @return user provided version to use for key and value serialization
*/
int userVersion();
Version userVersion();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the journal package, but you switch to the accord version? How does this impact CEP-45 (mutation tracking)? They won't be using the accord version

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to have got pulled into my branch somehow, not sure if this one is my bad. Either way, good catch, I'll revert.

import static accord.local.CommandStores.RangesForEpoch;

// TODO (required): test with large collection values, and perhaps split out some fields if they have a tendency to grow larger
// TODO (required): alert on metadata size
// TODO (required): versioning
public class AccordJournalValueSerializers
{
public interface FlyweightSerializer<ENTRY, IMAGE>
private static final int messagingVersion = MessagingService.VERSION_40;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add this back? This isn't the version accord journal uses

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its private and don't see anything in this class use it, so its dead code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be my mistake

int count = errorCount.get();
assertThat(count).isEqualTo(0).describedAs(new Description()
{
public String value()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public String value()
@Override
public String value()

@@ -70,6 +72,7 @@ public void testConcurrentSchemaModification() throws InterruptedException, IOEx
File[] dataDirs = new File[nThreads];
String baseDataDir = tempFolder.newFolder().getAbsolutePath();

AtomicReference<String> errors = new AtomicReference<>("");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i don't get, this is passing on trunk but failing constantly on accord... what did accord change to make this unstable?

Node node;

enum Status { INITIALIZED, STARTING, REPLAY, STARTED, TERMINATING, TERMINATED }
private volatile Status status = Status.INITIALIZED;

public AccordJournal(Params params, AccordAgent agent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took awhile but it did turn out the agent is dead code; tons of code to cleanup to remove that link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants