-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test fixes 3 #3976
base: cep-15-accord
Are you sure you want to change the base?
Test fixes 3 #3976
Conversation
@@ -185,7 +185,7 @@ public static Set<String> splitCommaDelimited(String src) | |||
public volatile DurationSpec.IntMillisecondsBound cms_default_retry_backoff = null; | |||
@Deprecated(since="5.1") | |||
public volatile DurationSpec.IntMillisecondsBound cms_default_max_retry_backoff = null; | |||
public String cms_retry_delay = "0 <= 50ms*1*attempts <= 10s,retries=10"; | |||
public String cms_retry_delay = "0 <= 50ms*1*attempts <= 1s,retries=10"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InProgressSequenceCoordinationTest started flaking, so I reverted to behavior that was closer to previous
} | ||
|
||
// Fetching only one epoch here since later epochs might have already been requested concurrently | ||
FetchTopologies.fetch(SharedContext.Global.instance, peers, epoch, epoch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just making fetch
asynchronous; I think it should have no positive or negative impact on tests, but blocking get
was unnerving, and we would get stuck here often on shutdown.
request.process(node, fromNodeId, message.header); | ||
}); | ||
} | ||
node.withEpoch(waitForEpoch, (ignored, withEpochFailure) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMS catchup was unnecessary here due to comment in the TODO above.
@@ -102,22 +104,34 @@ private Loader newLoader(Unseekables<?> searchKeysOrRanges, RedundantBefore redu | |||
return new Loader(this, searchKeysOrRanges, redundantBefore, testKind, minTxnId, maxTxnId, findAsDep); | |||
} | |||
|
|||
private void updateTransitive(UnaryOperator<NavigableMap<TxnId, Ranges>> update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one fixes transitive
map concurrent access
@@ -1149,7 +1149,7 @@ public void close() | |||
.collect(Collectors.toList()); | |||
try | |||
{ | |||
FBUtilities.waitOnFutures(futures, 1L, TimeUnit.MINUTES); | |||
FBUtilities.waitOnFutures(futures, instances.size(), TimeUnit.MINUTES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like on CI, we sometimes do not manage to shutdown within 1 minute after tests that have a lot of background work scheduled. We can probably become better with interrupts, but meanwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what examples are you referring to? a cluster with 12 nodes having 12 minutes to shutdown feels very excessive. Even 3 node cluster having 3m takes away time from other tests to run in CI
e.setValue(newRanges); | ||
} | ||
updateTransitive(transitive -> { | ||
NavigableMap<TxnId, Ranges> next = new TreeMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be nice to return the original map if there's no changes. That is, keep this null until we need to edit, then insert everything before-hand and update from afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushed update/fix
- SetShardDurable should correctly set DurableBefore Majority/Universal based on the Durability parameter - Partial compaction should update records in place to ensure truncation of discontiguous compactions do not lead to an incorrect field version being used - Journal compaction should not rewrite fields shadowed by a newer record - avoid String.format in Compactor hot path - avoid string concatenation on hot path; improve segment compactor partition build efficiency - Make sure to actually call compaction iterator - Fix %s placeholder
Failed on seed 0x6bea128ae851724b-org.apache.cassandra.simulator.SimulationException: Failed on seed 0x6bea128ae851724b Caused by: java.lang.AssertionError: Saw errors in node3: Unexpected exception: ERROR [AccordExecutor[0,8]:1] node3 2025-03-10 13:11:53,851 Uncaught accord exception java.util.ConcurrentModificationException: null at java.base/java.util.TreeMap$NavigableSubMap$SubMapIterator.nextEntry(TreeMap.java:1700) at java.base/java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1748) at java.base/java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1742) at org.apache.cassandra.service.accord.CommandsForRanges$Loader.intersects(CommandsForRanges.java:149) at org.apache.cassandra.service.accord.AccordTask$RangeTxnScanner.runInternal(AccordTask.java:1065) at org.apache.cassandra.service.accord.AccordTask$RangeTxnAndKeyScanner.runInternal(AccordTask.java:961) at org.apache.cassandra.service.accord.AccordTask$RangeTxnScanner.run(AccordTask.java:1053) at org.apache.cassandra.service.accord.AccordExecutor$PlainRunnable.run(AccordExecutor.java:1074) at org.apache.cassandra.service.accord.AccordExecutorAbstractLockLoop.runWithoutLock(AccordExecutorAbstractLockLoop.java:249) at org.apache.cassandra.concurrent.InfiniteLoopExecutor.loop(InfiniteLoopExecutor.java:125) at org.apache.cassandra.simulator.systems.InterceptedExecution$InterceptedThreadStart.run(InterceptedExecution.java:216) at
@@ -58,5 +60,5 @@ enum FailurePolicy { STOP, STOP_JOURNAL, IGNORE, DIE } | |||
/** | |||
* @return user provided version to use for key and value serialization | |||
*/ | |||
int userVersion(); | |||
Version userVersion(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the journal
package, but you switch to the accord version? How does this impact CEP-45 (mutation tracking)? They won't be using the accord version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to have got pulled into my branch somehow, not sure if this one is my bad. Either way, good catch, I'll revert.
import static accord.local.CommandStores.RangesForEpoch; | ||
|
||
// TODO (required): test with large collection values, and perhaps split out some fields if they have a tendency to grow larger | ||
// TODO (required): alert on metadata size | ||
// TODO (required): versioning | ||
public class AccordJournalValueSerializers | ||
{ | ||
public interface FlyweightSerializer<ENTRY, IMAGE> | ||
private static final int messagingVersion = MessagingService.VERSION_40; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why add this back? This isn't the version accord journal uses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its private and don't see anything in this class use it, so its dead code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to be my mistake
int count = errorCount.get(); | ||
assertThat(count).isEqualTo(0).describedAs(new Description() | ||
{ | ||
public String value() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public String value() | |
@Override | |
public String value() |
@@ -70,6 +72,7 @@ public void testConcurrentSchemaModification() throws InterruptedException, IOEx | |||
File[] dataDirs = new File[nThreads]; | |||
String baseDataDir = tempFolder.newFolder().getAbsolutePath(); | |||
|
|||
AtomicReference<String> errors = new AtomicReference<>(""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what i don't get, this is passing on trunk but failing constantly on accord... what did accord change to make this unstable?
Node node; | ||
|
||
enum Status { INITIALIZED, STARTING, REPLAY, STARTED, TERMINATING, TERMINATED } | ||
private volatile Status status = Status.INITIALIZED; | ||
|
||
public AccordJournal(Params params, AccordAgent agent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took awhile but it did turn out the agent is dead code; tons of code to cleanup to remove that link
Thanks for sending a pull request! Here are some tips if you're new here:
Commit messages should follow the following format:
The Cassandra Jira