This repository has been archived by the owner on Sep 21, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 49
Shard #74
Open
saj9191
wants to merge
22
commits into
master
Choose a base branch
from
shard
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
saj9191
force-pushed
the
shard
branch
8 times, most recently
from
July 1, 2019 18:55
4d41c90
to
c3c4ddf
Compare
saj9191
force-pushed
the
shard
branch
3 times, most recently
from
July 3, 2019 22:09
d7a320a
to
1c6ba0d
Compare
For both shard and non-sharded cases, we want to correctly setup the primary logs. However, in the sharded case, we call RecoveryAsync for every shard we recover from. We don't want to move the upgraded directory and log file for every shard, so we move this final logic to a function so we can call it once in both sharded / non-sharded cases.
If we want to perform shard recovery in parallel, we need to ensure each shard has it's own set of variables (ex. does not use the same log file variables). To make this easier to share with the non-sharded case, we move the variables to a class.
The service name folder will change depending on the shardID. This will affect where we look for log and checkpoint files. We update all directory and file lookups to use functions that account for the shardID. Note: For the initial self-connections, I use _serviceName and don't consider if it's a shard. This is because the Sample applications are still linked to the old version of Ambrosia. We will change this when we are able to link Samples to the new version.
This makes it easier to switch between sharded and non-sharded scenario. We still need to figure out the best way to pass in the shards to recover from.
For the sharded case, what is checked changes depending on the shard.
Add functionality so Immmortal Coordinator determines which shard to send a message. The shard is determined based on a hash of the destination currently. The hash only returns shard 1 for now.
For shard recovery, we need InputConnectionRecords to track the last processed ID and last processed replayable ID of ancestor shards. This map needs to be sent between peers. We only want to send longs instead of strings containing the name of the peer. To make this easier, we add a ShardID field so that we only have to parse the peer name when we make the initial connection. To make sure we don't break serialization when sharding is added, we move the replay serialization logic into a separate class so we can add unit tests. Any other serialization changes affected by shards will be moved into this class as well.
In the sharded case, we have to recover from multiple machines. We initialize the machine with a list of shards to recover from as well as a map indicating which machines each key belongs to. This also starts the merge process for the parent input records. This commit may be easier to view with -w.
Create a dictionary to keep track of the ancestors associated with each shard ID. This information should not change, so we store it with service information. The ancestor information will help determine which input / output information needs to be shared with peers.
When a connection is established, before the peers send a replay message, we send our ancestor list, so the peer knows which shard input / output data to add to the replay message.
We clear the ancestor data as we know the shard received the data and we no longer need to track this data.
We need to preserve a global ordering of output records during shard recovery. If we fail before recovery completes, it's possible that the order of execution will change. To handle this, we record a global ordering of outputs. We also handle merging parent output state in this commit.
This shard test tests basically the same thing as the normal basic end to end test, except the coordinators use the sharded logic. This is to ensure that the basic behavior for the sharded case is the same as the non-sharded case.
For sharding, we need to be able to concat output buffers to different records in the case a peer splits or merge.
It's possible for _lastShuffleDest to be null, which would result in a null dereference when we try to call length.
We don't want to update _outputs when a shard is launching. There will be an output variable for each parent that will be accessed in parallel. We need to update this variable to ensure recovery state is correct and to prevent race conditions.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.