-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a utility to perform a gradual split #116
Conversation
Pull Request Test Coverage Report for Build 895
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice solution. Thank you!
I can't help but feel like we are implementing functionality that should already come out of the box :/.
As a potential alternative, from the description, it seems like the initial_delay_sec
setting can give new instances more time to initialize before they are ready to serve traffic.
Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)
cloudbuild/cloudbuild.yaml
line 18 at r2 (raw file):
- go test -v ./... # Run unit tests for environment.
Perhaps this should say build gradual-split command?
cmd/gradual-split/main.go
line 52 at r2 (raw file):
} func (a *AppAPI) ServiceUpdate(ctx context.Context, serviceID string, service *appengine.Service, mask string) (*appengine.Operation, error) {
Should this have a docstring?
cmd/gradual-split/internal/split.go
line 43 at r2 (raw file):
continue } if verFrom >= version.Id {
Is this if statement needed? The scenario seems to be covered by the one below.
cmd/gradual-split/internal/split.go
line 82 at r2 (raw file):
} case len(service.Split.Allocations) == 2: // Assume the split is already in progress.
This seems to be called once from main. Under what circumstances would the split already be in progress?
cmd/gradual-split/internal/split.go
line 98 at r2 (raw file):
// each step. PerformSplit can resume a split in progress. func PerformSplit(ctx context.Context, api AppWrapper, service *appengine.Service, opt *SplitOptions) error { // delay time.Duration, vfrom, vto string) error {
Is this needed?
cmd/gradual-split/internal/split_test.go
line 33 at r2 (raw file):
return nil, a.serviceErr } return nil, nil // a.apis.Apps.Services.Patch(project, serviceID, service).UpdateMask(mask).Do()
Is this commented out code needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it feels strange that Flexible doesn't support gradual traffic migration between versions as Standard env does. https://cloud.google.com/appengine/docs/flexible/migrating-traffic
Do you suspect that initial_delay_sec
would solve the --promote
issue? The outage from #114 latest at least 5min at most 10min. Is the root cause that the VMs were either "not live" or "not ready" before the promote takes effect? Either way, it coul be time for explicit live and ready handlers.
Reviewable status: 0 of 1 approvals obtained (waiting on @cristinaleonr)
cloudbuild/cloudbuild.yaml
line 18 at r2 (raw file):
Previously, cristinaleonr (Cristina Leon) wrote…
Perhaps this should say build gradual-split command?
Done.
cmd/gradual-split/main.go
line 52 at r2 (raw file):
Previously, cristinaleonr (Cristina Leon) wrote…
Should this have a docstring?
Done.
cmd/gradual-split/internal/split.go
line 43 at r2 (raw file):
Previously, cristinaleonr (Cristina Leon) wrote…
Is this if statement needed? The scenario seems to be covered by the one below.
I was trying to exclude the case where "verFrom" is later or the same as the "latest" version. But that doesn't need to be in this loop. Moved after.
cmd/gradual-split/internal/split.go
line 82 at r2 (raw file):
Previously, cristinaleonr (Cristina Leon) wrote…
This seems to be called once from main. Under what circumstances would the split already be in progress?
A transient error causes the gradual-split utility to crash mid-migration, and we restart the build to continue from where it started.
cmd/gradual-split/internal/split.go
line 98 at r2 (raw file):
Previously, cristinaleonr (Cristina Leon) wrote…
Is this needed?
Thank you. No. Removed -- that was from before I added the opt
parameter.
cmd/gradual-split/internal/split_test.go
line 33 at r2 (raw file):
Previously, cristinaleonr (Cristina Leon) wrote…
Is this commented out code needed?
Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the problem has to do with heartbeat instances abruptly disconnecting and trying to reconnect to the latest promoted instance, I now believe that a gradual migration will be more helpful than the initial_delay_sec
setting (which might just delay the reconnection even further). So, this change .
One thing to consider is that this change seems to cause the build to take ~2x as long (>30min).
Reviewable status: complete! 1 of 1 approvals obtained
We discussed at the ops-tactical, that we'd like to see a native app engine deployment with |
Obsolete |
This change adds a deployment utility to gradually split traffic to a new version. This is meant to address:
This change is