Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a utility to perform a gradual split #116

Closed
wants to merge 12 commits into from

Conversation

stephen-soltesz
Copy link
Contributor

@stephen-soltesz stephen-soltesz commented Feb 14, 2023

This change adds a deployment utility to gradually split traffic to a new version. This is meant to address:


This change is Reviewable

@coveralls
Copy link
Collaborator

coveralls commented Feb 14, 2023

Pull Request Test Coverage Report for Build 895

  • 69 of 69 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.08%) to 98.053%

Totals Coverage Status
Change from base Build 892: 0.08%
Covered Lines: 1712
Relevant Lines: 1746

💛 - Coveralls

Copy link
Contributor

@cristinaleonr cristinaleonr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice solution. Thank you!

I can't help but feel like we are implementing functionality that should already come out of the box :/.

As a potential alternative, from the description, it seems like the initial_delay_sec setting can give new instances more time to initialize before they are ready to serve traffic.

Reviewable status: 0 of 1 approvals obtained (waiting on @stephen-soltesz)


cloudbuild/cloudbuild.yaml line 18 at r2 (raw file):

  - go test -v ./...

# Run unit tests for environment.

Perhaps this should say build gradual-split command?


cmd/gradual-split/main.go line 52 at r2 (raw file):

}

func (a *AppAPI) ServiceUpdate(ctx context.Context, serviceID string, service *appengine.Service, mask string) (*appengine.Operation, error) {

Should this have a docstring?


cmd/gradual-split/internal/split.go line 43 at r2 (raw file):

				continue
			}
			if verFrom >= version.Id {

Is this if statement needed? The scenario seems to be covered by the one below.


cmd/gradual-split/internal/split.go line 82 at r2 (raw file):

		}
	case len(service.Split.Allocations) == 2:
		// Assume the split is already in progress.

This seems to be called once from main. Under what circumstances would the split already be in progress?


cmd/gradual-split/internal/split.go line 98 at r2 (raw file):

// each step. PerformSplit can resume a split in progress.
func PerformSplit(ctx context.Context, api AppWrapper, service *appengine.Service, opt *SplitOptions) error {
	// delay time.Duration, vfrom, vto string) error {

Is this needed?


cmd/gradual-split/internal/split_test.go line 33 at r2 (raw file):

		return nil, a.serviceErr
	}
	return nil, nil // a.apis.Apps.Services.Patch(project, serviceID, service).UpdateMask(mask).Do()

Is this commented out code needed?

Copy link
Contributor Author

@stephen-soltesz stephen-soltesz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it feels strange that Flexible doesn't support gradual traffic migration between versions as Standard env does. https://cloud.google.com/appengine/docs/flexible/migrating-traffic

Do you suspect that initial_delay_sec would solve the --promote issue? The outage from #114 latest at least 5min at most 10min. Is the root cause that the VMs were either "not live" or "not ready" before the promote takes effect? Either way, it coul be time for explicit live and ready handlers.

Reviewable status: 0 of 1 approvals obtained (waiting on @cristinaleonr)


cloudbuild/cloudbuild.yaml line 18 at r2 (raw file):

Previously, cristinaleonr (Cristina Leon) wrote…

Perhaps this should say build gradual-split command?

Done.


cmd/gradual-split/main.go line 52 at r2 (raw file):

Previously, cristinaleonr (Cristina Leon) wrote…

Should this have a docstring?

Done.


cmd/gradual-split/internal/split.go line 43 at r2 (raw file):

Previously, cristinaleonr (Cristina Leon) wrote…

Is this if statement needed? The scenario seems to be covered by the one below.

I was trying to exclude the case where "verFrom" is later or the same as the "latest" version. But that doesn't need to be in this loop. Moved after.


cmd/gradual-split/internal/split.go line 82 at r2 (raw file):

Previously, cristinaleonr (Cristina Leon) wrote…

This seems to be called once from main. Under what circumstances would the split already be in progress?

A transient error causes the gradual-split utility to crash mid-migration, and we restart the build to continue from where it started.


cmd/gradual-split/internal/split.go line 98 at r2 (raw file):

Previously, cristinaleonr (Cristina Leon) wrote…

Is this needed?

Thank you. No. Removed -- that was from before I added the opt parameter.


cmd/gradual-split/internal/split_test.go line 33 at r2 (raw file):

Previously, cristinaleonr (Cristina Leon) wrote…

Is this commented out code needed?

Removed.

Copy link
Contributor

@cristinaleonr cristinaleonr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the problem has to do with heartbeat instances abruptly disconnecting and trying to reconnect to the latest promoted instance, I now believe that a gradual migration will be more helpful than the initial_delay_sec setting (which might just delay the reconnection even further). So, this change :lgtm:.

One thing to consider is that this change seems to cause the build to take ~2x as long (>30min).

Reviewable status: :shipit: complete! 1 of 1 approvals obtained

@stephen-soltesz
Copy link
Contributor Author

We discussed at the ops-tactical, that we'd like to see a native app engine deployment with --promote before merging this change. Ideally, we don't need this utility, if all the other changes to the HBS and Locate make it unnecessary.

@stephen-soltesz
Copy link
Contributor Author

Obsolete

@nkinkade nkinkade deleted the sandbox-soltesz-gradual-split branch August 22, 2023 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants