You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the --promote flag to the deployment of Locate in App Engine, all the heartbeat services are briefly disconnected and must reset with the Locate server before the set of healthy servers are re-populated again. This takes a surprisingly long time.
An incremental split appears to work as intended. From 19:35 to 20:10 a 10% increase every 5min resulted in no visible decrease in test rates or locate connections. The increase in traffic was due to hourly client traffic.
During the original event on 2023-02-09, the time for all servers to re-register took over 1hr (see image below). We see the same slow update in staging. @cristinaleonr suspects this is due to the heartbeat service's exponential backoff and plans to add additional metrics to the hbs so that we can see both the node and locate metrics.
Also, notable, during the manually split deployment to production on 13th, we do not see disruptions to the available health server counts.
With the --promote flag to the deployment of Locate in App Engine, all the heartbeat services are briefly disconnected and must reset with the Locate server before the set of healthy servers are re-populated again. This takes a surprisingly long time.
Unfortunately, Flexible environment App Engine does not support automatic migration (as Standard environment does).
However, we should still be able to create a tool that performs a gradual migration using Traffic Splitting.
It may also be possible to improve the shutdown / warmup mechanism for Locate to "hand off" from one version to the next more gracefully.
The text was updated successfully, but these errors were encountered: