SkyPilot v0.8.0: Faster Managed Jobs, Faster Provisioning, Digital Ocean and Vast support, DeepSeek R1 recipes and more!
We’re thrilled to release SkyPilot v0.8.0! This update makes SkyPilot faster and more robust, with major improvements to Managed Jobs, Kubernetes support, and new cloud integrations.
Highlights
- Faster Managed Jobs: 3x faster job submission, controller uses 37% less memory, and support for 2000+ concurrent jobs
- Faster Provisioning: Kubernetes provisioning is 4x faster — provisioning a GPU cluster with 200 nodes takes under 90 seconds.
sky launch
on existing clusters is 5x faster when using--fast
flag. - Intermediate buckets for managed jobs: bring your own buckets to be used as intermediate storage for managed jobs.
# ~/.sky/config.yaml jobs: bucket: s3://my-bucket
- Exciting new features in SkyServe:
- SkyServe load balancer now supports TLS via HTTPS
- New
load_balancing_policy
field to choose from multiple policies (round_robin, least_load) - Replica can now expose multiple ports
- New clouds: Digital Ocean and Vast
- New LLM Recipes: DeepSeek R1 and Janus, minGPT with Pytorch Distributed
Managed Jobs
- Managed jobs scheduler has been reworked: 3x faster, uses 37% less memory and can support up to 2000 jobs running simultaneously (#4318, #4485, #4341)
- Brand new look for the managed jobs dashboard, with new filters, log download, and failover history (#4253, #4644 ,#4638)
- You can now bring your own bucket to act as the intermediate storage for managed jobs (#4257)
- If no intermediate bucket is specified, we now create one bucket per job instead of one per
file_mount
/workdir
.
- If no intermediate bucket is specified, we now create one bucket per job instead of one per
sky jobs logs
has a new flag--sync-down
to download logs to local machine (#4527)- When fetching managed jobs logs, SkyPilot will autostart the jobs controller if it is not running (#4380)
- Robustness of managed jobs is greatly improved (#4247, #4283, #4562, #4602, #4615)
Backend
sky launch
on existing clusters is 5x faster when using--fast
flag. We have reworked the provisioning logic to be more efficient when reusing clusters (#4328, #4289)- We now use
uv
under the hood for 3x faster setup phase (#4414) - Beefed up resource leak protection (#4443, #4267)
- Skylet scheduler is 2x faster (#4264)
- New
remote_identity: NO_UPLOAD
option to skip uploading credentials to the remote VM (#4307) - Other robustness improvements (#4227, #4290, #4310, #4390, #4488)
Kubernetes
- Multi-node setup is now up to 4x faster: provisioning a GPU cluster with 200 nodes takes under 90 seconds (#4297, #4240, # 4393)
- TPUs (Single-host) on GKE are now supported on fixed and autoscaling node pools (#3947)
sky check
now shows enabled contexts (#4587)
- SkyPilot no longer has a dependency on
lsof
in k8s environments (#4304) sky show-gpus --cloud kubernetes
now handles limited permissions gracefully (#4208)- Both in-cluster (service account based) and kubeconfig auth are now supported concurrently (#4188)
- Custom GPU resource names are supported with
CUSTOM_GPU_RESOURCE_NAME
environment variable (#4337) - Fixed a bug with SSH on IPv6 dual stack clusters (#4497)
- Fixed a bug with L40 detection when using
nvidia.com/product
labels (#4511) pod_config
specified inconfig.yaml
is now validated before launching clusters (#4466)- Other performance and robustness improvements (#4398, #4415, #4420, #4425, #4420, #4429, #4452, #4469, #4514, #4505, #4558, #4561, #4437)
CLI & Core interfaces
sky logs
has a new--tail
parameter to stream job logs (#4241)sky.jobs.launch
from the Python API now returns the job id (#4620)
SkyServe
-
SkyServe now supports choosing a load balancing policy to be used by the service (#4439)
service: load_balancing_policy: round_robin # round_robin, least_load
Policy Description least_load
(New default) Routes requests to replicas with the lowest current load, optimizing for latency and throughput round_robin
Distributes requests evenly across all replicas in a circular order -
Improved security with TLS support on the load balancer (#3380)
-
You can now expose multiple ports on replicas: useful for running monitoring, UI or other services on the replicas (#4356)
New LLM recipes
Cloud-specfic enhancements
- AWS:
- Disable additional auto update services for ubuntu image with cloud-init (#4252)
- Adding aws assume role option, and env var detection (#4550)
- Credentials are no longer uploaded when using service account auth (#4395)
- Custom process based auth is now supported (#4547)
- SkyPilot now only uses the specified VPC or the default VPC (No other VPCs are used unless specified) (#4546)
- GCP: Fixed an issue where the service account was not activated for access google cloud storage on the controller, robustness improvements (#4529, #4593)
- Azure: Support image ids tagged with
latest
and robustness improvements (#4581, #4411, #4457) - Fluidstack: H100 SXM5 support (#4359)
- Lambda: Added support for GH200 and new regions (us-east-2, us-south-2, us-south-3) (#4291, #4377)
- RunPod: support spot pods (#4447) and private container registries (#4287)
- OCI: Faster and new provisioner, support for SkyServe, default image has been upgraded to 22.04 LTS (#4119, #4517)
Storage
- OCI object storage is now supported (#4501)
- Fixed a bug where object stores were not being mounted when only object stores were specified in file_mounts (#4317)
Docs
- Docs have been revamped: brand new Overview page explaining core concepts (#4342), improved structuring (#4664), docs for multi-k8s (#4586), and more!
⚠️ Deprecation notice
- LocalDockerBackend is deprecated. To run locally, use
sky local up
to setup a local k8s cluster. sky spot
CLI is now removed. Usesky jobs launch --use-spot
to launch spot instances.
Thanks to all contributors!
New contributors: @weih1121, @clayrosenthal, @manbeardave, @bend, @nkwangleiGIT, @kristopolous, @sachiniyer, @KeplerC, @aylei, @Yisaer, @cbrownstein, @chesterli29, @sfrolich, @AlexCuadron
Many thanks to all contributors who contributed to this release!
Contributors: @romilbhardwaj, @cg505, @Michaelvll, @zpoint, @HysunHe, @cblmemo, @andylizf, @concretevitamin, @KeplerC, @yika-luo, @cbrownstein, @weih1121, @nkwangleiGIT, @aylei, @clayrosenthal, @sethkimmel3, @landscapepainter, @Conless, @sfrolich, @AlexCuadron, @shashank2000, @mjibril, @asaiacai, @chesterli29, @Yisaer, @sachiniyer, @manbeardave, @bend, @kristopolous
Full Changelog: v0.7.0...v0.8.0