Skip to content

SkyPilot v0.8.0

Latest
Compare
Choose a tag to compare
@romilbhardwaj romilbhardwaj released this 12 Feb 23:13
c2c49a6

SkyPilot v0.8.0: Faster Managed Jobs, Faster Provisioning, Digital Ocean and Vast support, DeepSeek R1 recipes and more!

We’re thrilled to release SkyPilot v0.8.0! This update makes SkyPilot faster and more robust, with major improvements to Managed Jobs, Kubernetes support, and new cloud integrations.

Highlights

  • Faster Managed Jobs: 3x faster job submission, controller uses 37% less memory, and support for 2000+ concurrent jobs
  • Faster Provisioning: Kubernetes provisioning is 4x faster — provisioning a GPU cluster with 200 nodes takes under 90 seconds. sky launch on existing clusters is 5x faster when using --fast flag.
  • Intermediate buckets for managed jobs: bring your own buckets to be used as intermediate storage for managed jobs.
    # ~/.sky/config.yaml
    jobs:
      bucket: s3://my-bucket
    
  • Exciting new features in SkyServe:
    • SkyServe load balancer now supports TLS via HTTPS
    • New load_balancing_policy field to choose from multiple policies (round_robin, least_load)
    • Replica can now expose multiple ports
  • New clouds: Digital Ocean and Vast
  • New LLM Recipes: DeepSeek R1 and Janus, minGPT with Pytorch Distributed

Managed Jobs

  • Managed jobs scheduler has been reworked: 3x faster, uses 37% less memory and can support up to 2000 jobs running simultaneously (#4318, #4485, #4341)
  • Brand new look for the managed jobs dashboard, with new filters, log download, and failover history (#4253, #4644 ,#4638)
    Managed jobs dashboard
  • You can now bring your own bucket to act as the intermediate storage for managed jobs (#4257)
    • If no intermediate bucket is specified, we now create one bucket per job instead of one per file_mount/workdir.
  • sky jobs logs has a new flag --sync-down to download logs to local machine (#4527)
  • When fetching managed jobs logs, SkyPilot will autostart the jobs controller if it is not running (#4380)
  • Robustness of managed jobs is greatly improved (#4247, #4283, #4562, #4602, #4615)

Backend

  • sky launch on existing clusters is 5x faster when using --fast flag. We have reworked the provisioning logic to be more efficient when reusing clusters (#4328, #4289)
  • We now use uv under the hood for 3x faster setup phase (#4414)
  • Beefed up resource leak protection (#4443, #4267)
  • Skylet scheduler is 2x faster (#4264)
  • New remote_identity: NO_UPLOAD option to skip uploading credentials to the remote VM (#4307)
  • Other robustness improvements (#4227, #4290, #4310, #4390, #4488)

Kubernetes

  • Multi-node setup is now up to 4x faster: provisioning a GPU cluster with 200 nodes takes under 90 seconds (#4297, #4240, # 4393)
  • TPUs (Single-host) on GKE are now supported on fixed and autoscaling node pools (#3947)
  • sky check now shows enabled contexts (#4587)
    image
  • SkyPilot no longer has a dependency on lsof in k8s environments (#4304)
  • sky show-gpus --cloud kubernetes now handles limited permissions gracefully (#4208)
  • Both in-cluster (service account based) and kubeconfig auth are now supported concurrently (#4188)
  • Custom GPU resource names are supported with CUSTOM_GPU_RESOURCE_NAME environment variable (#4337)
  • Fixed a bug with SSH on IPv6 dual stack clusters (#4497)
  • Fixed a bug with L40 detection when using nvidia.com/product labels (#4511)
  • pod_config specified in config.yaml is now validated before launching clusters (#4466)
  • Other performance and robustness improvements (#4398, #4415, #4420, #4425, #4420, #4429, #4452, #4469, #4514, #4505, #4558, #4561, #4437)

CLI & Core interfaces

  • sky logs has a new --tail parameter to stream job logs (#4241)
  • sky.jobs.launch from the Python API now returns the job id (#4620)

SkyServe

  • SkyServe now supports choosing a load balancing policy to be used by the service (#4439)

    service:
      load_balancing_policy: round_robin  # round_robin, least_load
    
    Policy Description
    least_load (New default) Routes requests to replicas with the lowest current load, optimizing for latency and throughput
    round_robin Distributes requests evenly across all replicas in a circular order
  • Improved security with TLS support on the load balancer (#3380)

  • You can now expose multiple ports on replicas: useful for running monitoring, UI or other services on the replicas (#4356)

New LLM recipes

  • DeepSeek R1 (#4603) and DeepSeek Janus (#4611)
  • minGPT with Pytorch Distributed (#4464)

Cloud-specfic enhancements

  • AWS:
    • Disable additional auto update services for ubuntu image with cloud-init (#4252)
    • Adding aws assume role option, and env var detection (#4550)
    • Credentials are no longer uploaded when using service account auth (#4395)
    • Custom process based auth is now supported (#4547)
    • SkyPilot now only uses the specified VPC or the default VPC (No other VPCs are used unless specified) (#4546)
  • GCP: Fixed an issue where the service account was not activated for access google cloud storage on the controller, robustness improvements (#4529, #4593)
  • Azure: Support image ids tagged with latest and robustness improvements (#4581, #4411, #4457)
  • Fluidstack: H100 SXM5 support (#4359)
  • Lambda: Added support for GH200 and new regions (us-east-2, us-south-2, us-south-3) (#4291, #4377)
  • RunPod: support spot pods (#4447) and private container registries (#4287)
  • OCI: Faster and new provisioner, support for SkyServe, default image has been upgraded to 22.04 LTS (#4119, #4517)

Storage

  • OCI object storage is now supported (#4501)
  • Fixed a bug where object stores were not being mounted when only object stores were specified in file_mounts (#4317)

Docs

  • Docs have been revamped: brand new Overview page explaining core concepts (#4342), improved structuring (#4664), docs for multi-k8s (#4586), and more!

⚠️ Deprecation notice

  • LocalDockerBackend is deprecated. To run locally, use sky local up to setup a local k8s cluster.
  • sky spot CLI is now removed. Use sky jobs launch --use-spot to launch spot instances.

Thanks to all contributors!

New contributors: @weih1121, @clayrosenthal, @manbeardave, @bend, @nkwangleiGIT, @kristopolous, @sachiniyer, @KeplerC, @aylei, @Yisaer, @cbrownstein, @chesterli29, @sfrolich, @AlexCuadron

Many thanks to all contributors who contributed to this release!

Contributors: @romilbhardwaj, @cg505, @Michaelvll, @zpoint, @HysunHe, @cblmemo, @andylizf, @concretevitamin, @KeplerC, @yika-luo, @cbrownstein, @weih1121, @nkwangleiGIT, @aylei, @clayrosenthal, @sethkimmel3, @landscapepainter, @Conless, @sfrolich, @AlexCuadron, @shashank2000, @mjibril, @asaiacai, @chesterli29, @Yisaer, @sachiniyer, @manbeardave, @bend, @kristopolous

Full Changelog: v0.7.0...v0.8.0