Skip to content

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Mar 27, 2025

Why are these changes needed?

One of the goals of #3083 is to provide more predictable Kubernetes service names generated by the KubeRay controller for RayCluster, RayService, and RayJob. Therefore, we shortened the allowed length for CR names and validated them at the beginning so that we can now only add fixed suffixes to generate their Kubernetes service names without mutating and trimming the prefixes of their names. Mutating and trimming their prefixes make them less predictable.

A case mentioned in #2169 shows why we want #3083 for not mutating and trimming the prefixes of service names: A user tends to fully copy the CR name with the generated suffix, such as -head-svc, and uses it elsewhere:

For example, if the RayCluster name is 82ac5612-3bd9-4ef9-8828-5133dfe1a1fa-raycluster-pfsxh, the head service name becomes r-4ef9-8828-5133dfe1a1fa-raycluster-pfsxh-head-svc

But when we pass the cluster name in the Job Submission service, it raises the following error

dial tcp: lookup 82ac5612-3bd9-4ef9-8828-5133dfe1a1fa-raycluster-pfsxh-head-svc.default.svc.cluster.local on 10.96.0.10:53: no such host

But it also shows another case we missed: Kubernetes requires a service name to be a DNS1035 label or we will get this error when creating a service:

Failed creating service default/82ac5612-3bd9-4ef9-8828-5133dfe1a1fa-2fnql-head-svc, 
 Service "82ac5612-3bd9-4ef9-8828-5133dfe1a1fa-2fnql-head-svc" is invalid: metadata.name: Invalid value:
  "82ac5612-3bd9-4ef9-8828-5133dfe1a1fa-2fnql-head-svc": a DNS-1035 label must consist of lower case alphanumeric characters or '-', 
  start with an alphabetic character, 
  and end with an alphanumeric character (e.g. 'my-name',  or 'abc-123', regex used for validation is '[a-z]([-a-z0-9]*[a-z0-9])?')

This PR adds DNS1035 validations on RayCluster, RayService, and RayJob names.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@@ -24,6 +25,9 @@ func ValidateRayClusterMetadata(metadata metav1.ObjectMeta) error {
if len(metadata.Name) > MaxRayClusterNameLength {
return fmt.Errorf("RayCluster name should be no more than %d characters", MaxRayClusterNameLength)
}
if errs := validation.IsDNS1035Label(metadata.Name); len(errs) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we certain this won't break any existing systems using KubeRay that fail this validation? What are possible names previously used that would now fail this validation?

Copy link
Contributor Author

@rueian rueian Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR makes any names that start with a number or contain dots or upper letters fail at the beginning.

Also note that #3083 makes RayCluster names longer than 53 characters fail, and RayJob/RayServices names longer than 47 characters fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Names starting with dots or upper-case letters is fine because the previous validation didn't allow it:

$ kubectl ray create cluster Sample-cluster
Error: Failed to create Ray cluster with: RayCluster.ray.io "Sample-cluster" is invalid: metadata.name: Invalid value: "Sample-cluster": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')

But disallowing RayClusters that start with numbers seems like a breaking change:

$ kubectl ray create cluster 1-sample-cluster
Created Ray Cluster: 1-sample-cluster

Is there a really good reason why we should break support this naming scheme? Could we adapt validation to still allow it?

@rueian rueian marked this pull request as ready for review March 27, 2025 18:11
@kevin85421 kevin85421 self-assigned this Mar 27, 2025
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind opening an issue to make sure we document this breaking change in v1.4.0 release note?

@kevin85421 kevin85421 merged commit 59939cb into ray-project:master Apr 3, 2025
21 checks passed
@rueian
Copy link
Contributor Author

rueian commented Apr 3, 2025

Would you mind opening an issue to make sure we document this breaking change in v1.4.0 release note?

Sure. #3271

@andrewsykim
Copy link
Member

Names starting with dots or upper-case letters is fine because the previous validation didn't allow it:

$ kubectl ray create cluster Sample-cluster
Error: Failed to create Ray cluster with: RayCluster.ray.io "Sample-cluster" is invalid: metadata.name: Invalid value: "Sample-cluster": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')

But disallowing RayClusters that start with numbers seems like a breaking change:

$ kubectl ray create cluster 1-sample-cluster
Created Ray Cluster: 1-sample-cluster

Is there a really good reason why we should break support this naming scheme? Could we adapt validation to still allow it?

@rueian
Copy link
Contributor Author

rueian commented Apr 3, 2025

Is there a really good reason why we should break support this naming scheme?

The reason is that users tend to assume the generated service name is just {cr-name}-head-svc or {cr-name}-serve-svc and use those names in other systems. #2169 is an example. However, that assumption does not always hold in v1.3.

In v1.3, the prefix of generated service names will be trimmed if the CR name is too long or be prepended with r if the CR name starts with a number.

And we want that assumption to always hold for better user experiences, so #3083 and this PR add validations for length limitations and DNS1035 rules.

win5923 pushed a commit to win5923/kuberay that referenced this pull request Apr 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants