Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Get cloud provider with ray on kubernetes #51793

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

dayshah
Copy link
Contributor

@dayshah dayshah commented Mar 28, 2025

Why are these changes needed?

On GKE

dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://metadata.google.internal/computeMetadata/v1',headers={'Metadata-Flavor': 'Google'}))"
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [200]>
dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://169.254.169.254/latest/meta-data/'))"                                                                                       
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [404]>
dhyey@cloudshell:~ (dhyey-dev)$ kubectl exec -it $HEAD_POD -- python -c "import requests; print(requests.get('http://169.254.169.254/metadata/instance?api-version=2021-02-01'))"                                                                
Defaulted container "ray-head" out of: ray-head, autoscaler
<Response [404]>

On anyscale on eks (google metadata req results in ConnectionError)

>>> print(requests.get('http://169.254.169.254/latest/meta-data/'))
<Response [200]>
>>> print(requests.get('http://169.254.169.254/metadata/instance?api-version=2021-02-01'))
<Response [404]>

Note: Untested on azure

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@dayshah dayshah added go add ONLY when ready to merge, run all tests and removed go add ONLY when ready to merge, run all tests labels Mar 28, 2025
Signed-off-by: dayshah <[email protected]>
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Mar 28, 2025
@dayshah dayshah marked this pull request as ready for review March 28, 2025 21:07
@@ -81,6 +81,7 @@ class ClusterConfigToReport:
max_workers: Optional[int] = None
head_node_instance_type: Optional[str] = None
worker_node_instance_types: Optional[List[str]] = None
cloud_provider_alt: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot change the schema here without changing the server since server does the schema validation. Lets discuss offline how to change the schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated with added field to UsageStatsToReport is that all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and updated schema in test

Comment on lines 784 to 818
import requests

# Make internal metadata requests to all 3 clouds
# The requests may be rejected based on pod configuration but if it's a machine on the cloud provider it should at least be reachable.
try:
gcp_get_res = requests.get(
"http://metadata.google.internal/computeMetadata/v1",
headers={"Metadata-Flavor": "Google"},
timeout=1,
)
if gcp_get_res.status_code != 404:
result.cloud_provider_alt = "gcp"
except requests.exceptions.ConnectionError:
pass

try:
aws_get_res = requests.get(
"http://169.254.169.254/latest/meta-data/", timeout=1
)
if aws_get_res.status_code != 404:
result.cloud_provider_alt = "aws"
except requests.exceptions.ConnectionError:
pass

try:
azure_get_res = requests.get(
"http://169.254.169.254/metadata/instance?api-version=2021-02-01",
headers={"Metadata": "true"},
timeout=1,
)
if azure_get_res.status_code != 404:
result.cloud_provider_alt = "azure"
except requests.exceptions.ConnectionError:
pass

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MortalHappiness could you review this part?

Copy link
Member

@MortalHappiness MortalHappiness Mar 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjyao Do you mean I need to create a Kubernetes cluster on GCP and AWS and test this manually? By the way, I don't have access to Azure either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use if-else here. If http://metadata.google.internal/computeMetadata/v1, then we don't need to make requests to the other 2 URLs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or should we make requests in parallel to ensure the timeout is at most 1 second? In your current implementation, the worst-case timeout is 3 seconds. Not sure if timing is critical here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. updated to if elif
  2. Timing isn't critical afaik, it only runs once at the start of UsageStatsHead run. Open to making 3 async requests though

dayshah added 4 commits March 28, 2025 14:14
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Signed-off-by: dayshah <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants