Skip to content

Google Cloud Storage Integration #683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

anishxyz
Copy link

@anishxyz anishxyz commented Feb 3, 2025

Pull Request Summary

What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.

Test Plan and Usage Guide

How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.

from model_engine_server.infra.gateways.gcs_filesystem_gateway import GCSFilesystemGateway


def get_gcs_key(owner: str, file_id: str) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd prefix these w/ an underscore so that no one is tempted to try and import these from outside this file, thus breaking Clean Architecture norms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I think the s3_file_storage_gateway also doesn't have the prefixes

"""
try:
client = self.filesystem_gateway.get_storage_client({})
bucket = client.bucket(infra_config().gcs_bucket)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this pattern was already there, but I think it'd probably make more sense to pass in the bucket into the constructor of this class. This way, there's one less dependency on the old infra_config object. @seanshi-scale @tiffzhao5 thoughts?

Could also make the argument to just pass in the bucket as an argument with every get_file call, but that's outside of the scope of this change I'd say.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with having the bucket be passed in the constructor (in addition to anything else from any configs); dependencies.py does read in from infra_config at times to figure out constructor arguments, so there's precedent already

blobs = bucket.list_blobs(prefix=prefix)
downloaded_files = []

for blob in blobs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is just sequentially downloading the files? Is this the recommended way? ChatGPT showed me two responses, (1) with a ThreadPoolExecutor, and (2) this one.

@@ -0,0 +1,104 @@
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I think this is only really used for some fine tuning apis that aren't really used at this point, think it's fine to keep ofc since you'll probably need to initialize dependencies anyways, but this code probably won't really get exercised at all

Retrieve or create a Google Cloud Storage client. Could optionally
utilize environment variables or passed-in credentials.
"""
project = kwargs.get("gcp_project", os.getenv("GCP_PROJECT"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does this env var get set? it seems analogous to AWS_PROFILE but those changes would need to be baked into any relevant k8s yamls most likely

files = [blob.name for blob in bucket.list_blobs(prefix=prefix)]
return files

def download_files(self, path: str, target_path: str, overwrite=False, **kwargs) -> List[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC with the current state of the code, this only gets called if you use TGI, LightLLM, TensorRT-LLM, Deepspeed as inference frameworks, so I doubt this code ends up getting exercised in practice (it's only used to download a tokenizer to count tokens on the Gateway)


for blob in blobs:
# Remove prefix and leading slash to derive local name
file_path_suffix = blob.name.replace(prefix, "").lstrip("/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to replace(prefix, "", count=1)? (or something like that) just in case the prefix appears elsewhere in the string

for that matter is that also a bug in the s3 implementation?

Copy link
Contributor

@seanshi-scale seanshi-scale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just a few things that looked like bugs (from the s3 implementation)

@anishxyz
Copy link
Author

anishxyz commented Feb 5, 2025

Hi! Sorry should have made this a draft pr but still need to test and work on it. I put it up mainly to coordinate with aagam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants