-
Notifications
You must be signed in to change notification settings - Fork 65
Google Cloud Storage Integration #683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
from model_engine_server.infra.gateways.gcs_filesystem_gateway import GCSFilesystemGateway | ||
|
||
|
||
def get_gcs_key(owner: str, file_id: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd prefix these w/ an underscore so that no one is tempted to try and import these from outside this file, thus breaking Clean Architecture norms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw I think the s3_file_storage_gateway also doesn't have the prefixes
""" | ||
try: | ||
client = self.filesystem_gateway.get_storage_client({}) | ||
bucket = client.bucket(infra_config().gcs_bucket) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this pattern was already there, but I think it'd probably make more sense to pass in the bucket into the constructor of this class. This way, there's one less dependency on the old infra_config
object. @seanshi-scale @tiffzhao5 thoughts?
Could also make the argument to just pass in the bucket as an argument with every get_file
call, but that's outside of the scope of this change I'd say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with having the bucket be passed in the constructor (in addition to anything else from any configs); dependencies.py
does read in from infra_config at times to figure out constructor arguments, so there's precedent already
blobs = bucket.list_blobs(prefix=prefix) | ||
downloaded_files = [] | ||
|
||
for blob in blobs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is just sequentially downloading the files? Is this the recommended way? ChatGPT showed me two responses, (1) with a ThreadPoolExecutor, and (2) this one.
@@ -0,0 +1,104 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: I think this is only really used for some fine tuning apis that aren't really used at this point, think it's fine to keep ofc since you'll probably need to initialize dependencies anyways, but this code probably won't really get exercised at all
Retrieve or create a Google Cloud Storage client. Could optionally | ||
utilize environment variables or passed-in credentials. | ||
""" | ||
project = kwargs.get("gcp_project", os.getenv("GCP_PROJECT")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does this env var get set? it seems analogous to AWS_PROFILE
but those changes would need to be baked into any relevant k8s yamls most likely
files = [blob.name for blob in bucket.list_blobs(prefix=prefix)] | ||
return files | ||
|
||
def download_files(self, path: str, target_path: str, overwrite=False, **kwargs) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC with the current state of the code, this only gets called if you use TGI, LightLLM, TensorRT-LLM, Deepspeed as inference frameworks, so I doubt this code ends up getting exercised in practice (it's only used to download a tokenizer to count tokens on the Gateway)
|
||
for blob in blobs: | ||
# Remove prefix and leading slash to derive local name | ||
file_path_suffix = blob.name.replace(prefix, "").lstrip("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to replace(prefix, "", count=1)
? (or something like that) just in case the prefix appears elsewhere in the string
for that matter is that also a bug in the s3 implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, just a few things that looked like bugs (from the s3 implementation)
Hi! Sorry should have made this a draft pr but still need to test and work on it. I put it up mainly to coordinate with aagam |
Pull Request Summary
What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.
Test Plan and Usage Guide
How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.