SageMaker inference download locations are misconfigured, and models are downloaded twice #949

thvasilo · 2024-08-02T23:04:56Z

GSF tries to download the models into /opt/ml/gsgnn_model, as seen here https://github.com/thvasilo/graphstorm/blob/8e7c4c2e10accb114f2beccaa36ec3094d01241c/python/graphstorm/sagemaker/sagemaker_infer.py#L173

One a job with large model (learnable embeddings included) we see this in the logs in terms of disk space:

Filesystem      Size  Used Avail Use% Mounted on
overlay         120G   31G   90G  26% /
tmpfs            64M     0   64M   0% /dev
tmpfs           374G     0  374G   0% /sys/fs/cgroup
/dev/nvme0n1p1   70G   47G   24G  67% /usr/sbin/docker-init
/dev/nvme2n1   1008G  178G  779G  19% /tmp
shm             372G     0  372G   0% /dev/shm
/dev/nvme1n1    120G   31G   90G  26% /etc/hosts
tmpfs           374G     0  374G   0% /proc/acpi
tmpfs           374G     0  374G   0% /sys/firmware

The partition mounted under /, and I think that includes /opt , will only have 90GB available.

To be able to download larger datasets/models we need to be used the partition mounted under /tmp .

Also, in our inference launch script we define

https://github.com/thvasilo/graphstorm/blob/8e7c4c2e10accb114f2beccaa36ec3094d01241c/sagemaker/launch/launch_infer.py#L120

That will download the model data from the provided S3 path into /opt/ml/input/data/<channel_name> which by default for models will be /opt/ml/input/data/model (see the Estimator docs)

But then here, we try to download the model again, this time into /opt/ml/gsgnn_model

The text was updated successfully, but these errors were encountered:

thvasilo added bug Something isn't working 0.4 labels Aug 2, 2024

classicsong assigned thvasilo Aug 19, 2024

classicsong added this to the 0.4 release milestone Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SageMaker inference download locations are misconfigured, and models are downloaded twice #949

SageMaker inference download locations are misconfigured, and models are downloaded twice #949

thvasilo commented Aug 2, 2024

SageMaker inference download locations are misconfigured, and models are downloaded twice #949

SageMaker inference download locations are misconfigured, and models are downloaded twice #949

Comments

thvasilo commented Aug 2, 2024