Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLAMA3_1-405B-99 Docker Cmind not found issues #2105

Open
hhuo24pm opened this issue Feb 11, 2025 · 12 comments
Open

LLAMA3_1-405B-99 Docker Cmind not found issues #2105

hhuo24pm opened this issue Feb 11, 2025 · 12 comments

Comments

@hhuo24pm
Copy link

hhuo24pm commented Feb 11, 2025

(following instruction on https://docs.mlcommons.org/inference/benchmarks/language/llama3_1-405b/)
mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev
--model=llama3_1-405b-99
--implementation=reference
--framework=pytorch
--category=datacenter
--scenario=Offline
--execution_mode=test
--device=cpu
--docker --quiet
--test_query_count=10

mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev
--model=llama3_1-405b-99
--implementation=reference
--framework=pytorch
--category=datacenter
--scenario=Offline
--execution_mode=test
--device=cpu
--docker --quiet
--test_query_count=10
Traceback (most recent call last):
File "/home/hhremote/mlenergy2/bin/mlcr", line 8, in
sys.exit(mlcr())
^^^^^^
File "/home/hhremote/mlenergy2/lib/python3.12/site-packages/mlc/main.py", line 1670, in mlcr
main()
File "/home/hhremote/mlenergy2/lib/python3.12/site-packages/mlc/main.py", line 1752, in main
res = method(run_args)
^^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy2/lib/python3.12/site-packages/mlc/main.py", line 1511, in run
return self.call_script_module_function("run", run_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy2/lib/python3.12/site-packages/mlc/main.py", line 1491, in call_script_module_function
result = automation_instance.run(run_args) # Pass args to the run method
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 219, in run
r = self._run(i)
^^^^^^^^^^^^
File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 281, in _run
from cmind import cli
ModuleNotFoundError: No module named 'cmind'

error after installing cmind:

mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev
--model=llama3_1-405b-99
--implementation=reference
--framework=pytorch
--category=datacenter
--scenario=Offline
--execution_mode=test
--device=cpu
--docker --quiet
--test_query_count=10
Traceback (most recent call last):
File "/home/hhremote/mlenergy2/bin/mlcr", line 8, in
sys.exit(mlcr())
^^^^^^
File "/home/hhremote/mlenergy2/lib/python3.12/site-packages/mlc/main.py", line 1670, in mlcr
main()
File "/home/hhremote/mlenergy2/lib/python3.12/site-packages/mlc/main.py", line 1752, in main
res = method(run_args)
^^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy2/lib/python3.12/site-packages/mlc/main.py", line 1511, in run
return self.call_script_module_function("run", run_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy2/lib/python3.12/site-packages/mlc/main.py", line 1491, in call_script_module_function
result = automation_instance.run(run_args) # Pass args to the run method
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 219, in run
r = self._run(i)
^^^^^^^^^^^^
File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 286, in _run
mlc_input = r['mlc_input']
~^^^^^^^^^^^^^
KeyError: 'mlc_input'

@arjunsuresh
Copy link
Contributor

Hi @hhuo24pm looks like you are on an old version of mlperf-automations repository. Can you please do mlc pull repo and share the output?

@arjunsuresh
Copy link
Contributor

But llama3-405b is too large a model to try on CPU.

@hhuo24pm
Copy link
Author

I am having similar issues trying to run the other ones like ResNet50. this is the result of running mlc pull repo

mlc pull repo
[2025-02-11 17:19:37,573 main.py:1275 INFO] - Repository mlperf-automations already exists at /home/hhremote/MLC/repos/mlcommons@mlperf-automations. Pulling latest changes...
You are not currently on a branch.
Please specify which branch you want to merge with.
See git-pull(1) for details.

git pull <remote> <branch>

[2025-02-11 17:19:37,912 main.py:1754 ERROR] - Git command failed: Command '['git', '-C', '/home/hhremote/MLC/repos/mlcommons@mlperf-automations', 'pull']' returned non-zero exit status 1.
Traceback (most recent call last):
File "/home/hhremote/mlenergy3/bin/mlc", line 8, in
sys.exit(main())
^^^^^^
File "/home/hhremote/mlenergy3/lib/python3.12/site-packages/mlc/main.py", line 1755, in main
raise Exception(f"""An error occurred {res}""")
Exception: An error occurred {'return': 1, 'error': "Git command failed: Command '['git', '-C', '/home/hhremote/MLC/repos/mlcommons@mlperf-automations', 'pull']' returned non-zero exit status 1."}

@arjunsuresh
Copy link
Contributor

oh. Looks like it is an old dev version of mlcflow. Can you please do

rm -rf $HOME/MLC
pip install --upgrade mlcflow
mlc pull repo mlcommons@mlperf-automations --branch=dev

@hhuo24pm
Copy link
Author

hhuo24pm commented Feb 11, 2025

I have removed MLC then pulled the appropriate mlcommons perf automation repo, and the previous error is seemingly resolved.
But when Running ResNet50 CPU for docker, this is the error that follows after pre-processing the images:

ILSVRC2012_val_00002155.JPEG
ILSVRC2012_val_00000854.JPEG

[2025-02-11 17:57:02,933 module.py:5481 INFO] - ! call "postprocess" from /home/hhremote/MLC/repos/mlcommons@mlperf-automations/script/extract-file/customize.py
[2025-02-11 17:57:02,971 module.py:5481 INFO] - ! call "postprocess" from /home/hhremote/MLC/repos/mlcommons@mlperf-automations/script/get-dataset-imagenet-val/customize.py
[2025-02-11 17:57:03,115 module.py:560 INFO] - * mlcr run,docker,container
[2025-02-11 17:57:03,861 module.py:560 INFO] - * mlcr get,docker
[2025-02-11 17:57:04,015 module.py:1274 INFO] - ! load /home/hhremote/MLC/repos/local/cache/get-docker_25788506/mlc-cached-state.json

Checking existing Docker container:

docker ps --format "{{ .ID }}," --filter "ancestor=localhost/local/mlc-script-app-mlperf-inference-generic--reference--resnet50--onnxruntime--cpu--test--r5.0-dev-default--offline:ubuntu-22.04-latest" 2> /dev/null

Traceback (most recent call last):
File "/home/hhremote/mlenergy4/bin/mlcr", line 8, in
sys.exit(mlcr())
^^^^^^
File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1670, in mlcr
main()
File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1752, in main
res = method(run_args)
^^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1511, in run
return self.call_script_module_function("run", run_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1491, in call_script_module_function
result = automation_instance.run(run_args) # Pass args to the run method
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 225, in run
r = self._run(i)
^^^^^^^^^^^^
File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1772, in _run
r = customize_code.preprocess(ii)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/script/run-mlperf-inference-app/customize.py", line 286, in preprocess
r = mlc.access(ii)
^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 96, in access
result = method(self, options)
^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1508, in docker
return self.call_script_module_function("docker", run_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1493, in call_script_module_function
result = automation_instance.docker(run_args) # Pass args to the run method
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 4691, in docker
return docker_run(self, i)
^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/docker.py", line 381, in docker_run
r = self_module.action_object.access(mlc_docker_input)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 96, in access
result = method(self, options)
^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1511, in run
return self.call_script_module_function("run", run_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1501, in call_script_module_function
raise ScriptExecutionError(f"Script {function_name} execution failed. Error : {error}")
mlc.main.ScriptExecutionError: Script run execution failed. Error : Unexpected error occurred with docker run:
Command 'docker ps --format "{{ .ID }}," --filter "ancestor=localhost/local/mlc-script-app-mlperf-inference-generic--reference--resnet50--onnxruntime--cpu--test--r5.0-dev-default--offline:ubuntu-22.04-latest" 2> /dev/null' returned non-zero exit status 1.

@arjunsuresh
Copy link
Contributor

Looks like docker failed. Can you please share the output of the below?

docker ps --format "{{ .ID }}," --filter "ancestor=localhost/local/mlc-script-app-mlperf-inference-generic--reference--resnet50--onnxruntime--cpu--test--r5.0-dev-default--offline:ubuntu-22.04-latest"
echo $?

@hhuo24pm
Copy link
Author

it just returns 0

docker ps --format "{{ .ID }}," --filter "ancestor=localhost/local/mlc-script-app-mlperf-inference-generic--reference--resnet50--onnxruntime--cpu--test--r5.0-dev-default--offline:ubuntu-22.04-latest"
echo $?
permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.47/containers/json?filters=%7B%22ancestor%22%3A%7B%22localhost%2Flocal%2Fmlc-script-app-mlperf-inference-generic--reference--resnet50--onnxruntime--cpu--test--r5.0-dev-default--offline%3Aubuntu-22.04-latest%22%3Atrue%7D%7D": dial unix /var/run/docker.sock: connect: permission denied
1
(mlenergy4) hhremote@hhjww-desktop:~$ sudo docker ps --format "{{ .ID }}," --filter "ancestor=localhost/local/mlc-script-app-mlperf-inference-generic--reference--resnet50--onnxruntime--cpu--test--r5.0-dev-default--offline:ubuntu-22.04-latest"
echo $?
[sudo] password for hhremote:
0

@arjunsuresh
Copy link
Contributor

oh, so thats the problem. The user hhremote is not in the docker group. Is it possible to do

sudo usermod -aG docker hhremote

After this you probably need to restart the shell to make it effective.

@hhuo24pm
Copy link
Author

hhuo24pm commented Feb 12, 2025

Thank you for helping with diagnosis but after running ResNet50 it produce a different error:
(configuration is the same, ResNet50, onnxruntime, offline, docker)

 ./run_local.sh onnxruntime resnet50 gpu --scenario Offline    --threads 2 --user_conf '/home/hhremote/MLC/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/cd6a056c64d64984a8c064eb63c8d919.conf' --use_preprocessed_dataset --cache_dir /home/hhremote/MLC/repos/local/cache/get-preprocessed-dataset-imagenet_0f6d302a --dataset-list /home/hhremote/MLC/repos/local/cache/extract-file_6834e885/val.txt 2>&1 | tee '/home/hhremote/MLC/repos/local/cache/get-mlperf-inference-results-dir_3a62a57f/test_results/hhjww_desktop-reference-gpu-onnxruntime-v1.20.1-cu118/resnet50/offline/performance/run_1/console.out'; echo ${PIPESTATUS[0]} > exitstatus
python3 python/main.py --profile resnet50-onnxruntime --model "/home/hhremote/MLC/repos/local/cache/download-file_5b804679/resnet50_v1.onnx" --dataset-path /home/hhremote/MLC/repos/local/cache/get-preprocessed-dataset-imagenet_0f6d302a --output "/home/hhremote/MLC/repos/local/cache/get-mlperf-inference-results-dir_3a62a57f/test_results/hhjww_desktop-reference-gpu-onnxruntime-v1.20.1-cu118/resnet50/offline/performance/run_1" --scenario Offline --threads 2 --user_conf /home/hhremote/MLC/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/cd6a056c64d64984a8c064eb63c8d919.conf --use_preprocessed_dataset --cache_dir /home/hhremote/MLC/repos/local/cache/get-preprocessed-dataset-imagenet_0f6d302a --dataset-list /home/hhremote/MLC/repos/local/cache/extract-file_6834e885/val.txt
INFO:main:Namespace(dataset='imagenet', dataset_path='/home/hhremote/MLC/repos/local/cache/get-preprocessed-dataset-imagenet_0f6d302a', dataset_list='/home/hhremote/MLC/repos/local/cache/extract-file_6834e885/val.txt', data_format=None, profile='resnet50-onnxruntime', scenario='Offline', max_batchsize=32, model='/home/hhremote/MLC/repos/local/cache/download-file_5b804679/resnet50_v1.onnx', output='/home/hhremote/MLC/repos/local/cache/get-mlperf-inference-results-dir_3a62a57f/test_results/hhjww_desktop-reference-gpu-onnxruntime-v1.20.1-cu118/resnet50/offline/performance/run_1', inputs=None, outputs=['ArgMax:0'], backend='onnxruntime', device=None, model_name='resnet50', threads=2, qps=None, cache=0, cache_dir='/home/hhremote/MLC/repos/local/cache/get-preprocessed-dataset-imagenet_0f6d302a', preprocessed_dir=None, use_preprocessed_dataset=True, accuracy=False, find_peak_performance=False, debug=False, user_conf='/home/hhremote/MLC/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/cd6a056c64d64984a8c064eb63c8d919.conf', audit_conf='audit.config', time=None, count=None, performance_sample_count=None, max_latency=None, samples_per_query=8)
INFO:imagenet:Loading 50000 preprocessed images using 2 threads
INFO:imagenet:reduced image list, 31401 images not found
INFO:imagenet:loaded 18599 images, cache=0, already_preprocessed=True, took=0.4sec
/home/hhremote/MLC/repos/local/cache/install-python-venv_e61e4ede/mlperf/lib/python3.12/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:115: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'AzureExecutionProvider, CPUExecutionProvider'
  warnings.warn(
INFO:main:starting TestScenario.Offline
Traceback (most recent call last):
  File "/home/hhremote/MLC/repos/local/cache/get-git-repo_4c38de27/inference/vision/classification_and_detection/python/main.py", line 781, in <module>
    main()
  File "/home/hhremote/MLC/repos/local/cache/get-git-repo_4c38de27/inference/vision/classification_and_detection/python/main.py", line 752, in main
    lg.StartTestWithLogSettings(sut, qsl, settings, log_settings, audit_config)
  File "/home/hhremote/MLC/repos/local/cache/get-git-repo_4c38de27/inference/vision/classification_and_detection/python/dataset.py", line 66, in load_query_samples
    self.image_list_inmemory[sample], _ = self.get_item(sample)
                                          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/local/cache/get-git-repo_4c38de27/inference/vision/classification_and_detection/python/imagenet.py", line 173, in get_item
    img = np.load(dst + ".npy")
          ^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/local/cache/install-python-venv_e61e4ede/mlperf/lib/python3.12/site-packages/numpy/lib/npyio.py", line 436, in load
    raise EOFError("No data left in file")
EOFError: No data left in file
malloc(): unsorted double linked list corrupted
./run_local.sh: line 30: 1938131 Aborted                 (core dumped) python3 python/main.py --profile resnet50-onnxruntime --model "/home/hhremote/MLC/repos/local/cache/download-file_5b804679/resnet50_v1.onnx" --dataset-path /home/hhremote/MLC/repos/local/cache/get-preprocessed-dataset-imagenet_0f6d302a --output "/home/hhremote/MLC/repos/local/cache/get-mlperf-inference-results-dir_3a62a57f/test_results/hhjww_desktop-reference-gpu-onnxruntime-v1.20.1-cu118/resnet50/offline/performance/run_1" --scenario Offline --threads 2 --user_conf /home/hhremote/MLC/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/cd6a056c64d64984a8c064eb63c8d919.conf --use_preprocessed_dataset --cache_dir /home/hhremote/MLC/repos/local/cache/get-preprocessed-dataset-imagenet_0f6d302a --dataset-list /home/hhremote/MLC/repos/local/cache/extract-file_6834e885/val.txt
Traceback (most recent call last):
  File "/home/hhremote/mlenergy4/bin/mlcr", line 8, in <module>
    sys.exit(mlcr())
             ^^^^^^
  File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1670, in mlcr
    main()
  File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1752, in main
    res = method(run_args)
          ^^^^^^^^^^^^^^^^
  File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1511, in run
    return self.call_script_module_function("run", run_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1491, in call_script_module_function
    result = automation_instance.run(run_args)  # Pass args to the run method
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 225, in run
    r = self._run(i)
        ^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1772, in _run
    r = customize_code.preprocess(ii)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/script/run-mlperf-inference-app/customize.py", line 286, in preprocess
    r = mlc.access(ii)
        ^^^^^^^^^^^^^^
  File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 96, in access
    result = method(self, options)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1511, in run
    return self.call_script_module_function("run", run_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1491, in call_script_module_function
    result = automation_instance.run(run_args)  # Pass args to the run method
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 225, in run
    r = self._run(i)
        ^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1842, in _run
    r = self._call_run_deps(prehook_deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3532, in _call_run_deps
    r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3702, in _run_deps
    r = self.action_object.access(ii)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 96, in access
    result = method(self, options)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1511, in run
    return self.call_script_module_function("run", run_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1491, in call_script_module_function
    result = automation_instance.run(run_args)  # Pass args to the run method
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 225, in run
    r = self._run(i)
        ^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1858, in _run
    r = prepare_and_run_script_with_postprocessing(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 5488, in prepare_and_run_script_with_postprocessing
    r = script_automation._call_run_deps(posthook_deps, local_env_keys, local_env_keys_from_meta, env, state, const, const_state,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3532, in _call_run_deps
    r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3702, in _run_deps
    r = self.action_object.access(ii)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 96, in access
    result = method(self, options)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1511, in run
    return self.call_script_module_function("run", run_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1491, in call_script_module_function
    result = automation_instance.run(run_args)  # Pass args to the run method
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 225, in run
    r = self._run(i)
        ^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1885, in _run
    r = self._run_deps(post_deps, clean_env_keys_post_deps, env, state, const, const_state, add_deps_recursive, recursion_spaces,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3702, in _run_deps
    r = self.action_object.access(ii)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 96, in access
    result = method(self, options)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1511, in run
    return self.call_script_module_function("run", run_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hhremote/mlenergy4/lib/python3.12/site-packages/mlc/main.py", line 1501, in call_script_module_function
    raise ScriptExecutionError(f"Script {function_name} execution failed. Error : {error}")
mlc.main.ScriptExecutionError: Script run execution failed. Error : MLC script failed (name = benchmark-program, return code = 34304)


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Please file an issue at https://github.com/mlcommons/mlperf-automations/issues along with the full MLC command being run and the relevant
or full console log.

@arjunsuresh
Copy link
Contributor

INFO:imagenet:Loading 50000 preprocessed images using 2 threads
INFO:imagenet:reduced image list, 31401 images not found
INFO:imagenet:loaded 18599 images, cache=0, already_preprocessed=True, took=0.4sec

Looks like the download of the image failed as only 18599/50000 images got downloaded. Please retry the command after doing mlc rm cache --tags=dataset,imagenet -f.

@anandhu-eng are we not checking the checksum for imagenet download?

@anandhu-eng
Copy link
Contributor

@arjunsuresh , yes we are. By default, the imagenet dataset is downloaded using cmutil after which the checksum is verified in run.sh. I have run full dataset download to verify this.

@hhuo24pm
Copy link
Author

INFO:imagenet:Loading 50000 preprocessed images using 2 threads
INFO:imagenet:reduced image list, 31401 images not found
INFO:imagenet:loaded 18599 images, cache=0, already_preprocessed=True, took=0.4sec

Looks like the download of the image failed as only 18599/50000 images got downloaded. Please retry the command after doing mlc rm cache --tags=dataset,imagenet -f.

@anandhu-eng are we not checking the checksum for imagenet download?

Thank you, this did fix ResNet50 for me, I can now run it.
Unfortunately I then ran into more issues running 3d-unet with following configuration: MLCommons-Python, 3d-UNET-99, Edge, pytorch, CUDA, Docker.

Command:
mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev
--model=3d-unet-99
--implementation=reference
--framework=pytorch
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=50

Error message:
mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev
--model=3d-unet-99
--implementation=reference
--framework=pytorch
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=50
[2025-02-20 20:25:56,804 module.py:560 INFO] - * mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev
[2025-02-20 20:25:56,906 module.py:560 INFO] - * mlcr get,mlcommons,inference,src
[2025-02-20 20:25:56,908 module.py:1274 INFO] - ! load /home/hhremote/MLC/repos/local/cache/get-mlperf-inference-src_6c38b7d0/mlc-cached-state.json
[2025-02-20 20:25:56,934 module.py:560 INFO] - * mlcr get,mlperf,inference,results,dir,_version.r5.0-dev
[2025-02-20 20:25:56,935 module.py:1274 INFO] - ! load /home/hhremote/MLC/repos/local/cache/get-mlperf-inference-results-dir_3a62a57f/mlc-cached-state.json
[2025-02-20 20:25:56,957 module.py:560 INFO] - * mlcr install,pip-package,for-mlc-python,_package.tabulate
[2025-02-20 20:25:56,960 module.py:1274 INFO] - ! load /home/hhremote/MLC/repos/local/cache/install-pip-package-for-mlc-python_87ea312c/mlc-cached-state.json
[2025-02-20 20:25:56,981 module.py:560 INFO] - * mlcr get,mlperf,inference,utils
[2025-02-20 20:25:57,020 module.py:560 INFO] - * mlcr get,mlperf,inference,src
[2025-02-20 20:25:57,022 module.py:1274 INFO] - ! load /home/hhremote/MLC/repos/local/cache/get-mlperf-inference-src_6c38b7d0/mlc-cached-state.json
[2025-02-20 20:25:57,026 module.py:5481 INFO] - ! call "postprocess" from /home/hhremote/MLC/repos/mlcommons@mlperf-automations/script/get-mlperf-inference-utils/customize.py
Using MLCommons Inference source from /home/hhremote/MLC/repos/local/cache/get-git-repo_4c38de27/inference

Running loadgen scenario: Offline and mode: performance
[2025-02-20 20:25:57,135 module.py:560 INFO] - * mlcr build,dockerfile
[2025-02-20 20:25:57,232 module.py:560 INFO] - * mlcr get,docker
[2025-02-20 20:25:57,235 module.py:1274 INFO] - ! load /home/hhremote/MLC/repos/local/cache/get-docker_25788506/mlc-cached-state.json
mlc pull repo && mlcr --tags=app,mlperf,inference,generic,_reference,_3d-unet-99,_pytorch,_cuda,_test,_r5.0-dev_default,_offline --quiet=true --env.MLC_QUIET=yes --env.MLC_MLPERF_IMPLEMENTATION=reference --env.MLC_MLPERF_MODEL=3d-unet-99 --env.MLC_MLPERF_RUN_STYLE=test --env.MLC_MLPERF_SKIP_SUBMISSION_GENERATION=False --env.MLC_DOCKER_PRIVILEGED_MODE=True --env.MLC_MLPERF_SUBMISSION_DIVISION=open --env.MLC_MLPERF_INFERENCE_TP_SIZE=1 --env.MLC_MLPERF_SUBMISSION_SYSTEM_TYPE=edge --env.MLC_MLPERF_DEVICE=cuda --env.MLC_MLPERF_USE_DOCKER=True --env.MLC_MLPERF_BACKEND=pytorch --env.MLC_MLPERF_LOADGEN_SCENARIO=Offline --env.MLC_TEST_QUERY_COUNT=50 --env.MLC_MLPERF_FIND_PERFORMANCE_MODE=yes --env.MLC_MLPERF_LOADGEN_ALL_MODES=no --env.MLC_MLPERF_LOADGEN_MODE=performance --env.MLC_MLPERF_RESULT_PUSH_TO_GITHUB=False --env.MLC_MLPERF_SUBMISSION_GENERATION_STYLE=full --env.MLC_MLPERF_INFERENCE_VERSION=5.0-dev --env.MLC_RUN_MLPERF_INFERENCE_APP_DEFAULTS=r5.0-dev_default --env.MLC_MLPERF_SUBMISSION_CHECKER_VERSION=v5.0 --env.MLC_MLPERF_INFERENCE_SOURCE_VERSION=5.0.15 --env.MLC_MLPERF_LAST_RELEASE=v5.0 --env.MLC_MLPERF_INFERENCE_RESULTS_VERSION=r5.0-dev --env.MLC_MODEL=3d-unet-99 --env.MLC_MLPERF_LOADGEN_COMPLIANCE=no --env.MLC_MLPERF_LOADGEN_EXTRA_OPTIONS= --env.MLC_MLPERF_LOADGEN_SCENARIOS,=Offline --env.MLC_MLPERF_LOADGEN_MODES,=performance --env.MLC_OUTPUT_FOLDER_NAME=test_results --add_deps_recursive.coco2014-original.tags=_full --add_deps_recursive.coco2014-preprocessed.tags=_full --add_deps_recursive.imagenet-original.tags=_full --add_deps_recursive.imagenet-preprocessed.tags=_full --add_deps_recursive.openimages-original.tags=_full --add_deps_recursive.openimages-preprocessed.tags=_full --add_deps_recursive.openorca-original.tags=_full --add_deps_recursive.openorca-preprocessed.tags=_full --add_deps_recursive.coco2014-dataset.tags=_full --add_deps_recursive.igbh-dataset.tags=_full --add_deps_recursive.get-mlperf-inference-results-dir.tags=_version.r5.0-dev --add_deps_recursive.get-mlperf-inference-submission-dir.tags=_version.r5.0-dev --add_deps_recursive.mlperf-inference-nvidia-scratch-space.tags=_version.r5.0-dev --v=False --print_env=False --print_deps=False --dump_version_info=True --quiet
Dockerfile written at /home/hhremote/MLC/repos/mlcommons@mlperf-automations/script/app-mlperf-inference/dockerfiles/nvcr.io-nvidia-pytorch-24.03-py3.Dockerfile
[2025-02-20 20:25:57,347 docker.py:191 INFO] - Dockerfile generated at /home/hhremote/MLC/repos/mlcommons@mlperf-automations/script/app-mlperf-inference/dockerfiles/nvcr.io-nvidia-pytorch-24.03-py3.Dockerfile
[2025-02-20 20:25:57,432 module.py:560 INFO] - * mlcr get,docker
[2025-02-20 20:25:57,434 module.py:1274 INFO] - ! load /home/hhremote/MLC/repos/local/cache/get-docker_25788506/mlc-cached-state.json
[2025-02-20 20:25:57,449 module.py:560 INFO] - * mlcr get,mlperf,inference,submission,dir,local,_version.r5.0-dev
[2025-02-20 20:25:57,450 module.py:1274 INFO] - ! load /home/hhremote/MLC/repos/local/cache/get-mlperf-inference-submission-dir_1cf0ec51/mlc-cached-state.json
[2025-02-20 20:25:57,492 module.py:560 INFO] - * mlcr get,nvidia-docker
[2025-02-20 20:25:57,521 module.py:560 INFO] - * mlcr detect,os
[2025-02-20 20:25:57,531 module.py:5334 INFO] - ! cd /home/hhremote/MLC/repos/local/cache/get-nvidia-docker_59ebd80c
[2025-02-20 20:25:57,531 module.py:5335 INFO] - ! call /home/hhremote/MLC/repos/mlcommons@mlperf-automations/script/detect-os/run.sh from tmp-run.sh
[2025-02-20 20:25:57,556 module.py:5481 INFO] - ! call "postprocess" from /home/hhremote/MLC/repos/mlcommons@mlperf-automations/script/detect-os/customize.py
[2025-02-20 20:25:57,597 module.py:560 INFO] - * mlcr get,docker
[2025-02-20 20:25:57,598 module.py:1274 INFO] - ! load /home/hhremote/MLC/repos/local/cache/get-docker_25788506/mlc-cached-state.json
[2025-02-20 20:25:57,600 module.py:5334 INFO] - ! cd /home/hhremote/MLC/repos/local/cache/get-nvidia-docker_59ebd80c
[2025-02-20 20:25:57,600 module.py:5335 INFO] - ! call /home/hhremote/MLC/repos/mlcommons@mlperf-automations/script/get-nvidia-docker/run-ubuntu.sh from tmp-run.sh
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && sudo apt-get update
[sudo] password for hhremote:
deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /
#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/experimental/deb/$(ARCH) /
Hit:2 https://nvidia.github.io/libnvidia-container/stable/deb/amd64 InRelease
Hit:3 http://security.ubuntu.com/ubuntu noble-security InRelease
Hit:4 https://debian.neo4j.com stable InRelease
Hit:5 https://download.docker.com/linux/ubuntu noble InRelease
Hit:6 https://brave-browser-apt-release.s3.brave.com stable InRelease
Hit:7 http://ca.archive.ubuntu.com/ubuntu noble InRelease
Ign:8 http://dl.google.com/linux/chrome-remote-desktop/deb stable InRelease
Hit:9 http://ca.archive.ubuntu.com/ubuntu noble-updates InRelease
Hit:10 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64 InRelease
Hit:11 http://ca.archive.ubuntu.com/ubuntu noble-backports InRelease
Hit:12 https://packages.microsoft.com/repos/code stable InRelease
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 InRelease [1,581 B]
Ign:13 https://ppa.launchpadcontent.net/appimagelauncher-team/stable/ubuntu noble InRelease
Hit:14 http://dl.google.com/linux/chrome-remote-desktop/deb stable Release
Hit:15 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu noble InRelease
Err:16 https://ppa.launchpadcontent.net/appimagelauncher-team/stable/ubuntu noble Release
404 Not Found [IP: 185.125.190.80 443]
Err:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 InRelease
The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
Reading package lists... Done
N: Skipping acquire of configured file 'main/binary-i386/Packages' as repository 'https://brave-browser-apt-release.s3.brave.com stable InRelease' doesn't support architecture 'i386'
E: The repository 'https://ppa.launchpadcontent.net/appimagelauncher-team/stable/ubuntu noble Release' does not have a Release file.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64 InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
Traceback (most recent call last):
File "/home/hhremote/mlenergy3/bin/mlcr", line 8, in
sys.exit(mlcr())
^^^^^^
File "/home/hhremote/mlenergy3/lib/python3.12/site-packages/mlc/main.py", line 1715, in mlcr
main()
File "/home/hhremote/mlenergy3/lib/python3.12/site-packages/mlc/main.py", line 1797, in main
res = method(run_args)
^^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy3/lib/python3.12/site-packages/mlc/main.py", line 1529, in run
return self.call_script_module_function("run", run_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy3/lib/python3.12/site-packages/mlc/main.py", line 1509, in call_script_module_function
result = automation_instance.run(run_args) # Pass args to the run method
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 225, in run
r = self._run(i)
^^^^^^^^^^^^
File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 1772, in _run
r = customize_code.preprocess(ii)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/script/run-mlperf-inference-app/customize.py", line 286, in preprocess
r = mlc.access(ii)
^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy3/lib/python3.12/site-packages/mlc/main.py", line 92, in access
result = method(options)
^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy3/lib/python3.12/site-packages/mlc/main.py", line 1526, in docker
return self.call_script_module_function("docker", run_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy3/lib/python3.12/site-packages/mlc/main.py", line 1511, in call_script_module_function
result = automation_instance.docker(run_args) # Pass args to the run method
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 4691, in docker
return docker_run(self, i)
^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/docker.py", line 308, in docker_run
r = self_module._run_deps(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/MLC/repos/mlcommons@mlperf-automations/automation/script/module.py", line 3702, in _run_deps
r = self.action_object.access(ii)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy3/lib/python3.12/site-packages/mlc/main.py", line 92, in access
result = method(options)
^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy3/lib/python3.12/site-packages/mlc/main.py", line 1529, in run
return self.call_script_module_function("run", run_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hhremote/mlenergy3/lib/python3.12/site-packages/mlc/main.py", line 1519, in call_script_module_function
raise ScriptExecutionError(f"Script {function_name} execution failed. Error : {error}")
mlc.main.ScriptExecutionError: Script run execution failed. Error : MLC script failed (name = get-nvidia-docker, return code = 256)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Please file an issue at https://github.com/mlcommons/mlperf-automations/issues along with the full MLC command being run and the relevant
or full console log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants