Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cudnn version mismatch - resnet50 tensorflow - mlc #215

Open
anandhu-eng opened this issue Feb 12, 2025 · 0 comments
Open

Cudnn version mismatch - resnet50 tensorflow - mlc #215

anandhu-eng opened this issue Feb 12, 2025 · 0 comments

Comments

@anandhu-eng
Copy link
Contributor

output log:

./run_local.sh tf resnet50 gpu --scenario Offline    --threads 2 --user_conf '/root/MLC/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/
tmp/008b42b487e843888434313954e77347.conf' --use_preprocessed_dataset --cache_dir /root/MLC/repos/local/cache/get-preprocessed-dataset-imagenet_f2fa0fec --dataset-lis
t /root/MLC/repos/local/cache/extract-file_49f3fae9/val.txt 2>&1 | tee '/mlc-mount/home/anandhu/test_results/93e5e028e03c-reference-gpu-tf-v2.18.0-cu124/resnet50/offl
ine/performance/run_1/console.out'; echo ${PIPESTATUS[0]} > exitstatus
python3 python/main.py --profile resnet50-tf --model "/root/MLC/repos/local/cache/download-file_a5ea13cc/resnet50_v1.pb" --dataset-path /root/MLC/repos/local/cache/ge
t-preprocessed-dataset-imagenet_f2fa0fec --output "/mlc-mount/home/anandhu/test_results/93e5e028e03c-reference-gpu-tf-v2.18.0-cu124/resnet50/offline/performance/run_1
" --scenario Offline --threads 2 --user_conf /root/MLC/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/008b42b487e843888434313954e77
347.conf --use_preprocessed_dataset --cache_dir /root/MLC/repos/local/cache/get-preprocessed-dataset-imagenet_f2fa0fec --dataset-list /root/MLC/repos/local/cache/extr
act-file_49f3fae9/val.txt
INFO:main:Namespace(dataset='imagenet', dataset_path='/root/MLC/repos/local/cache/get-preprocessed-dataset-imagenet_f2fa0fec', dataset_list='/root/MLC/repos/local/cac
he/extract-file_49f3fae9/val.txt', data_format=None, profile='resnet50-tf', scenario='Offline', max_batchsize=32, model='/root/MLC/repos/local/cache/download-file_a5e
a13cc/resnet50_v1.pb', output='/mlc-mount/home/anandhu/test_results/93e5e028e03c-reference-gpu-tf-v2.18.0-cu124/resnet50/offline/performance/run_1', inputs=['input_te
nsor:0'], outputs=['ArgMax:0'], backend='tensorflow', device=None, model_name='resnet50', threads=2, qps=None, cache=0, cache_dir='/root/MLC/repos/local/cache/get-pre
processed-dataset-imagenet_f2fa0fec', preprocessed_dir=None, use_preprocessed_dataset=True, accuracy=False, find_peak_performance=False, debug=False, user_conf='/root
/MLC/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/008b42b487e843888434313954e77347.conf', audit_conf='audit.config', time=None, c
ount=None, performance_sample_count=None, max_latency=None, samples_per_query=8)
2025-02-12 10:30:16.828618: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-poin
t round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-12 10:30:16.853479: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin
 cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1739356216.880204    3058 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been re
gistered
E0000 00:00:1739356216.887595    3058 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been
 registered
2025-02-12 10:30:16.915071: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-
critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlo
w with the appropriate compiler flags.
INFO:matplotlib.font_manager:generated new fontManager
INFO:imagenet:Loading 50000 preprocessed images using 2 threads
INFO:imagenet:loaded 50000 images, cache=0, already_preprocessed=True, took=0.9sec
WARNING:tensorflow:From /root/MLC/repos/local/cache/get-git-repo_c7f3aa29/inference/vision/classification_and_detection/python/backend_tf.py:55: FastGFile.__init__ (f
rom tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.gfile.GFile.
WARNING:tensorflow:From /root/venv/mlc/lib/python3.10/site-packages/tensorflow/python/tools/strip_unused_lib.py:84: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2.
WARNING:tensorflow:From /root/venv/mlc/lib/python3.10/site-packages/tensorflow/python/tools/optimize_for_inference_lib.py:138: remove_training_nodes (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2.
I0000 00:00:1739356257.281273    3058 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 78665 MB memory:  -> device: 0, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:18:00.0, compute capability: 9.0
I0000 00:00:1739356257.287068    3058 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 78665 MB memory:  -> device: 1, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:2a:00.0, compute capability: 9.0
I0000 00:00:1739356257.290797    3058 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 78665 MB memory:  -> device: 2, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:3a:00.0, compute capability: 9.0
I0000 00:00:1739356257.294197    3058 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 78665 MB memory:  -> device: 3, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:5d:00.0, compute capability: 9.0
I0000 00:00:1739356257.298001    3058 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:4 with 78665 MB memory:  -> device: 4, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:9a:00.0, compute capability: 9.0
I0000 00:00:1739356257.308591    3058 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:5 with 78665 MB memory:  -> device: 5, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:ab:00.0, compute capability: 9.0
I0000 00:00:1739356257.312613    3058 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:6 with 78665 MB memory:  -> device: 6, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:ba:00.0, compute capability: 9.0
I0000 00:00:1739356257.315976    3058 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:7 with 78665 MB memory:  -> device: 7, name: NVIDIA H100 80GB HBM3, pci bus id: 0000:db:00.0, compute capability: 9.0
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1739356257.774200    3058 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled
E0000 00:00:1739356261.281341    3599 cuda_dnn.cc:522] Loaded runtime CuDNN library: 9.0.0 but source was compiled with: 9.3.0.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2025-02-12 10:31:01.283436: W tensorflow/core/framework/op_kernel.cc:1841] OP_REQUIRES failed at conv_ops_fused_impl.h:625 : INVALID_ARGUMENT: No DNN in stream executor.
2025-02-12 10:31:01.283474: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: INVALID_ARGUMENT: No DNN in stream executor.
         [[{{node resnet_model/Relu}}]]
2025-02-12 10:31:01.283486: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: INVALID_ARGUMENT: No DNN in stream executor.
         [[{{node resnet_model/Relu}}]]
         [[ArgMax/_3]]
2025-02-12 10:31:01.283510: I tensorflow/core/framework/local_rendezvous.cc:424] Local rendezvous recv item cancelled. Key hash: 5866837555468538586
Traceback (most recent call last):
  File "/root/venv/mlc/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1407, in _do_call
    return fn(*args)
  File "/root/venv/mlc/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1390, in _run_fn
    return self._call_tf_sessionrun(options, feed_dict, fetch_list,
  File "/root/venv/mlc/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1483, in _call_tf_sessionrun
    return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) INVALID_ARGUMENT: No DNN in stream executor.
         [[{{node resnet_model/Relu}}]]
         [[ArgMax/_3]]
  (1) INVALID_ARGUMENT: No DNN in stream executor.
         [[{{node resnet_model/Relu}}]]
0 successful operations.
0 derived errors ignored.

run command:

mlcr run-mlperf,inference,_find-performance,_full,_r5.0-dev \
   --model=resnet50 \
   --implementation=reference \
   --framework=tensorflow \
   --category=datacenter \
   --scenario=Offline \
   --execution_mode=test \
   --device=cuda  \
   --docker --quiet \
   --test_query_count=5000
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant