Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The recently updated system or environment doesn't have the necessary CUDA library or drivers installed for cuDNN #5080

Open
mariuslesniak opened this issue Jan 31, 2025 · 5 comments

Comments

@mariuslesniak
Copy link

mariuslesniak commented Jan 31, 2025

I am training a Deep Neural Network with TensorFlow's Keras API using T4 instance. This was working well for the last year or so until a day ago when the problem has emerged relating to CUDA error.

Describe the current behavior
The error is:
"InvalidArgumentError: Graph execution error:

Detected at node sequential_1/bidirectional_1/forward_lstm_1/CudnnRNNV3 defined at (most recent call last)"
The rest of the error code is contained in the attached file.

Describe the expected behavior
The expected behaviour would be to use GPU with no such error. "Dnn is not supported" indicates that the LSTM layer in the model is attempting to use the CuDNN implementation, which is optimized for NVIDIA GPUs. However, either there is no compatible GPU available in your environment or CuDNN is not properly configured. The fact that. the code worked well till yesterday suggests an unaccounted change in the system or the environment.

What web browser you are using
I am using Chrome

Additional context
Link to a minimal, public, self-contained notebook that reproduces this issue.

  • Share the file using your GitHub account using File > Save a copy as a GitHub Gist.
  • or Share Drive notebooks using the Share button then 'Get Shareable Link'.
@mariuslesniak
Copy link
Author

I have also attached some elements of the code, as per below:

Imports and setting thge environment

!pip install -Uqq fastai

import numpy as np
import pandas as pd
import os
import string
from IPython.display import FileLink
from datetime import date
from sklearn.preprocessing import StandardScaler
from keras.models import Sequential
from keras.layers import LSTM, Dense, Bidirectional, Dropout

The rest of the code ....................................

    # Creating training data set
    # --------------------------
    print('Creating the training set')
    train = df.copy()
    train.head(window_length+1)

    train.tail(window_length+1)

    train_rows = train.values.shape[0]
    train_samples = np.empty([ train_rows - factor_a * window_length, window_length, number_of_features], dtype=float)
    train_labels = np.empty([ train_rows - factor_a * window_length, number_of_features], dtype=float)
    for i in range(0, train_rows - factor_a * window_length):
        train_samples[i] = train.iloc[i : i+window_length, 0 : number_of_features]
        train_labels[i] = train.iloc[i+window_length : i+window_length+1, 0 : number_of_features]


    print('Creating scales samples')
    scaler = StandardScaler()
    transformed_dataset = scaler.fit_transform(train.values)
    scaled_train_samples = pd.DataFrame(data=transformed_dataset, index=train.index)
    scaled_train_samples.head(window_length+1)
    x_train = np.empty([ train_rows - factor_a * window_length, window_length, number_of_features], dtype=float)
    y_train = np.empty([ train_rows - factor_a * window_length, number_of_features], dtype=float)

    for i in range(0, train_rows - factor_a * window_length):
        x_train[i] = scaled_train_samples.iloc[i : i+window_length, 0 : number_of_features]
        y_train[i] = scaled_train_samples.iloc[i+window_length : i+window_length+1, 0 : number_of_features]

    import tensorflow as tf
    from tensorflow import keras
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import LSTM, Dense, Bidirectional, Dropout
    from tensorflow.keras.optimizers import Adam
    from tensorflow.keras.metrics import mse

    physical_devices = tf.config.list_physical_devices('GPU')
    print("Num GPUs Available: ", len(physical_devices))
    try:
        tf.config.experimental.set_memory_growth(physical_devices[0], True)
    except:
    #   Invalid device or cannot modify virtual devices once initialized.
        pass

    # Initialising the RNN
    model = Sequential()
    # Adding the input layer and the LSTM layer
    model.add(Bidirectional(LSTM(240,
                            input_shape = (window_length, number_of_features),
                            return_sequences = True)))
    # Adding a first Dropout layer
    model.add(Dropout(0.2))
    # Adding a second LSTM layer
    model.add(Bidirectional(LSTM(240,
                            input_shape = (window_length, number_of_features),
                            return_sequences = True)))
    # Adding a second Dropout layer
    model.add(Dropout(0.2))
    # Adding a third LSTM layer
    model.add(Bidirectional(LSTM(240,
                            input_shape = (window_length, number_of_features),
                            return_sequences = True)))
    # Adding a fourth LSTM layer
    model.add(Bidirectional(LSTM(240,
                            input_shape = (window_length, number_of_features),
                            return_sequences = False)))
    # Adding a third Dropout layer
    model.add(Dropout(0.2))
    # Adding the first output layer
    model.add(Dense(70))
    # Adding the last output layer
    model.add(Dense(number_of_features))

    model.compile(optimizer=Adam(learning_rate=0.0001), loss ='mse', metrics=['accuracy'])

    model.fit(x=x_train, y=y_train, batch_size=100, epochs=2000, verbose=2)

The rest of the code .........................................................................

@metrizable
Copy link
Contributor

@mariuslesniak Thanks for filing the issue and thanks for using Colab.

Thanks for some of the code. You mentioned:

"The rest of the error code is contained in the attached file"

Could you upload this file so that we can help troubleshoot? Also, are you able to provide a minimal reproducible example that we can run and debug? Thanks!

@mariuslesniak
Copy link
Author

mariuslesniak commented Feb 1, 2025

Thanks for your message. In response, I have attached three files as follows:

  1. parametric_colab_forecasting_loop.ipynb, a cutdown version of my code that you can run to see the problem (uploaded as a .txt file
  2. source-data-history.csv, a set of example input data
  3. predict-data-template.csv, an output data template required by the programme
    In file (1) lines 28,31,36,45-47, 57, 242 and 265 refere to the relevant (2) and (3) file names. In my case the (2) nd (3) files were placed on a google-drive in a directory called MyDrive/Colab_files, as mounted on the system /content/drive

Hope this will help.

Kind regards

parametric_colab_forecasting_loop.txt

source-data-history.csv
predict-data-template.csv

@metrizable
Copy link
Contributor

metrizable commented Feb 3, 2025

@mariuslesniak Thanks for the example. I was able to successfully run the code in parametric_colab_forecasting_loop.txt with your provided .csv files. I did make two modifications: 1) I lowered the cycle count limit to finish in a timely fashion, and 2) updated the code to fix the warning: "Do not pass an input_shape/input_dim argument to a layer. When using Sequential models, prefer using an Input(shape) object as the first layer in the model instead":

# Initialising the RNN
model = Sequential()
model.add(keras.Input(shape=(window_length, number_of_features)))

I invoked your sample code on a GPU T4 runtime and did not see any errors (the one cited in the OP ("InvalidArgumentError: Graph execution error) or otherwise) in the output or in the final:

Image

It may be that your larger cycle count causes later errors, but that would seem unrelated to CUDA not configured correctly. Are you able to share a notebook with output saved that includes the error?

@mariuslesniak
Copy link
Author

mariuslesniak commented Feb 4, 2025

Hi, I was able to do the suggested correction regarding "input_shape" and rerun the code. Unfortunately, I still have the same problem when running it in the available (latest) Jupyter Notebook. I have attached the edited code (parametric_colab_forecasting_loop.txt) as well as the resulting error (output.txt).
I used T4 GPU and also checked the versions of TF: 2.18.0 and nvidia drivers: 550.54.15 and CUDA Version: 12.4.
If I run the code in the fallback runtime version, as available via the Command Palette and the "Use fallback runtime version" command, everything seems to be working fine.

parametric_colab_forecasting_loop.txt

output.txt

Kind regards,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants