Skip to content

Bugs when running on Multiple A100 #426

@Mars-Cat2023

Description

@Mars-Cat2023

I use 2* A100 40GB for running "python train.py ...".

I am 100% sure that I successfully activate both of GPUs correctly (and see that it gives me two times outputs out-of-sync.)

However, I met the bugs:

Image

You can see from the above figure that it gives me two 222222222...2 (on Line 3 and Line 4 above) (because I have 2 GPUs), but it only gives me one 111111...1 (on Line 5) and no 0000...000 GOOD. And it gives self.args.gradient_accumulation_steps=0 on the 6th Line. (I don't know whether this is correct value for self.args.gradient_accumulation_steps or not).

https://github.com/ReaLLMASIC/nanoGPT/blob/master/train.py
What I printed out is like this:

    def train(self):   # Line 1005
        ...
        print("2222222222222222222222222222222222222222222222222222222222222222222")
        # Create progress bar                  # Line 1018
        progress = Progress()                  # Line 1019
        with progress:                              # Line 1020
                task_id = progress.add_task("[green]Training...", total=(self.args.max_iters - self.iter_num))   # Line 1021
                while True:                            # Line 1022
                         ...
                         print("11111111111111111111111111111111111111111")
                         print(self.args.gradient_accumulation_steps)
                         for micro_step in range(self.args.gradient_accumulation_steps):                        # Line 1134        
                                    ...
                                    print("0000000000000000000000000000000000000000000000000000 GOOD")
                                    self.scaler.scale(loss).backward()                                          # Line 1162
                         ...                    
        ...
   ...

It seems like that self.scaler.scale(loss).backward() didn't run, and led to this error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions