Bugs when running on Multiple A100

I use 2* A100 40GB for running "python train.py ...".

I am 100% sure that I successfully activate both of GPUs correctly (and see that it gives me two times outputs out-of-sync.)

However, I met the bugs:

![Image](https://github.com/user-attachments/assets/5511a703-e1f6-43b5-9749-0fcac24e70d4)

You can see from the above figure that it gives me two `222222222...2` (on Line 3 and Line 4 above) (because I have 2 GPUs), but it only gives me one `111111...1` (on Line 5) and no `0000...000 GOOD`. And it gives `self.args.gradient_accumulation_steps=0` on the 6th Line. (I don't know whether this is correct value for `self.args.gradient_accumulation_steps` or not).

https://github.com/ReaLLMASIC/nanoGPT/blob/master/train.py
What I printed out is like this:
```
    def train(self):   # Line 1005
        ...
        print("2222222222222222222222222222222222222222222222222222222222222222222")
        # Create progress bar                  # Line 1018
        progress = Progress()                  # Line 1019
        with progress:                              # Line 1020
                task_id = progress.add_task("[green]Training...", total=(self.args.max_iters - self.iter_num))   # Line 1021
                while True:                            # Line 1022
                         ...
                         print("11111111111111111111111111111111111111111")
                         print(self.args.gradient_accumulation_steps)
                         for micro_step in range(self.args.gradient_accumulation_steps):                        # Line 1134        
                                    ...
                                    print("0000000000000000000000000000000000000000000000000000 GOOD")
                                    self.scaler.scale(loss).backward()                                          # Line 1162
                         ...                    
        ...
   ...

```

It seems like that `self.scaler.scale(loss).backward()` didn't run, and led to this error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugs when running on Multiple A100 #426

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bugs when running on Multiple A100 #426

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions