forked from karpathy/nanoGPT
-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Description
I use 2* A100 40GB for running "python train.py ...".
I am 100% sure that I successfully activate both of GPUs correctly (and see that it gives me two times outputs out-of-sync.)
However, I met the bugs:
You can see from the above figure that it gives me two 222222222...2 (on Line 3 and Line 4 above) (because I have 2 GPUs), but it only gives me one 111111...1 (on Line 5) and no 0000...000 GOOD. And it gives self.args.gradient_accumulation_steps=0 on the 6th Line. (I don't know whether this is correct value for self.args.gradient_accumulation_steps or not).
https://github.com/ReaLLMASIC/nanoGPT/blob/master/train.py
What I printed out is like this:
def train(self): # Line 1005
...
print("2222222222222222222222222222222222222222222222222222222222222222222")
# Create progress bar # Line 1018
progress = Progress() # Line 1019
with progress: # Line 1020
task_id = progress.add_task("[green]Training...", total=(self.args.max_iters - self.iter_num)) # Line 1021
while True: # Line 1022
...
print("11111111111111111111111111111111111111111")
print(self.args.gradient_accumulation_steps)
for micro_step in range(self.args.gradient_accumulation_steps): # Line 1134
...
print("0000000000000000000000000000000000000000000000000000 GOOD")
self.scaler.scale(loss).backward() # Line 1162
...
...
...
It seems like that self.scaler.scale(loss).backward() didn't run, and led to this error.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
