You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to finetune qwen2.5-vl on 16 * 80G GPUS, and I use LLaMA-Factory and set preprocessing_num_workers=16. However, I met the following error and the program seem to got crush. It seems that the error come from datasets library
The error logging is like following:
Converting format of dataset (num_proc=16): 100%|█████████▉| 19265/19267 [11:44<00:00, 5.88 examples/s]
Converting format of dataset (num_proc=16): 100%|█████████▉| 19266/19267 [11:44<00:00, 5.02 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 19267/19267 [11:44<00:00, 5.44 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████| 19267/19267 [11:44<00:00, 27.34 examples/s]
Running tokenizer on dataset (num_proc=16): 0%| | 0/19267 [00:00<?, ? examples/s]
Invalid NAL unit size (45405 > 35540).
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (7131 > 3225).
missing picture in access unit with size 54860
Invalid NAL unit size (48042 > 33645).
missing picture in access unit with size 3229
missing picture in access unit with size 33649
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (48042 > 33645).
Error splitting the input into NAL units.
missing picture in access unit with size 35544
Invalid NAL unit size (45405 > 35540).
Error splitting the input into NAL units.
Error splitting the input into NAL units.
Invalid NAL unit size (8187 > 7069).
missing picture in access unit with size 7073
Invalid NAL unit size (8187 > 7069).
Error splitting the input into NAL units.
Invalid NAL unit size (7131 > 3225).
Error splitting the input into NAL units.
Invalid NAL unit size (14013 > 5998).
missing picture in access unit with size 6002
Invalid NAL unit size (14013 > 5998).
Error splitting the input into NAL units.
Invalid NAL unit size (17173 > 7231).
missing picture in access unit with size 7235
Invalid NAL unit size (17173 > 7231).
Error splitting the input into NAL units.
Invalid NAL unit size (16964 > 6055).
missing picture in access unit with size 6059
Invalid NAL unit size (16964 > 6055).
Exception in thread Thread-9 (accepter)Error splitting the input into NAL units.
:
Traceback (most recent call last):
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Running tokenizer on dataset (num_proc=16): 0%| | 0/19267 [13:22<?, ? examples/s] self.run()
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 953, in run
Invalid NAL unit size (7032 > 2927).
missing picture in access unit with size 2931
self._target(*self._args, **self._kwargs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/multiprocess/managers.py", line 194, in accepter
Invalid NAL unit size (7032 > 2927).
Error splitting the input into NAL units.
t.start()
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 935, in start
Invalid NAL unit size (28973 > 6121).
missing picture in access unit with size 6125
_start_new_thread(self._bootstrap, ())Invalid NAL unit size (28973 > 6121).
RuntimeError: can't start new threadError splitting the input into NAL units.
Invalid NAL unit size (4411 > 296).
missing picture in access unit with size 300
Invalid NAL unit size (4411 > 296).
Error splitting the input into NAL units.
Invalid NAL unit size (14414 > 1471).
missing picture in access unit with size 1475
Invalid NAL unit size (14414 > 1471).
Error splitting the input into NAL units.
Invalid NAL unit size (5283 > 1792).
missing picture in access unit with size 1796
Invalid NAL unit size (5283 > 1792).
Error splitting the input into NAL units.
Invalid NAL unit size (79147 > 10042).
missing picture in access unit with size 10046
Invalid NAL unit size (79147 > 10042).
Error splitting the input into NAL units.
Invalid NAL unit size (45405 > 35540).
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (7131 > 3225).
missing picture in access unit with size 54860
Invalid NAL unit size (48042 > 33645).
missing picture in access unit with size 3229
missing picture in access unit with size 33649
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (48042 > 33645).
Error splitting the input into NAL units.
missing picture in access unit with size 35544
Invalid NAL unit size (45405 > 35540).
Error splitting the input into NAL units.
Error splitting the input into NAL units.
Invalid NAL unit size (8187 > 7069).
missing picture in access unit with size 7073
Invalid NAL unit size (8187 > 7069).
Error splitting the input into NAL units.
Invalid NAL unit size (7131 > 3225).
Error splitting the input into NAL units.
Invalid NAL unit size (14013 > 5998).
missing picture in access unit with size 6002
Invalid NAL unit size (14013 > 5998).
Error splitting the input into NAL units.
Invalid NAL unit size (17173 > 7231).
missing picture in access unit with size 7235
Invalid NAL unit size (17173 > 7231).
Error splitting the input into NAL units.
Invalid NAL unit size (16964 > 6055).
missing picture in access unit with size 6059
Invalid NAL unit size (16964 > 6055).
Exception in thread Thread-9 (accepter)Error splitting the input into NAL units.
:
Traceback (most recent call last):
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Running tokenizer on dataset (num_proc=16): 0%| | 0/19267 [13:22<?, ? examples/s] self.run()
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 953, in run
Invalid NAL unit size (7032 > 2927).
missing picture in access unit with size 2931
self._target(*self._args, **self._kwargs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/multiprocess/managers.py", line 194, in accepter
Invalid NAL unit size (7032 > 2927).
Error splitting the input into NAL units.
t.start()
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 935, in start
Invalid NAL unit size (28973 > 6121).
missing picture in access unit with size 6125
_start_new_thread(self._bootstrap, ())Invalid NAL unit size (28973 > 6121).
RuntimeError: can't start new threadError splitting the input into NAL units.
Invalid NAL unit size (4411 > 296).
missing picture in access unit with size 300
Invalid NAL unit size (4411 > 296).
Error splitting the input into NAL units.
Invalid NAL unit size (14414 > 1471).
missing picture in access unit with size 1475
Invalid NAL unit size (14414 > 1471).
Error splitting the input into NAL units.
Invalid NAL unit size (5283 > 1792).
missing picture in access unit with size 1796
Invalid NAL unit size (5283 > 1792).
Error splitting the input into NAL units.
Invalid NAL unit size (79147 > 10042).
missing picture in access unit with size 10046
Invalid NAL unit size (79147 > 10042).
Error splitting the input into NAL units.
Invalid NAL unit size (45405 > 35540).
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (7131 > 3225).
missing picture in access unit with size 54860
Invalid NAL unit size (48042 > 33645).
missing picture in access unit with size 3229
missing picture in access unit with size 33649
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (48042 > 33645).
Error splitting the input into NAL units.
missing picture in access unit with size 35544
Invalid NAL unit size (45405 > 35540).
Error splitting the input into NAL units.
Error splitting the input into NAL units.
Invalid NAL unit size (8187 > 7069).
missing picture in access unit with size 7073
Invalid NAL unit size (8187 > 7069).
Error splitting the input into NAL units.
Invalid NAL unit size (7131 > 3225).
Error splitting the input into NAL units.
Invalid NAL unit size (14013 > 5998).
missing picture in access unit with size 6002
Invalid NAL unit size (14013 > 5998).
Error splitting the input into NAL units.
Invalid NAL unit size (17173 > 7231).
missing picture in access unit with size 7235
Invalid NAL unit size (17173 > 7231).
Error splitting the input into NAL units.
Invalid NAL unit size (16964 > 6055).
missing picture in access unit with size 6059
Invalid NAL unit size (16964 > 6055).
Exception in thread Thread-9 (accepter)Error splitting the input into NAL units.
:
Traceback (most recent call last):
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Running tokenizer on dataset (num_proc=16): 0%| | 0/19267 [13:22<?, ? examples/s] self.run()
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 953, in run
Invalid NAL unit size (7032 > 2927).
missing picture in access unit with size 2931
self._target(*self._args, **self._kwargs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/multiprocess/managers.py", line 194, in accepter
Invalid NAL unit size (7032 > 2927).
Error splitting the input into NAL units.
t.start()
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 935, in start
Invalid NAL unit size (28973 > 6121).
missing picture in access unit with size 6125
_start_new_thread(self._bootstrap, ())Invalid NAL unit size (28973 > 6121).
RuntimeError: can't start new threadError splitting the input into NAL units.
Invalid NAL unit size (4411 > 296).
missing picture in access unit with size 300
Invalid NAL unit size (4411 > 296).
Error splitting the input into NAL units.
Invalid NAL unit size (14414 > 1471).
missing picture in access unit with size 1475
Invalid NAL unit size (14414 > 1471).
Error splitting the input into NAL units.
Invalid NAL unit size (5283 > 1792).
missing picture in access unit with size 1796
Invalid NAL unit size (5283 > 1792).
Error splitting the input into NAL units.
Invalid NAL unit size (79147 > 10042).
missing picture in access unit with size 10046
Invalid NAL unit size (79147 > 10042).
Error splitting the input into NAL units.
Invalid NAL unit size (45405 > 35540).
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (7131 > 3225).
missing picture in access unit with size 54860
Invalid NAL unit size (48042 > 33645).
missing picture in access unit with size 3229
missing picture in access unit with size 33649
Invalid NAL unit size (86720 > 54856).
Invalid NAL unit size (48042 > 33645).
Error splitting the input into NAL units.
missing picture in access unit with size 35544
Invalid NAL unit size (45405 > 35540).
Error splitting the input into NAL units.
Error splitting the input into NAL units.
Invalid NAL unit size (8187 > 7069).
missing picture in access unit with size 7073
Invalid NAL unit size (8187 > 7069).
Error splitting the input into NAL units.
Invalid NAL unit size (7131 > 3225).
Error splitting the input into NAL units.
Invalid NAL unit size (14013 > 5998).
missing picture in access unit with size 6002
Invalid NAL unit size (14013 > 5998).
Error splitting the input into NAL units.
Invalid NAL unit size (17173 > 7231).
missing picture in access unit with size 7235
Invalid NAL unit size (17173 > 7231).
Error splitting the input into NAL units.
Invalid NAL unit size (16964 > 6055).
missing picture in access unit with size 6059
Invalid NAL unit size (16964 > 6055).
Exception in thread Thread-9 (accepter)Error splitting the input into NAL units.
:
Traceback (most recent call last):
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Running tokenizer on dataset (num_proc=16): 0%| | 0/19267 [13:22<?, ? examples/s] self.run()
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 953, in run
Invalid NAL unit size (7032 > 2927).
missing picture in access unit with size 2931
self._target(*self._args, **self._kwargs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/multiprocess/managers.py", line 194, in accepter
Invalid NAL unit size (7032 > 2927).
Error splitting the input into NAL units.
t.start()
File "/opt/conda/envs/python3.10.13/lib/python3.10/threading.py", line 935, in start
Invalid NAL unit size (28973 > 6121).
missing picture in access unit with size 6125
_start_new_thread(self._bootstrap, ())Invalid NAL unit size (28973 > 6121).
RuntimeError: can't start new threadError splitting the input into NAL units.
Invalid NAL unit size (4411 > 296).
missing picture in access unit with size 300
Invalid NAL unit size (4411 > 296).
Error splitting the input into NAL units.
Invalid NAL unit size (14414 > 1471).
missing picture in access unit with size 1475
Invalid NAL unit size (14414 > 1471).
Error splitting the input into NAL units.
Invalid NAL unit size (5283 > 1792).
missing picture in access unit with size 1796
Invalid NAL unit size (5283 > 1792).
Error splitting the input into NAL units.
Invalid NAL unit size (79147 > 10042).
missing picture in access unit with size 10046
Invalid NAL unit size (79147 > 10042).
Error splitting the input into NAL units.
Describe the bug
I am trying to finetune qwen2.5-vl on 16 * 80G GPUS, and I use
LLaMA-Factory
and setpreprocessing_num_workers=16
. However, I met the following error and the program seem to got crush. It seems that the error come fromdatasets
libraryThe error logging is like following:
Others
No response
Steps to reproduce the bug
None
Expected behavior
excpect to run successfully
Environment info
The text was updated successfully, but these errors were encountered: