Skip to content

torch Distributed Data Parallel with ccl backend failed for torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu while working on torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu #53

@XinyuYe-Intel

Description

@XinyuYe-Intel

image

I use transformers Trainer to finetune LLM by Distributed Data Parallel with ccl backend, when I use torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu, it will fail like above image. But when I use torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu, it worked well.

The script I used is https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/examples/finetuning/instruction/finetune_clm.py, command is:

mpirun  --host 172.17.0.2,172.17.0.3 -n 2 -ppn 1 -genv OMP_NUM_THREADS=48 python3 finetune_clm.py     --model_name_or_path mosaicml/mpt-7b-chat     --train_file alpaca_data.json  --bf16 False     --output_dir ./mpt_peft_finetuned_model     --num_train_epochs 1     --max_steps 3     --per_device_train_batch_size 4     --per_device_eval_batch_size 4     --gradient_accumulation_steps 1     --evaluation_strategy "no"     --save_strategy "steps"   --save_steps 2000     --save_total_limit 1     --learning_rate 1e-4      --logging_steps 1     --peft lora     --group_by_length True     --dataset_concatenation     --do_train     --trust_remote_code True     --tokenizer_name "EleutherAI/gpt-neox-20b"     --use_fast_tokenizer True     --max_eval_samples 64     --no_cuda --ddp_backend ccl

Can you help investigate this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions