torch Distributed Data Parallel with ccl backend failed for torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu while working on torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu

![image](https://github.com/oneapi-src/oneCCL/assets/88082706/20702cda-e084-462a-90b8-e7bcc0ed4c27)

I use transformers Trainer to finetune LLM by Distributed Data Parallel with ccl backend, when I use torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu, it will fail like above image. But when I use torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu, it worked well.

The script I used is https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/examples/finetuning/instruction/finetune_clm.py, command is:
```bash
mpirun  --host 172.17.0.2,172.17.0.3 -n 2 -ppn 1 -genv OMP_NUM_THREADS=48 python3 finetune_clm.py     --model_name_or_path mosaicml/mpt-7b-chat     --train_file alpaca_data.json  --bf16 False     --output_dir ./mpt_peft_finetuned_model     --num_train_epochs 1     --max_steps 3     --per_device_train_batch_size 4     --per_device_eval_batch_size 4     --gradient_accumulation_steps 1     --evaluation_strategy "no"     --save_strategy "steps"   --save_steps 2000     --save_total_limit 1     --learning_rate 1e-4      --logging_steps 1     --peft lora     --group_by_length True     --dataset_concatenation     --do_train     --trust_remote_code True     --tokenizer_name "EleutherAI/gpt-neox-20b"     --use_fast_tokenizer True     --max_eval_samples 64     --no_cuda --ddp_backend ccl
```

Can you help investigate this issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

torch Distributed Data Parallel with ccl backend failed for torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu while working on torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu #53

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

torch Distributed Data Parallel with ccl backend failed for torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu while working on torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu #53

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions