Ordering of Intel extension imports not documented

## Problem

When using `oneccl_bindings_for_pytorch` with `intel_extension_for_pytorch` including Intel GPU support, the ordering of the import statements is important for functionality and does not seem to be documented in the repository or anywhere else I have found.

`intel_extension_for_pytorch` needs to be imported first _before_ `oneccl_bindings_for_pytorch`, otherwise the collectives for GPU will not be recognized:

## Minimum example to reproduce

Below is a minimum working example that demonstrate the error: `oneccl_bindings_for_pytorch` is imported before IPEX, and throws an error saying that `allgather` is not implemented on `[xpu]`. The below is called using `mpirun -n 4 -genvall -bootstrap ssh python ccl_test.py`.

```python
import oneccl_bindings_for_pytorch
import intel_extension_for_pytorch as ipex

rank = int(os.environ["PMI_RANK"])
world_size = int(os.environ["PMI_SIZE"])

torch.manual_seed(rank)

os.environ["RANK"] = str(rank)
os.environ["WORLD_SIZE"] = str(world_size)
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "21616"

group = dist.init_process_group(backend="ccl")

# generate random data on XPU
data = torch.rand(16, 8, device=f"xpu:{rank}")
if dist.get_rank() == 0:
    print(f"Initializing XPU data for rank {rank}")
    print(data)
    print(f"Performing all reduce for {world_size} ranks")

dist.all_reduce(data)
dist.barrier()
if dist.get_rank() == 0:
    print(f"All reduce done")
    print(data)
```

The error:

```python
Performing all reduce for 4 ranks
Traceback (most recent call last):
  File "ccl_test.py", line 42, in <module>
    dist.all_reduce(data)
  File ".../lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1534, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: oneccl_bindings_for_pytorch: allreduce isn't implementd on backend [xpu].
```

This will also trigger for other collectives (e.g. `allgather`). **The code will run successfully if you import IPEX first, followed by oneCCL.**

## Proposed solution

Please add documentation regarding this behavior: it is actually expected since IPEX and oneCCL act on `torch` dynamically, but this is not documented and may confuse users.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ordering of Intel extension imports not documented #44

Problem

Minimum example to reproduce

Proposed solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ordering of Intel extension imports not documented #44

Description

Problem

Minimum example to reproduce

Proposed solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions