Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does Trainer(devices=1) use all CPUs? #19595

Closed
MaximilienLC opened this issue Mar 7, 2024 · 7 comments · Fixed by #20399
Closed

Does Trainer(devices=1) use all CPUs? #19595

MaximilienLC opened this issue Mar 7, 2024 · 7 comments · Fixed by #20399
Labels
good first issue Good for newcomers help wanted Open to be worked on question Further information is requested ver: 2.2.x

Comments

@MaximilienLC
Copy link

MaximilienLC commented Mar 7, 2024

Bug description

def _parse_cpu_cores(cpu_cores: Union[int, str, List[int]]) -> int:

cpu_cores being a list of integers will always raise an exception, which shouldn't according to the Trainer documentation/this function signature

What version are you seeing the problem on?

master

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @Borda

@MaximilienLC MaximilienLC added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Mar 7, 2024
@awaelchli
Copy link
Contributor

awaelchli commented Mar 8, 2024

@MaximilienLC Do you mean this?

trainer = Trainer(
      accelerator="cpu",
      devices=[1, 3],
  )
TypeError: `devices` selected with `CPUAccelerator` should be an int > 0.

This is correct. It does not make sense to select device indices on a CPU. Device indices are meant for accelerators like CUDA GPUs or TPUs.

If you select devices=1 (int) on CPU, PyTorch will already use all cores and threads when appropriate. And devices > 1 just simulates DDP on CPU but it does not give you any benefits in terms of speed, it's only meant for debugging DDP.

@awaelchli awaelchli added question Further information is requested working as intended Working as intended and removed bug Something isn't working needs triage Waiting to be triaged by maintainers labels Mar 8, 2024
@awaelchli
Copy link
Contributor

If there is documentation that contradicts this, please point me to it so we can update it. Thanks!

@MaximilienLC
Copy link
Author

devices: The devices to use. Can be set to a positive number (int or str), a sequence of device indices

I guess this section could add that info for CPU.

So you mean devices=1 on CPU equates to using all CPUs?

@awaelchli
Copy link
Contributor

So you mean devices=1 on CPU equates to using all CPUs?

Yes PyTorch will use all CPUs if it can parallelize the operation accordingly.

@awaelchli awaelchli changed the title Trainer argument devices will always raise an Exception with arguments accelerator=cpu and devices: list[int] Does Trainer(devices=1) use all CPUs? Mar 8, 2024
@awaelchli awaelchli reopened this Mar 8, 2024
@jneuendorf
Copy link

I agree that this should not throw an exception assuming the current documentation is correct. For example, the device could come from an environment variable

trainer = Trainer(
    accelerator=os.environ.get("DEVICE", "cpu"),
    devices=-1,  # use all devices
)

So when deploying the code to different machines, it would break for CPU. Maybe a warning would be more adequate to notify the user that something is not optimally configured.

Would devices="auto" also use all devices?

@awaelchli awaelchli added good first issue Good for newcomers help wanted Open to be worked on and removed working as intended Working as intended labels Aug 3, 2024
@fisaenko
Copy link

Hi! Would like to work on this one.

@awaelchli, wanted to clarify, what is more logical here: change the exception in the case accelerator="cpu", so that it is not thrown for devices equal to -1 (as pointed above), or the current behavior is correct and we only need to add more info to the devices section in the trainer.py docstring?

@chualanagit
Copy link
Contributor

Humbly I think the solution should be to remove List[int] as an input type for cpu accelerator, @awaelchli please review this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Open to be worked on question Further information is requested ver: 2.2.x
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants