Skip to content

Error while training and use of pre-trained weights #35

@Shuhul24

Description

@Shuhul24

While training occworld, using the command for training occworld model, after training successfully for 1 epoch, I end up with the following error:

Traceback (most recent call last):
  File "/scratch/p24cs0005/OccWorld/train.py", line 362, in <module>
    main(0, args)
  File "/scratch/p24cs0005/OccWorld/train.py", line 327, in main
    val_miou, _ = CalMeanIou_sem._after_epoch()
  File "/scratch/p24cs0005/OccWorld/utils/metric_util.py", line 96, in _after_epoch
    dist.all_reduce(self.total_seen)
  File "/csehome/p24cs0005/miniconda3/envs/occworld/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
    return func(*args, **kwargs)
  File "/csehome/p24cs0005/miniconda3/envs/occworld/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1699, in all_reduce
    default_pg = _get_default_group()
  File "/csehome/p24cs0005/miniconda3/envs/occworld/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 707, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Why does this error occur as it happens after being trained for 1 epoch and evaluation?

Also, in the github repo, there is a pretrained weight link which leads to a file latest.pth. Is it for occworld model or the VQVAE model?

I have this doubt because I tried training the VQVAE, and the weights in out/vqvae also consists of a latest.pth file. Also, if on putting the latest.pth in out/occworld, can I evaluate my results by running the command for evaluation?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions