Support setting the cache_size_limit parameter of dynamo in PyTorch 2.0 (#10054)

hhaAndroid · web-flow · commit 4a0b0c3143a7 · 2023-03-31T10:19:16.000+08:00
diff --git a/docs/en/notes/faq.md b/docs/en/notes/faq.md
@@ -2,6 +2,42 @@
 
 We list some common troubles faced by many users and their corresponding solutions here. Feel free to enrich the list if you find any frequent issues and have ways to help others to solve them. If the contents here do not cover your issue, please create an issue using the [provided templates](https://github.com/open-mmlab/mmdetection/blob/master/.github/ISSUE_TEMPLATE/error-report.md/) and make sure you fill in all required information in the template.
 
+## PyTorch 2.0 Support
+
+The vast majority of algorithms in MMDetection now support PyTorch 2.0 and its `torch.compile` function. Users only need to install MMDetection 3.0.0rc7 or later versions to enjoy this feature. If any unsupported algorithms are found during use, please feel free to give us feedback. We also welcome contributions from the community to benchmark the speed improvement brought by using the `torch.compile` function.
+
+To enable the `torch.compile` function, simply add `--cfg-options compile=True` after `train.py` or `test.py`. For example, to enable `torch.compile` for RTMDet, you can use the following command:
+
+```shell
+# Single GPU
+python tools/train.py configs/rtmdet/rtmdet_s_8xb32-300e_coco.py  --cfg-options compile=True
+
+# Single node multiple GPUs
+./tools/dist_train.sh configs/rtmdet/rtmdet_s_8xb32-300e_coco.py 8 --cfg-options compile=True
+
+# Single node multiple GPUs + AMP
+./tools/dist_train.sh configs/rtmdet/rtmdet_s_8xb32-300e_coco.py 8 --cfg-options compile=True --amp
+```
+
+It is important to note that PyTorch 2.0's support for dynamic shapes is not yet fully developed. In most object detection algorithms, not only are the input shapes dynamic, but the loss calculation and post-processing parts are also dynamic. This can lead to slower training speeds when using the `torch.compile` function. Therefore, if you wish to enable the `torch.compile` function, you should follow these principles:
+
+1. Input images to the network are fixed shape, not multi-scale
+2. set `torch._dynamo.config.cache_size_limit` parameter. TorchDynamo will convert and cache the Python bytecode, and the compiled functions will be stored in the cache. When the next check finds that the function needs to be recompiled, the function will be recompiled and cached. However, if the number of recompilations exceeds the maximum value set (64), the function will no longer be cached or recompiled. As mentioned above, the loss calculation and post-processing parts of the object detection algorithm are also dynamically calculated, and these functions need to be recompiled every time. Therefore, setting the `torch._dynamo.config.cache_size_limit` parameter to a smaller value can effectively reduce the compilation time
+
+In MMDetection, you can set the `torch._dynamo.config.cache_size_limit` parameter through the environment variable `DYNAMO_CACHE_SIZE_LIMIT`. For example, the command is as follows:
+
+```shell
+# Single GPU
+export DYNAMO_CACHE_SIZE_LIMIT = 4
+python tools/train.py configs/rtmdet/rtmdet_s_8xb32-300e_coco.py  --cfg-options compile=True
+
+# Single node multiple GPUs
+export DYNAMO_CACHE_SIZE_LIMIT = 4
+./tools/dist_train.sh configs/rtmdet/rtmdet_s_8xb32-300e_coco.py 8 --cfg-options compile=True
+```
+
+About the common questions about PyTorch 2.0's dynamo, you can refer to [here](https://pytorch.org/docs/stable/dynamo/faq.html)
+
 ## Installation
 
 - Compatibility issue between MMCV and MMDetection; "ConvWS is already registered in conv layer"; "AssertionError: MMCV==xxx is used but incompatible. Please install mmcv>=xxx, \<=xxx."
diff --git a/docs/zh_cn/notes/faq.md b/docs/zh_cn/notes/faq.md
@@ -2,6 +2,42 @@
 
 我们在这里列出了使用时的一些常见问题及其相应的解决方案。 如果您发现有一些问题被遗漏，请随时提 PR 丰富这个列表。 如果您无法在此获得帮助，请使用 [issue模板](https://github.com/open-mmlab/mmdetection/blob/master/.github/ISSUE_TEMPLATE/error-report.md/)创建问题，但是请在模板中填写所有必填信息，这有助于我们更快定位问题。
 
+## PyTorch 2.0 支持
+
+MMDetection 目前绝大部分算法已经支持了 PyTorch 2.0 及其 `torch.compile` 功能, 用户只需要安装 MMDetection 3.0.0rc7 及其以上版本即可。如果你在使用中发现有不支持的算法，欢迎给我们反馈。我们也非常欢迎社区贡献者来 benchmark 对比 `torch.compile` 功能所带来的速度提升。
+
+如果你想启动 `torch.compile` 功能，只需要在 `train.py` 或者 `test.py` 后面加上 `--cfg-options compile=True`。 以 RTMDet 为例，你可以使用以下命令启动 `torch.compile` 功能：
+
+```shell
+# 单卡
+python tools/train.py configs/rtmdet/rtmdet_s_8xb32-300e_coco.py  --cfg-options compile=True
+
+# 单机 8 卡
+./tools/dist_train.sh configs/rtmdet/rtmdet_s_8xb32-300e_coco.py 8 --cfg-options compile=True
+
+# 单机 8 卡 + AMP 混合精度训练
+./tools/dist_train.sh configs/rtmdet/rtmdet_s_8xb32-300e_coco.py 8 --cfg-options compile=True --amp
+```
+
+需要特别注意的是，PyTorch 2.0 对于动态 shape 支持不是非常完善，目标检测算法中大部分不仅输入 shape 是动态的，而且 loss 计算和后处理过程中也是动态的，这会导致在开启 `torch.compile` 功能后训练速度会变慢。基于此，如果你想启动 `torch.compile` 功能，则应该遵循如下原则：
+
+1. 输入到网络的图片是固定 shape 的，而非多尺度的
+2. 设置 `torch._dynamo.config.cache_size_limit` 参数。TorchDynamo 会将 Python 字节码转换并缓存，已编译的函数会被存入缓存中。当下一次检查发现需要重新编译时，该函数会被重新编译并缓存。但是如果重编译次数超过预设的最大值（64），则该函数将不再被缓存或重新编译。前面说过目标检测算法中的 loss 计算和后处理部分也是动态计算的，这些函数需要在每次迭代中重新编译。因此将 `torch._dynamo.config.cache_size_limit` 参数设置得更小一些可以有效减少编译时间
+
+在 MMDetection 中可以通过环境变量 `DYNAMO_CACHE_SIZE_LIMIT` 设置 `torch._dynamo.config.cache_size_limit` 参数，以 RTMDet 为例，命令如下所示：
+
+```shell
+# 单卡
+export DYNAMO_CACHE_SIZE_LIMIT = 4
+python tools/train.py configs/rtmdet/rtmdet_s_8xb32-300e_coco.py  --cfg-options compile=True
+
+# 单机 8 卡
+export DYNAMO_CACHE_SIZE_LIMIT = 4
+./tools/dist_train.sh configs/rtmdet/rtmdet_s_8xb32-300e_coco.py 8 --cfg-options compile=True
+```
+
+关于 PyTorch 2.0 的 dynamo 常见问题，可以参考 [这里](https://pytorch.org/docs/stable/dynamo/faq.html)
+
 ## 安装
 
 - MMCV 与 MMDetection 的兼容问题: "ConvWS is already registered in conv layer"; "AssertionError: MMCV==xxx is used but incompatible. Please install mmcv>=xxx, \<=xxx."
diff --git a/mmdet/utils/__init__.py b/mmdet/utils/__init__.py
@@ -8,7 +8,8 @@
 from .misc import (find_latest_checkpoint, get_test_pipeline_cfg,
                    update_data_root)
 from .replace_cfg_vals import replace_cfg_vals
-from .setup_env import register_all_modules, setup_multi_processes
+from .setup_env import (register_all_modules, setup_cache_size_limit_of_dynamo,
+                        setup_multi_processes)
 from .split_batch import split_batch
 from .typing_utils import (ConfigType, InstanceList, MultiConfig,
                            OptConfigType, OptInstanceList, OptMultiConfig,
@@ -21,5 +22,6 @@
     'AvoidCUDAOOM', 'all_reduce_dict', 'allreduce_grads', 'reduce_mean',
     'sync_random_seed', 'ConfigType', 'InstanceList', 'MultiConfig',
     'OptConfigType', 'OptInstanceList', 'OptMultiConfig', 'OptPixelList',
-    'PixelList', 'RangeType', 'get_test_pipeline_cfg'
+    'PixelList', 'RangeType', 'get_test_pipeline_cfg',
+    'setup_cache_size_limit_of_dynamo'
 ]
diff --git a/mmdet/utils/setup_env.py b/mmdet/utils/setup_env.py
@@ -1,12 +1,40 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 import datetime
+import logging
 import os
 import platform
 import warnings
 
 import cv2
 import torch.multiprocessing as mp
 from mmengine import DefaultScope
+from mmengine.logging import print_log
+from mmengine.utils import digit_version
+
+
+def setup_cache_size_limit_of_dynamo():
+    """Setup cache size limit of dynamo.
+
+    Note: Due to the dynamic shape of the loss calculation and
+    post-processing parts in the object detection algorithm, these
+    functions must be compiled every time they are run.
+    Setting a large value for torch._dynamo.config.cache_size_limit
+    may result in repeated compilation, which can slow down training
+    and testing speed. Therefore, we need to set the default value of
+    cache_size_limit smaller. An empirical value is 4.
+    """
+
+    import torch
+    if digit_version(torch.__version__) >= digit_version('2.0.0'):
+        if 'DYNAMO_CACHE_SIZE_LIMIT' in os.environ:
+            import torch._dynamo
+            cache_size_limit = int(os.environ['DYNAMO_CACHE_SIZE_LIMIT'])
+            torch._dynamo.config.cache_size_limit = cache_size_limit
+            print_log(
+                f'torch._dynamo.config.cache_size_limit is force '
+                f'set to {cache_size_limit}.',
+                logger='current',
+                level=logging.WARNING)
 
 
 def setup_multi_processes(cfg):
diff --git a/tools/test.py b/tools/test.py
@@ -12,6 +12,7 @@
 from mmdet.engine.hooks.utils import trigger_visualization_hook
 from mmdet.evaluation import DumpDetResults
 from mmdet.registry import RUNNERS
+from mmdet.utils import setup_cache_size_limit_of_dynamo
 
 
 # TODO: support fuse_conv_bn and format_only
@@ -65,6 +66,10 @@ def parse_args():
 def main():
     args = parse_args()
 
+    # Reduce the number of repeated compilations and improve
+    # testing speed.
+    setup_cache_size_limit_of_dynamo()
+
     # load config
     cfg = Config.fromfile(args.config)
     cfg.launcher = args.launcher
diff --git a/tools/train.py b/tools/train.py
@@ -9,6 +9,8 @@
 from mmengine.registry import RUNNERS
 from mmengine.runner import Runner
 
+from mmdet.utils import setup_cache_size_limit_of_dynamo
+
 
 def parse_args():
     parser = argparse.ArgumentParser(description='Train a detector')
@@ -60,6 +62,10 @@ def parse_args():
 def main():
     args = parse_args()
 
+    # Reduce the number of repeated compilations and improve
+    # training speed.
+    setup_cache_size_limit_of_dynamo()
+
     # load config
     cfg = Config.fromfile(args.config)
     cfg.launcher = args.launcher