各种深度学习框架bug及work around汇总帖

各种深度学习框架bug及work around汇总帖
mmdetection distributed train死锁问题

表现：设置某些参数，或在某些特定情况下GPU显存占满，但并不继续运行，暂停程序后停在(pid, sts) = os.waitpid(self.pid, wait_flags)

参考：distributed all_reduce deadlocks in v1.1 · Issue #20630 · pytorch/pytorch (github.com), dist_train keep waiting when filter_empty_gt=False · Issue #2193 · open-mmlab/mmdetection (github.com)

原因：总之是，不同卡之间同步使用all_reduce函数时，由于卡之间的结果不一致，导致一直等待，具体详细原因可能如下
- 不同卡之间loss的个数不同
- 不同卡之间的计算图不同，导致有些参数有grad，有些参数没有
- Pytorch 1.1 uses nccl 2.4.2 which has a known issue of hanging with long running jobs that was fixed for 2.4.6. NVIDIA/nccl@f40ce73
解决方案：
- export NCCL_LL_THRESHOLD=0或者更新nccl，可以忽略上面所有因素，使程序继续运行，但是可能导致all_reduce产生不符合预期的结果，导致错误结果
- 上述第一个因素，很好解决，略过
- 上述第二个因素，找到backward调用的位置（如：mmdet/core/fp16/hooks.py L65），打印所有grad不为None的parameters的名字，据此找到计算图不同的原因，并修改网络结构
mmdetection RoIPool an illegal memory access was encountered

表现：使用mmdetection自带RoIPool会随机出现内存访问越界的问题，如
```
roi_layer=dict(type='RoIPool', out_size=7)
```
参考：CUDA error: an illegal memory access was encountered still exists for RoiPool · Issue #2145 · open-mmlab/mmdetection (github.com)，Maybe certain bugs exists in the RoIPool cuda source file "roi_pool_kernel.cu" · Issue #1007 · open-mmlab/mmdetection (github.com)

解决方案：
- 据说新版本已经修复了，可以考虑copy新版本RoIPool代码，或者更改设置使用torchvision的RoIPool如下
```
roi_layer=dict(type='RoIPool', out_size=7, use_torchvision=True),
```
相关阅读:
软件工程过程第8章敏捷过程
 软件工程过程第7章软件工程过程改进
 软件工程过程第6章软件工程过程的建立与监控
 软件工程过程第5章协同过程模型
 软件工程过程第4章瀑布模型应用实例
 软件工程过程第3章软件生存周期模型
 软件工程过程第2章软件开发的主要活动
 软件工程过程第1章绪论
 软件工程过程第0章目录
 Bitcoin源代码编译安装详解
原文地址：https://www.cnblogs.com/Xiaoyan-Li/p/14451547.html