官方:https://github.com/MhLiao/DB
周军大神实现的:https://github.com/WenmuZhou/DBNet.pytorch
1.官方的
官方的按照安装流程很容易安装,只是我的环境是ubuntu16.04+cuda8,所以一直用的pytorch1.0.1(py3.7)的。也可以跑起来.但是训练的模型推理预测出来全是空啊,txt全是空的,visualize文件夹图片灰蒙蒙的没有框。loss不收敛
[INFO] [2020-01-18 16:24:09,584] step: 1340, epoch: 0, loss: 4.332346, lr: 0.007000
[INFO] [2020-01-18 16:24:09,585] bce_loss: 0.568492
[INFO] [2020-01-18 16:24:09,585] thresh_loss: 0.563487
[INFO] [2020-01-18 16:24:09,586] l1_loss: 0.092640
[INFO] [2020-01-18 16:24:19,117] step: 1360, epoch: 0, loss: 4.255758, lr: 0.007000
[INFO] [2020-01-18 16:24:19,120] bce_loss: 0.544069
[INFO] [2020-01-18 16:24:19,122] thresh_loss: 0.539020
[INFO] [2020-01-18 16:24:19,124] l1_loss: 0.099640
[INFO] [2020-01-18 16:24:28,766] step: 1380, epoch: 0, loss: 4.507674, lr: 0.007000
[INFO] [2020-01-18 16:24:28,767] bce_loss: 0.560643
[INFO] [2020-01-18 16:24:28,768] thresh_loss: 0.652172
[INFO] [2020-01-18 16:24:28,768] l1_loss: 0.105229
用ic15数据集训练也是如此,不知道问题出在哪里。后面再看看
有一个bug,是学习率一开始都是0.07,设置都没有用,需要到DB/DB-master/training/learning_rate.py这里改,被写死了
因为怀疑是版本原因导致不收敛什么的,于是我就把自己电脑上装两个cuda,8和10,装10,然后创建虚拟环境,然后又报错,报cuda错误,
error in : invalid device function
RuntimeError: copy_if failed to synchronize: device-side assert triggered
搞了好久,无解,然后解决问题的时候发现conda安装的cuda是10.1版本的,而我本地是10.0版本的,同时看到
作者在回答issue的时候说:https://github.com/MhLiao/DB/issues/36
‘
Make sure your CUDA path of $CUDA_HOME is the same version as your CUDA in PyTorch by the command of echo $CUDA_HOME. If not, you need to change the $CUDA_HOME by export CUDA_HOME=path-of-another-version or re-install PyTorch with the same CUDA version as in CUDA_HOME.
’
本地与conda的cuda版本需要一致。然后我重来一遍:
conda install numpy=1.17.4 pytorch=1.3 torchvision cudatoolkit=10.0.130 -c pytorch
如此,可以!
但是,好像还不收敛啊。。。。………………。6……6…………-%4$
2.非官方的
安装安装流程,一股脑的安装,确实可以跑,但是一开始显示, DBNet.pytorch INFO: train with device cpu and pytorch 1.3.0
因为我电脑上没有1.3需要的cuda10,所以就跑cpu了。很慢。
后来在群里看到有人用pytorch1.1.0版本编过了,但是他是cuda10.我也安装1.1.0版本,然后训练各种报错啊,无助。。。后来都放弃了,后来又重捣鼓。
在此过程中,越来越觉得anconda很好,在虚拟环境下,敲conda list可以显示安装的各个库的版本。
_libgcc_mutex 0.1 main
absl-py 0.9.0 <pip>
anyconfig 0.9.10 <pip>
backcall 0.1.0 py36_0
blas 1.0 mkl
ca-certificates 2019.11.27 0
cachetools 4.0.0 <pip>
certifi 2019.11.28 py36_0
cffi 1.13.2 py36h2e261b9_0
chardet 3.0.4 <pip>
cudatoolkit 8.0 3
cycler 0.10.0 <pip>
decorator 4.4.1 py_0
freetype 2.9.1 h8a8886c_1
future 0.18.2 <pip>
google-auth 1.10.1 <pip>
google-auth-oauthlib 0.4.1 <pip>
grpcio 1.26.0 <pip>
idna 2.8 <pip>
imageio 2.6.1 <pip>
imgaug 0.3.0 <pip>
intel-openmp 2019.4 243
ipython 7.11.1 py36h39e3cac_0
ipython_genutils 0.2.0 py36_0
jedi 0.15.2 py36_0
jpeg 9b h024ee3a_2
kiwisolver 1.1.0 <pip>
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20181209 hc058e9b_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libpng 1.6.37 hbc83047_0
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.1.0 h2733197_0
Markdown 3.1.1 <pip>
matplotlib 3.1.2 <pip>
mkl 2019.4 243
mkl-service 2.3.0 py36he904b0f_0
mkl_fft 1.0.15 py36ha843d7b_0
mkl_random 1.1.0 py36hd6b4f25_0
natsort 7.0.0 <pip>
ncurses 6.1 he6710b0_1
networkx 2.4 <pip>
ninja 1.9.0 py36hfd86e86_0
numpy 1.18.1 py36h4f9e942_0
numpy 1.17.4 <pip>
numpy-base 1.18.1 py36hde5b4d6_0
oauthlib 3.1.0 <pip>
olefile 0.46 py_0
opencv-python 4.1.2.30 <pip>
opencv-python-headless 4.1.2.30 <pip>
openssl 1.1.1d h7b6447c_3
parso 0.5.2 py_0
pexpect 4.7.0 py36_0
pickleshare 0.7.5 py36_0
Pillow 6.2.2 <pip>
pillow 7.0.0 py36hb39fc2d_0
pip 19.3.1 py36_0
Polygon3 3.0.8 <pip>
prompt_toolkit 3.0.2 py_0
protobuf 3.11.2 <pip>
ptyprocess 0.6.0 py36_0
pyasn1 0.4.8 <pip>
pyasn1-modules 0.2.8 <pip>
pyclipper 1.1.0.post3 <pip>
pycparser 2.19 py_0
pygments 2.5.2 py_0
pyparsing 2.4.6 <pip>
python 3.6.10 h0371630_0
python-dateutil 2.8.1 <pip>
pytorch 1.0.1 py3.6_cuda8.0.61_cudnn7.1.2_2 pytorch
PyWavelets 1.1.1 <pip>
PyYAML 5.2 <pip>
readline 7.0 h7b6447c_5
requests 2.22.0 <pip>
requests-oauthlib 1.3.0 <pip>
rsa 4.0 <pip>
scikit-image 0.16.2 <pip>
scipy 1.4.1 <pip>
setuptools 44.0.0 py36_0
Shapely 1.6.4.post2 <pip>
six 1.13.0 py36_0
sqlite 3.30.1 h7b6447c_0
tensorboard 2.1.0 <pip>
tensorboardX 1.8 <pip>
tk 8.6.8 hbc83047_0
torch 1.1.0 <pip>
torchvision 0.2.1 <pip>
torchvision 0.2.2 py_3 pytorch
tqdm 4.40.1 <pip>
traitlets 4.3.3 py36_0
urllib3 1.25.7 <pip>
wcwidth 0.1.7 py36_0
Werkzeug 0.16.0 <pip>
wheel 0.33.6 py36_0
xz 5.2.4 h14c3975_4
zlib 1.2.11 h7b6447c_3
zstd 1.3.7 h0b5b093_0
安装软件直接:pip install tensorboardX==1.8
不加版本号默认装最新的。
还可以pip install 'tensorboardX<1.9'.安装小于1.9版本。
主要有两个错误:
2020-01-18 16:23:24,753 DBNet.pytorch ERROR: Traceback (most recent call last):
File "/data_1/Yang/project/2019/project/DBNet.pytorch/DBNet.pytorch-master/base/base_trainer.py", line 77, in __init__
self.writer.add_graph(self.model, dummy_input)
File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/tensorboardX/writer.py", line 774, in add_graph
self._get_file_writer().add_graph(graph(model, input_to_model, verbose, **kwargs))
File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 292, in graph
list_of_nodes, node_stats = parse(graph, args)
File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 227, in parse
if node.debugName() == 'self':
AttributeError: 'torch._C.Value' object has no attribute 'debugName'
2020-01-18 16:23:24,753 DBNet.pytorch WARNING: add graph to tensorboard failed
2020-01-18 16:23:24,756 DBNet.pytorch INFO: train dataset has 889 samples,297 in dataloader, validate dataset has 111 samples,111 in dataloader
Traceback (most recent call last):
File "tools/train.py", line 74, in <module>
main(config)
File "tools/train.py", line 58, in main
trainer.train()
File "/data_1/Yang/project/2019/project/DBNet.pytorch/DBNet.pytorch-master/base/base_trainer.py", line 103, in train
self.epoch_result = self._train_epoch(epoch)
File "/data_1/Yang/project/2019/project/DBNet.pytorch/DBNet.pytorch-master/trainer/trainer.py", line 46, in _train_epoch
for i, batch in enumerate(self.train_loader):
File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
return self._process_next_batch(batch)
File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
TypeError: Traceback (most recent call last):
File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
TypeError: 'NoneType' object is not callable
首先这个,File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 227, in parse
if node.debugName() == 'self':
AttributeError: 'torch._C.Value' object has no attribute 'debugName'
看样子好像是tensorboardX版本不对,百度一下果真,说要把版本整到1.8.conda list显示我是1.9,然后敲:
pip install tensorboardX1.8,显示如下:
Requirement already satisfied: six in /data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages (from tensorboardX1.8) (1.13.0)
Requirement already satisfied: protobuf>=3.2.0 in /data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages (from tensorboardX1.8) (3.11.2)
Requirement already satisfied: numpy in /data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages (from tensorboardX1.8) (1.17.4)
Requirement already satisfied: setuptools in /data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages (from protobuf>=3.2.0->tensorboardX==1.8) (44.0.0.post20200106)
Installing collected packages: tensorboardX
Found existing installation: tensorboardX 1.9
Uninstalling tensorboardX-1.9:
Successfully uninstalled tensorboardX-1.9
Successfully installed tensorboardX-1.8
直接会自动卸载1.9装1.8
然后再训练,果真只剩下最后的那个错误。
Traceback (most recent call last):
File "tools/train.py", line 74, in
main(config)
File "tools/train.py", line 58, in main
trainer.train()
File "/data_1/Yang/project/2019/project/DBNet.pytorch/DBNet.pytorch-master/base/base_trainer.py", line 103, in train
self.epoch_result = self._train_epoch(epoch)
File "/data_1/Yang/project/2019/project/DBNet.pytorch/DBNet.pytorch-master/trainer/trainer.py", line 46, in _train_epoch
for i, batch in enumerate(self.train_loader):
File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in next
return self._process_next_batch(batch)
File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
raise batch.exc_type(batch.exc_msg)
TypeError: Traceback (most recent call last):
File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
samples = collate_fn([dataset[i] for i in batch_indices])
TypeError: 'NoneType' object is not callable
在github上面有人解答了这个问题,https://github.com/WenmuZhou/DBNet.pytorch/issues/4
在DBNet.pytorch-master/data_loader/init.py, line 74
if 'collate_fn' not in config['loader'] or config['loader']['collate_fn'] is None or len(config['loader']['collate_fn']) == 0:
#config['loader']['collate_fn'] = None # here has to changle,========= 这里要改成下面的,不然传None进去会被直接赋值 ====
config['loader']['collate_fn'] = torch.utils.data.dataloader.default_collate
else:
config['loader']['collate_fn'] = eval(config['loader']['collate_fn'])()
_dataset = get_dataset(data_path=data_path, module_name=dataset_name, transform=img_transfroms, dataset_args=dataset_args)
sampler = None
if distributed:
from torch.utils.data.distributed import DistributedSampler
# 3)使用DistributedSampler
sampler = DistributedSampler(_dataset)
config['loader']['shuffle'] = False
config['loader']['pin_memory'] = True
loader = DataLoader(dataset=_dataset, sampler=sampler, **config['loader'])
return @loader
如此,再训练,就ok了!!!
赶紧训练,并用自己的数据训练看看!