• Real-time Scene Text Detection with Differentiable Binarization 问题记录


    官方:https://github.com/MhLiao/DB
    周军大神实现的:https://github.com/WenmuZhou/DBNet.pytorch

    1.官方的

    官方的按照安装流程很容易安装,只是我的环境是ubuntu16.04+cuda8,所以一直用的pytorch1.0.1(py3.7)的。也可以跑起来.但是训练的模型推理预测出来全是空啊,txt全是空的,visualize文件夹图片灰蒙蒙的没有框。loss不收敛

    [INFO] [2020-01-18 16:24:09,584] step:   1340, epoch:   0, loss: 4.332346, lr: 0.007000
    [INFO] [2020-01-18 16:24:09,585] bce_loss: 0.568492
    [INFO] [2020-01-18 16:24:09,585] thresh_loss: 0.563487
    [INFO] [2020-01-18 16:24:09,586] l1_loss: 0.092640
    [INFO] [2020-01-18 16:24:19,117] step:   1360, epoch:   0, loss: 4.255758, lr: 0.007000
    [INFO] [2020-01-18 16:24:19,120] bce_loss: 0.544069
    [INFO] [2020-01-18 16:24:19,122] thresh_loss: 0.539020
    [INFO] [2020-01-18 16:24:19,124] l1_loss: 0.099640
    [INFO] [2020-01-18 16:24:28,766] step:   1380, epoch:   0, loss: 4.507674, lr: 0.007000
    [INFO] [2020-01-18 16:24:28,767] bce_loss: 0.560643
    [INFO] [2020-01-18 16:24:28,768] thresh_loss: 0.652172
    [INFO] [2020-01-18 16:24:28,768] l1_loss: 0.105229
    

    用ic15数据集训练也是如此,不知道问题出在哪里。后面再看看

    有一个bug,是学习率一开始都是0.07,设置都没有用,需要到DB/DB-master/training/learning_rate.py这里改,被写死了

    因为怀疑是版本原因导致不收敛什么的,于是我就把自己电脑上装两个cuda,8和10,装10,然后创建虚拟环境,然后又报错,报cuda错误,
    error in : invalid device function
    RuntimeError: copy_if failed to synchronize: device-side assert triggered
    搞了好久,无解,然后解决问题的时候发现conda安装的cuda是10.1版本的,而我本地是10.0版本的,同时看到
    作者在回答issue的时候说:https://github.com/MhLiao/DB/issues/36

    Make sure your CUDA path of $CUDA_HOME is the same version as your CUDA in PyTorch by the command of echo $CUDA_HOME. If not, you need to change the $CUDA_HOME by export CUDA_HOME=path-of-another-version or re-install PyTorch with the same CUDA version as in CUDA_HOME.

    本地与conda的cuda版本需要一致。然后我重来一遍:

    conda install numpy=1.17.4 pytorch=1.3 torchvision cudatoolkit=10.0.130 -c pytorch
    

    如此,可以!
    但是,好像还不收敛啊。。。。………………。6……6…………-%4$

    2.非官方的

    安装安装流程,一股脑的安装,确实可以跑,但是一开始显示, DBNet.pytorch INFO: train with device cpu and pytorch 1.3.0
    因为我电脑上没有1.3需要的cuda10,所以就跑cpu了。很慢。
    后来在群里看到有人用pytorch1.1.0版本编过了,但是他是cuda10.我也安装1.1.0版本,然后训练各种报错啊,无助。。。后来都放弃了,后来又重捣鼓。
    在此过程中,越来越觉得anconda很好,在虚拟环境下,敲conda list可以显示安装的各个库的版本。

    _libgcc_mutex             0.1                        main  
    absl-py                   0.9.0                     <pip>
    anyconfig                 0.9.10                    <pip>
    backcall                  0.1.0                    py36_0  
    blas                      1.0                         mkl  
    ca-certificates           2019.11.27                    0  
    cachetools                4.0.0                     <pip>
    certifi                   2019.11.28               py36_0  
    cffi                      1.13.2           py36h2e261b9_0  
    chardet                   3.0.4                     <pip>
    cudatoolkit               8.0                           3  
    cycler                    0.10.0                    <pip>
    decorator                 4.4.1                      py_0  
    freetype                  2.9.1                h8a8886c_1  
    future                    0.18.2                    <pip>
    google-auth               1.10.1                    <pip>
    google-auth-oauthlib      0.4.1                     <pip>
    grpcio                    1.26.0                    <pip>
    idna                      2.8                       <pip>
    imageio                   2.6.1                     <pip>
    imgaug                    0.3.0                     <pip>
    intel-openmp              2019.4                      243  
    ipython                   7.11.1           py36h39e3cac_0  
    ipython_genutils          0.2.0                    py36_0  
    jedi                      0.15.2                   py36_0  
    jpeg                      9b                   h024ee3a_2  
    kiwisolver                1.1.0                     <pip>
    ld_impl_linux-64          2.33.1               h53a641e_7  
    libedit                   3.1.20181209         hc058e9b_0  
    libffi                    3.2.1                hd88cf55_4  
    libgcc-ng                 9.1.0                hdf63c60_0  
    libgfortran-ng            7.3.0                hdf63c60_0  
    libpng                    1.6.37               hbc83047_0  
    libstdcxx-ng              9.1.0                hdf63c60_0  
    libtiff                   4.1.0                h2733197_0  
    Markdown                  3.1.1                     <pip>
    matplotlib                3.1.2                     <pip>
    mkl                       2019.4                      243  
    mkl-service               2.3.0            py36he904b0f_0  
    mkl_fft                   1.0.15           py36ha843d7b_0  
    mkl_random                1.1.0            py36hd6b4f25_0  
    natsort                   7.0.0                     <pip>
    ncurses                   6.1                  he6710b0_1  
    networkx                  2.4                       <pip>
    ninja                     1.9.0            py36hfd86e86_0  
    numpy                     1.18.1           py36h4f9e942_0  
    numpy                     1.17.4                    <pip>
    numpy-base                1.18.1           py36hde5b4d6_0  
    oauthlib                  3.1.0                     <pip>
    olefile                   0.46                       py_0  
    opencv-python             4.1.2.30                  <pip>
    opencv-python-headless    4.1.2.30                  <pip>
    openssl                   1.1.1d               h7b6447c_3  
    parso                     0.5.2                      py_0  
    pexpect                   4.7.0                    py36_0  
    pickleshare               0.7.5                    py36_0  
    Pillow                    6.2.2                     <pip>
    pillow                    7.0.0            py36hb39fc2d_0  
    pip                       19.3.1                   py36_0  
    Polygon3                  3.0.8                     <pip>
    prompt_toolkit            3.0.2                      py_0  
    protobuf                  3.11.2                    <pip>
    ptyprocess                0.6.0                    py36_0  
    pyasn1                    0.4.8                     <pip>
    pyasn1-modules            0.2.8                     <pip>
    pyclipper                 1.1.0.post3               <pip>
    pycparser                 2.19                       py_0  
    pygments                  2.5.2                      py_0  
    pyparsing                 2.4.6                     <pip>
    python                    3.6.10               h0371630_0  
    python-dateutil           2.8.1                     <pip>
    pytorch                   1.0.1           py3.6_cuda8.0.61_cudnn7.1.2_2    pytorch
    PyWavelets                1.1.1                     <pip>
    PyYAML                    5.2                       <pip>
    readline                  7.0                  h7b6447c_5  
    requests                  2.22.0                    <pip>
    requests-oauthlib         1.3.0                     <pip>
    rsa                       4.0                       <pip>
    scikit-image              0.16.2                    <pip>
    scipy                     1.4.1                     <pip>
    setuptools                44.0.0                   py36_0  
    Shapely                   1.6.4.post2               <pip>
    six                       1.13.0                   py36_0  
    sqlite                    3.30.1               h7b6447c_0  
    tensorboard               2.1.0                     <pip>
    tensorboardX              1.8                       <pip>
    tk                        8.6.8                hbc83047_0  
    torch                     1.1.0                     <pip>
    torchvision               0.2.1                     <pip>
    torchvision               0.2.2                      py_3    pytorch
    tqdm                      4.40.1                    <pip>
    traitlets                 4.3.3                    py36_0  
    urllib3                   1.25.7                    <pip>
    wcwidth                   0.1.7                    py36_0  
    Werkzeug                  0.16.0                    <pip>
    wheel                     0.33.6                   py36_0  
    xz                        5.2.4                h14c3975_4  
    zlib                      1.2.11               h7b6447c_3  
    zstd                      1.3.7                h0b5b093_0  
    

    安装软件直接:pip install tensorboardX==1.8
    不加版本号默认装最新的。
    还可以pip install 'tensorboardX<1.9'.安装小于1.9版本。
    主要有两个错误:

    2020-01-18 16:23:24,753 DBNet.pytorch ERROR: Traceback (most recent call last):
      File "/data_1/Yang/project/2019/project/DBNet.pytorch/DBNet.pytorch-master/base/base_trainer.py", line 77, in __init__
        self.writer.add_graph(self.model, dummy_input)
      File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/tensorboardX/writer.py", line 774, in add_graph
        self._get_file_writer().add_graph(graph(model, input_to_model, verbose, **kwargs))
      File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 292, in graph
        list_of_nodes, node_stats = parse(graph, args)
      File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 227, in parse
        if node.debugName() == 'self':
    AttributeError: 'torch._C.Value' object has no attribute 'debugName'
    
    2020-01-18 16:23:24,753 DBNet.pytorch WARNING: add graph to tensorboard failed
    2020-01-18 16:23:24,756 DBNet.pytorch INFO: train dataset has 889 samples,297 in dataloader, validate dataset has 111 samples,111 in dataloader
    Traceback (most recent call last):
      File "tools/train.py", line 74, in <module>
        main(config)
      File "tools/train.py", line 58, in main
        trainer.train()
      File "/data_1/Yang/project/2019/project/DBNet.pytorch/DBNet.pytorch-master/base/base_trainer.py", line 103, in train
        self.epoch_result = self._train_epoch(epoch)
      File "/data_1/Yang/project/2019/project/DBNet.pytorch/DBNet.pytorch-master/trainer/trainer.py", line 46, in _train_epoch
        for i, batch in enumerate(self.train_loader):
      File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
        return self._process_next_batch(batch)
      File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
        raise batch.exc_type(batch.exc_msg)
    TypeError: Traceback (most recent call last):
      File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
        samples = collate_fn([dataset[i] for i in batch_indices])
    TypeError: 'NoneType' object is not callable
    
    

    首先这个,File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 227, in parse
    if node.debugName() == 'self':
    AttributeError: 'torch._C.Value' object has no attribute 'debugName'

    看样子好像是tensorboardX版本不对,百度一下果真,说要把版本整到1.8.conda list显示我是1.9,然后敲:
    pip install tensorboardX1.8,显示如下:
    Requirement already satisfied: six in /data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages (from tensorboardX
    1.8) (1.13.0)
    Requirement already satisfied: protobuf>=3.2.0 in /data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages (from tensorboardX1.8) (3.11.2)
    Requirement already satisfied: numpy in /data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages (from tensorboardX
    1.8) (1.17.4)
    Requirement already satisfied: setuptools in /data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages (from protobuf>=3.2.0->tensorboardX==1.8) (44.0.0.post20200106)
    Installing collected packages: tensorboardX
    Found existing installation: tensorboardX 1.9
    Uninstalling tensorboardX-1.9:
    Successfully uninstalled tensorboardX-1.9
    Successfully installed tensorboardX-1.8

    直接会自动卸载1.9装1.8

    然后再训练,果真只剩下最后的那个错误。

    Traceback (most recent call last):
    File "tools/train.py", line 74, in
    main(config)
    File "tools/train.py", line 58, in main
    trainer.train()
    File "/data_1/Yang/project/2019/project/DBNet.pytorch/DBNet.pytorch-master/base/base_trainer.py", line 103, in train
    self.epoch_result = self._train_epoch(epoch)
    File "/data_1/Yang/project/2019/project/DBNet.pytorch/DBNet.pytorch-master/trainer/trainer.py", line 46, in _train_epoch
    for i, batch in enumerate(self.train_loader):
    File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in next
    return self._process_next_batch(batch)
    File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
    TypeError: Traceback (most recent call last):
    File "/data_1/Yang/software_install/Anaconda1105/envs/dbnet/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
    TypeError: 'NoneType' object is not callable

    在github上面有人解答了这个问题,https://github.com/WenmuZhou/DBNet.pytorch/issues/4
    在DBNet.pytorch-master/data_loader/init.py, line 74
    if 'collate_fn' not in config['loader'] or config['loader']['collate_fn'] is None or len(config['loader']['collate_fn']) == 0:
    #config['loader']['collate_fn'] = None # here has to changle,========= 这里要改成下面的,不然传None进去会被直接赋值 ====
    config['loader']['collate_fn'] = torch.utils.data.dataloader.default_collate
    else:
    config['loader']['collate_fn'] = eval(config['loader']['collate_fn'])()

    _dataset = get_dataset(data_path=data_path, module_name=dataset_name, transform=img_transfroms, dataset_args=dataset_args)
    sampler = None
    if distributed:
    from torch.utils.data.distributed import DistributedSampler
    # 3)使用DistributedSampler
    sampler = DistributedSampler(_dataset)
    config['loader']['shuffle'] = False
    config['loader']['pin_memory'] = True
    loader = DataLoader(dataset=_dataset, sampler=sampler, **config['loader'])
    return @loader

    如此,再训练,就ok了!!!
    赶紧训练,并用自己的数据训练看看!

  • 相关阅读:
    Nginx原理入门教程
    MSDN原版系统镜像ISO下载站
    JWT跨域身份验证解决方案
    PHP获取毫秒时间戳
    IDCode校验算法
    PurpleAir空气质量数据采集
    检测微信好友是否删除自己
    京东联盟开发(13)——获取官方活动推广数据
    微信二维码标准
    车牌号正则表达式
  • 原文地址:https://www.cnblogs.com/yanghailin/p/12209685.html
Copyright © 2020-2023  润新知