• linux服务器上配置进行kaggle比赛的深度学习tensorflow keras环境详细教程


    本文首发于个人博客https://kezunlin.me/post/6b505d27/,欢迎阅读最新内容!

    full guide tutorial to install and configure deep learning environments on linux server

    Quick Guide

    prepare

    tools

    • MobaXterm (for windows)
    • ssh + vscode

    for windows:
    drop files to MobaXterm to upload to server
    use zip format

    commands

    view disk

    du -d 1 -h
    df -h
    

    gpu and cpu usage

    watch -n 1 nvidia-smi
    top 
    

    view files and count

    wc -l data.csv
    
    # count how many folders
    ls -lR | grep '^d' | wc -l
    17
    
    # count how many jpg files
    ls -lR | grep '.jpg' | wc -l
    1360
    
    # view 10 images 
    ls train | head
    ls test | head
    

    link datasets

    # link 
    ln -s srt dest
    ln -s /data_1/kezunlin/datasets/ dl4cv/datasets
    

    scp

    scp -r node17:~/dl4cv  ~/git/
    scp -r node17:~/.keras ~/
    

    tmux for background tasks

    tmux new -s notebook
    tmux ls 
    tmux attach -t notebook
    tmux detach
    

    wget download

    # wget 
    # continue donwload
    wget -c url 
    
    # background donwload for large file
    wget -b -c url
    tail -f wget-log
    
    # kill background wget
    pkill -9 wget
    

    tips about training large model

    terminal 1:

    tmux new -s train
    conda activate keras
    
    time python train_alexnet.py
    

    terminal 2:

    tmux detach
    
    tmux attach -t train
    

    and then close vscode, otherwise bash training process will exit when we close vscode.

    cuda driver and toolkits

    see cuda-toolkit for cuda driver version

    cudatookit version depends on cuda driver version.

    install nvidia-drivers

    sudo add-apt-repository ppa:graphics-drivers/ppa
    sudp apt-get update
    
    sudo apt-cache search nvidia-*
    # nvidia-384
    # nvidia-396
    sudo apt-get -y install nvidia-418
    
    # test 
    nvidia-smi
    Failed to initialize NVML: Driver/library version mismatch
    

    reboot to test again
    https://stackoverflow.com/questions/43022843/nvidia-nvml-driver-library-version-mismatch

    install cuda-toolkit(dirvers)

    remove all previous nvidia drivers

    sudo apt-get -y pruge nvidia-*
    

    go to here and download cuda_10.1

    wget -b -c http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.run
    sudo sh cuda_10.1.243_418.87.00_linux.run
    
    sudo ./cuda_10.1.243_418.87.00_linux.run
    
    vim .bashrc
    # for cuda and cudnn
    export PATH=/usr/local/cuda/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
    

    check cuda driver version

    > cat /proc/driver/nvidia/version
    NVRM version: NVIDIA UNIX x86_64 Kernel Module  418.87.00  Thu Aug  8 15:35:46 CDT 2019
    GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11) 
    
    
    >nvidia-smi
    Tue Aug 27 17:36:35 2019       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
    |-------------------------------+----------------------+----------------------+
    
    
    > nvidia-smi -L
    GPU 0: Quadro RTX 8000 (UUID: GPU-acb01c1b-776d-cafb-ea35-430b3580d123)
    GPU 1: Quadro RTX 8000 (UUID: GPU-df7f0fb8-1541-c9ce-e0f8-e92bccabf0ef)
    GPU 2: Quadro RTX 8000 (UUID: GPU-67024023-20fd-a522-dcda-261063332731)
    GPU 3: Quadro RTX 8000 (UUID: GPU-7f9d6a27-01ec-4ae5-0370-f0c356327913)
    
    > nvcc -V
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2019 NVIDIA Corporation
    Built on Sun_Jul_28_19:07:16_PDT_2019
    Cuda compilation tools, release 10.1, V10.1.243
    

    install conda

    ./Anaconda3-2019.03-Linux-x86_64.sh 
    [yes]
    [yes]
    

    config channels

    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/
    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/menpo/
    conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
    
    conda config --set show_channel_urls yes
    

    install libraries

    conclusions:

    • py37/keras: conda install -y tensorflow-gpu keras==2.2.5
    • py37/torch: conda install -y pytorch torchvision
    • py36/mxnet: conda install -y mxnet

    keras 2.2.5 was released on 2019/8/23.
    Add new Applications: ResNet101, ResNet152, ResNet50V2, ResNet101V2, ResNet152V2.

    common libraries

    conda install -y scikit-learn scikit-image pandas matplotlib pillow opencv seaborn
    pip install imutils progressbar pydot pylint
    

    pip install imutils to avoid downgrade for tensorflow-gpu

    py37

    cudatoolkit               10.0.130                  0    
    cudnn                     7.6.0                cuda10.0_0    
    tensorflow-gpu            1.13.1 
    

    py36

    cudatoolkit        anaconda/pkgs/main/linux-64::cudatoolkit-10.1.168-0
    cudnn              anaconda/pkgs/main/linux-64::cudnn-7.6.0-cuda10.1_0
    tensorboard        anaconda/pkgs/main/linux-64::tensorboard-1.14.0-py36hf484d3e_0
    tensorflow         anaconda/pkgs/main/linux-64::tensorflow-1.14.0-gpu_py36h3fb9ad6_0
    tensorflow-base    anaconda/pkgs/main/linux-64::tensorflow-base-1.14.0-gpu_py36he45bfe2_0
    tensorflow-estima~ anaconda/cloud/conda-forge/linux-64::tensorflow-estimator-1.14.0-py36h5ca1d4c_0
    tensorflow-gpu     anaconda/pkgs/main/linux-64::tensorflow-gpu-1.14.0-h0d30ee6_0
    

    imutils only support 36 and 37.
    mxnet only support 35 and 36.

    details

    # remove py35
    conda remove -n py35 --all
    
    conda info --envs
    
    conda create -n py37 python==3.7
    conda activate py37
    
    # common libraries
    conda install -y scikit-learn pandas pillow opencv
    pip install imutils
    
    # imutils
    conda search imutils  
    # py36 and py37
    
    # Name                       Version           Build  Channel             
    imutils                        0.5.2          py27_0  anaconda/cloud/conda-forge
    imutils                        0.5.2          py36_0  anaconda/cloud/conda-forge
    imutils                        0.5.2          py37_0  anaconda/cloud/conda-forge
    
    # tensorflow-gpu and keras
    conda install -y tensorflow-gpu keras
    
    # install pytorch
    conda install -y pytorch torchvision
    
    # install mxnet
    # method 1: pip
    pip search mxnet
    mxnet-cu80[mkl]/mxnet-cu90[mkl]/mxnet-cu91[mkl]/mxnet-cu92[mkl]/mxnet-cu100[mkl]/mxnet-cu101[mkl]
    
    # method 2: conda
    conda install mxnet
    # py35 and py36
    

    TensorFlow Object Detection API

    home page: home page

    download tensorflow models and rename models-master to tfmodels

    vim ~/.bashrc

    export PYTHONPATH=/home/kezunlin/dl4cv:/data_1/kezunlin/tfmodels/research:$PYTHONPATH
    

    source ~/.bashrc

    jupyter notebook

    conda activate py37
    conda install -y jupyter 
    

    install kernels

    python -m ipykernel install --user --name=py37
    Installed kernelspec py37 in /home/kezunlin/.local/share/jupyter/kernels/py37
    

    config for server

    python -c "import IPython;print(IPython.lib.passwd())"
    Enter password: 
    Verify password: 
    sha1:ef2fb2aacff2:4ea2998699638e58d10d594664bd87f9c3381c04
    
    jupyter notebook --generate-config
    Writing default config to: /home/kezunlin/.jupyter/jupyter_notebook_config.py
    
    vim .jupyter/jupyter_notebook_config.py
    
    c.NotebookApp.ip = '*'  
    c.NotebookApp.password = u'sha1:xxx:xxx' 
    c.NotebookApp.open_browser = False 
    c.NotebookApp.port = 8888 
    c.NotebookApp.enable_mathjax = True
    

    run jupyter on background

    tmux new -s notebook
    jupyter notebook
    # ctlr+b+d exit session and DO NOT close session
    # ctlr+d exit session and close session
    

    access web and input password

    test

    py37

    import cv2
    cv2.__version
    import tensorflow as tf
    import keras 
    import torch
    import torchvision
    

    cat .keras/keras.json

    {
        "epsilon": 1e-07,
        "floatx": "float32",
        "backend": "tensorflow",
        "image_data_format": "channels_last"
    }
    

    py36

    import mxnet
    

    train demo

    export

    # use CPU only
    export CUDA_VISIBLE_DEVICES=""
    
    # use gpu 0 1
    export CUDA_VISIBLE_DEVICES="0,1"
    

    code

    import os
    os.environ['CUDA_VISIBLE_DEVICES'] = "0,1"
    

    start train

    python train.py
    

    ./keras folder

    view keras models and datasets

    ls .keras/
    datasets  keras.json  models
    

    models saved to /home/kezunlin/.keras/models/
    datasets saved to /home/kezunlin/.keras/datasets/

    models lists

    xxx_kernels_notop.h5 for include_top = False
    xxx_kernels.h5 for include_top = True

    Datasets

    mnist

    cifar10

    to skip download

    wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
    mv ~/Download/cifar-10-python.tar.gz ~/.keras/datasets/cifar-10-batches-py.tar.gz
    

    to load data

    (x_train, y_train), (x_test, y_test) = cifar10.load_data()
    

    flowers-17

    animals

    panda images are WRONG !!!

    counts

    ls -lR animals/cat | grep ".jpg" | wc -l
    1000
    ls -lR animals/dog | grep ".jpg" | wc -l
    1000
    ls -lR animals/panda | grep ".jpg" | wc -l
    1000
    

    kaggle cats vs dogs

    caltech101

    download background

    wget -b -c http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz
    

    Kaggle API

    install and config

    see kaggle-api

    conda activate keras
    conda install kaggle
    
    # download kaggle.json
    mv kaggle.json ~/.kaggle/kaggle.json
    chmod 600 ~/.kaggle/kaggle.json
    
    cat kaggle.json
    {"username":"xxx","key":"yyy"}
    

    or by export

    export KAGGLE_USERNAME=xxx
    export KAGGLE_KEY=yyy
    

    tips

    1. go to account and select 'Create API Token' and keras.json will be downloaded.
    2. Ensure kaggle.json is in the location ~/.kaggle/kaggle.json to use the API.

    check version

    kaggle --version
    Kaggle API 1.5.5
    

    commands overview

    commands

    kaggle competitions {list, files, download, submit, submissions, leaderboard}
    kaggle datasets {list, files, download, create, version, init}
    kaggle kernels {list, init, push, pull, output, status}
    kaggle config {view, set, unset}
    

    download datasets

    kaggle competitions download -c dogs-vs-cats
    

    show leaderboard

    kaggle competitions leaderboard dogs-vs-cats --show
    teamId  teamName                           submissionDate       score    
    ------  ---------------------------------  -------------------  -------  
    71046  Pierre Sermanet                    2014-02-01 21:43:19  0.98533  
    66623  Maxim Milakov                      2014-02-01 18:20:58  0.98293  
    72059  Owen                               2014-02-01 17:04:40  0.97973  
    74563  Paul Covington                     2014-02-01 23:05:20  0.97946  
    74298  we've been in KAIST                2014-02-01 21:15:30  0.97840  
    71949  orchid                             2014-02-01 23:52:30  0.97733  
    

    set default competition

    kaggle config set --name competition --value dogs-vs-cats
    - competition is now set to: dogs-vs-cats
    
    kaggle config set --name competition --value dogs-vs-cats-redux-kernels-edition
    

    dogs-vs-cats
    dogs-vs-cats-redux-kernels-edition

    submit

    kaggle c submissions
    - Using competition: dogs-vs-cats
    - No submissions found
    
    kaggle c submit -f ./submission.csv -m "first submit"
    

    competition has already ended, so can not submit.

    Nvidia-docker and containers

    install

    sudo apt-get -y install docker
    
    # Install nvidia-docker2 and reload the Docker daemon configuration
    sudo apt-get install -y nvidia-docker2
    sudo pkill -SIGHUP dockerd
    

    restart (optional)

    cat /etc/docker/daemon.json

    {
        "runtimes": {
            "nvidia": {
                "path": "nvidia-container-runtime",
                "runtimeArgs": []
            }
        }
    }
    
    sudo systemctl enable docker
    sudo systemctl start docker
    

    if errors occur:
    Job for docker.service failed because the control process exited with error code.
    See "systemctl status docker.service" and "journalctl -xe" for details.
    check /etc/docker/daemon.json

    test

    sudo docker run --runtime=nvidia --rm nvidia/cuda:10.1-base nvidia-smi
    sudo nvidia-docker run --rm nvidia/cuda:10.1-base nvidia-smi
    
    Thu Aug 29 00:11:32 2019       
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Quadro RTX 8000     Off  | 00000000:02:00.0 Off |                  Off |
    | 43%   67C    P2   136W / 260W |  46629MiB / 48571MiB |     17%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  Quadro RTX 8000     Off  | 00000000:03:00.0 Off |                  Off |
    | 34%   54C    P0    74W / 260W |      0MiB / 48571MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   2  Quadro RTX 8000     Off  | 00000000:82:00.0 Off |                  Off |
    | 34%   49C    P0    73W / 260W |      0MiB / 48571MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    |   3  Quadro RTX 8000     Off  | 00000000:83:00.0 Off |                  Off |
    | 33%   50C    P0    73W / 260W |      0MiB / 48571MiB |      3%      Default |
    +-------------------------------+----------------------+----------------------+
                                                                                
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+
    

    add user to docker group, and no need to use sudo docker xxx

    command refs

    sudo nvidia-docker run --rm nvidia/cuda:10.1-base nvidia-smi
    sudo nvidia-docker -t -i --privileged nvidia/cuda bash
    
    sudo docker run -it --name kzl -v /home/kezunlin/workspace/:/home/kezunlin/workspace nvidia/cuda
    

    Reference

    History

    • 20190821: created.

    Copyright

  • 相关阅读:
    七、vertical-align属性、透明度属性及兼容、ps常用工具、常见的图片格式、项目规范、命名参考、iconfont的使用
    自定义注解!绝对是程序员装逼的利器!!
    令人爱不释手的Python列表推导式
    用Python画colorbar渐变图+修改刻度大小+修改渐变颜色
    pandas:使用函数批量处理数据(map、apply、applymap)
    【Python3】xlwt/xlrd模块读取和新建excel并生成直方图
    什么是可串行化MVCC
    python计算和媳妇在一起天数的小程序,最后绘制成花.
    520了,用32做个简单的小程序
    “TensorFlow 开发者出道计划”全攻略,玩转社区看这里!
  • 原文地址:https://www.cnblogs.com/kezunlin/p/11955538.html
Copyright © 2020-2023  润新知