• Horovod in Docker


    https://horovod.readthedocs.io/en/stable/docker.html

    Step1 构建镜像

    GPU

    $ mkdir horovod-docker-gpu
    $ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu
    $ docker build -t horovod:latest horovod-docker-gpu
    

    CPU

    $ mkdir horovod-docker-gpu
    $ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.cpu
    $ docker build -t horovod:latest horovod-docker-cpu
    

    在单机上运行

    GPU 的机器,可以使用 nvidia-docker.

    $ nvidia-docker run -it horovod:latest
    root@c278c88dd552:/examples# horovodrun -np 4 -H localhost:4 python keras_mnist_advanced.py
    

    在多机上运行

    (一)多机运行的条件:ssh免密登陆

    http://www.linuxproblem.org/art_9.html

    1. First log in on A as user a and generate a pair of authentication keys. Do not enter a passphrase:
    a@A:~> ssh-keygen -t rsa
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/a/.ssh/id_rsa): 
    Created directory '/home/a/.ssh'.
    Enter passphrase (empty for no passphrase): 
    Enter same passphrase again: 
    Your identification has been saved in /home/a/.ssh/id_rsa.
    Your public key has been saved in /home/a/.ssh/id_rsa.pub.
    The key fingerprint is:
    3e:4f:05:79:3a:9f:96:7c:3b:ad:e9:58:37:bc:37:e4 a@A
    
    1. Now use ssh to create a directory ~/.ssh as user b on B. (The directory may already exist, which is fine):
    a@A:~> ssh b@B mkdir -p .ssh
    b@B's password: 
    
    1. Finally append a's new public key to b@B:.ssh/authorized_keys and enter b's password one last time:
    a@A:~> cat .ssh/id_rsa.pub | ssh b@B 'cat >> .ssh/authorized_keys'
    b@B's password: 
    
    1. From now on you can log into B as b from A as a without password:
    a@A:~> ssh b@B
    

    (二)主worker

    host1$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest
    root@c278c88dd552:/examples# horovodrun -np 16 -H host1:4,host2:4,host3:4,host4:4 -p 12345 python keras_mnist_advanced.py
    

    (三)从 workers:

    host2$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest 
        bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
    
    host3$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest 
        bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
    
    host4$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest 
        bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
    

    支持远程直接数据存储

    $ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh --cap-add=IPC_LOCK --device=/dev/infiniband horovod:latest
    root@c278c88dd552:/examples# ...
    
  • 相关阅读:
    JS设置Cookie过期时间
    linq to xml
    ToDictionary的用法
    为程序使用内存缓存(MemoryCache)
    NuGet的几个小技巧
    IIS 的几个小技巧
    在Visual Studio中使用NuGet管理项目库
    在ASP.NET MVC中,使用Bundle来打包压缩js和css
    在C#中使用WMI查询进程的用户信息
    WMI测试器
  • 原文地址:https://www.cnblogs.com/shix0909/p/13391019.html
Copyright © 2020-2023  润新知