一、Docker网络解决方案
Docker跨主机容器间网络通信实现的工具有Pipework、Flannel、Weave、Open vSwitch(虚拟交换机)、Calico, 其中Pipework、Weave、Flannel,三者的区别是:
1、Weave的思路
在每个宿主机上布置一个特殊的route的容器,不同宿主机的route容器连接起来。 route拦截所有普通容器的ip请求,并通过udp包发送到其他宿主机上的普通容器。这样在跨机的多个容器端看到的就是同一个扁平网络。 weave解决了网络问题,不过部署依然是单机的。
2、Flannel的思路
Flannel是CoreOS团队针对Kubernetes设计的一个网络规划服务,简单来说,它的功能是让集群中的不同节点主机创建的Docker容器都具有全集群唯一的虚拟IP地址。但在默认的Docker配置中,每个节点上的Docker服务会分别负责所在节点容器的IP分配。这样导致的一个问题是,不同节点上容器可能获得相同的内外IP地址。并使这些容器之间能够之间通过IP地址相互找到,也就是相互ping通。Flannel设计目的就是为集群中所有节点重新规划IP地址的使用规则,从而使得不同节点上的容器能够获得"同属一个内网"且"不重复的"IP地址,并让属于不同节点上的容器能够直接通过内网IP通信。
Flannel实质上是一种"覆盖网络(overlay network)",即表示运行在一个网上的网(应用层网络),并不依靠ip地址来传递消息,而是采用一种映射机制,把ip地址和identifiers做映射来资源定位。也就是将TCP数据包装在另一种网络包里面进行路由转发和通信,目前已经支持UDP、VxLAN、AWS VPC和GCE路由等数据转发方式。
Flannel 使用etcd存储配置数据和子网分配信息。flannel 启动之后,后台进程首先检索配置和正在使用的子网列表,然后选择一个可用的子网,然后尝试去注册它。etcd也存储这个每个主机对应的ip。flannel 使用etcd的watch机制监视/coreos.com/network/subnets下面所有元素的变化信息,并且根据它来维护一个路由表。为了提高性能,flannel优化了Universal TAP/TUN设备,对TUN和UDP之间的ip分片做了代理。
Flannel工作原理
每个主机配置一个ip段和子网个数。例如,可以配置一个覆盖网络使用 10.1.0.0/16段,每个主机/24个子网。因此主机a可以接受10.1.15.0/24,主机B可以接受10.1.20.0/24的包。flannel使用etcd来维护分配的子网到实际的ip地址之间的映射。对于数据路径,flannel 使用udp来封装ip数据报,转发到远程主机。选择UDP作为转发协议是因为他能穿透防火墙。例如,AWS Classic无法转发IPoIP or GRE 网络包,是因为它的安全组仅仅支持TCP/UDP/ICMP。 Flannel工作原理流程图如下 (默认的节点间数据通信方式是UDP转发; flannel默认使用8285端口作为UDP封装报文的端口,VxLan使用8472端口)
对上图的简单说明 (Flannel的工作原理可以解释如下):
-> 数据从源容器中发出后,经由所在主机的docker0虚拟网卡转发到flannel0虚拟网卡,这是个P2P的虚拟网卡,flanneld服务监听在网卡的另外一端。
-> Flannel通过Etcd服务维护了一张节点间的路由表,该张表里保存了各个节点主机的子网网段信息。
-> 源主机的flanneld服务将原本的数据内容UDP封装后根据自己的路由表投递给目的节点的flanneld服务,数据到达以后被解包,然后直接进入目的节点的flannel0虚拟网卡,然后被转发到目的主机的docker0虚拟网卡,最后就像本机容器通信一样的由docker0路由到达目标容器。
这样整个数据包的传递就完成了,这里需要解释三个问题:
1) UDP封装是怎么回事?
在UDP的数据内容部分其实是另一个ICMP(也就是ping命令)的数据包。原始数据是在起始节点的Flannel服务上进行UDP封装的,投递到目的节点后就被另一端的Flannel服务
还原成了原始的数据包,两边的Docker服务都感觉不到这个过程的存在。
2) 为什么每个节点上的Docker会使用不同的IP地址段?
这个事情看起来很诡异,但真相十分简单。其实只是单纯的因为Flannel通过Etcd分配了每个节点可用的IP地址段后,偷偷的修改了Docker的启动参数。
在运行了Flannel服务的节点上可以查看到Docker服务进程运行参数(ps aux|grep docker|grep "bip"),例如“--bip=182.48.25.1/24”这个参数,它限制了所在节
点容器获得的IP范围。这个IP范围是由Flannel自动分配的,由Flannel通过保存在Etcd服务中的记录确保它们不会重复。
3) 为什么在发送节点上的数据会从docker0路由到flannel0虚拟网卡,在目的节点会从flannel0路由到docker0虚拟网卡?
例如现在有一个数据包要从IP为172.17.18.2的容器发到IP为172.17.46.2的容器。根据数据发送节点的路由表,它只与172.17.0.0/16匹配这条记录匹配,因此数据从docker0
出来以后就被投递到了flannel0。同理在目标节点,由于投递的地址是一个容器,因此目的地址一定会落在docker0对于的172.17.46.0/24这个记录上,自然的被投递到了docker0网卡。
3、pipework的思路
pipework是一个单机的工具,组合了brctl等工具,可以认为pipework解决的是宿主机上的设置容器的虚拟网卡、网桥、ip等,可以配合其他网络使用。
如果容器数量不多,想简单的组一个大的3层网络,可以考虑weave
如果容器数量很多,而且你们的环境复杂,需要多个子网,可以考虑open vswitch或者fannel
weave的总体网络性能表现欠佳, flannel VXLAN 能满足要求,一般推荐用flannel
二、Flannel部署(node01 10.192.27.115 node02 10.192.27.116)
1. 写入分配的子网段到etcd,供flanneld使用
#任意一个ETCD 节点上(这里选择在master节点上) 写入数据库:key为 /coreos.com/network/config ,value网段信息(网段信息为:{ "Network": "172.17.0.0/16", "Backend": {"Type": "vxlan"}})
[root@master01 ~]# cd /root/k8s/etcd-cert/ [root@master01 etcd-cert]# ls ca-config.json ca.csr ca-csr.json ca-key.pem ca.pem etcd-cert.sh server.csr server-csr.json server-key.pem server.pem #设置一个键值对 [root@master01 etcd-cert]# /opt/etcd/bin/etcdctl --ca-file=ca.pem --cert-file=server.pem --key-file=server-key.pem --endpoints="https://10.192.27.100:2379,https://10.192.27.115:2379,https://10.192.27.116:2379" set /coreos.com/network/config '{ "Network": "172.17.0.0/16", "Backend": {"Type": "vxlan"}}' #已经设置成功:配置一个覆盖网络使用172.17.0.0/16段和VxLan转发 { "Network": "172.17.0.0/16", "Backend": {"Type": "vxlan"}} # (默认的节点间数据通信方式是UDP转发; flannel默认使用8285端口作为UDP封装报文的端口,VxLan使用8472端口) #获取一个键值对 [root@master01 etcd-cert]# /opt/etcd/bin/etcdctl --ca-file=ca.pem --cert-file=server.pem --key-file=server-key.pem --endpoints="https://10.192.27.100:2379,https://10.192.27.115:2379,https://10.192.27.116:2379" get /coreos.com/network/config { "Network": "172.17.0.0/16", "Backend": {"Type": "vxlan"}} { "Network": "172.17.0.0/16", "Backend": {"Type": "vxlan"}} #获取 key: /coreos.com/network/config value:{ "Network": "172.17.0.0/16", "Backend": {"Type": "vxlan"}}
2. 下载二进制包 https://github.com/coreos/flannel/releases
[root@master01 etcd-cert]# cd .. [root@master01 k8s]# wget https://github.com/coreos/flannel/releases/download/v0.10.0/flannel-v0.10.0-linux-amd64.tar.gz [root@master01 k8s]# scp flannel-v0.10.0-linux-amd64.tar.gz root@10.192.27.115:~ #传至node01 [root@master01 k8s]# scp flannel-v0.10.0-linux-amd64.tar.gz root@10.192.27.116:~ #传至node02
3. 两个节点上安装docker
# 安装依赖包 官方文档:https://docs.docker.com yum install -y yum-utils device-mapper-persistent-data lvm2 # 添加Docker软件包源 yum-config-manager \ --add-repo \ https://download.docker.com/linux/centos/docker-ce.repo # 安装Docker CE yum install -y docker-ce # 启动Docker服务并设置开机启动 systemctl start docker systemctl enable docker 镜像从哪里来? Docker Hub是由Docker公司负责维护的公共注册中心,包含大量的容器镜像,Docker工具默认从这个公共镜像库下载镜像。 地址:https://hub.docker.com/explore 配置镜像加速器:https://www.daocloud.io/mirror #下载镜像时会加速 curl -sSL https://get.daocloud.io/daotools/set_mirror.sh | sh -s http://f1361db2.m.daocloud.io
二进制包安装 #这里使用二进制包安装
[root@node01 ~]# wget https://download.docker.com/linux/static/stable/x86_64/docker-18.09.4-ce.tgz [root@node01 ~]# tar -xf docker-18.09.4.tgz [root@node01 ~]# ls docker containerd containerd-shim ctr docker dockerd docker-init docker-proxy runc [root@node01 ~]# cp docker/* /usr/bin/ [root@node01 ~]#
[root@node02 ~]# tar -xf docker-18.09.4.tgz [root@node02 ~]# cp docker/* /usr/bin/ [root@node02 ~]#
4. 部署与配置Flannel
两个节点都要的操作(node01 10.192.27.115 node02 10.192.27.116)
#如果是多个node 可以安装一台 把/opt/kubernetes 和flannel、docker服务启动的文件 考过去
[root@node01 ~]# tar -xf flannel-v0.10.0-linux-amd64.tar.gz [root@node01 ~]# ls anaconda-ks.cfg flanneld flannel-v0.10.0-linux-amd64.tar.gz mk-docker-opts.sh README.md [root@node01 ~]# mkdir -p /opt/kubernetes/{cfg,bin,ssl} [root@node01 ~]# mv flanneld mk-docker-opts.sh /opt/kubernetes/bin
编辑 flannel.sh 脚本 用生成 flannel的配置文件 、flanneld服务启动的文件和dockerd服务启动文件
#!/bin/bash ETCD_ENDPOINTS=${1:-"http://127.0.0.1:2379"} cat <<EOF >/opt/kubernetes/cfg/flanneld FLANNEL_OPTIONS="--etcd-endpoints=${ETCD_ENDPOINTS} \ -etcd-cafile=/opt/etcd/ssl/ca.pem \ -etcd-certfile=/opt/etcd/ssl/server.pem \ -etcd-keyfile=/opt/etcd/ssl/server-key.pem" EOF cat <<EOF >/usr/lib/systemd/system/flanneld.service [Unit] Description=Flanneld overlay address etcd agent After=network-online.target network.target Before=docker.service [Service] Type=notify EnvironmentFile=/opt/kubernetes/cfg/flanneld ExecStart=/opt/kubernetes/bin/flanneld --ip-masq \$FLANNEL_OPTIONS ExecStartPost=/opt/kubernetes/bin/mk-docker-opts.sh -k DOCKER_NETWORK_OPTIONS -d /run/flannel/subnet.env Restart=on-failure [Install] WantedBy=multi-user.target EOF cat <<EOF >/usr/lib/systemd/system/dockerd.service [Unit] Description=Docker Application Container Engine Documentation=https://docs.docker.com After=network-online.target firewalld.service Wants=network-online.target [Service] Type=notify EnvironmentFile=/run/flannel/subnet.env ExecStart=/usr/bin/dockerd \$DOCKER_NETWORK_OPTIONS ExecReload=/bin/kill -s HUP \$MAINPID LimitNOFILE=infinity LimitNPROC=infinity LimitCORE=infinity TimeoutStartSec=0 Delegate=yes KillMode=process Restart=on-failure StartLimitBurst=3 StartLimitInterval=60s [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable flanneld systemctl restart flanneld systemctl enable dockerd systemctl restart dockerd
执行脚本
[root@node01 ~]# bash flannel.sh https://10.192.27.100:2379,https://10.192.27.115:2379,https://10.192.27.116:2379
查看进程
[root@node01 ~]# ps -ef | grep flannel #flannel是需要双向认证 客户端:node01的flannel服务 与 服务器端:etcd服务 root 28574 1 0 11:15 ? 00:00:07 /opt/kubernetes/bin/flanneld --ip-masq --etcd-endpoints=https://10.192.27.100:2379,https://10.192.27.115:2379,https://10.192.27.116:2379 -etcd-cafile=/opt/etcd/ssl/ca.pem -etcd-certfile=/opt/etcd/ssl/server.pem -etcd-keyfile=/opt/etcd/ssl/server-key.pem root 39802 18416 0 13:44 pts/0 00:00:00 grep --color=auto flannel [root@node01 ~]#
5.文件copy到其它node上
[root@node01 ~]# scp -r /opt/kubernetes root@10.192.27.116:/opt [root@node01 ~]# scp /usr/lib/systemd/system/dockerd.service root@10.192.27.116:/usr/lib/systemd/system [root@node01 ~]# scp /usr/lib/systemd/system/flanneld.service root@10.192.27.116:/usr/lib/systemd/system
[root@node02 ~]# systemctl daemon-reload [root@node02 ~]# systemctl enable flanneld Created symlink from /etc/systemd/system/multi-user.target.wants/flanneld.service to /usr/lib/systemd/system/flanneld.service. [root@node02 ~]# systemctl restart flanneld [root@node02 ~]# systemctl enable dockerd Created symlink from /etc/systemd/system/multi-user.target.wants/dockerd.service to /usr/lib/systemd/system/dockerd.service. [root@node02 ~]# systemctl restart dockerd
[root@node02 ~]# ps -ef | grep docker root 22818 1 0 14:52 ? 00:00:00 /usr/bin/dockerd --bip=172.17.46.1/24 --ip-masq=false --mtu=1450 root 22826 22818 1 14:52 ? 00:00:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info root 23012 13951 0 14:52 pts/0 00:00:00 grep --color=auto docker [root@node02 ~]# ps -ef | grep flannel root 22672 1 0 14:52 ? 00:00:00 /opt/kubernetes/bin/flanneld --ip-masq --etcd-endpoints=https://10.192.27.100:2379,https://10.192.27.115:2379,https://10.192.27.116:2379 -etcd-cafile=/opt/etcd/ssl/ca.pem -etcd-certfile=/opt/etcd/ssl/server.pem -etcd-keyfile=/opt/etcd/ssl/server-key.pem root 23038 13951 0 14:53 pts/0 00:00:00 grep --color=auto flannel
6.配置文件解析
启动服务时会生成flannel运行的环境变量
[root@node01 ~]# cat /run/flannel/subnet.env DOCKER_OPT_BIP="--bip=172.17.43.1/24" DOCKER_OPT_IPMASQ="--ip-masq=false" DOCKER_OPT_MTU="--mtu=1450" DOCKER_NETWORK_OPTIONS=" --bip=172.17.43.1/24 --ip-masq=false --mtu=1450"
[root@node02 ~]# cat /run/flannel/subnet.env DOCKER_OPT_BIP="--bip=172.17.46.1/24" DOCKER_OPT_IPMASQ="--ip-masq=false" DOCKER_OPT_MTU="--mtu=1450" DOCKER_NETWORK_OPTIONS=" --bip=172.17.46.1/24 --ip-masq=false --mtu=1450"
#修改docker的服务启动配置 配置Docker使用Flannel生成的子网和引用变量参数
[root@node01 ~]# grep -v '^#' /usr/lib/systemd/system/dockerd.service [Unit] Description=Docker Application Container Engine Documentation=https://docs.docker.com After=network-online.target firewalld.service Wants=network-online.target [Service] Type=notify EnvironmentFile=/run/flannel/subnet.env #多了一行环境变量 ExecStart=/usr/bin/dockerd $DOCKER_NETWORK_OPTIONS #以flannel网络启动 ExecReload=/bin/kill -s HUP $MAINPID LimitNOFILE=infinity LimitNPROC=infinity LimitCORE=infinity TimeoutStartSec=0 Delegate=yes KillMode=process Restart=on-failure StartLimitBurst=3 StartLimitInterval=60s [Install] WantedBy=multi-user.target [root@node01 ~]#
[root@localhost system]# cat docker.bak [Unit] Description=Docker Application Container Engine Documentation=https://docs.docker.com BindsTo=containerd.service After=network-online.target firewalld.service containerd.service Wants=network-online.target Requires=docker.socket [Service] Type=notify # the default is not to use systemd for cgroups because the delegate issues still # exists and systemd currently does not support the cgroup feature set required # for containers run by docker ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock ExecReload=/bin/kill -s HUP $MAINPID TimeoutSec=0 RestartSec=2 Restart=always # Note that StartLimit* options were moved from "Service" to "Unit" in systemd 229. # Both the old, and new location are accepted by systemd 229 and up, so using the old location # to make them work for either version of systemd. StartLimitBurst=3 # Note that StartLimitInterval was renamed to StartLimitIntervalSec in systemd 230. # Both the old, and new name are accepted by systemd 230 and up, so using the old name to make # this option work for either version of systemd. StartLimitInterval=60s # Having non-zero Limit*s causes performance problems due to accounting overhead # in the kernel. We recommend using cgroups to do container-local accounting. LimitNOFILE=infinity LimitNPROC=infinity LimitCORE=infinity # Comment TasksMax if your systemd version does not support it. # Only systemd 226 and above support this option. TasksMax=infinity # set delegate yes so that systemd does not reset the cgroups of docker containers Delegate=yes # kill only the docker process, not all processes in the cgroup KillMode=process [Install] WantedBy=multi-user.target
7. 查看网络状态
[root@node01 ~]# ifconfig docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 inet 172.17.43.1 netmask 255.255.255.0 broadcast 172.17.43.255 ether 02:42:96:a2:41:c6 txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 em1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.192.27.115 netmask 255.255.255.128 broadcast 10.192.27.127 inet6 fe80::444d:ef36:fd70:9a89 prefixlen 64 scopeid 0x20<link> ether 80:18:44:e6:eb:dc txqueuelen 1000 (Ethernet) RX packets 3905052 bytes 633862527 (604.4 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 3275346 bytes 515290623 (491.4 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 81 flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 #由于em1 所以为 flannel.1 如果网卡为eth0 flannel0 inet 172.17.43.0 netmask 255.255.255.255 broadcast 0.0.0.0 inet6 fe80::342a:5aff:feb1:ec27 prefixlen 64 scopeid 0x20<link> ether 36:2a:5a:b1:ec:27 txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 8 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 12096 bytes 689540 (673.3 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 12096 bytes 689540 (673.3 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[root@node02 ~]# ifconfig docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 inet 172.17.46.1 netmask 255.255.255.0 broadcast 172.17.46.255 ether 02:42:8f:3e:f5:65 txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 em1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.192.27.116 netmask 255.255.255.128 broadcast 10.192.27.127 inet6 fe80::fde1:f746:6309:54a2 prefixlen 64 scopeid 0x20<link> ether 50:9a:4c:77:36:c5 txqueuelen 1000 (Ethernet) RX packets 5753325 bytes 888281290 (847.1 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 5123425 bytes 644662134 (614.7 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 16 flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 172.17.46.0 netmask 255.255.255.255 broadcast 0.0.0.0 inet6 fe80::fc48:3dff:fe42:ab6a prefixlen 64 scopeid 0x20<link> ether fe:48:3d:42:ab:6a txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 8 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 11668 bytes 614833 (600.4 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 11668 bytes 614833 (600.4 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[root@node01 ~]# ping 172.17.46.1 PING 172.17.46.1 (172.17.46.1) 56(84) bytes of data. 64 bytes from 172.17.46.1: icmp_seq=1 ttl=64 time=0.272 ms 64 bytes from 172.17.46.1: icmp_seq=2 ttl=64 time=0.182 ms 64 bytes from 172.17.46.1: icmp_seq=3 ttl=64 time=0.182 ms [root@node02 ~]# ping 172.17.43.1 PING 172.17.43.1 (172.17.43.1) 56(84) bytes of data. 64 bytes from 172.17.43.1: icmp_seq=1 ttl=64 time=0.264 ms 64 bytes from 172.17.43.1: icmp_seq=2 ttl=64 time=0.213 ms 64 bytes from 172.17.43.1: icmp_seq=3 ttl=64 time=0.216 ms