全称 Linux Virtual Server, 作者 章文嵩
VS:Virtual Server, Director(调度器)
RS:Real Server, 后端服务器
CIP: Client IP
VIP: Virtual Server IP
DIP:Director IP
RIP:Real Server IP
CIP <--> VIP <--> DIP <--> RIP
lvs的两个组件:
ipvsadm:用户空间的命令行工具,规则管理器,用于管理集群服务及集群服务上的Real Server;
ipvs:是lvs的核心软件,工作于内核上的netfilter的INPUT之上的程序,接收来自ipvsadm定义的规则;主流的linux发行版默认都已经集成了ipvs,因此用户只需安装一个管理工具ipvsadm即可。
ipvs安装在Director上,把发往Virtual IP的请求转发到Real Server上;
一个ipvs主机可以同时定义多个service;
一个ipvs服务至少应该有一个real server;
安装ipvsadm
# yum install -y ipvsadm
# rpm -ql ipvsadm
/etc/sysconfig/ipvsadm-config
/usr/lib/systemd/system/ipvsadm.service
/usr/sbin/ipvsadm
/usr/sbin/ipvsadm-restore
/usr/sbin/ipvsadm-save
ipvsadm命令
# man ipvsadm
# ipvsadm -h
管理虚拟服务
ipvsadm -A|E -t|u|f service-address [-s scheduler] [-p [timeout]]
-A, --add-service:Add a virtual service. #添加虚拟服务;
A service address is uniquely defined by a tRIPlet: IP address, port number, and protocol.
Alternatively, a virtual service may be defined by a firewall-mark.
service-address:
-t, tcp, vip:port
-u, udp, vip:port
-f, fwm, mark
-t, --tcp-service service-address:Use TCP service. The service-address is of the form host[:port]. #使用TCP协议;
-u, --udp-service service-address:Use UDP service. See the -t|--tcp-service for the descRIPtion of the service-address. #使用UDP协议;
-f, --fwmark-service integer:Use a firewall-mark, an integer value greater than zero, to denote a virtual service instead of an address, port and protocol. #使用防火墙标记;
The marking of packets with a firewall-mark is configured using the -m|--mark option to iptables
-s, --scheduler scheduling-method:Algorithm for allocating TCP connections and UDP datagrams to real servers.
Scheduling algorithms are implemented as kernel modules.
-p, --persistent [timeout]: Specify that a virtual service is persistent.
If this option is specified, multiple requests from a client are redirected to the same real server selected for the first request.
Optionally, the timeout of persistent sessions may be specified given in seconds, otherwise the default of 300 seconds will be used.
This option may be used in conjunction with protocols such as SSL or FTP where it is important that clients consistently connect with the same real server.
-E, --edit-service:Edit a virtual service. #修改虚拟服务;
-D, --delete-service:Delete a virtual service, along with any associated real servers. #删除虚拟服务;
管理 Real Server
ipvsadm -a|e -t|u|f service-address -r server-address [-g|i|m] [-w weight]
[packet-forwarding-method] #lvs的类型;
-g, --gatewaying:Use gatewaying (direct routing). This is the default.
-i, --ipip:Use ipip encapsulation (tunneling).
-m, --masquerading:Use masquerading (network access translation, or NAT).
-w, --weight weight:Weight is an integer specifying the capacity of a server relative to the others in the pool.
The valid values of weight are 0 through to 65535. The default is 1.
Quiescent servers are specified with a weight of zero.
A quiescent server will receive no new jobs but still serve the existing jobs, for all scheduling algorithms distributed with the Linux Virtual Server.
Setting a quiescent server may be useful if the server is overloaded or needs to be taken out of service for maintenance.
查看规则
ipvsadm -L|l [options]
-L, -l, --list:List the virtual server table if no argument is specified. If a service-address is selected, list this service only.
-c, --connection:Connection output. The list command with this option will list current IPVS connections.
-n, --numeric:Numeric output. IP addresses and port numbers will be printed in numeric format rather than as as host names and services respectively, which is the default.
--exact:Expand numbers.
Display the exact value of the packet and byte counters, instead of only the rounded number in K's (multiples of 1000) M's (multiples of 1000K) or G's (multiples of 1000M).
This option is only relevant for the -L command.
--stats:Output of statistics information. The list command with this option will display the statistics information of services and their servers.
--rate:Output of rate information. The list command with this option will display the rate information
(such as connections/second, bytes/second and packets/second) of services and their servers.
清空
-C, --clear:Clear the virtual server table. #删除所有虚拟服务;
保存
-S, --save:Dump the Linux Virtual Server rules to stdout in a format that can be read by -R|--restore.
ipvsadm-save - save the IPVS table to stdout
# ipvsadm -S > /PATH/TO/SOME_RULE_FILE
# ipvsadm-save > /PATH/TO/SOME_RULE_FILE
重载
-R, --restore:Restore Linux Virtual Server rules from stdin.
# ipvsadm -R < /PATH/TO/SOME_RULE_FILE
# ipvsadm-restore < /PATH/TO/SOME_RULE_FILE
清空计数器
ipvsadm -Z [-t|u|f service-address]
-Z, --zero:Zero the packet, byte and rate counters in a service or all services.
FWM: FireWall Mark
在netfilter上给报文打标记;mangle表;
打标记的方法:
iptables -t mangle -A PREROUTING -d $vip -p $protocol --dport $sport -j MARK --set-mark # #标签值为整数;
MARK:This target is used to set the Netfilter mark value associated with the packet.
It can be used in conjunction with routing based on fwmark.
If you plan on doing so, note that the mark needs to be set in the PREROUTING chain of the mangle table to affect routing. The mark field is 32 bits wide.
--set-mark value[/mask]:Zeroes out the bits given by mask and ORs value into the packet mark. If mask is omitted, 0xFFFFFFFF is assumed.
在Director上,基于防火墙标机定义集群服务:
# iptables -t mangle -A PREROUTING -d 172.16.100.71 -p tcp --dport 80 -j MARK --set-mark 6
# ipvsadm -A -f 6 -s rr
# ipvsadm -a -f 6 -r 172.16.100.8 -g
# ipvsadm -a -f 6 -r 172.16.100.9 -g
lvs的工作模式
lvs-nat
多目标的DNAT,通过将请求报文中的目标地址和目标端口修改为挑选出的某RS的RIP和PORT实现转发;
(1)RIP和DIP必须在同一IP网络,且应该使用私有地址;RS的网关要指向DIP(保证响应报文必须经由VS);
(2)请求报文和响应报文都经由Director转发,较高负载下,Director易于成为系统性能瓶颈;
(3)支持端口映射;VIP上的端口和RIP上的端口不必非得是同一个;
(4)VS必须是Linux,RS可以是任意OS;
设计要点:
(1)DIP与RIP要在同一IP网络,RIP的网关要指向DIP;
(2)支持端口映射;
(3)是否用到共享存储取决业务需求;
在Director上,配置VIP地址为172.16.100.6,实际应用中这里的VIP应该是公网地址;DIP为192.168.10.129;
在RS1上,配置RIP为192.168.10.8,网关指向DIP;
在RS2上,配置RIP为192.168.10.9,网关指向DIP;
分别在RS1和RS2上安装httpd,为了测试效果,分别为两台RS的主页面写入不同的内容;
为了显示nat的端口映射功能,可以把RS1的httpd监听在非默认端口上;
# ipvsadm -A -t 172.16.100.6:80 -s rr
# ipvsadm -a -t 172.16.100.6:80 -r 192.168.10.8:8089 -m
# ipvsadm -a -t 172.16.100.6:80 -r 192.168.10.9:80 -m
# echo 1 > /proc/sys/net/ipv4/ip_forward #开启核心转发,不然Director的两块网卡无法通信;
lvs-dr:Direct Routing (默认类型)
通过为请求报文重新封装一个MAC首部进行转发,源MAC是DIP所在接口的MAC,目标MAC是挑选出某RS的RIP所在接口的MAC地址;IP首部不会发生变化(CIP<-->VIP);
(1)确保前端路由器将目标IP为VIP的请求报文发往Director:
(2)RS的RIP可以是私有地址,也可以是公网地址;RIP与DIP在同一网段;
(3)RS跟Director必须在同一物理网络,以实现基于MAC的转发;RS的网关不能指向DIP;
(4)请求报文必须由Director调度,但响应报文不能经由Director调度;
(5)不支持端口映射;
设计要点:
(1)各主机一个接口即可,但需要在同一物理网络中;RIP或DIP配置在物理接口上,VIP配置在接口别名上;
(2)RIP的网关不能指向DIP;RIP和DIP通常应该在同一网络,但此二者未必会与vip在同一网络;
(3)各rs需要先设置内核参数,再设置vip和路由;
在DR模型中,由于每个节点均要配置VIP,因此存在VIP的MAC广播问题;在现在的linux内核中,都提供了相应的kernel参数对MAC广播进行管理;
当一个客户端向VIP发出一个连接请求时,此请求必须要连接至Director的VIP,而不能是RealServer的。
要解决地址的冲突的问题,让各RS上的VIP不可见,仅用于接收目标地址为VIP的报文,同时可作为响应报文的源地址;在RealServer上隐藏VIP,使得它们无法获知网络上的ARP请求;
为了让RS以VIP为源地址把报文发送给客户端,RS的RIP在物理网卡上,VIP在lo(虚拟接口)上,报文出去时要定义走lo这个设备,就是加一条路由,这样就能保证RS响应客户端时的源IP为VIP。
数据包在由Direcotr发往Realserver时,只有目标MAC地址发生了改变(变成了Realserver的MAC地址)。
Realserver在接收到数据包后会将数据包路由至本地回环设备,接着,监听于本地回环设备VIP上的服务则对进来的数据库进行相应的处理,而后将处理结果回应至RIP,但数据包的原地址依然是VIP。
(a)在前端路由器上静态绑定VIP和Director的MAC地址;
(b)在RS使用arptables;
(c) 修改RS的内核参数,来限制arp响应和通告的级别;禁止RS响应对VIP的ARP请求,禁止RS的VIP进行通告;
arp_ignore:定义接收到ARP请求时的响应级别;
0:只要本地配置的有相应地址,就给予响应;默认;
Reply for any local target IP address, configured on any interface.
1:仅在请求的目标IP地址配置在到达的接口上的时候,才给予响应;DR模型使用;
Reply only if the target IP address is local address configured on the incoming interface.
arp_announce:定义将自身地址向外通告时的通告级别;
0:将本地所有接口上的所有地址向外通告;默认;
Use any local address, configured on any interface.
2:总是避免向非本网络通告;DR模型使用;
Always use the best local address for this target.
# less /usr/share/doc/kernel-doc-3.10.0/Documentation/networking/ip-sysctl.txt #可通过该文档查看参数说明;
设置RS的接口IP:
# ifconfig lo:0 $vip netmask 255.255.255.255 broadcast $vip
# route add -host $vip dev lo:0 #如果在同一网段,可以不设定;或者开启核心转发,也可以不设定;
# echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
# echo 1 > /proc/sys/net/ipv4/conf/lo/arp_ignore
# echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
# echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announce
在Director上,配置DIP为172.16.100.6/16,nmtui图形工具设置即可;VIP为172.16.100.71/32;
# ifconfig eno16777736:0 172.16.100.71/32 broadcast 172.16.100.71 up #设置VIP;由于VIP只是用来接收报文,而不用于通信,所以这里设置32位掩码,只广播给自己;
广播地址(Broadcast Address)是专门用于同时向网络中所有主机发送数据包的一个地址。
在使用TCP/IP 协议的网络中,主机标识段hostID为全1的IP地址为广播地址。
例如,对于10.1.1.0 (255.255.255.0) 网段,其广播地址为10.1.1.255 (255即为2进制的11111111),当发出一个目的地址为10.1.1.255的封包时,它将被发给该网段上的所有计算机。
在RS1上,配置RIP为172.16.100.8/16,网关指向路由172.16.0.1;nmtui图形工具设置即可;
# tree /proc/sys/net/ipv4/conf/
# echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
# echo 1 > /proc/sys/net/ipv4/conf/lo/arp_ignore
# echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
# echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announce
# ifconfig lo:0 172.16.100.71/32 broadcast 172.16.100.71 up #设置VIP;只广播给自己;
在RS2上,配置RIP为172.16.100.9/16,网关指向路由172.16.0.1;nmtui图形工具设置即可;
# echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
# echo 1 > /proc/sys/net/ipv4/conf/lo/arp_ignore
# echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
# echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announce
# ifconfig lo:0 172.16.100.71/32 broadcast 172.16.100.71 up #设置VIP;只广播给自己;
在Director上,配置集群;
# ipvsadm -A -t 172.16.100.71:80 -s rr
# ipvsadm -a -t 172.16.100.71:80 -r 172.16.100.8 -g
# ipvsadm -a -t 172.16.100.71:80 -r 172.16.100.9 -g
# ipvsadm -Ln
lvs-tun:tunnel
转发方式:不修改请求报文的IP首部(源IP为CIP,目标IP为VIP),而是在原IP首部之外再封装一个IP首部(源IP为DIP,目标IP为挑选出的RS的RIP);
(1)RIP,DIP,VIP都是公网地址;
(2)RS网关不能指向DIP;
(3)请求报文经由Director转发,响应报文直接发送给CIP;
(4)不支持端口映射;
(5)RS的OS必须支持隧道功能;
lvs-fullnat
通过同时修改请求报文的源IP地址(CIP-->DIP)和目标IP地址(VIP-->RIP)进行转发;
(1)VIP是公网地址,RIP和DIP是私网地址,且通常不在同一网络中,但需要经由路由器互通;
(2)RS收到的请求报文源IP为DIP,因此响应报文将直接响应给DIP;
(3)请求和响应报文都经由Director;
(4)支持端口映射;
lvs scheduler(调度算法)
IPVS的负载调度算法有十种,根据其调度时是否考虑后端主机的当前负载,可分为静态算法和动态算法两类;
静态算法:仅根据算法本身进行调度,不考虑各节点当前负载;注重起点公平;
RR:Round Robin,轮询;
WRR:Weighted RR,加权轮询;
SH:Source Hashing,源地址哈希;实现会话绑定的一种方式;将来自同一个IP的请求始终发往第一次选中的RS;
DH:Destination Hashing,目标地址哈希;正向web代理(缓存),负载均衡内网用户对外部服务器的请求;哈希的是目标地址;
动态算法:根据算法及各Real Server当前的负载状态进行调度;注重结果公平;
LC: Least Connections,最少连接;
WLC: Weighted LC, 加权最少连接;默认的调度算法;
SED:Shortest Expections Delay,最短期望延迟;
NQ:Never Queue,永不排队,改进的SED算法;
LBLC:Locality-Based Least Connection,基于本地的最少连接;动态的DH算法;
LBLCR:LBLC with Replication,带复制功能的基于本地的最少连接;