土豆运营团队称之为:穷人的劳斯莱斯。呵呵!我这里一直使用ZXTM,但是因为一些特殊的业务需要,新尝试了这种架构。我参考了土豆网站运维的文章,但是网上相关内容极少,并且含糊其词,所以写了本文。
1 这2款软件的功能以及和ZXTM,LVS等对比请参看土豆团队博文:http://blog.ops.tudou.com/wp/?p=188
2 安装前准备:
注:我也强调/etc/hosts文件内容的重要性,在安装前务必配置好想使用的IP和主机名,因为启动spread需要指定主机名,但是和土豆团队文章不同,我认为非根据`uname -n`,下面会提到。
3 安装spread:
我也选择了4.0.0版本,原因是最初使用4.1.0时出现了很多问题,但是欢迎大家去体验4.1.0版本,并留言给我
tar zxvf spread-src-4.0.0.tar.gz && cd spread-src-4.0.0 && ./configure&& make && make install
4 安装wackamole:
下载地址为http://www.cnds.jhu.edu/download/download_wackamole.cgi 需要输入一下信息 点击下载
注:这个过程中可能会出现三个问题:
1 Invalid configuration `x86_64-unknown-linux-gnu’: machine `x86_64-unknown’ not recognized
解决办法:需要将2个文件拷贝过来覆盖此目录下文件:
cp /usr/share/libtool/config.sub .
2 checking size of char… configure: error: cannot compute sizeof (char)
解决办法:将安装的spread的lib目录定义在LD_LIBRARY_PATH里面,我的是空,所以直接赋值:
#export LD_LIBRARY_PATH=/usr/local/lib(这个目录是默认安装的lib目录)
3 后话了,在开启wackamole时可能出现:Starting wackamole…/usr/local/sbin/wackamole: error while loading shared libraries: libspread.so: cannot open shared object file: No such file or directory [FAILED]
解决办法:这个可能是因为在安装spread后没有执行ldconfig,如果还是不行,可以locate出来lib文件的目录地址放在/etc/ld.so.conf中,再执行ldconfig
启动脚本大家可以参考一下土豆原博客,但是有写html码,并且spread的脚本有问题,启动和杀掉进程都有一些问题,不知道别人有没有这样的问题,但是我在最后会粘贴一下我改善过的脚本。
5 配置原理:
我的实验环境:
centos5.5
想要达到的实验目的:
对三个真实IP:192.168.9.160,192.168.9.161,192.168.9.162虚拟成三个虚拟IP(正常情况下每个真实IP使用一个虚拟IP):192.168.9.109,192.168.9.112,192.168.9.113
当出现故障时,虚拟IP自动“飘”到其他机器上。
1 配置spread:
他的spread.conf主要配置的是想要虚拟的组的设备上真实的IP和主机明的对应关系,以下是我的配置:
首先看一下我的host文件:
#vi /etc/hosts
192.168.9.160 test00.dongwm.com
192.168.9.161 test01.dongwm.com
192.168.9.162 test02.dongwm.com[/cc]
#vi spread.conf
[cc lang='bash' width="99%" height="100%"]DaemonGroup = spread
DaemonUser = spread
EventLogFile = /usr/local/etc/spreadlog_%h.log
EventPriority = ERROR
Spread_Segment 192.168.255.255:4803 {
test00.dongwm.com 192.168.9.160
test01.dongwm.com 192.168.9.161
test02.dongwm.com 192.168.9.162
} #这是一种广播方式,还有一种多播配置方式
注:每台机器都要开启此进程
2 配置wackamole
#vi wackamole.conf
然后其他节点的监听方式为: Spread = 4803@server.dongwm.com
SpreadRetryInterval = 5s
Group = test #这个类似于分布式消息系统,当你参加到这个组,就可以监听所有人,此程序进入此模式的命令是spuser 其中 j表示参加,l表示离开,有兴趣的可以研究下
Control = /var/run/wack.it
Prefer None #这个就是提供一个优先选择的手段,我们这里的业务不需要,所以没有设置,设置方式参考官网的pdf文档
VirtualInterfaces {
{ eth0:192.168.9.109/32 }
{ eth0:192.168.9.112/32 }
{ eth0:192.168.9.113/32 }
} #这里就是想要虚拟的IP
Arp-Cache = 90s
Notify {
eth0:192.168.8.1/32 #这是你路由器的地址,很重要的
arp-cache
}
balance {
AcquisitionsPerRound = all
interval = 4s
}
mature = 5s
6 启动服务,查看日志:
1 创建spread用户,假如你设定了其他用户,这步略过
2 需要创建/var/run/spread/目录
启动spread
注:我也是每个机器都启动这个进程
6 启动服务,查看日志:
/etc/init.d/spread start
查看端口监听:
tcp 0 0 0.0.0.0:4803 0.0.0.0:* LISTEN 18318/spread
udp 0 0 0.0.0.0:4803 0.0.0.0:* 18318/spread
udp 0 0 0.0.0.0:4804 0.0.0.0:* 18318/spread
启动wacka mole:
查看日志:
会提示虚拟IP网卡启动了
当三台服务器都启动后:
执行ifconfig
会发现每个服务器上飘了一个VIP:
eth0 Link encap:Ethernet HWaddr 00:50:56:91:00:1B
inet addr:192.168.9.162 Bcast:192.168.255.255 Mask:255.255.0.0
inet6 addr: fe80::250:56ff:fe91:1b/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1054714 errors:0 dropped:0 overruns:0 frame:0
TX packets:356497 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:123799512 (118.0 MiB) TX bytes:94783259 (90.3 MiB)
eth0:1 Link encap:Ethernet HWaddr 00:50:56:91:00:1B
inet addr:192.168.9.112 Bcast:192.168.9.112 Mask:255.255.255.255
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
root@test00:~$ ifconfig
inet addr:192.168.9.160 Bcast:192.168.255.255 Mask:255.255.0.0
inet6 addr: fe80::250:56ff:fe91:13/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:966721 errors:0 dropped:0 overruns:0 frame:0
TX packets:343226 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:182954295 (174.4 MiB) TX bytes:67258649 (64.1 MiB)
eth0:3 Link encap:Ethernet HWaddr 00:50:56:91:00:13
inet addr:192.168.9.113 Bcast:192.168.9.113 Mask:255.255.255.255
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
eth0 Link encap:Ethernet HWaddr 00:50:56:91:00:15
inet addr:192.168.9.161 Bcast:192.168.255.255 Mask:255.255.0.0
inet6 addr: fe80::250:56ff:fe91:15/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:869456 errors:0 dropped:0 overruns:0 frame:0
TX packets:162884 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:161753343 (154.2 MiB) TX bytes:40910624 (39.0 MiB)
eth0:1 Link encap:Ethernet HWaddr 00:50:56:91:00:15
inet addr:192.168.9.109 Bcast:192.168.9.109 Mask:255.255.255.255
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
7 实验()
原理:我自己写了脚本,去检测本机的一些进程和服务是否异常。假如异常,就执行脚本命令,停止这个机器上的wackamole进程;当进程和服务恢复,我又执行脚本命令,开启wackamole进程
这里模拟出现异常,脚本杀掉进程:
Stopping wackamole... [确定]
执行ifconfig:
eth0 Link encap:Ethernet HWaddr 00:50:56:91:00:1B
inet addr:192.168.9.162 Bcast:192.168.255.255 Mask:255.255.0.0
inet6 addr: fe80::250:56ff:fe91:1b/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1059629 errors:0 dropped:0 overruns:0 frame:0
TX packets:359237 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:124517323 (118.7 MiB) TX bytes:95293463 (90.8 MiB)
执行其他2台服务器,发现:
eth0 Link encap:Ethernet HWaddr 00:50:56:91:00:13
inet addr:192.168.9.160 Bcast:192.168.255.255 Mask:255.255.0.0
inet6 addr: fe80::250:56ff:fe91:13/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:972018 errors:0 dropped:0 overruns:0 frame:0
TX packets:345715 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:184164965 (175.6 MiB) TX bytes:67683923 (64.5 MiB)
eth0:1 Link encap:Ethernet HWaddr 00:50:56:91:00:13
inet addr:192.168.9.112 Bcast:192.168.9.112 Mask:255.255.255.255
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
eth0:3 Link encap:Ethernet HWaddr 00:50:56:91:00:13
inet addr:192.168.9.113 Bcast:192.168.9.113 Mask:255.255.255.255
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
^.^ 成功了,飘过来了
再看一下HA延迟,刚才我一直在另外一个服务器上执行 ping 192.168.9.112
64 bytes from 192.168.9.112: icmp_seq=145 ttl=64 time=0.333 ms
64 bytes from 192.168.9.112: icmp_seq=146 ttl=64 time=0.414 ms
64 bytes from 192.168.9.112: icmp_seq=147 ttl=64 time=0.346 ms
64 bytes from 192.168.9.112: icmp_seq=148 ttl=64 time=0.373 ms
64 bytes from 192.168.9.112: icmp_seq=149 ttl=64 time=0.333 ms
64 bytes from 192.168.9.112: icmp_seq=150 ttl=64 time=0.313 ms
64 bytes from 192.168.9.112: icmp_seq=151 ttl=64 time=0.323 ms
64 bytes from 192.168.9.112: icmp_seq=152 ttl=64 time=0.324 ms
64 bytes from 192.168.9.112: icmp_seq=153 ttl=64 time=0.432 ms
64 bytes from 192.168.9.112: icmp_seq=154 ttl=64 time=0.510 ms
64 bytes from 192.168.9.112: icmp_seq=155 ttl=64 time=0.348 ms
64 bytes from 192.168.9.112: icmp_seq=156 ttl=64 time=0.303 ms
64 bytes from 192.168.9.112: icmp_seq=157 ttl=64 time=0.383 ms
64 bytes from 192.168.9.112: icmp_seq=158 ttl=64 time=0.365 ms
看,没有停顿!
注:我们可以使用 spmonitor命令,进去选择0,查看个节点情况
8 发布我改善后的启动脚本(尊重原创,我这里只是修改):
1 spread:
#
# spread This starts and stops spread
#
# chkconfig: 345 90 10
# description: This starts the spread daemon
#
# processname: spread
# config: /etc/spread.conf
# pidfile:/var/run/spread.pid
DAEMON=/usr/sbin/spread
CONFIG=/etc/spread.conf
LOG=/your/path/spread.log
HOST=`uname -n`
NAME="spread"
RETVAL=0
#Source function library.
. /etc/rc.d/init.d/functions
start() {
echo -n "Starting $NAME..."
daemon $($DAEMON 2>&1 >$LOG &)
RETVAL=$?
[ "$RETVAL" = 0 ] && touch /var/lock/subsys/$NAME
echo
}
stop() {
echo -n "Stopping $NAME..."
killproc $DAEMON
[ "$RETVAL" = 0 ] && rm -f /var/lock/subsys/$NAME
echo
}
case "$1" in
start)
start
;;
stop)
stop
;;
restart)
stop
start
;;
status)
status $NAME
RETVAL=$?
;;
*)
echo $"Usage: $0 {start|stop|restart|status}"
RETVAL=1
esac
exit $RETVAL
2 wackamole
#
# wackamole This starts and stops wackamole
#
# chkconfig: 345 95 05
# description: This starts the wackamole daemon
#
# requires: spread
# processname: wackamole
# config: /etc/wackamole.conf
# pidfile:/var/run/wackamole.pid
DAEMON=/usr/sbin/wackamole
CONFIG=/etc/wackamole.conf
NAME="wackamole"
RETVAL=0
#Source function library.
. /etc/rc.d/init.d/functions
start() {
echo -n "Starting $NAME..."
daemon $DAEMON -c $CONFIG
RETVAL=$?
[ "$RETVAL" = 0 ] && touch /var/lock/subsys/$NAME
echo
}
stop() {
echo -n "Stopping $NAME..."
killproc $DAEMON
[ "$RETVAL" = 0 ] && rm -f /var/lock/subsys/$NAME
echo
}
case "$1" in
start)
start
;;
stop)
stop
;;
restart)
stop
start
;;
status)
status $NAME
RETVAL=$?
;;
*)
cho $"Usage: $0 {start|stop|restart|status}"
RETVAL=1
esac
exit $RETVAL