ZooKeeper:为分布式应用提供协调服务,而且其本身亦是分布的
(A distributed coordination service for distributed applicaitons)
ZooKeeper是为分布式应用设计的分布式、开源协调服务组件(coordination service)。他提供了一个简单原语集合,在此之上可以为分布式应用建立更高层次的同步(synchronization)、配置管理(configuration maintenance)、群组(groups)以及名称服务(naming)。在编程上,ZooKeeper设计很简单,所使用的数据模型风格很像文件系统的目录树结构。
协调服务是出了名的难以做对。他们非常容易出错,比如竞争、死锁。ZooKeeper的目的是使具体的分布式应用解脱出来,不必都从一张白纸开始实现自己的分布协调服务。
设计目标
Design Goals
简单
ZooKeeper可使分布的不同进程通过一个类似文件系统的共享“层次化的名称空间”(我们可以叫它注册结点znode)互相进行协作。和一个为存储设计的文件系统区别,ZooKeeper只是把数据保存在内存中,以支持其高吞吐量和低延迟。
ZooKeeper的充分考虑高性能、高可用性、严格按序访问的要求。在性能方面使它用于大型的分布式系统的情况,可靠性方面使它本身不成为一个失败单点,严格的按序访问使客户端可实现更复杂的同步原语操作。
可复制
就像它所协调的分布式进程一样,ZooKeeper本身也要可复制到一些主机上。
组成ZooKeeper服务的各个服务器需要互相知道对方。只要大部分服务器的服务是可用的,ZooKeeper服务就是可用的。客户端(Client)也需要知道所有服务器。有了这个服务器清单,客户端将创建一个“句柄”指向ZooKeeper服务。
客户端只连接到整个ZooKeeper服务的某个服务器上。客户端使用并维护一个TCP连接,通过这个连接发送请求、接受响应、获取观察的事件以及发送心跳。如果这个TCP连接中断,客户端将尝试连接到另外的ZooKeeper服务器。客户端第一次连接到ZooKeeper服务时,接受这个连接的ZooKeeper服务器会为这个客户端建立一个会话。当这个客户端连接到另外的服务器时,这个会话会被新的服务器重新建立。
按序
ZooKeeper为每一个更新操作(update)打上一个数字戳反映ZooKeeper所有事务之间的顺序。我们称这个数字为zxid(ZooKeeper Transaction Id),每一个更新都会有一个唯一的zxid。读操作(包括watch)的排序和更新关联:每一个读响应都会附上所连接服务器当前处理过的最后zxid。
快
对读操作工作量占优的会特别快。ZooKeeper当前正在成千上万的机器上运行着,在那些读操作大大于写操作的地方,10:1左右,他们表现最好。
数据模型,层次化的名称空间
Data model and the hierarchical namespace
ZooKeeper提供的名称空间非常类似一个标准的文件系统。每一个“名字”就像一个地址,由斜线(/)分隔的地址元素字符串组成的。ZooKeeper名字空间中的每一个znode都由一个地址来唯一标识。除了根结点(/)外,每一个znode结点有一个父亲结点,父亲结点的地址是儿子结点地址的前缀(少了一层地址元素)。同时和标准文件系统很像的一点是:如果一个znode结点还有儿子结点,它是不能被删除的。
结点 & 暂时结点
Nodes and ephemeral nodes
和标准文件系统不同,ZooKeeper名称空间中的结点(node)可像关联子结点那样关联数据。看起来,我们好像有了一个可把文件(file)看是一个目录(directory)的文件系统。(ZooKeeper设计结点可关联数据,这数据通常是状态信息、配置信息、地址信息等,这样的数据通常都是很小的,从几个字节到千字节之间。) 我们会用 znode 来称呼ZooKeeper结点,明确表示我们所指的是ZooKeeper的数据结点。
znode维护了一个统计结构(stat structure),存放版本号和时间戳。使用这个结构来验证缓存、协调更新。znode数据或ACL设置变更时,版本号会自增变大,客户端获取znode的数据时也会带回并可获得这个版本号。
对每个znode数据的读和写都是原子性的(atomic)。读的时候返回znode的全部数据(字节数组),写的时候完全替换原来的数据。每一个结点都有一个“访问控制表”(Access Control List, ACL),用来控制谁可以做什么。
ZooKeeper还提供“暂时结点”的概念。在ZooKeeper名称中,如果暂时结点的会话还是活动的,他们会一直存在,但如果会话结束的时候会被删除。暂时结点可用于实现[tbd](原文标注了[tbd],表示待完成)
状态的更新和观察
Conditional updates and watches
ZooKeeper支持watch(观察)的概念。客户端可以在每个znode结点上设置一个观察。如果被观察的znode结点有变更,所设置的watch观察将被触发并移除。watch被触发时,这个watch所属的客户端将接收到一个通知包被告知结点已经发生变化。若客户端和所连接的ZooKeeper服务器断开连接时(不是断开会话哦),客户端也会收到一个本地通知。这可用于[tbd]。
一些保证
Guarantees
ZooKeeper非常快,也非常简单,ZooKeeper提供了建立更加复杂的服务(比如同步服务)的基础保证:
-
顺序一致性Sequential Consistency – 一个客户端发送的更新操作,ZooKeeper服务严格按照发出的顺序更新。
-
原子性Atomicity – 更性要么成功要么失败,不会出现部分成功的结果。
-
单一系统映像Single System Image – 所有客户端看到的ZooKeeper服务是一样,无论客户端连接到哪个服务器上。
-
可靠性Reliability – 更新操作一定被接受后,这个操作将被持久化直至有其他的更新覆盖它。
-
及时性Timeliness – 客户端看到的ZooKeeper服务视图,能够保证及时更新within a certain time bound.
更多的信息,如何使用这些特性,请参考[tbd]
简单的API
Simple API
ZooKeeper的一个设计目标,就是要提供一个非常简单的编程接口,所以,我们只提供以下操作:
create
创建一个给定地址的结点
delete
删除一个结点
exists
测试返回给定地址的结点是否存在
get data
读取结点的数据(字节数组)
set data
将数据写入结点(覆盖替换,非追加)
get children
返回一个结点的所有子结点
sync
waits for data to be propagated
更深入的讨论、要知道如何使用这些操作实现一个更高层级的操作,请参考[tbd]
实现
Implementation
所下《组件图》展示了ZooKeeper服务的高层组件。除“请求处理器(request processor)”,ZooKeeper服务中的每个服务器都要复制自己的每个组件的副本(With the exception of the request processor, each of the servers that make up the ZooKeeper service replicates its own copy of each of components.)
“replicated database”是一个内存数据库存放整个数据树。所有的更新操作都先保存到磁盘中以备恢复,写操作都是先序列化保存到磁盘中才变更内存的数据库。
每个ZooKeeper服务器都为客户端提供服务。客户端只连接到一个服务器提交耕种请求。读请求只要所连接的服务器的本地数据库即可提供。变更ZooKeeper服务发请求(写请求)必须通过一个“协定协议(agreement protocol)”处理。
协定协议规定,客户端发出的写操作都会转发某个唯一的服务器(称leader服务器),其它服务器(称follower服务器),从leader服务器获取消息要求,并同意这些信息。消息传递层也充分考虑了leader服务器失败重新决定以及follower服务器同步的情况。
ZooKeeper使用了自定义的原子性消息传递协议。因为消息层是原子性的,ZooKeeper保证本地的数据副本不会失真(diverge)。当leader服务器接收到write请求时,他计算要应用此写操作后的系统状态,并将此转化为一个能够获取到此新状态的事务(transaction)。
Uses
The programming interface to ZooKeeper is deliberately simple. With it, however, you can implement higher order operations, such as synchronizations primitives, group membership, ownership, etc. Some distributed applications have used it to: [tbd: add uses from white paper and video presentation.] For more information, see [tbd]
Performance
ZooKeeper is designed to be highly performant. But is it? The results of the ZooKeeper’s development team at Yahoo! Research indicate that it is. (See ZooKeeper Throughput as the Read-Write Ratio Varies.) It is especially high performance in applications where reads outnumber writes, since writes involve synchronizing the state of all servers. (Reads outnumbering writes is typically the case for a coordination service.)
ZooKeeper Throughput as the Read-Write Ratio Varies |
The figure ZooKeeper Throughput as the Read-Write Ratio Varies is a throughput graph of ZooKeeper release 3.2 running on servers with dual 2Ghz Xeon and two SATA 15K RPM drives. One drive was used as a dedicated ZooKeeper log device. The snapshots were written to the OS drive. Write requests were 1K writes and the reads were 1K reads. “Servers” indicate the size of the ZooKeeper ensemble, the number of servers that make up the service. Approximately 30 other servers were used to simulate the clients. The ZooKeeper ensemble was configured such that leaders do not allow connections from clients.
In version 3.2 r/w performance improved by ~2x compared to the previous 3.1 release.
Benchmarks also indicate that it is reliable, too. Reliability in the Presence of Errors shows how a deployment responds to various failures. The events marked in the figure are the following:
-
Failure and recovery of a follower
-
Failure and recovery of a different follower
-
Failure of the leader
-
Failure and recovery of two followers
-
Failure of another leader
Reliability
To show the behavior of the system over time as failures are injected we ran a ZooKeeper service made up of 7 machines. We ran the same saturation benchmark as before, but this time we kept the write percentage at a constant 30%, which is a conservative ratio of our expected workloads.
Reliability in the Presence of Errors |
The are a few important observations from this graph. First, if followers fail and recover quickly, then ZooKeeper is able to sustain a high throughput despite the failure. But maybe more importantly, the leader election algorithm allows for the system to recover fast enough to prevent throughput from dropping substantially. In our observations, ZooKeeper takes less than 200ms to elect a new leader. Third, as followers recover, ZooKeeper is able to raise throughput again once they start processing requests.
The ZooKeeper Project
ZooKeeper has been successfully used in many industrial applications. It is used at Yahoo! as the coordination and failure recovery service for Yahoo! Message Broker, which is a highly scalable publish-subscribe system managing thousands of topics for replication and data delivery. It is used by the Fetching Service for Yahoo! crawler, where it also manages failure recovery. A number of Yahoo! advertising systems also use ZooKeeper to implement reliable services.
All users and developers are encouraged to join the community and contribute their expertise. See the Zookeeper Project on Apache for more information.