Znodes maintain a stat structure that includes version numbers for data changes, acl changes. The stat structure also has timestamps. The version number, together with the timestamp, allows ZooKeeper to validate the cache and to coordinate updates. Each time a znode's data changes, the version number increases.
Ephemeral Nodes(临时节点)
ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Because of this behavior ephemeral znodes are not allowed to have children.
Sequence Nodes -- Unique Naming
When creating a znode you can also request that ZooKeeper append a monotonically increasing counter(单调递增) to the end of path. This counter is unique to the parent znode. The counter has a format of %010d -- that is 10 digits with 0 (zero) padding i.e. "<path>0000000001". Note: the counter used to store the next sequence number is a signed int (4bytes) maintained by the parent node。
Time in ZooKeeper
ZooKeeper tracks time multiple ways:
-
Zxid(ZooKeeper使用的时间戳)
Every change to the ZooKeeper state receives a stamp in the form of a zxid (ZooKeeper Transaction Id). This exposes the total ordering of all changes to ZooKeeper. Each change will have a unique zxid and if zxid1 is smaller than zxid2 then zxid1 happened before zxid2.
-
Version numbers
Every change to a node will cause an increase to one of the version numbers of that node. The three version numbers are version (number of changes to the data of a znode), cversion (number of changes to the children of a znode), and aversion (number of changes to the ACL of a znode).
-
Ticks(用来维持server间的心跳)
When using multi-server ZooKeeper, servers use ticks to define timing of events such as status uploads, session timeouts, connection timeouts between peers, etc.
-
Real time
ZooKeeper doesn't use real time, or clock time, at all except to put timestamps into the stat structure on znode creation and znode modification.
ZooKeeper Stat Structure
The Stat structure for each znode in ZooKeeper is made up of the following fields:
-
czxid(创建时间)
The zxid of the change that caused this znode to be created.
-
mzxid(上次修改时间)
The zxid of the change that last modified this znode.
-
ctime(从创建到现在经过的时间)
The time in milliseconds from epoch when this znode was created.
-
mtime(从上次修改到现在经过的时间)
The time in milliseconds from epoch when this znode was last modified.
-
version
The number of changes to the data of this znode.
-
cversion
The number of changes to the children of this znode.
-
aversion
The number of changes to the ACL of this znode.
-
ephemeralOwner
The session id of the owner of this znode if the znode is an ephemeral node. If it is not an ephemeral node, it will be zero.
-
dataLength
The length of the data field of this znode.
-
numChildren
The number of children of this znode.
ZooKeeper Sessions
A ZooKeeper client establishes a session with the ZooKeeper service by creating a handle to the service using a language binding. 然后从CONNECTING状态转到CONNECTED状态。如果有不可恢复的错误发生,比如session过期或者认证失败或者应用程序关闭了,handle会转到CLOSED状态。
To create a client session the application code must provide a connection string containing a comma separated list of host:port pairs, each corresponding to a ZooKeeper server (e.g. "127.0.0.1:4545" or "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002"). The ZooKeeper client library will pick an arbitrary(任意的) server and try to connect to it. If this connection fails, or if the client becomes disconnected from the server for any reason, the client will automatically try the next server in the list, until a connection is (re-)established.
Session expiration is managed by the ZooKeeper cluster itself, not by the client. When the ZK client establishes a session with the cluster it provides a "timeout" value detailed above. 当超时发生时,集群会删除掉该session上的临时节点,立即通知跟这个变化相关的client (anyone watching those znodes).
Example state transitions for an expired session as seen by the expired session's watcher:
-
'connected' : session is established and client is communicating with cluster (client/server communication is operating properly)
-
.... client is partitioned from the cluster
-
'disconnected' : client has lost connectivity with the cluster
-
.... time elapses, after 'timeout' period the cluster expires the session, nothing is seen by client as it is disconnected from cluster
5. .... time elapses, the client regains network level connectivity with the cluster
6.'expired' : eventually the client reconnects to the cluster, it is then notified of the expiration
Once a connection to the server is successfully established (connected) there are basically two cases where the client lib generates connectionloss when either a synchronous or asynchronous operation is performed and one of the following holds:
-
The application calls an operation on a session that is no longer alive/valid
2.The ZooKeeper client disconnects from a server when there are pending operations to that server, i.e., there is a pending asynchronous call.
ZooKeeper Watches
Here is ZooKeeper's definition of a watch: a watch event is one-time trigger, sent to the client that set the watch, which occurs when the data for which the watch was set changes. There are three key points to consider in this definition of a watch:
-
One-time trigger
One watch event will be sent to the client when the data has changed. For example, if a client does a getData("/znode1", true) and later the data for /znode1 is changed or deleted, the client will get a watch event for /znode1. If /znode1 changes again, no watch event will be sent unless the client has done another read that sets a new watch.
-
Sent to the client
This implies that an event is on the way to the client, but may not reach the client before the successful return code to the change operation reaches the client that initiated the change. Watches are sent asynchronously to watchers. ZooKeeper provides an ordering guarantee: a client will never see a change for which it has set a watch until it first sees the watch event.
-
The data for which the watch was set
This refers to the different ways a node can change. It helps to think of ZooKeeper as maintaining two lists of watches: data watches and child watches. getData() and exists() set data watches. getChildren() sets child watches. Alternatively, it may help to think of watches being set according to the kind of data returned. getData() and exists() return information about the data of the node, whereas getChildren() returns a list of children. Thus, setData() will trigger data watches for the znode being set (assuming the set is successful). A successful create() will trigger a data watch for the znode being created and a child watch for the parent znode. A successful delete() will trigger both a data watch and a child watch (since there can be no more children) for a znode being deleted as well as a child watch for the parent znode.
Watches are maintained locally at the ZooKeeper server to which the client is connected. This allows watches to be lightweight to set, maintain, and dispatch. When a client connects to a new server, the watch will be triggered for any session events. Watches will not be received while disconnected from a server. When a client reconnects, any previously registered watches will be reregistered and triggered if needed. In general this all occurs transparently. There is one case where a watch may be missed: a watch for the existence of a znode not yet created will be missed if the znode is created and deleted while disconnected.
Consistency Guarantees
ZooKeeper is a high performance, scalable service. Both reads and write operations are designed to be fast, though reads are faster than writes. The reason for this is that in the case of reads, ZooKeeper can serve older data, which in turn is due to ZooKeeper's consistency guarantees:
- Sequential Consistency
-
Updates from a client will be applied in the order that they were sent.
- Atomicity
-
Updates either succeed or fail -- there are no partial results.
- Single System Image
-
A client will see the same view of the service regardless of the server that it connects to.
- Reliability
-
Once an update has been applied, it will persist from that time forward until a client overwrites the update. This guarantee has two corollaries:
-
If a client gets a successful return code, the update will have been applied. On some failures (communication errors, timeouts, etc) the client will not know if the update has applied or not. We take steps to minimize the failures, but the guarantee is only present with successful return codes. (This is called the monotonicity condition in Paxos.)
-
Any updates that are seen by the client, through a read request or successful update, will never be rolled back when recovering from server failures.
-
- Timeliness
-
The clients view of the system is guaranteed to be up-to-date within a certain time bound (on the order of tens of seconds). Either system changes will be seen by a client within this bound, or the client will detect a service outage.
Using these consistency guarantees it is easy to build higher level functions such as leader election, barriers, queues, and read/write revocable locks solely at the ZooKeeper client (no additions needed to ZooKeeper). See Recipes and Solutions for more details.
-