ETCD核心机制解析

ETCD核心机制解析
ETCD整体机制

etcd 是一个分布式的、可靠的 key-value 存储系统，它适用于存储分布式系统中的关键数据。

etcd 集群中多个节点之间通过Raft算法完成分布式一致性协同，算法会选举出一个主节点作为 leader，由 leader 负责数据的同步与分发。当 leader 出现故障后系统会自动地重新选取另一个节点成为 leader，并重新完成数据的同步。

etcd集群实现高可用主要是基于quorum机制，即：集群中半数以上的节点可用时，集群才可继续提供服务，quorum机制在分布式一致性算法中应用非常广泛，此处不再详细阐述。

raft数据更新和etcd调用是基于两阶段机制：

第一阶段 leader记录log (uncommited)；日志复制到follower；follower响应，操作成功，响应客户端；调用者调用leader，leader会将kv数据存储在日志中，并利用实时算法raft进行复制

第二阶段 leader commit；通知follower；当复制给了N+1个节点后，本地提交，返回给客户端，最后leader异步通知follower完成通知

ETCD核心API分析

etcd提供的api主要有kv相关、lease相关及watch，查看其源码可知：

kv相关接口：
```
type KV interface {
	// Put puts a key-value pair into etcd.
	// Note that key,value can be plain bytes array and string is
	// an immutable representation of that bytes array.
	// To get a string of bytes, do string([]byte{0x10, 0x20}).
	Put(ctx context.Context, key, val string, opts ...OpOption) (*PutResponse, error)

	// Get retrieves keys.
	// By default, Get will return the value for "key", if any.
	// When passed WithRange(end), Get will return the keys in the range [key, end).
	// When passed WithFromKey(), Get returns keys greater than or equal to key.
	// When passed WithRev(rev) with rev > 0, Get retrieves keys at the given revision;
	// if the required revision is compacted, the request will fail with ErrCompacted .
	// When passed WithLimit(limit), the number of returned keys is bounded by limit.
	// When passed WithSort(), the keys will be sorted.
	Get(ctx context.Context, key string, opts ...OpOption) (*GetResponse, error)

	// Delete deletes a key, or optionally using WithRange(end), [key, end).
	Delete(ctx context.Context, key string, opts ...OpOption) (*DeleteResponse, error)

	// Compact compacts etcd KV history before the given rev.
	Compact(ctx context.Context, rev int64, opts ...CompactOption) (*CompactResponse, error)

	// Txn creates a transaction.
	Txn(ctx context.Context) Txn
}
```
主要有Put、Get、Delete、Compact、Do和Txn方法；Put用于向etcd集群中写入消息，以key value的形式存储；Get可以根据key查看其对应存储在etcd中的数据；Delete通过删除key来删除etcd中的数据；Compact 方法用于压缩 etcd 键值对存储中的事件历史，避免事件历史无限制的持续增长；Txn 方法在单个事务中处理多个请求，etcd事务模式为：

if compare

then op

else op

commit

lease相关接口：
```
type Lease interface {
	// Grant creates a new lease.
	Grant(ctx context.Context, ttl int64) (*LeaseGrantResponse, error)

	// Revoke revokes the given lease.
	Revoke(ctx context.Context, id LeaseID) (*LeaseRevokeResponse, error)

	// TimeToLive retrieves the lease information of the given lease ID.
	TimeToLive(ctx context.Context, id LeaseID, opts ...LeaseOption) (*LeaseTimeToLiveResponse, error)

	// Leases retrieves all leases.
	Leases(ctx context.Context) (*LeaseLeasesResponse, error)

	// KeepAlive keeps the given lease alive forever. If the keepalive response
	// posted to the channel is not consumed immediately, the lease client will
	// continue sending keep alive requests to the etcd server at least every
	// second until latest response is consumed.
	//
	// The returned "LeaseKeepAliveResponse" channel closes if underlying keep
	// alive stream is interrupted in some way the client cannot handle itself;
	// given context "ctx" is canceled or timed out. "LeaseKeepAliveResponse"
	// from this closed channel is nil.
	//
	// If client keep alive loop halts with an unexpected error (e.g. "etcdserver:
	// no leader") or canceled by the caller (e.g. context.Canceled), the error
	// is returned. Otherwise, it retries.
	//
	// TODO(v4.0): post errors to last keep alive message before closing
	// (see https://github.com/coreos/etcd/pull/7866)
	KeepAlive(ctx context.Context, id LeaseID) (<-chan *LeaseKeepAliveResponse, error)

	// KeepAliveOnce renews the lease once. The response corresponds to the
	// first message from calling KeepAlive. If the response has a recoverable
	// error, KeepAliveOnce will retry the RPC with a new keep alive message.
	//
	// In most of the cases, Keepalive should be used instead of KeepAliveOnce.
	KeepAliveOnce(ctx context.Context, id LeaseID) (*LeaseKeepAliveResponse, error)

	// Close releases all resources Lease keeps for efficient communication
	// with the etcd server.
	Close() error
}
```
lease 是分布式系统中一个常见的概念，用于代表一个分布式租约。典型情况下，在分布式系统中需要去检测一个节点是否存活的时，就需要租约机制。

Grant方法用于创建一个租约，当服务器在给定 time to live 时间内没有接收到 keepAlive 时租约过期；Revoke撤销一个租约，所有附加到租约的key将过期并被删除；TimeToLive 获取租约信息；KeepAlive 通过从客户端到服务器端的流化的 keep alive 请求和从服务器端到客户端的流化的 keep alive 应答来维持租约；检测分布式系统中一个进程是否存活，可以在进程中去创建一个租约，并在该进程中周期性的调用 KeepAlive 的方法。如果一切正常，该节点的租约会一致保持，如果这个进程挂掉了，最终这个租约就会自动过期，在 etcd 中，允许将多个 key 关联在同一个 lease 之上，可以大幅减少 lease 对象刷新带来的开销。

watch相关接口：
```
type Watcher interface {
	// Watch watches on a key or prefix. The watched events will be returned
	// through the returned channel. If revisions waiting to be sent over the
	// watch are compacted, then the watch will be canceled by the server, the
	// client will post a compacted error watch response, and the channel will close.
	// If the context "ctx" is canceled or timed out, returned "WatchChan" is closed,
	// and "WatchResponse" from this closed channel has zero events and nil "Err()".
	// The context "ctx" MUST be canceled, as soon as watcher is no longer being used,
	// to release the associated resources.
	//
	// If the context is "context.Background/TODO", returned "WatchChan" will
	// not be closed and block until event is triggered, except when server
	// returns a non-recoverable error (e.g. ErrCompacted).
	// For example, when context passed with "WithRequireLeader" and the
	// connected server has no leader (e.g. due to network partition),
	// error "etcdserver: no leader" (ErrNoLeader) will be returned,
	// and then "WatchChan" is closed with non-nil "Err()".
	// In order to prevent a watch stream being stuck in a partitioned node,
	// make sure to wrap context with "WithRequireLeader".
	//
	// Otherwise, as long as the context has not been canceled or timed out,
	// watch will retry on other recoverable errors forever until reconnected.
	//
	// TODO: explicitly set context error in the last "WatchResponse" message and close channel?
	// Currently, client contexts are overwritten with "valCtx" that never closes.
	// TODO(v3.4): configure watch retry policy, limit maximum retry number
	// (see https://github.com/etcd-io/etcd/issues/8980)
	Watch(ctx context.Context, key string, opts ...OpOption) WatchChan

	// RequestProgress requests a progress notify response be sent in all watch channels.
	RequestProgress(ctx context.Context) error

	// Close closes the watcher and cancels all watch requests.
	Close() error
}
```
etcd 的Watch 机制可以实时地订阅到 etcd 中增量的数据更新，watch 支持指定单个 key，也可以指定一个 key 的前缀。Watch 观察将要发生或者已经发生的事件，输入和输出都是流;输入流用于创建和取消观察，输出流发送事件。一个观察 RPC 可以在一次性在多个key范围上观察，并为多个观察流化事件，整个事件历史可以从最后压缩修订版本开始观察。

ETCD数据版本机制

etcd数据版本中主要有term表示leader的任期，revision 代表的是全局数据的版本。当集群发生 Leader 切换，term 的值就会 +1，在节点故障，或者 Leader 节点网络出现问题，再或者是将整个集群停止后再次拉起，都会发生 Leader 的切换；当数据发生变更，包括创建、修改、删除，其 revision 对应的都会 +1，在集群中跨 Leader 任期之间，revision 都会保持全局单调递增，集群中任意一次的修改都对应着一个唯一的 revision，因此我们可以通过 revision 来支持数据的 MVCC，也可以支持数据的 Watch。

对于每一个 KeyValue 数据节点，etcd 中都记录了三个版本：
- 第一个版本叫做 create_revision，是 KeyValue 在创建时对应的 revision；
- 第二个叫做 mod_revision，是其数据被操作的时候对应的 revision；
- 第三个 version 就是一个计数器，代表了 KeyValue 被修改了多少次。
在同一个 Leader 任期之内，所有的修改操作，其对应的 term 值始终相等，而 revision 则保持单调递增。当重启集群之后，所有的修改操作对应的 term 值都加1了。

ETCD之MVCC并发控制

说起mvcc大家都不陌生，mysql的innodb中就使用mvcc实现高并发的数据访问，对数据进行多版本处理，并通过事务的可见性来保证事务能看到自己应该看到的数据版本，同样，在etcd中也使用mvcc进行并发控制。

etcd支持对同一个 Key 发起多次数据修改，每次数据修改都对应一个版本号。etcd记录了每一次修改对应的数据，即一个 key 在 etcd 中存在多个历史版本。在查询数据的时候如果不指定版本号，etcd 会返回 Key 对应的最新版本，同时etcd 也支持指定一个版本号来查询历史数据。

etcd将每一次修改都记录下来，使用 watch订阅数据时，可以支持从任意历史时刻（指定 revision）开始创建一个 watcher，在客户端与 etcd 之间建立一个数据管道，etcd 会推送从指定 revision 开始的所有数据变更。etcd 提供的 watch 机制保证，该 Key 的数据后续的被修改之后，通过这个数据管道即时的推送给客户端。

分析其源码可知：
```
type revision struct {
	// main is the main revision of a set of changes that happen atomically.
	main int64

	// sub is the the sub revision of a change in a set of changes that happen
	// atomically. Each change has different increasing sub revision in that
	// set.
	sub int64
}

func (a revision) GreaterThan(b revision) bool {
	if a.main > b.main {
		return true
	}
	if a.main < b.main {
		return false
	}
	return a.sub > b.sub
}
```
在etcd的mvcc实现中有一个revision结构体，main 表示当前操作的事务 id，全局自增的逻辑时间戳，sub 表示当前操作在事务内部的子 id，事务内自增，从 0 开始；通过GreaterThan方法进行事务版本的比较。

ETCD存储数据结构

etcd 中所有的数据都存储在一个 btree的数据结构中，该btree保存在磁盘中，并通过mmap的方式映射到内存用来支持快速的访问，treeIndex的定义如下：
```
type treeIndex struct {
	sync.RWMutex
	tree *btree.BTree
}

func newTreeIndex() index {
	return &treeIndex{
		tree: btree.New(32),
	}
}
```
index所绑定对btree的操作有Put、Get、Revision、Range及Visit等，以Put方法为例，其源码如下：
```
func (ti *treeIndex) Put(key []byte, rev revision) {
	keyi := &keyIndex{key: key}

	ti.Lock()
	defer ti.Unlock()
	item := ti.tree.Get(keyi)
	if item == nil {
		keyi.put(rev.main, rev.sub)
		ti.tree.ReplaceOrInsert(keyi)
		return
	}
	okeyi := item.(*keyIndex)
	okeyi.put(rev.main, rev.sub)
}
```
通过源码可知对btree数据的读写操作都是在加锁下完成的，从而来保证并发下数据的一致性。
相关阅读:
python 基础 7.1 datetime 获得时间
 Python 学习笔记12
Python 学习笔记11
Python 学习笔记10
Python 学习笔记9
Python 学习笔记8
Python 学习笔记7
Python 学习笔记6
Python 学习笔记5
Python 学习笔记4
原文地址：https://www.cnblogs.com/FG123/p/13632095.html