一致性算法—Paxos、Raft、ZAB

2020-02-04 19:15 ⁄ 工业·编程 ⁄ 共 10625字 ⁄ 字号小中大 ⁄ 暂无评论

1、分布式系统对fault tolerence的一般解决方案是state machine replication（状态机复制）。

2、分布式一致性算法的一种更准确的说法应该是：state machine replication的共识（consensus）算法。

3、pasox其实是一个共识算法。系统的最终一致性，不仅需要达成共识，还会取决于client的行为。

4、分布式系统中有多个节点就会存在节点间通信的问题，存在着两种节点通讯模型：共享内存（Shared memory）、消息传递（Messages passing），以下谈到的算法都是基于消息传递的通讯模型的。它的假设前提是，在分布式系统中进程之间的通信会出现丢失、延迟、重复等现象，但不会出现传错的现象。以下的算法就是为了保证在这样的系统中进程间基于消息传递就某个值达成一致。

一、一致性概述

当前工业实际应用中的一致性模型分类

1.1、弱一致性（最终一致性）

DNS（Domain Name System）

Gossip（Cassandra、Redis的通信协议）

1.2、强一致性

大体可分两类：

1.2.1、主从同步

基本思想：

主从同步复制：

1、Master接受写请求

2、Master复制日志至slave

3、Master等待，直到所有从库返回

存在的问题:

一个节点失败，Master阻塞，导致整集群不可用，保证了一致性，可用性大大降低

1.2.2、多数派

基本思想：

每次写都保证写入大于N/2个节点，每次读保证从大于N/2个节点中读。

相关算法：

Paxos

Raft（multi-paxos）

ZAB（multi-paxos）

二、Pasox

Paxos算法是莱斯利·兰伯特(Leslie Lamport)1990年提出的一种基于消息传递的一致性算法。

Paxos的发展分类：Basic Paxos、Multi Paxos、Fast Paxos

2.1、Basic Paxos

2.1.1、角色介绍

Client：系统外部角色，请求发起者。像民众

Proposer：接受Client请求，向集群提出提议（propose），并在冲突发生时，起到冲突调解的作用。像议员，替民众提出议案

Acceptor：提议投票和接受者，只有在形成法定人数（Quorum，一般即为majority-多数派）时，提议才会最终被接受。像国会

Learner：提议接受者，backup-备份，对集群一致性没什么影响。像记录员

2.1.2、步骤、阶段（phases）

1、Phase 1a：Prepare

proposer提出一个**提议，编号为N，**此N大于这个proposer之前提出的提案编号。请求acceptors的quorum接受。

2、 Phase 1b：Promise

如果N大于此acceptor之前接受的任何提案编号则接受，否则拒绝。

3、Phase 2a：Accept

如果达到了多数派，proposer会发出 accept请求，此请求包含提案编号N，以及提案内容。

4、Phase 2b：Accepted

如果此acceptor在此期间没有收到任何编号大于N的提案，则接受此提案内容，否则忽略。

2.1.3、基本流程

2.1.3.1、正常流程

there is 1 Client, 1 Proposer, 3 Acceptors (i.e. the Quorum size is 3) and 2 Learners (represented by the 2 vertical lines).

This diagram represents the case of a first round, which is successful (i.e. no process in the network fails).

Client   Proposer      Acceptor     Learner

   |         |          | | |       | |

   X-------->|          | | |       | | Request

   |         X--------->|->|->|       | | Prepare(1)

   |         |<---------X--X--X       | | Promise(1,{Va,Vb,Vc})

   |         X--------->|->|->|       | | Accept!(1,V)

   |         |<---------X--X--X------>|->| Accepted(1,V)

   |<---------------------------------X--X Response

   |         |          | | |       | |

2.1.3.2、一个Acceptor宕机

In the following diagram, one of the Acceptors in the Quorum fails, so the Quorum size becomes 2. In this case,

the Basic Paxos protocol still succeeds.

Client   Proposer      Acceptor     Learner

   |         |          | | |       | |

   X-------->|          | | |       | | Request

   |         X--------->|->|->|       | | Prepare(1)

   |         |          | | !       | | !! FAIL !!

   |         |<---------X--X          | | Promise(1,{Va, Vb, null})

   |         X--------->|->|          | | Accept!(1,V)

   |         |<---------X--X--------->|->| Accepted(1,V)

   |<---------------------------------X--X Response

   |         |          | |          | |

2.1.3.4、一个Learner宕机

In the following case, one of the (redundant) Learners fails, but the Basic Paxos protocol still succeeds.

Client Proposer         Acceptor     Learner

   |         |          | | |       | |

   X-------->|          | | |       | | Request

   |         X--------->|->|->|       | | Prepare(1)

   |         |<---------X--X--X       | | Promise(1,{Va,Vb,Vc})

   |         X--------->|->|->|       | | Accept!(1,V)

   |         |<---------X--X--X------>|->| Accepted(1,V)

   |         |          | | |       | ! !! FAIL !!

   |<---------------------------------X     Response

   |         |          | | |       |

2.1.3.4、一个Proposer宕机

In this case, a Proposer fails after proposing a value, but before the agreement is reached. Specifically, it fails in the middle of the Accept

message, so only one Acceptor of the Quorum receives the value. Meanwhile, a new Leader (a Proposer) is elected (but this is not shown in detail).

Note that there are 2 rounds in this case (rounds proceed vertically, from the top to the bottom).

Client Proposer        Acceptor     Learner

   |      |             | | |       | |

   X----->|             | | |       | | Request

   |      X------------>|->|->|       | | Prepare(1)

   |      |<------------X--X--X       | | Promise(1,{Va, Vb, Vc})

   |      |             | | |       | |

   |      |             | | |       | | !! Leader fails during broadcast !!

   |      X------------>| | |       | | Accept!(1,V)

   |      !             | | |       | |

   |         |          | | |       | | !! NEW LEADER !!

   |         X--------->|->|->|       | | Prepare(2)

   |         |<---------X--X--X       | | Promise(2,{V, null, null})

   |         X--------->|->|->|       | | Accept!(2,V)

   |         |<---------X--X--X------>|->| Accepted(2,V)

   |<---------------------------------X--X Response

   |         |          | | |       | |

2.1.4、潜在问题

2.1.4.1、活锁（livelock）或决斗（dueling）

活锁发生的流程：

The most complex case is when multiple Proposers believe themselves to be Leaders. For instance, the current leader may fail and later recover,

but the other Proposers have already re-selected a new leader. The recovered leader has not learned this yet and attempts to begin one round in

conflict with the current leader. In the diagram below, 4 unsuccessful rounds are shown, but there could be more (as suggested at the bottom of

the diagram).

Client   Leader         Acceptor     Learner

|      |             | | |       | |

X----->|             | | |       | | Request

|      X------------>|->|->|       | | Prepare(1)

|      |<------------X--X--X       | | Promise(1,{null,null,null})

|      !             | | |       | | !! LEADER FAILS

|         |          | | |       | | !! NEW LEADER (knows last number was 1)

|         X--------->|->|->|       | | Prepare(2)

|         |<---------X--X--X       | | Promise(2,{null,null,null})

|      | |          | | |       | | !! OLD LEADER recovers

|      | |          | | |       | | !! OLD LEADER tries 2, denied

|      X------------>|->|->|       | | Prepare(2)

|      |<------------X--X--X       | | Nack(2)

|      | |          | | |       | | !! OLD LEADER tries 3

|      X------------>|->|->|       | | Prepare(3)

|      |<------------X--X--X       | | Promise(3,{null,null,null})

|      | |          | | |       | | !! NEW LEADER proposes, denied

|      | X--------->|->|->|       | | Accept!(2,Va)

|      | |<---------X--X--X       | | Nack(3)

|      | |          | | |       | | !! NEW LEADER tries 4

|      | X--------->|->|->|       | | Prepare(4)

|      | |<---------X--X--X       | | Promise(4,{null,null,null})

|      | |          | | |       | | !! OLD LEADER proposes, denied

|      X------------>|->|->|       | | Accept!(3,Vb)

|      |<------------X--X--X       | | Nack(4)

|      | |          | | |       | | ... and so on ...

解决办法：如果发生冲突，则Proposer等待一个Random的Timeout（一般几秒）再提交自己的提议。

2.1.4.2、难实现、效率低（2轮RTT）

1、Basic Paxos的难度是较为出名的，且不易理解；

2、提交提议、提交提案（日志）内容进行了两轮RTT操作，效率较低。

2.2、Multi Paxos

2.2.1、角色介绍

减少角色，简化步骤：

由于Basic Paxos存在活锁问题，而且根因是多个Proposer导致的。Multi Paxos则提出了一个新的概念——Leader，由于Basic Paxos存在两轮RTT导致的效率低下问题，Multi Paxos则通过Leader角色 + 在消息中增加一个随机的I（the round number I is included along with each value which is incremented in each round by the same Leader），使得两轮RTT只在竞选Leader时出现，其余情况只进行一轮RTT

Leader：唯一的Proposer，所有请求都需经过此Leader

2.1.3、基本流程

2.1.3.1、选主流程

1、从Basic Paxos Protocol的角色关系出发:

In the following diagram, only one instance (or "execution") of the basic Paxos protocol, with an initial Leader (a Proposer),

is shown. Note that a Multi-Paxos consists of several instances of the basic Paxos protocol.

Client   Proposer      Acceptor     Learner

|         |          | | |       | | --- First Request ---

X-------->|          | | |       | | Request

|         X--------->|->|->|       | | Prepare(N)

|         |<---------X--X--X       | | Promise(N,I,{Va,Vb,Vc})

|         X--------->|->|->|       | | Accept!(N,I,V)

|         |<---------X--X--X------>|->| Accepted(N,I,V)

|<---------------------------------X--X Response

|         |          | | |       | |

where V = last of (Va, Vb, Vc).

2、从Multi Paxos Protocol角色关系出发：

A common deployment of the Multi-Paxos consists in collapsing the role of the Proposers, Acceptors and Learners to "Servers".

So, in the end, there are only "Clients" and "Servers".

Client      Servers

|         | | | --- First Request ---

X-------->| | | Request

|         X->|->| Prepare(N)

|         |<-X--X Promise(N, I, {Va, Vb})

|         X->|->| Accept!(N, I, Vn)

|         X<>X<>X Accepted(N, I)

|<--------X | | Response

|         | | |

2.1.3.2、正常请求操作流程

1、从Basic Paxos Protocol的角色关系出发:

In this case, subsequence instances of the basic Paxos protocol (represented by I+1) use the same leader, so the phase 1 (of these subsequent

instances of the basic Paxos protocol), which consist in the Prepare and Promise sub-phases, is skipped. Note that the Leader should be stable,

i.e. it should not crash or change.

The following diagram represents the first "instance" of a basic Paxos protocol, when the roles of the Proposer, Acceptor and Learner are collapsed to a single role, called the "Server".

Client   Proposer       Acceptor     Learner

|         |          | | |       | | --- Following Requests ---

X-------->|          | | |       | | Request

|         X--------->|->|->|       | | Accept!(N,I+1,W)

|         |<---------X--X--X------>|->| Accepted(N,I+1,W)

|<---------------------------------X--X Response

|         |          | | |       | |

2、从Multi Paxos Protocol角色关系出发：

In the subsequent instances of the basic Paxos protocol, with the same leader as in the previous instances of the basic Paxos protocol,

the phase 1 can be skipped.

Client      Servers

X-------->| | | Request

|         X->|->| Accept!(N,I+1,W)

|         X<>X<>X Accepted(N,I+1)

|<--------X | | Response

|         | | |

三、Raft

Raft可以认为是比Multi Paxos更简单的一致性算法

3.1、Raft协议中的相关概念定义

3.1.1、角色定义

Leader：

主节点，整个集群只有一个Leader，所有的写请求都通过Leader发送给Follower；

Follower：

从节点（服从角色）；

Candidate：

在Leader消息发送失败或宕机，整集群没有Leader时，此时Follower接收Leader的心跳包失败，则Follwer开始竞选Leader时，它们的身份是Candidate。Candidate只是个中间状态，不会长期存在。

3.1.2、Term（任期）定义

在每一个Leader的任期期间，都有唯一表示该任期的一个Term；

3.2、基本操作

3.2.1、Raft将state machine replication划分为三个子问题

1、Leader Election

2、Log Replication

3、Safety

3.2.2、Leader Election步骤

集群启动或Leader的心跳包消息无法发送给Follower时，触发 Leader Election——选主操作。

3.2.3、Log Replication步骤

1、所有的写请求都要经过Leader；

2、Leader将写请求携带在心跳包中发送给Follower；

3、当Leader收到多数派回复的消息后，则先自己提交写操作，同时发送Commit请求给Follower；

3.2.4、Safety保证

1、Leader宕机感知：

a、Raft通过TimeOut来保证Follower能正确感知Leader宕机或消息丢失的事件，并触发Follower竞选Leader；

b、Leader需要给Follower发送心跳包（heartbeats），数据也是携带在心跳包中发送给Follower的；

2、选主平票情况

Leader Election时平票情况下，则两个Candidates会产生一个随机的Timewait，继续发送下一个竞选消息。

3、、脑裂（大小集群）情况：

小集群由于没有得到多数派的回复，写操作失败；

大集群会发生重新选主的过程，且新Leader拥有自己新的Term(任期)，写操作成功；

当小集群回到大集群时，由于小集群的Term小于新集群的Term，则同步新集群的信息。

3.3、一致性并不代表完全正确性

3.3.1、Client Request操作的三个可能结果：成功、失败、unknown（Timeout）

理解unknown（Timeout）

场景：Client写请求，Leader向Follower同步日志，此时集群中有3个节点失败，2个节点存活，结果是？假设节点为：S1、S2、S3、S4、S5（Leader）

假设S5和S4存活，Client发起第N次写请求为操作I 时，由于Leader没有得到多数派的回复，操作I只被发送到了S4中，此时Leader即会返回Client unknown，因为Leader不知道后面会不会成功将该条日志写入多数派中。

结果1：假设Leader在返回客户端后，宕机的Follower：S1、S2、S3恢复正常，Leader再次发送第N次写请求——操作I，且得到了多数派的回复，则提交日志，写操作最终结果为成功；

结果2：假设Leader在返回客户端后，此时S5和S4宕机，且S1、S2、S3恢复正常，此时S1、S2、S3触发选主操作，且集群恢复可用，如果此时Client发起第N+1次请求为操作I+1 ,且Client操作成功后 S5、S4恢复正常，则保存在S5、S4中的操作I 会被删除，S5、S4同步最新的操作I+1 到本地。则第N次写请求—操作I 失败；

总结：一致性需要客户端和共识算法（Consensus）来共同保证。

四、ZAB

ZAB的全称是Zookeeper atomic broadcast protocol，是Zookeeper内部用到的一致性协议。基本与Raft相同。

在一些名词的叫法上有些区别：

如ZAB将某一个leader的周期称为epoch，而Raft则称为Term。

实现上也有些许不同：

Raft保证日志连续性，心跳方向为Leader至Follower。ZAB则相反。

五、一致性算法的实践

5.1、使用Paxos的组件

Chubby(Google首次运用Multi Paxos算法到工程领域)

5.2、使用Raft的组件

Redis-Cluster、etcd

5.3、使用ZAB的组件

Zookeeper（Yahoo开源）

附

本文参考链接：

https://raft.github.io/

https://www.bilibili.com/video/av21667358?t=3887

https://en.wikipedia.org/wiki/Paxos_(computer_science)

https://en.wikipedia.org/wiki/Raft_(computer_science)