- Raft Consensus Protocol Made Simpler [link]
- Raft Cluster Membership Change Protocol [this one]
- Raft Consensus Protocol Implementations [link]
In my first blog post about Raft — Raft Consensus Protocol Made Simpler, I promised I’d write about Raft cluster membership change. That’s too important a topic to omit. Due to various administrative reasons, new nodes may need to be added to the cluster, existing nodes may need to be removed or replaced. A mature and practical consensus protocol has to address the need to update cluster membership. In this blog post, let’s see how Raft does it. This blog post relies extensively on the knowledge built in that first blog post. It’s strongly recommended that you at least skimmed through that before diving into this one.
Cluster Membership Change
Cluster membership update is surprisingly easy once you understand the fundamentals of Raft’s log replication and leader election. Diego initially designed a more complex algorithm but later discovered it could be largely simplified if we impose a restriction: only one node change at a time — either add or remove. It’s a realistic simplification because clusters typically run in 3-node-mode, 5-node-mode, or occasionally 7-node-mode. Changes among them require only a handful of sequential one-time cluster membership updates.
To perform cluster membership updates, two new client RPCs are introduced: AddServer and RemoveServer. Clients — typically admins — use them to interact with the cluster leader for membership updates. What does the leader do when it receives an AddServer/RemoveServer request? Well, you guessed it. The leader appends that request to the log and starts replicating to followers. Here is the important part: a node updates its cluster membership knowledge as soon as an AddServer/RemoveServer is appended to its log (with a minor variance when the leader needs to remove itself — as we shall see later). That’s right, a node does not wait for that log entry to be committed (safely replicated in a majority). It starts using that information right away. If it’s AddServer, the new server will be treated as the node’s peer. If it’s RemoveServer, that server is removed from the node’s peer list. This eliminates the need of keeping a record of how much of the log is committed. It also reduces the delay of cluster membership change because the knowledge of commitment typically lags the log entries.
Why Switching Immediately Works
It may seem counterintuitive at first glance because almost any robust change in a distributed system requires “two-phases” — Paxos’ preparing and accepting a value, two phase commit algorithm, etc. The magic here comes down to the restriction we imposed earlier. The danger of cluster membership change is in that out-of-sync nodes may have different ideas about the number needed to form a majority. For example, if 2 nodes are added to a 3-node cluster in one shot, these two new nodes may conspire with one existing node to form a 3-out-of-5 majority. But the other existing two nodes which don’t know about the cluster membership change may form a 2-out-of-3 majority. Two leaders arise in this situation. See the following figure-1 as an illustration:
The genius of the one-node-at-a-time restriction is that it prevents any split brain (multiple leaders) situation. Because the two or more split groups are guaranteed to share nodes, which act as arbitrators to stop multiple leaders being elected. Suppose you have a cluster of 4 nodes and now you want to add one more. Nodes which still think there are only 4 nodes in the cluster will have to summon supports from 3 nodes to form a majority. Nodes which know there are actually 5 nodes now will also have to collect votes from 3 nodes. There are only 5 nodes in total. So the first 3-node-group and the second 3-node-group are guaranteed to have overlaps. See the following figure-2 as an illustration:
You can work out other examples starting with an even/odd number of nodes trying to add/remove one server. They all turn out to have to have overlaps. The existence of overlap prevents multiple leaders because a node only votes for one candidate in any given term. In the 4+1 example above, the first 3-node-group could elect a leader for term t, which prevents the second 3-node-group from winning in term t. The second 3-node-group could rerun and win for term t+1, in which case their leader will override the leader from the first 3-node-group. In theory, there may be back-and-forth between the two groups, but in practice the elected leader will start sending out AppendEntry fast enough to convert others to followers. Interjection: I call it AppendEntry instead of the original AppendEntries in Diego’s dissertation. You’ll have to refer to the my Raft Consensus Protocol Made Simpler to understand why that’s preferred.
Where the “two phases” concept comes in is that new cluster membership change can only be performed once the last cluster membership change is committed. Otherwise the one-at-a-time promise will break down because there may be nodes which think the cluster still consists of 3 nodes while others already know it contains 5 nodes now.
Address a Couple Loose Ends for Completeness
There are a couple other loose ends we have to address before the cluster membership protocol is complete.
First all of, nodes always respond to AppendEntry and RequestVote RPCs even if those come from nodes not part of their cluster configuration. Otherwise newly added nodes can’t sync to the leader’s log; or sometimes there won’t be enough votes to constitute a majority — for example, a fourth node is added and one existing node fails, now 3 nodes (including the new node) are required to form a majority.
Secondly, the newly added nodes need a way to quickly catch up with the log otherwise they could block the system. For example, a fourth node with empty log is added to a 3-node cluster. Then one of the existing 3 nodes fails. Now this fourth node’s lagging log will delay a majority (3 out of 4) replication because it has to catch up first before participating in replicating new log entries. See the figure-3 below for an illustration. There’re many ways to address this. We could put the fourth node in a pure observer mode for it to receive AppendEntry and catch up with the leader first before it participates in any leader election/voting. Or we could implement a more efficient log snapshot dump so that this lagging new node doesn’t have to wait for a sequential log backfill.
Thirdly, what if the to-be-removed server is the leader itself? In that case, does the leader immediately step down after it appends the cluster membership change to its own log? This is the minor variance I alluded to in the earlier section. The answer is no. In an extreme case, there are only 2 nodes in the cluster. If the leader steps down immediately before it can replicate the request to the other node, the other node will never know that the node size is reduced to 1 and therefore still thinks that two nodes are needed for a majority. In that case, that remaining node will never become a leader. Fortunately, the solution is straightforward. The leader steps down when it knows that the cluster membership change is committed, in which case a new leader can be elected without the old leader. Suppose you have a 4-node-cluster, the leader steps down once the cluster membership change is replicated in two other nodes. Now, the remaining 3 nodes will time out waiting for AppendEntry and start running for the new leader. The nodes with the new cluster membership change in its log knows only 2 are needed to form a majority. In this case, the cluster can still survive if any one of the remaining 3 nodes crashes.
Lastly, there is an edge case in removing a server. The leader knows that that server is removed and starts acting that way as soon as the leader appends the cluster membership change to its log. That means the leader will stop sending that node AppendEntry. The removed node, if still connected to others, will think that the leader is dead because it doesn’t hear any AppendEntry and will then start an election, disrupting the rest of the cluster. The solution is to modify the RequestVote RPC handling slightly: if a node still hears regular AppendEntry RPC, it denies any vote request. That way this removed node can start its election, but it won’t get any vote from the rest of the cluster.
I think that’s everything for the core of Raft that I want to share. Diego’s dissertation also talked about log compaction, client interaction, their user study, implementation and performance, etc. But I don’t think those areas are particularly novel. So I don’t plan to write about them. I do want to write something about the implementation from my own perspective, which I’ll probably get to in a future post.