The Paxos algorithm for consensus was described by Leslie Lamport and is important due to its contributions to distributed computing theory as well as its practical use since it is widely deployed. However, even with the Paxos protocol’s widespread acceptance and implementation, it is no secret that the Paxos algorithm is difficult to understand. These difficulties can be summarized by the following quotes from subject matter experts:

The dirty little secret of the NSDI community is that at most five people really, truly understand every part of Paxos.

  • NSDI reviewer


There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system…the final system will be based on an unproven protocol.

  • Chubby authors


Due to the difficulty in implementation and understanding of the Paxos algorithm, Raft was designed with understandability as the number one goal. Before we jump into how the Raft algorithm addresses consensus, let’s first look at why distributed systems and consensus are needed.

Why Distributed Systems & Consensus are Important

The simplest architecture for a service, whatever it may be, is to have a single centralized server that handles client requests. In this scenario, a client request comes in, the server processes it and then responds appropriately. This all sounds great until the server crashes and can no longer respond to requests. When the possibility and repercussions of downtime grow large enough and a more fault tolerant system is required, it is time for a distributed system or more specifically a replicated state machine.

A replicated state machine consists of a number of identical, or replicated, servers that should be isolated both physically and electrically such that they can fail independently. This decreases the probability of unplanned downtime and increases the confidence level that there will always be a server available to respond to client requests. However, you cannot just spin up copies of the servers and allow them to respond to client requests independently as their identical state must be maintained. In order for the states of the individual servers to remain in sync, there must be a protocol in place that ensures all requests go to all servers, as well as determining the order requests are received, processed, and responded to. This is generally referred to as log replication and is more easily understood with an example.

Imagine that there are two servers that are being used to handle client requests for trading stocks. As mentioned above, all requests must go to both servers, but the client should be unaware that there are multiple servers. Picture a husband and wife making simultaneous requests. The husband requests to purchase 120 more shares and the wife sends a request to purchase 50% more shares of the same particular stock account. What happens if these requests arrive at the servers in different orders and the requests are executed immediately? Simply put, the clients could receive inconsistent results. Let’s examine how.

For easy math, let’s say that a particular stock account contains 1000 shares. If server 1 first receives the husband’s request to purchase 120 more shares and then receives the wife’s request to invest in an additional 50% more shares, the final number of shares in the account would be 1680 shares.

Now imagine that server 2 received the request to purchase 50% more shares, and then received the request to purchase 120 more shares; the final account balance on server 2 would be 1620 shares.

Although this is a simplified example, it shows the violation, as the two servers are behaving individually instead of as a single replica. Not only are the married couple likely to notice the difference, they will most likely be using a competitors service moving forward. This is the reason a consensus algorithm is required when dealing with a replicated state machine and exactly what the Paxos and Raft algorithms solve.

The Raft Algorithm

The Raft algorithm has many similarities to Paxos, but the emphasis on understandability makes implementation more feasible for developers. Raft accomplishes this by focusing on leader election and burdening the leader with all of the responsibilities for ensuring proper log replication. Since all of the commits and elections require a majority vote, the minimum Raft cluster size is 3 servers; although a cluster size of 5 servers is the most common. With 3 servers in a cluster, 1 can fail and there can still be a majority since candidates will always vote for themselves. With 5 servers it is possible for two to fail and still maintain a majority vote. The more servers that are introduced to the cluster, the lower the probability of not having a majority, but of course the tradeoff is cost.

Each server in the cluster will be in one of three states: leader, follower, or candidate. The details of each role will be discussed as we break the algorithm into three core parts:

  1. Leader Election
  2. Log Replication
  3. Safety

Lets look at each of these core components individually.

Leader Election

All nodes will begin in the follower state. If the followers do not hear from the leader before the end of the election timeout, then they will convert to candidates and send out a remote procedure call (RPC) known as RequestVote RPC’s for the other nodes to vote to elect them leader. If they do not hear back from specific nodes, the RPCs will be resent until they hear back, a majority is reached, a different leader sends a different RPC or heartbeat, or the election timeout expires again. The election timeout is set between 150-300ms at random so generally there will only be one candidate at a time. In the rare occurrence that two follower nodes become candidates at the same time, a split vote may take place. If there are an odd number of nodes, a majority will be reached. If for some reason one server is not responding, leaving an even number of nodes voting, a split vote can occur. Under these circumstances, no leader is elected and the election timeout will expire again and another candidate will petition to be leader. Randomly assigning the election timeout over a range of time drastically reduces the number of split votes although they will occur infrequently.

Raft keeps track of elections by using sequential terms. The terms are of arbitrary length and start when a follower becomes a candidate and end at the start of the next election. Each time that there is an election, the candidate will increment their term by 1. During a split vote, the candidates still increment the term by 1 but the length of the term is only for the duration of the election. After a split vote, the next election occurs after the election timeout and the term is once again increased by 1. An example of how elections and terms are incremented can be seen in the figure below.

election process

Once a leader is elected, the node will remain leader until the other nodes determine them to be unresponsive at which point another election will occur. This is determined by the expiration of their election timeout. However, the leader can reset the election timeout of the other nodes in two ways:

  1. The leader will send AppendEntries RPCs in parallel to the other nodes when attempting to commit changes to the log (discussed in the log replication section below).
  2. The leader sends heartbeats in parallel to the other nodes that are AppendEntries RPCs that have no associated log entries. Heartbeats essentially serve as notification to the other nodes that the leader is still responsive.

It should now be clear how leader election works. Once a leader is elected they must start handling the log replication.

Log Replication

Log replication falls solely on the elected leader. Once a leader has been elected, they will immediately be tasked with receiving client commands, appending them to their logs after reaching consensus, and then responding to the client. This process is fairly straightforward when all nodes are up and running normally. However, a few caveats are introduced by leader crashes or when nodes fall behind as a result of becoming unresponsive for a given period of time.

Before discussing the edge cases, let’s first explore how it works when all servers are running normally.

Reaching Consensus when all Servers are Responding

When the leader receives a command from the client, it will first append the command to its own log but it will not commit the change. Each log entry contains three important pieces of information: the client command, the term number, and an integer index for its position in the log. Once the log entry has been appended to the leaders log, the leader must get confirmation from the majority of the other nodes that they too have stored the entry in their logs. This is done by the leader sending out AppendEntries RPCs in parallel to all of the nodes. Once a positive response is received from a majority of nodes, the leader can confirm that consensus has been reached and will commit the log entry before serving a response to the client. By committing a change, the leader will also commit all previously uncommitted changes to its log, including those from other former leaders, since they can confirm that they coincide with the majority of the nodes. In all subsequent heartbeats the leader will include the index of the latest commit so that there is transparency between nodes. This also allows nodes that have logs that do not agree to update their logs accordingly (will be discussed in more detail shortly). The leader can be sure that their logs are consistent with the majority of nodes at this point for two reasons:

  1. If two entries in different logs have the same index and term, then they must contain the same command. This is a valid assumption since the log index will be unique for the given term and the logs are immutable. Meaning, the log entries cannot be moved around once committed.
  2. If two entries in different logs have the same index and term, then all previous log entries are identical as well. This is valid due to the fact that when the leader sends the AppendEntries RPCs, they include the index and term of the preceding log entry. If the node does not contain this entry then the node will not accept the entry as valid and returns a negative response. Thus, the leader will not receive a majority of positive responses and cannot commit the change if this check is violated.

While all nodes are responsive, the logs will stay consistent and these checks will never fail. However, if servers are knocked offline or the leader crashes, inconsistencies can be introduced and will have to be corrected.

How the Leader Handles Inconsistencies in Log Replication

A follower will be made aware of the inconsistency between its own log and the leader’s log when an AppendEntries RPC arrives and the preceding term and index do not match. At this point, the follower will send a negative response to the leader in regards to appending the entry. The leader will then need to resolve the differences in the follower’s log by walking back along the log entry indexes until a point where the leader and follower’s logs agree. This is done in the following steps:

  1. AppendEntries RPCs sent to all nodes
  2. Node responds negatively due to the fact that the node’s latest log entries’ term and index do not match what was sent in the RPC.
  3. The leader will keep track of a nextIndex for each node. This is initialized as the index preceding the leader’s latest entry when the leader is elected.
  4. After a negative response from a node, the leader will decrement its nextIndex for the node that sent the negative response and resend the AppendEntries RPC with the new nextIndex value.
  5. This will continue until a point where the leader’s log and node’s log match.
  6. At this point, the follower will accept the AppendEntries RPC and overwrite its log to match the leaders log.
  7. The nodes log has now been updated to be in sync with the majority.

It is important to note that the leader will never change its own log and it is always the followers that amend their logs. This has to do with safety measures that are taken prior to electing a leader. These safety measures ensure that the leader’s logs are up to date. We will examine Raft safety in the next section.

Safety Regarding Leader Election

The safety measures imposed by the Raft algorithm put restrictions on leader elections. This restriction says that a candidate must have all of the committed entries that the majority of nodes have or they cannot be elected. This rule is maintained because information regarding the candidate’s log is sent with the RequestVote RPCs. The logs are first compared by term number and the later term is determined more up to date. If the terms are the same, then whichever log is longer is considered more up to date. A node will not cast its vote for a candidate unless the candidate’s logs are either equal or more up to date. This guarantees that a candidate with missing entries cannot be elected leader.

Bringing It All Together

Raft combines its core components of leader election, log replication, and safety to ensure that consensus is reached before responding to commands and that a single replicated machine state is always maintained among the majority of servers. However, there is one final consideration that has not been discussed which is timing and server availability.

The Raft algorithm will be able to function in regards to maintaining proper leader election if the following holds true:

broadcastTime < electionTimeout < MTBF

Where:

  • broadcastTime is equal to the latency associated with the time it takes to send and receive messages.
  • electionTimeout is chosen at random for some interval (Suggested to be 150-300ms).
  • MTBF is the mean time before failure for nodes. This is assumed to be relatively long and should be satisfied in most circumstances.

This means that the distributed system cannot be asynchronous and must be at least eventually synchronous. It is okay if a server falls behind here and there, because it will just be considered unresponsive and the raft algorithm is meant to tolerate failures. However, it is recommended that the broadcastTime be an order of magnitude less that the electionTimeout to ensure that leader election is functioning properly, otherwise nothing can get done.

Up Next

Hopefully at this point you have a pretty good conceptual understanding of how the Raft consensus algorithm works. Topics like how cluster membership changes are achieved, how client requests are directed to the correct leader, etc. were not discussed in this post as the aim was to get across a conceptual understanding.

In the next post we will try to implement a consensus algorithm for Docker containers based on the Raft algorithm.