Basic Paxos

This week subject is a classic in distributed computing: Paxos which is so popular that it almost became a synonym of consensus.

First let’s start by examining the problem faced by distributed system and why we need a consensus algorithm.

The problem

In the (not so) old days there was a single server serving client requests. The requests (or commands) where executed sequentially and so was the system state updated.

No consensus needed with a single server. It represents a single point of failure and doesn’t scale if traffic increases.

So far so good, but when the server failed there was simply no service available. In order to improve availability and resiliency we added more servers. As we added more servers we still needed the clients to see the whole system as a single server. That means the behaviour of the distributed system should be the same (from the clients point of view) as if there was only a single server.

When a service is provided by multiple server they need to agree on the order in which the operations are processed in order to provide a consistent state across the system,

Let’s take a simple example to understand what it means. Let’s there is a server that manages bank accounts and we have an account with 100$ in. Then we have a client who makes a deposit of 100$ and the bank credit the interests on the same account. If theses 2 operations are processed simultaneously by 2 different servers the system outcome (i.e. the account balance) may vary depending on the order of the operations processed by the server:

  • If the client deposit comes first the balance becomes 200$ then the 5% interests are credited and the final balance is 210$.
  • If the 5% interest comes first the balance becomes 105$ and then the 100$ deposit is added and the final balance is 205$.

What we observe here is that to ensure the same outcome on all the servers, the servers need to agree on a processing order. Solving this ordering problem is made by having the servers agree on every transaction to process next. For each operation there is a consensus to be made among the servers.

Basic Paxos

In this section we’re focusing on a single round of consensus. This is basic Paxos. We can then apply another round for every command to be processed by the system.

The name

Paxos is the name of a greek island and the Paxos algorithm is inspired by the way the parliament was run on this island. Basically a law needed a majority of the votes to be adopted. The algorithm relies on this same idea that a proposed value must be accepted by a majority to be chosen.

The roles

There are 3 kind of roles in Paxos:

  • The proposers: these are the ones who propose new values to agree on. Typically the servers who handle client requests.
  • The acceptors: these are the ones who vote or “accept” a value. They also keep track of the decision process.
  • The learners: these are the ones who want to know the “chosen” values.

Any participant can play one or more  roles. This is often the case that the servers play all the roles.

The assumptions

In each round:

  • One or more servers proposes a value
  • The participants agree on a single value
  • Only one value may be chosen
  • Participants never learn a value unless it has been chosen

The following assumptions are also made on the system:

  • No byzantine failures: the servers behave correctly or fail completely. The algorithm doesn’t support malicious behaviour of the participants.
  • The system is eventually synchronous (most messages are successfully delivered and the number of lost messages is bounded).

The rules

Each round comprises 2 (or 3) phases:

  • a “prepare” phase: where the proposer checks that the acceptors are going to vote for his proposal.
  • an “accept” phase: where the acceptors vote for the proposal and “choose” a value.
  • and possibly a “learning” phase: where the learners are notified of the outcome of the round. (It doesn’t influence the way a value is chosen so let’s ignore this phase for now).

Each proposal has a unique number and higher numbers take priority over lower numbers.

To avoid number collision the proposal number should include some sort of server id. (e.g. <proposal #>.<server id>).

The algorithm

a typical 2 phase exchange between proposer and acceptor where the acceptor accepts the proposal.
The proposer

The proposer must maintain and persist one value: the max proposal number. Its behaviour is defined as follow:

  1. Choose a new proposal number higher the proposer’s max proposal number.
  2. Send a “prepare” message to all the acceptors with the new proposal number.
  3. when responses received from the majority of the acceptors:
    • if it received a response with an accepted proposal number  the acceptor replaces the proposed value with the received value.
  4. Send “accept” message with the  proposal number and the value.
  5. When responses received from the majority of the acceptors:
    • If there is a rejection (response with a number higher than the current proposal number). The value is not “chosen” so the proposer updates its max proposal number and start again from 1.
    • If no rejection the value is “chosen”.
The acceptor

The acceptor must maintain 3 values:

  • min proposal number
  • accepted proposal number
  • accepted value

Its behaviour is the counter-part of the proposer behaviour:

  1. When it receives a “prepare” message
    • If the received proposal number is greater than the acceptor’s min proposal value then update its min proposal
    • Then replies with a message containing the highest accepted proposal number and its associated  value (if any).
  2. When receives an “accept” message:
    • if the proposal number is greater or equal to the acceptor’s min proposal value, it updates its min proposal and its accepted proposal number with the associated value and then replies ok.
    • Then it replies with a message containing its min proposal value.

An interesting fact is that only the proposer knows that a value has been chosen. The acceptors only knows the value it has accepted but it doesn’t know wether the value is accepted by a majority. That’s why there is an additional “learning” phase where the chosen values gets dispatch to the “learners”.

Now let’s prove that this algorithm guarantee that only one value can be chosen. There are 3 cases to consider:

A previous value has already been chosen
The first proposal has already been chosen by a majority when the second comes in. Server 5 learns the chosen value in response of its propose phase.

The second proposer finds out about the previous accepted value in response of its propose phase. It then replace its value with the already accepted one. In the end both proposal are accepted with the same value.

A previous value has already been accepted
The first proposal has already been accepted by server 3 when the second comes in. Server 3 can’t accept the second proposal and the first one is chosen.

Similarly the second proposer finds out that a previous value has already been accepted by one of the acceptors. This is why it waits for a response from the majority. It guarantees that there is at least one acceptor that has seen any previous proposal.

A previous value has only been proposed
When the second proposal comes in the first one has not yet been accepted by a majority. The first proposal is eventually discarded.

Here no value has been accepted, therefore the second proposal will succeed and the first one will fail. The initial proposer will then have to try again and the outcome will be the second value. This time only the second value is chosen (while the first one is rejected).

These are all the possible case and with all of them the outcome is only a single value.

There is another case worth mentioning because it corresponds to a livelock. It may happen when the proposers keep sending new proposals without leaving a chance to the acceptors to accept any of them. This is usually handle easily by waiting a random amount of time before sending out a new proposal. This random delay gives the acceptor a chance to accept a value and avoid the livelock.

This covers basic Paxos and concludes this blog post. Next time I’m going to focus on multi-paxos and how it can be adapted to generate a distributed log.