Leader election process
Automated leader election in Tarantool helps guarantee that
there is at most one leader at any given moment of time in a replica set.
A leader is a writable node, and all other nodes are non-writable –
they accept read-only requests exclusively.
When the election is enabled, the life cycle of
a replica set is divided into so-called
terms. Each term is described by a monotonically growing number.
After the first boot, each node has its term equal to 1. When a node sees that
it is not a leader and there is no leader available for some time in the replica
set, it increases the term and starts a new leader election round.
Leader election happens via votes. The node which started the election votes
for itself and sends vote requests to other nodes.
Upon receiving vote requests, a node votes for the first of them, and then cannot
do anything in the same term but wait for a leader being elected.
The node that collected a quorum of votes
becomes the leader
and notifies other nodes about that. Also, a split vote can happen
when no nodes received a quorum of votes. In this case,
after a random timeout,
each node increases its term and starts a new election round if no new vote
request with a greater term arrives during this time period.
Eventually, a leader is elected.
All the non-leader nodes are called followers. The nodes that start a new
election round are called candidates. The elected leader sends heartbeats to
the non-leader nodes to let them know it is alive. So if there are no heartbeats
for a period set by the replication_timeout
option, a new election starts. Terms and votes are persisted by
each instance in order to preserve certain Raft guarantees.
During the election, the nodes prefer to vote for those ones that have the
newest data. So as if an old leader managed to send something before its death
to a quorum of replicas, that data wouldn’t be lost.
When election is enabled, there must be connections
between each node pair so as it would be the full mesh topology. This is needed
because election messages for voting and other internal things need direct
connection between the nodes.
Also, if election is enabled on the node, it won’t replicate from any nodes except
the newest leader. This is done to avoid the issue when a new leader is elected,
but the old leader has somehow survived and tries to send more changes
to the other nodes.
Term numbers also work as a kind of a filter.
For example, you can be sure that if election
is enabled on two nodes and
node1 has the term number less than
node2 won’t accept any transactions from