Leader election process
Automated leader election in Tarantool helps guarantee that in a replica set
there is at most one leader at any given moment of time.
A leader is a writable node, and all other nodes are non-writable –
they accept read-only requests exclusively. This can be useful when an application
doesn’t want to support master-master replication, and it is necessary to
ensure that only one node accepts new transactions and commit them successfully.
When election is enabled, the life cycle of
a replica set is divided into so called
terms. Each term is described by a monotonically growing number.
After the first boot, each node has its term equal to 1. When a node sees that
it is not a leader and there is no leader available for some time in the replica
set, it increases the term and starts a new leader election round.
Leader election happens via votes. The node which started the election votes
for itself and sends vote requests to other nodes.
Upon receiving vote requests, a node votes for the first of them, and then cannot
do anything in the same term but wait for a leader being elected.
A node that collected a quorum of votes
becomes a leader
and notifies other nodes about that. Also a split-vote can happen
when no nodes received a quorum of votes. In this case,
after a random timeout,
each node increases its term and starts a new election round if no new vote
request with greater term is arrived during this time period.
Eventually a leader is elected.
All the non-leader nodes are called followers. The nodes that start a new
election round are called candidates. The elected leader sends heartbeats to
the non-leader nodes to let them know it is alive. So if there are no heartbeats
for a time period set by the replication_timeout
option, a new election is started. Terms and votes are persisted by
each instance in order to preserve certain Raft guarantees.
During the election, the nodes prefer to vote for those ones that have the
newest data. So as if an old leader managed to send something before its death
to a quorum of replicas, that data wouldn’t be lost.
When election is enabled, it is required
to have connections
between each node pair so as it would be the fullmesh topology. This is needed
because election messages for voting and other internal things need direct
connection between the nodes.
Also if election is enabled on the node, it won’t replicate from any nodes except
the newest leader. This is done to avoid the issue when a new leader is elected,
but the old leader still somehow survived and tries to send more changes
to the other nodes.
Term numbers also work as a kind of a filter. You can be sure that if election
is enabled on 2 nodes and
node1 has the term number less than
node2 won’t accept any transactions from