Raft(Understandable Consensus Algorithm)

고승우·2024년 7월 20일
0

Raftify

목록 보기
2/4
post-thumbnail

Introduction

Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but its structure is different from Paxos

Raft is similar in many ways to existing consensus algorithms, but it has several novel features:

  • Strong Leader: Raft uses a stronger form of leadership than other consensus algorithms.
  • Leader election: Raft uses randomized timers to elect leaders.
  • Membership changes: Raft’s mechanism for changing the set of servers in the cluster uses a new joint consensus approach where the majorities of two different configurations overlap during transitions.

Replicated state machines

The consensus algorithm manages a replicated log containing state machine commands from clients. The state machines process identical sequences of commands from the logs, so they produce the same outputs.

  • They ensure safety (never returning an incorrect result) under all non-Byzantine conditions
  • They are fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients
  • They do not depend on timing to ensure the consistency of the logs
  • Command can complete as soon as a majority of the cluster has responded to a single round of remote procedure calls

The Raft consensus algorithm

Raft decomposes the consensus problem into three relatively independent subproblems

  • Leader eletction: A new leader must be chosen when an existing leader fails
  • Log replication: The leader must accept log entries

State

Persistent state on all servers

  • CurrentTerm: latest term server has seen (initialized to 0 on first boot, increases monotonically)
  • votedFor: candidateId that received vote in current term (or null if none)
  • log[]: log entries; each entry contains command for state machine, and term when entry was received by leader (first index is 1)

Volatile state on all servers

  • commitIndex: index of highest log entry known to be committed (initialized to 0, increases
    monotonically)
  • lastApplied: index of highest log entry applied to state machine (initialized to 0, increases monotonically)

AppendEntries RPC

Invoked by leader to replicate log entries. Also used as heartbeat

Arguments

  • term: leader's term
  • leaderId: so follower can redirect clients
  • prevLogIndex: index of log entry immediately preceding new ones
  • prevLogTerm: term of prevLogIndex entry
  • entries[]: log entries to store (empty for heartbeat; may send more than one for efficiency)
  • leaderCommit: leader’s commitIndex

Results

  • term: currentTerm, for leader to update itself
  • success: true if follower contained entry matching prevLogIndex and prevLogTerm

Receiver implementation

  1. Reply false if term < currentTerm
  2. Updates its term if term > currentTerm
  3. Reply false if log doesn’t contain an entry at prevLogIndex
    whose term matches prevLogTerm
  4. If an existing entry conflicts with a new one (same index
    but different terms), delete the existing entry and all that
    follow it
  5. Append any new entries not already in the log
  6. If leaderCommit > commitIndex, set commitIndex =
    min(leaderCommit, index of last new entry)

RequestVote RPC

Invoked by candidates to gather votes

Arguments

  • term: candidate’s term
  • candidateId: candidate requesting vote
  • lastLogIndex: index of candidate’s last log entry
  • lastLogTerm: term of candidate’s last log entry

Results

  • term: currentTerm, for candidate to update itself
  • voteGranted: true means candidate received vote

Receiver implementation

  1. Reply false if term < currentTerm
  2. Updates its term if term > currentTerm
  3. If votedFor is null or candidateId, and candidate’s log is at
    least as up-to-date as receiver’s log, grant vote

Rules for Servers

All Servers

  • If commitIndex > lastApplied: increment lastApplied, apply log[lastApplied] to state machine
  • If RPC request or response contains term T > currentTerm: set currentTerm = T, convert to follower

Follwers

  • Respond to RPCs from candidates and leaders
  • If election timeout elapses without receiving AppendEntries
    RPC from current leader
    or granting vote to candidate: convert to candidate
  • If client contacts a follower, the follower redirects it to the leader

Candidates

  • On conversion to candidate, start election:
    - Increment current Term
    • Vote for self
    • Reset election timer
    • Send RequestVote RPCs to all other servers
  • If votes received from majority of servers: become leader
  • If AppendEntries RPC received from new leader: convert to
    follower
  • If election timeout elapses: start new election

Leaders

  • Upon election: send initial empty AppendEntries RPCs (heartbeat) to each server; repeat during idle periods to prevent election timeouts
  • If command received from client: append entry to local log, respond after entry applied to state machine
  • If last log index ≥ nextIndex for a follower: send AppendEntries RPC with log entries starting at nextIndex
    - If successful: update nextIndex and matchIndex for
    follower
    - If AppendEntries fails because of log inconsistency:
    decrement nextIndex and retry
  • If there exists an N such that N > commitIndex, a majority
    of matchIndex[i] ≥ N, and log[N].term == currentTerm: set commitIndex = N

Properties that Raft guarantee

  • Election Safety: At most one leader can be elected in a given term
  • Leader Append-Only: A leader never overwrites or deletes entries in its log
  • Log Matching: If two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index
  • Leader Completeness: If a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms
  • State Machine Safety: If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index
profile
٩( ᐛ )و 

0개의 댓글