1
Summer 2013
Replication and Replica Sets
Member of Technical Staff, 10genWilliam Zola
2
Why Replication?
To keep your data safe
3
Why Replication?
To keep your data available
4
Why Replication?
Because bad things happen to good data centers
5
What is replication and why do we need it?
Replication
ImportantData
Copy of Important
Data
Copy of Important
Data
6
• Using replica sets for high availability– PRIMARY, SECONDARY, and ARBITER nodes– PRIMARY elections
• Using replica sets for disaster recovery• Configure a replica set so there’s no single point of
failure• No-Downtime Maintenance• Durability in a networked environment
Agenda
7
• Not new to DBA or System Administration• New to MongoDB or MongoDB replication
Audience
8
Use Cases
9
Stakeholders
10
• High Availability (automatic failover)
Use Cases
11
• High Availability (automatic failover)
• Disaster Recovery
Use Cases
12
• High Availability (automatic failover)
• Disaster Recovery
• No downtime for maintenance– Backups– Maintenance (index rebuilds, compaction)
Use Cases
13
• High Availability (automatic failover)
• Disaster Recovery
• No downtime for maintenance– Backups– Maintenance (index rebuilds, compaction)
• Replica Set is "transparent" to the application
Use Cases
14
• High Availability (automatic failover)
• Disaster Recovery
• No downtime for maintenance– Backups– Maintenance (index rebuilds, compaction)
• Replica Set is "transparent" to the application
• Read Scaling (extra copies to read from)
Use Cases
15
MongoDB Replication Basics
16
Replica Set Features
• A cluster of N servers• Any (one) node can be
primary• All writes to primary• Reads go to primary (default)
optionally to a secondary
• Consensus election of primary• Automatic failover• Automatic recovery
Node 3
Node 1
Node 2
Primary WRITESREADS
READS
Pick me!
READS
17
• Replica set is two or more nodes
Node 1
Node 2P
Node 3
How MongoDB Replication works
18
• Election establishes the PRIMARY• Data replicates from PRIMARY to SECONDARIES
Node 1
Node 2 Primary
Node 3
How MongoDB Replication works
data data
20
Planned– Hardware upgrade– O/S or file-system tuning– Relocation of data to new file-system / storage– Software upgrade
Unplanned– Hardware failure– Data center failure– Region outage– Human error– Application corruption
Types of outage
AUTOMATIC FAILOVER
MAINTENANCEw/o DOWNTIME
21
Mechanics of Automatic Failover
22
• Data replicates from PRIMARY to SECONDARIES
Node 1
Node 2 Primary
Node 3
Mechanics of Automatic Failover
data data
23
• Election establishes the PRIMARY• Data replicates from PRIMARY to SECONDARIES• Primary might FAIL
Node 1
Node 2 Primary
Node 3
Mechanics of Automatic Failover
data data data
data
dat
a da
ta
data data data
data
dat
a da
ta
24
Node 1 Node 3
• Automatic election of new PRIMARY if majority exists
Node 2 DOWN
negotiate new primary
✗
Mechanics of Automatic Failover
25
Node 1 Node 3
Node 2 DOWN
negotiate new master
Mechanics of Automatic Failover
New PRIMARY elected
Primary
26
Node 1 Node 3
Node 2RECOVERING
negotiate new master
Primary
Mechanics of Automatic Failover
Automatic Recovery of Failed Node
Can performfull resync from secondaryif necessary
27
• Once caught-up resumes syncing from primary• Original replica set configuration is re-established
Node 1
Node 2
Node 3
Mechanics of Automatic Failover
Primary
28
Cluster Size and Rules of Failover
29
Primary Election
Primary
Secondary
Secondary
As long as a partition can see a majority (>50%) of the cluster, then it will elect a primary.
Must have a STRICT majority to be elected primary!!!
30
Simple Failure
Primary
Failed Node
Secondary
66% of cluster visible. Primary is elected
Secondary
31
Failed Node
33% of cluster visible. Read only mode.
Failed Node
Secondary
Simple Failure
Secondary
Secondary
Primary
32
Network Partition
Primary
Secondary
Secondary
33
Network Partition
Primary
Secondary
Secondary
Primary
Failed Node
Secondary
66% of cluster visible.
Primary is elected
34
Secondary
Network Partition
33% visible. Read only mode.
Primary
Secondary
Failed Node
Failed Node
Secondary
35
Secondary
No “Split Brain” Problem
Primary
Secondary
A node must be elected by a strict majority of the set in order to be a primary• Only the primary node
can accept writes• A replica set never has
two primary nodes
36
Even Cluster Size
Primary
Secondary
Secondary
Secondary
37
Primary
Secondary
Secondary
Secondary
Failed Node
Secondary
Failed Node
50% of cluster visible. Read only mode.
Secondary
Even Cluster Size
38
Primary
Secondary
Failed Node
Secondary
Failed Node
50% of cluster visible. Read only mode.
Secondary
Secondary
Secondary
Even Cluster Size✗ODD = good
39
Types of Nodes
Regular • Regular node holds a copy of your data
• Arbiter node has no data• but it can vote! use to break ties
Secondary
Secondary
Arbiter
• Secondary / All data Nodes• different priorities• other configuration options
Primary• Primary
• A data node that won the election
40
Add an Arbiter!
Primary
Secondary
Secondary
Secondary
Arbiter
Add an arbiter node to break ties
• Odd number of votes in set• Arbiter is lightweight – does
not store data
42
High Availability
43
High Availability
44
No Downtime Maintenance
1. Take secondary out of set
2. Perform maintenance
3. Replace secondary in set
4. Wait for it to catch up
Secondary
Secondary
Secondary
Primary1. Take secondary out of set
2. Perform maintenance
3. Replace secondary in set
4. Wait for it to catch up✓
✓
45
No Downtime Maintenance
1. Take secondary out of set
2. Perform maintenance
3. Replace secondary in set
4. Wait for it to catch up
Secondary
Secondary
5. Step down the primary
(wait for new primary to be elected)
6. Repeat steps 1-4
Secondary
Primary
Primary✓
✓
✓
46
Primary
Arbiter
Secondary
Is this a good configuration?
2 Replicas + Arbiter??
47
Primary
Arbiter
Secondary
2 Replicas + Arbiter??
1. Take secondary out of set
2. Perform maintenance
3. Primary node crashes– Uh-oh!– Replica set is down– Data from the primary hasn’t
been replicated
48
Use Three Data Nodes!
Primary
Secondary
Secondary
Use a minimum of three data nodes to assure high availability
49
Avoid Single Points of Failure
50
Avoid Single Points of Failure
51
Avoid Single points of failure
Primary
Secondary
Secondary
Top of rack switch
Rack falls over
52
Better
Primary
Secondary
Secondary
Loss of internet
DC burns down
53
Even Better
Secondary
Secondary
Primary
San Francisco
Dallas
54
Priorities
Secondary
Secondary
Primary
San Francisco
Dallas
Priority 1
Priority 1
Priority 0
Disaster recover data center. Will never become primary automatically.
55
Even Better
Primary
Secondary
Secondary
San Francisco
Dallas
New York
Secondary
Secondary
56
Node Priority
Primary
Secondary
Secondary
Secondary
Secondary
Priority 10
Priority 10
Priority 5
Priority 5
Priority 0 Dallas
New York
SanFrancisco
57
Node Sizing
Primary
Secondary
Secondary
Secondary
Secondary
Priority 10
Priority 10
Priority 5
Priority 5
Priority 0 Dallas
New York
SanFrancisco
Nodes that can become primary should be sized equally
• RAM • Disk• IOPS
58
Recap
59
Replica Set Review
Primary
Secondary
Secondary
Replica set contains N nodes• At most one node is the
PRIMARY• All writes go to the PRIMARY• SECONDARY nodes contain
up-to-date copies of the data• SECONDARY nodes
continually copy data from the PRIMARY
WRITES
60
Failover Review
Primary
Secondary
Secondary
If the PRIMARY fails, the Replica Set can elect a new PRIMARY
• A strict (>50%) majority is required for election
• The former PRIMARY will rejoin the set as a SECONDARY when it recovers
WRITES
61
Partition Review
A Network Partition prevents the nodes from communicating
• The Replica Set treats a partition as a “down node”
• A node must get a strict majority of the votes to be elected PRIMARY
• Even numbers of votes reduce availability
• Use Arbiters to break ties• Spread your nodes across multiple
data centers
Secondary
Primary
Secondary
62
Using Applications with Replica Sets
63
Application View
ApplicationCode Here
MongoDBDriver
64
Replica Set
Under the Covers
ApplicationCode Here
MongoDBDriver
Secondary
Secondary
Primary
Replica Set Connection:
my-set/host1:27017,host2:27017,host3:27017
65
Replica Set
Secondary Reads
ApplicationCode Here
MongoDBDriver
Secondary
Secondary
Primary
Potentially Stale!
66
Replica Set
Failover
MongoDBDriver
Secondary
Secondary
Primary✗Connection Exception
ApplicationCode Here
67
Replica Set
New Election
ApplicationCode Here
MongoDBDriver
Secondary
Secondary
Primary
Secondary✗
68
Durability and Replica Sets
69
• Wikipedia:– In database systems, durability is the ACID property which
guarantees that transactions that have committed will survive permanently.
Durability
70
The Lifetime of a Write Operation (single-node)
ApplicationCode Here
MongoDBDriver
Journal Data in RAM
Network Write
Validate Data
Update RAM Update Journal
71
Get Last Error
ApplicationCode Here
MongoDBDriver
Journal Data in RAM
Network Write
getLastError command
getLastError ResultValidate Data
72
Write Concern
MongoDBDriver
Network Write
getLastError command
getLastError Result
Network Acknowledgement {w:0}
Check for Error {w:1}
Journal Sync {j:1}
76
Replica Sets and Durability
Primary
Secondary
Secondary
Secondary
Secondary
A write that has replicated to a majority of the nodes is durable
• The most up-to-date node will be elected primary
• The write will be present on that node
No guarantee of which nodes will have the write
• Use “tag sets” for finer-grained control
✓
✓
✓
Durable!
77
Network Write Concern
MongoDBDriver
Network Write
getLastError command
getLastError Result
Specific Number of Nodes
{w:2}
Majority of Data Nodes {w: ’majority’}
Tag Set {w: “my tag set”}
Wait for timeout {w:2, wtimeout:2000}
Replica Set
Primary
Secondary
Secondary
78
Wrapping it Up
79
Why Replication?
To keep your data safe and available
80
• High Availability (auto-failover)
• Disaster Recovery
• No downtime for maintenance
• Replica Set is "transparent" to the application
• Writes are durable with appropriate Write
Concern
Features
81
• Easy to setup – Try on a single machine– Multiple nodes with different ports on a single
host
• Check on-line documentation for RS tutorials– http://docs.mongodb.org/manual/replication/
#tutorials
Just Use It!
82
Questions?
83
Thank You!
Top Related