OpenEdge Replication Made Easy Adam Backman White Star Software [email protected].
Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain
-
Upload
lucidworks -
Category
Technology
-
view
99 -
download
0
Transcript of Cross Data Center Replication for the Enterprise: Presented by Adam Williams, Iron Mountain
O C T O B E R 1 1 -‐ 1 4 , 2 0 1 6 • B O S T O N , M A
Cross Data Center Replica@on for the Enterprise Adam Williams
Search Lead, Iron Mountain
Objec&ves • How Iron Mountain uses cross data center replica&on (CDCR) • Our experiences with CDCR • Disaster recovery op&ons available • What you need to run CDCR • How to configure CDCR • How to keep CDCR running daily • What’s next for solr CDCR?
Iron Mountain Solr • Record Center Project
– 140,000 worldwide users – Went live in 2013 – Users maintain and order records stored at Iron Mountain – 5.3 billion documents stored in 38 clouds – Completely virtual, internally hosted infrastructure (180 vms) – Hosted on tomcat – Early adopter of solr 4 – Currently index at 140,000 documents per min (16 indexers, 2 million per min capacity) – 11 million avg updates per day (15 min update SLA) – 140,000 searches avg per day – Customers rely on Iron Mountain for essen&al business processes such as claims processing,
financials and medical records
Iron Mountain Solr • Business Requirement
– Maintain a disaster recovery environment capable of being fully func&onal within 4 hours of an event
– Data accuracy must be within 15 minutes of produc&on – Ac&ve / Passive replica&on
Short History Worked with several commi[ers to develop for Iron Mountain: • Developed under SOLR-‐6273 • Available in Solr 6 • Iron Mountain tested the func&onality in our environments • Running in produc&on at Iron Mountain for over a year • Assisted with developing formal documenta&on posted on the solr wiki Documenta&on: h[p://yonik.com/solr-‐cross-‐data-‐center-‐replica&on/ h[ps://cwiki.apache.org/confluence/pages/viewpage.ac&on?pageId=62687462
Cross Data Center Replica&on (CDCR) • Replicate data to mul&ple data centers
(source and target) • Data is replicated to the target once it is
persisted to disk in the source • Changes are replicated in near real-‐&me
based upon seangs • Assumes source and target are iden&cal
when CDCR is introduced or blank • Shard leaders send updates to target cloud
leaders which replicate within the cloud CDCR Apache documenta&on
Iron Mountain Experiences with CDCR • CDCR has provided us with piece of mind and saved us
on several occasions. • It would take us approximately 2 weeks to recreate our
indexes of 5.3 billion records from scratch. • Confidence that we have a warm backup ready in case
of a disaster. • On two occasions we had corrupt indexes in
produc&on. We restored from the backups in our DR data center. Resul&ng in less than one hour of down&me.
• DR system allows us to run large queries and facets for maintenance/research ac&vi&es without impac&ng produc&on load.
Disaster Recovery
Apple Data Center Mesa, Arizona -‐ May 2015 Solar Panels catch fire
Disaster Recovery – Why not? • Smaller companies are less likely to
have disaster recovery capability • Economy of scale is a challenge • Achieving a “hot standby” is costly • Approach must be reliable and
rehearsed regularly • Disaster is not necessarily a cataclysmic
event, could be the result of malicious acts (internal or external) or corrupt data.
2012 CRN study
Disaster Recovery -‐ Backing up Solr Is the backup going to load??? • Ever have this happen to
you? • If an index file is not fully
copied, the index can be corrupt.
• This is a challenge with hot backups and disk mirroring with Solr.
Disaster Recovery Op&ons with Solr Op@on Actors Risk Index to two instances in different data centers
Index (I) to Source (S) and to Target (T) at once
-‐ Oien requires addi&onal custom dev. -‐ No guarantee that the instances are iden&cal.
Disk Mirroring
-‐ What if en&re index file is not copied? -‐ What state is the disk in at the &me of an abrupt event?
Regular Backups -‐ Works if you have low volume index updates with a controlled schedule -‐ Managing backups, storing offsite and retrieving quickly when needed
Cross Data Center Replica&on
-‐ Ability to monitor and track replica&on to see that it is running properly
I S
T
S T
S T
S T
Advantages of CDCR Advantage Comment Can be controlled by an administrator / support personnel
-‐ Does not require storage or infrastructure personnel -‐ Can be turned off / on easily compared to turning on/off
disk replica&on -‐ Can be monitored for latency and accuracy as the target
system is running
Increase in confidence that the standby system is ready
Target system is fully func&onal and ready at all &mes.
Works Cross data centers If the target is unavailable synching will queue and restart when it is available.
Data backup Full index available in remote data center loca&on
What you need to run CDCR • We added addi&onal disk space to queue up tlogs in case the target is unavailable for periods of &me. Disk
• Reliable network between the source and target with capacity for exchanging your data quickly. Bandwidth
• Ability to monitor replica&on for issues. A monitoring tool such as Nagios or Solarwinds. Monitoring
• Must run in test environment to determine rate of change, seangs and bandwidth. Tes&ng
We did not need to add addi&onal memory or CPU for CDCR.
Required Seangs on source -‐DtargetZk=<targetZkhost1>,<targetZkhost2>,<targetZkhost3>,<targetZkhost4>,<targetZkhost5> -‐DsourceCollec&on=<source collec&on name> -‐DtargetCollec&on=<target collec&on_name>
Startup seangs for source system:
-‐ TargetZk is the host names of the DR (target) zookeepers -‐ SourceCollec&on is the name of the clouds in the primary data center (source) -‐ Target Collec&on is the name of the cloud in the DR data center (target)
Solrconfig Seangs <requestHandler name="/cdcr" class="solr.CdcrRequestHandler"> <lst name="replica"> <str name="zkHost">${targetZk}</str> <str name="source">${sourceCollec&on}</str> <str name="target">${targetCollec&on}</str> </lst> <lst name="replicator"> <str name="threadPoolSize">8</str> <str name="schedule">10</str> <str name="batchSize">2000</str> </lst> <lst name="updateLogSynchronizer"> <str name="schedule">1000</str> </lst> </requestHandler> <updateRequestProcessorChain name="cdcr-‐processor-‐chain"> <processor class="solr.CdcrUpdateProcessorFactory" /> <processor class="solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
Solrconfig seangs on source:
-‐ replicator -‐ Thread Pool -‐ Scheduler -‐ BatchSize
-‐ updateLogSynchronizer -‐ schedule
Solrconfig Seangs Parameter Required Default Iron Mountain Descrip@on
threadPoolSize No 2 8 The number of threads to use for forwarding updates. One thread per replica is recommended.
schedule No 10 10 The delay in milliseconds for the monitoring the update log(s).
batchSize No 128 2000 The number of updates to send in one batch. The op&mal size depends on the size of the documents. Large batches of large documents can increase your memory usage significantly.
schedule No 60000 1000 The delay in milliseconds for synchronizing the updates log.
CDCR Apache documenta&on
Determining Solrconfig Seangs • Approach
– Determine the average size of your documents – Iden&fy rate of change – Determine network capacity – Standup scaled model in a test environment – Index documents at various rates to source and monitor throughput.
– Use CDCR API to collect throughput / performance metrics – Run for brief periods in produc&on on limited collec&ons before going full-‐scale
Configuring and Monitoring CDCR API included in CDCR func&onality allows you to ac&vely control and monitor replica&on. API Entry Points (Control) collec&on/cdcr?ac&on=STATUS: Returns the current state of CDCR. collec&on/cdcr?ac&on=START: Starts CDCR replica&on collec&on/cdcr?ac&on=STOP: Stops CDCR replica&on. collec&on/cdcr?ac&on=ENABLEBUFFER: Enables the buffering of updates. collec&on/cdcr?ac&on=DISABLEBUFFER: Disables the buffering of updates. API Entry Points (Monitoring) core/cdcr?ac&on=QUEUES: Fetches sta&s&cs about the queue for each replica and about the update logs. core/cdcr?ac&on=OPS: Fetches sta&s&cs about the replica&on performance (opera&ons per second) for each replica core/cdcr?ac&on=ERRORS: Fetches sta&s&cs and other informa&on about replica&on errors for each replica.
Configuring and Monitoring CDCR Considera@on Why? Approach Monitor
Disk Size Enough disk space is needed to store tlog files. If the target data center is offline, the system will queue tlogs un&l the connec&on to the target is restored.
Separate par&&on for tlogs with enough space to queue tlogs for 24 hours
Disk Monitor for tlogs directory if disk is greater than 60% full
Configuring and Monitoring CDCR Considera@on Why? Approach Monitor
Connec@vity to target zookeepers
If the source cannot see the target zookeepers, then replica&on is being queued
Make sure target zookeepers are accessible -‐DtargetZk=zk1,zk2,zk3 -‐DsourceCollec&on=cloud1
Ping test from solr instances to target zookeeper Or Log monitor to capture errors in solr connec&ng to target
Configuring and Monitoring CDCR Considera@on Why? Approach Monitor
Is CDCR enabled in source?
It’s simple, but essen&al.
Check CDCR status using built-‐in API h[p://<host>:<port>/solr/<collec&on>/cdcr?ac&on=status
Best prac&ce: STOP and START source aier every deployment h[p://<host>:<port>/solr/<collec&on>/cdcr?ac&on=stop h[p://<host>:<port>/solr/<collec&on>/cdcr?ac&on=start
Every 5 minutes and aier deployments / maintenance verify that CDCR is enabled
Configuring and Monitoring CDCR Considera@on Why? Approach Monitor
Are target buffers disabled in source?
Tlog files will grow on the target and not be cleaned, will lead to large disk and slowness as the node scans all of the tlogs
Disable target buffers aier each deployment h[p://<host>:<port>/solr/<collec&on>/cdcr?ac&on=DISABLEBUFFER
Every 5 minutes and aier deployments / maintenance verify that source buffers are disabled
Configuring and Monitoring CDCR Considera@on Why? Approach Monitor
Is CDCR working within agreed SLAs?
Validate that CDCR is working end-‐to-‐end
Add a test document to the source cloud. Then check for it on the target and &me how long it took to read it from the target. When done, delete it.
Aier every deployment and several &mes a day. If the document is not found in the target aier 5 minutes, throw an alert.
Configuring and Monitoring CDCR Considera@on Why? Approach Monitor
Latency If there is a spike in indexing, there can be some latency
Use API call to determine the queue size (bytes), number of tlog files and last update opera&on &me h[p://<host>:<port>/solr/<collec&on>/cdcr?ac&on=QUEUES
Every 15 minutes. If tlogs grows greater than 100 files and last update &me is older than an hour.
Configuring and Monitoring CDCR Considera@on Why? Approach Monitor
Performance How many changes (adds/deletes) are being processed per second?
Use API call to determine the average performance per second h[p://<host>:<port>/solr/<collec&on>/cdcr?ac&on=OPS
Once daily gather performance stats, store and review. Stats can help you op&mize performance.
What’s next for data center replica&on?
• Ac@ve / Ac@ve – Ability to replicate between the target and source
• Selec@ve replica@on -‐ Ability to sync sets of data between data centers. Master source capable of synching select data with replicas in remote data centers.
• CDCR was developed with our needs in mind. What are the needs of the community?