GraphConnect Europe 2016 - Moving Graphs to Production at Scale - Ian Robinson
-
Upload
neo4j-the-fastest-and-most-scalable-native-graph-database -
Category
Technology
-
view
213 -
download
0
Transcript of GraphConnect Europe 2016 - Moving Graphs to Production at Scale - Ian Robinson
Overview
• Solu%onArchitectures• Hardware/So5wareRequirements• HAArchitecture• Backups• Monitoring• Tes%ng
Solu3onArchitectures
Server• ServerinfrastructurewrapsembeddedNeo4j• Binaryprotocol(Bolt)• Uniformdrivers(Java,.NET,Python,JavaScript)
ServerwithProceduresEmbedded
Cypher/BoltCypher/BoltCypher/Bolt
Driver
Applica%on
Loadbalancer
Solu3onArchitectures
ServerServerwithProcedures• Server-sidejar,calledfromCypher• Executecomplexlogiconserver• Closetothedata• Mul%pleopera%onsperrequest• Integratewithbackendsystems• Graphglobalqueries,schemaintrospec%on,etc.
Embedded
Cypher/BoltRESTAPICypher/Bolyt
Driver
Applica%on
Loadbalancer
Cypher/Bolt
Procedures
hUps://github.com/neo4j-contrib/neo4j-apoc-procedures
Solu3onArchitectures
ServerServerwithProceduresEmbedded• HostNeo4jinapplica%on’sJavaprocess• AccesstoNeo4j’sJavaAPIs
JavaAPIs
Applica%on
HardwareCPU• IntelCorei3(minimum)• IntelCorei7(recommended)• Neo4jscaleswiththenumberofcores
• RequiresEnterprisetoscalebeyond4coresDisk• SLC(single-levelcell)SSDw/SATA• ext4(recommended),ZFS• IncreasepermiUednumberofopenfilesto40,000+
Memory• LotsofRAM(forheap+pagecache)
• 8-12GBheap(upto24GB)• Explicitlysetpagecacheto(storesize+10%+headroom)
– Otherwisedefaultsto50%ofRAM-heap-size(75%pre2.3)
dbms.memory.pagecache.size=10g
neo4j.conf
SoEware
Java• OpenJDK8orOracleJava8• IBMJDK8onPOWER8• G1garbagecollector• Defaultfrom2.3• JDK1.7.0_71orlater
Opera3ngSystem• Linux• HPUX• Windows2012
wrapper.java.additional=-XX:+UseG1GC
neo4j-wrapper.conf(pre2.3)
EC2Instances• HVM(hardwarevirtualmachine)overPV(paravirtual)• C3orC4(compute-op%mized)• E.gc4.2xlarge(15GiBRAM,8vCPU,1000MbpsEBSthroughput)
• R3(memory-op%mized)• E.g.r3.xlarge(30.5GiBRAM,4vCPU)• NotEBS-op%mizedbydefault
• UseHAclusteringandonlinebackupsforincreaseddurability• DistributeclusteracrossAvailabilityZonesinaRegion
LocalStorage• SSDorHDD• HighestI/Operformance
• Includedinvirtualserver• Upto8x800GBSSD(i2.8xlarge)or24x2000GBHDD(d2.8xlarge)• LostwhenEC2instanceisterminated
Elas3cBlockStore(EBS)• AUachedtoEC2instancevianetworkconnec%on• Upto16TBSSD• PersistevenifEC2instanceisterminated
• UseEBS-op%mizedEC2instancesfordedicatedthroughputtoEBS• ProvisionedIOPS(io1)forpredictableperformance • Upto30IOPSperGiB
– E.g.300GiBvolume,9000IOPS
HAArchitecture
Database
Transac%onPropaga%on
ClusterManagement
Neo4jHAInstance2
Database
Transac%onPropaga%on
ClusterManagement
Neo4jHAInstance1
Database
Transac%onPropaga%on
ClusterManagement
Neo4jHAInstance3
Master
ClusterConfigura3onJoiningCluster• ha.initial_hosts (neo4j.conf)
• Listofserverstocontactwhenjoiningcluster• Allhostsmustbeavailablewhenstar%nginstance• Forlargeclusters,supplyonlyasmallnumberofhosts,e.g.3
PullandPushTransac3ons• ha.pull_interval=10s (offbydefault)• ha.tx_push_factor=1 (default,butbesteffortsonly)
Tuning• ha.heartbeat_timeout=11s (default)
• Heartbeatssent,bydefault,every5s• Increase%meoutsifpausescauseheartbeatstobedelayed• Warning:itwilltakelongertodiscoveraninstancehasfailed
• ha.role_switch_timeout=120s (default)• Increaseifnewinstances%meoutwhilecatchingupwithmasteronstartup
HARoleEndpoints–UsefulforLoadBalancingEndpoint State StatusCode Body/db/manage/server/ha/master
Master 200 OK true
Slave 404 Not Found false
Unknown 404 Not Found UNKNOWN/db/manage/server/ha/slave
Master 404 Not Found false
Slave 200 OK true
Unknown 404 Not Found UNKNOWN/db/manage/server/ha/available
Master 200 OK master
Slave 200 OK slave
Unknown 404 Not Found UNKNOWN
From2.3onwards dbms.security.ha_status_auth_enabled=false
neo4j.conf
HAJMXEndpoint
JSONResponse• Alive?• Role• LastcommiUedtransac%onID• Instancesincluster• Role• InstanceID• Available?• URI
Iden%fyslavesfallingbehind
Doeseveryoneagreeoncomposi%onofcluster?
/db/manage/server/jmx/domain/org.neo4j/instance%3Dkernel%230%2Cname%3DHigh%20Availability
CrossDC-Clusters
• Samesubnet(considerusingaVPN)• BandwidthbetweenDCsalignedwithwritethroughput• Commonprac%ce:instancesinsecondaryrunasslave-only• Restrictsmasterelec%ontotheprimary
• Whenfailingover,reconfigureinstancesinsecondary
ha.slave_only=true
neo4j.conf
ha.slave_only=false
neo4j.conf
ScaleHorizontallyForHighReadThroughput
Applica%on
Master Slave Slave
LoadBalancer
e.g.HAProxyELB
NGINX
ScaleHorizontallyForHighReadThroughput
Applica%on
Master Slave Slave
ReadLoadBalancerWriteLoadBalancer
HAProxyConfigura3on
hUp://blog.armbruster-it.de/2015/08/neo4j-and-haproxy-some-best-prac%ces-and-tricks/
ConfigureHAProxyasReadLoadBalancerglobal daemon maxconn 256
defaults mode http timeout connect 5000ms timeout client 50000ms timeout server 50000ms
frontend http-in bind *:80 default_backend neo4j-slaves
backend neo4j-slaves option httpchk GET /db/manage/server/ha/slave server s1 10.0.1.10:7474 maxconn 32 check server s2 10.0.1.11:7474 maxconn 32 check server s3 10.0.1.12:7474 maxconn 32 check
listen admin bind *:8080 stats enable
ConfigureHAProxyasReadLoadBalancerglobal daemon maxconn 256
defaults mode http timeout connect 5000ms timeout client 50000ms timeout server 50000ms
frontend http-in bind *:80 default_backend neo4j-slaves
backend neo4j-slaves option httpchk GET /db/manage/server/ha/slave server s1 10.0.1.10:7474 maxconn 32 check server s2 10.0.1.11:7474 maxconn 32 check server s3 10.0.1.12:7474 maxconn 32 check
listen admin bind *:8080 stats enable
404 Not Found false
404 Not Found UNKNOWN
200 OK true
Master
Slave
Unknown
ImproveReadPerformancewithCacheSharding
Applica%on
1 2 3
LoadBalancer
MATCH (c:Country{name:'Australia'})... MATCH (c:Country{name:'Zambia'})... MATCH (c:Country{name:'Norway'})...
CacheShardingUsingConsistentRou3ng
Applica%on
1 2 3
LoadBalancer
MATCH (c:Country{name:'Australia'})... MATCH (c:Country{name:'Zambia'})... MATCH (c:Country{name:'Norway'})... A-I1J-R2S-Z3
MATCH (c:Country{name:'Zambia'})... MATCH (c:Country{name:'Norway'})... MATCH (c:Country{name:'Australia'})...
ConfigureHAProxyforCacheShardingglobal daemon maxconn 256
defaults mode http timeout connect 5000ms timeout client 50000ms timeout server 50000ms
frontend http-in bind *:80 default_backend neo4j-slaves
backend neo4j-slaves balance url_param country_code server s1 10.0.1.10:7474 maxconn 32 server s2 10.0.1.11:7474 maxconn 32 server s3 10.0.1.12:7474 maxconn 32
listen admin bind *:8080 stats enable
ConfigureHAProxyforCacheShardingglobal daemon maxconn 256
defaults mode http timeout connect 5000ms timeout client 50000ms timeout server 50000ms
frontend http-in bind *:80 default_backend neo4j-slaves
backend neo4j-slaves balance url_param country_code server s1 10.0.1.10:7474 maxconn 32 server s2 10.0.1.11:7474 maxconn 32 server s3 10.0.1.12:7474 maxconn 32
listen admin bind *:8080 stats enable
BackupsModes• Full• Incremental• Ontopofapreviousbackup• Useslogicallogstoapplychanges,sologsmustbekeptatleast2xbackupinterval
ConsistencyCheck• Partoffullbackupandstandalonetool• Evaluatestorehealth• -verify false todisableinbackup
dbms.tx_log.rotation.retention_policy=7 days (default)
neo4j.conf
BackupStrategies
• Localorremotebackups• Ifbackinguptoremotemachine,consistencychecktakesplaceofflinewithrespecttothedatabase
• Backupfromadedicatedslaveorroundrobin• Chooseaschedule:• Fullonceperday,incrementaleveryhour
• Torestorefrombackup:• Stopinstance• Replacegraph.dbwithbackup• Startinstance
BackupStrategies
BackupServer
A B C
A–full,consistencycheckB–full,consistencycheckC–full,consistencycheckA–incrementalB–incrementalC–incremental…A–incrementalB–incrementalC–incrementalA–full,consistencycheckB–full,consistencycheckC–full,consistencycheck
bin/neo4j-backup \ -from single://neo4j.example.org:20000 \ -to /backups/201510151318263/graph.db -verify true|false
MonitoringPull• MetricsavailableviaJMXandHTTPandinbrowser
Push• Metricspublishingfrom2.3onwards(Enterprise)• Node,rela%onship,propertycounts• Network/cluster• Transac%ons(ac%ve,started,commiUed,rolledback,etc)• Neo4jpagecache(pagefaults,evic%ons,flushes,excep%ons)• JVM
• Publishedto:• Graphite• Ganglia• CSV
metrics.graphite.enabled=true metrics.graphite.server=52.29.63.174:2003 metrics.prefix=neo4j-1
neo4j.config
CollateInternalandExternalViewsoftheSystemSystem• collectd
Database• Metrics• Tailneo4j.log
HAEndpoints• /db/manage/server/ha/master • /db/manage/server/ha/slave
ServerLatencies• hAp.log
CypherQueries• dbms.logs.query.enabled=true • dbms.logs.query.threshold=2s
Applica3onmetrics• End-to-endlatencies
TestatScaleSoakTests• Representa%vedatasetandqueries• Peakloadandabove
Verify• Correctness• Performance• Latency• Throughput
• StabilityOpera3ons• Backup• Disasterrecovery• Replaceinstances
PerformanceTip–UsetheCypherQueryPlanner
8,386,880hits 59,272hits
CREATE INDEX ON :Crime(description)
PerformanceTip–WriteRequests
• AlignthenumberofconcurrentwriterequestswiththenumberofNeo4jserverthreadsonthemaster• Bydefault,numberofserverthreads=numberofCPUsreportedavailablebytheJVM
• Configurethenumberofthreadsinneo4j.confusingorg.neo4j.server.webserver.maxthreads
• Servicerequestsfromathreadpoolinyourapplica%on• Usethethreadpoolqueuedepthtoapplybackpressure
PerformanceTip–BatchWritesUsingaQueue
Write
WriteWrite
Queue
SingleThread Batch
hUp://maxdemarzi.com/2013/09/05/scaling-writes/hUp://maxdemarzi.com/2014/07/01/scaling-concurrent-writes-in-neo4j/
PerformanceTip–JVM
• LookforGCpausesindebug.log• grep blocked data/databases/graph.db/debug.log
• Causedby• Heaptoosmall• New/survivorspacetoosmall• BadlywriUenCypherqueryorstoredprocedure
EnableGCLogging
LogwillbewriUentologs/neo4j-gc.log
wrapper.java.additional=-Xloggc:logs/neo4j-gc.log wrapper.java.additional=-XX:+PrintGCDetails wrapper.java.additional=-XX:+PrintGCDateStamps wrapper.java.additional=-XX:+PrintGCApplicationStoppedTime wrapper.java.additional=-XX:+PrintTenuringDistribution wrapper.java.additional=-XX:+PrintGCCause
neo4j-wrapper.conf