Post on 13-Apr-2018
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP, LLC
Graphs Graph Databases
Graph Analy7cs on GPUs
1 http://www.bigdata.com/blog
bigdata® Presented at Xcelerate 3/18/2014
Graph Database • High performance, Scalable
– 50B edges/node – High level query language – Efficient Graph Traversal – High 9s solu7on
• Open Source (Subscrip7ons) – Autodesk, EMC, market data,
genomics and personalized medicine, etc.
GPU Analy2cs • Extreme Performance
– 5-‐100x faster than graphlab – 10,000x faster than graphdbs
• DARPA funding • Disrup7ve technology
– Early adopters – Huge ROIs
• Open Source
SYSTAP™, LLC © 2006-2013 All Rights Reserved
2 http://www.bigdata.com/blog
Small Business, Founded 2006 100% Employee Owned
SYSTAP, LLC
bigdata® Presented at Xcelerate 3/18/2014
Related “Graph” Technologies
SYSTAP™, LLC © 2006-2013 All Rights Reserved
3 http://www.bigdata.com/blog
Redpoint repositions existing technology. MPGraph compares favorably with high end hardware solutions from YARC, Oracle, and SAP, but is open source and uses commodity hardware.
Pair up bigdata and MPGraph
STTR
bigdata® Presented at Xcelerate 3/18/2014
Similar models, different problems • Graph query and graph analy7cs (traversal/mining)
– Related data models – Very different computa7onal requirements
• Many technologies are a bad match or limited solu7on – Key-‐value stores (bigtable, Accumulo, Cassandra, HBase) – Map-‐reduce
• An7-‐pabern – Dump all data into “big bucket”
SYSTAP™, LLC © 2006-2013 All Rights Reserved
4 http://www.bigdata.com/blog
bigdata® Presented at Xcelerate 3/18/2014
Similar models, different problems • Graph query and graph analy7cs (traversal/mining)
– Related data models – Very different computa7onal requirements
• Many technologies are a bad match or limited solu7on – Key-‐value stores (bigtable, Accumulo, Cassandra, HBase) – Map-‐reduce
• An7-‐pabern – Dump all data into “big bucket”
Storage and computa2on pa:erns must be correctly matched for high performance.
SYSTAP™, LLC
© 2006-2013 All Rights Reserved
5 http://www.bigdata.com/blog
bigdata® Presented at Xcelerate 3/18/2014
Op7mize for the right problem • Graph Query
– Declara7ve Query Language (SPARQL) • Query op7miza7on is cri7cal for performance.
– Index Locality (1D par77oning, mul7ple indices) • Get everything about a subject on one page of the index.
– Scale-‐out must flow queries over the data • Otherwise slams the network and the client
– Must order and constrain joins to read as lible data as possible • As-‐bound vectored nested index joins (bigdata) • Sideways informa7on passing and merge joins (RDF3X)
SYSTAP™, LLC © 2006-2014 All Rights Reserved
6 http://www.bigdata.com/blog
bigdata® Presented at Xcelerate 3/18/2014
Vectored Query in Scale-‐Out
SYSTAP™, LLC © 2006-2012 All Rights Reserved
7 http://www.bigdata.com/blog
bigdata® Presented at Xcelerate 3/18/2014
Op7mize for the right problem • Storage and computa7on paberns must be correctly matched
for high performance. • Graph analy7cs:
– Parallelism – work must be distributed and balanced. – Memory bandwidth – memory, not disk, is the bobleneck – 2D par77oning – O(N) communica7ons pabern (versus O(N*N))
SYSTAP™, LLC © 2006-2014 All Rights Reserved
8 http://www.bigdata.com/blog
BFS PR
bigdata® Presented at Xcelerate 3/18/2014
Accelerated Graph Analy7cs
SYSTAP™, LLC © 2006-2013 All Rights Reserved
9 http://www.bigdata.com/blog
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
10 http://www.bigdata.com/blog
The Seman7c Web • The Seman7c Web is a stack of standards developed by the
W3C for the interchange and query of metadata and graph structured data. – Open data – Linked data – Mul7ple sources of authority – Self-‐describing data and rules – Federa7on or aggrega7on – And, increasingly, provenance
bigdata® Presented at Xcelerate 3/18/2014
11 http://www.bigdata.com/blog
The Standards or the Data?
TBL, 2000. S. Bratt, 2006.
bigdata® Presented at Xcelerate 3/18/2014
12 http://www.bigdata.com/blog
The data – it’s about the data
bigdata® Presented at Xcelerate 3/18/2014
13 http://www.bigdata.com/blog
The killer “big data” app • Clouds + “Open” Data = Big Data Integra7on • Cri7cal advantages
– Fast integra7on cycle – Open standards – Integrate heterogeneous data, linked data,
structured data, data at rest, and streams. – Maintain fine-‐grained provenance of
federated data. – Opportunis7c exploita7on of data
– Fragmented informa7on – Dynamic informa7on – Latent Informa7on (graph mining)
bigdata® Presented at Xcelerate 3/18/2014
Unifying Architecture (example)
SYSTAP™, LLC © 2006-2012 All Rights Reserved
14 http://www.bigdata.com/blog
Unified Data Model
Streams
Unstructured
Semi-structured
Structured
Heterogeneous Data Sources as Input
Federate
Aggregate
Resource Centric (Linked Data)
Discover
Unified Compute and Storage Model
Data Bus
Update Query Database (SSD)
- Business Logic - Web Clients - Peer Systems
Graph Mining (GPUs)
- Aggregated –or– federated - High-level Query (SPARQL)
- Graph traversal / mining - “Think like a vertex”
Data Cache
- Key Value Stores
bigdata® Presented at Xcelerate 3/18/2014
Road Map • Column-‐wise
– Faster load and query. – Increased data density and scaling. – Integra7on point with GPU (shared data).
• Mul7-‐node GPU – 2D decomposi7on (DARPA STTR)
• Performance op7miza7on for scale-‐out – Reducing latency and increasing throughput – Integra7on point for SPARQL accelera7on and 2D GPU cluster.
• SPARQL on GPU – Query at 3 billion edges/second – Same underlying library, but horizontal scaling is NOT 2D.
SYSTAP™, LLC © 2006-2013 All Rights Reserved
15 http://www.bigdata.com/blog
bigdata® Presented at Xcelerate 3/18/2014
Customers and Use Cases
SYSTAP™, LLC © 2006-2012 All Rights Reserved
16 http://www.bigdata.com/blog
17 © Copyright 2012 EMC Corporation. All rights reserved.
SYSTAP™, LLC © 2006-2012 All Rights Reserved
17 http://www.bigdata.com/blog
Manufacturing Product Data is Heterogeneous
…and difficult to find, re-use, and share
Autodesk PLM360
bigdata® Presented at Xcelerate 3/18/2014
Hadoop / bigdata® pipeline
SYSTAP™, LLC © 2006-2013 All Rights Reserved
18 http://www.bigdata.com/blog
Inference Cloud
HA Query Cloud
Durable Queue
Map/Reduce Layer
Durable Queue
+/- stmts
+/- stmts
BD Journals on PFS / HDFS
Linear scaling on query throughput
Scalable inference workload
Can be used for custom quads-mode inference strategies
bigdata® Presented at Xcelerate 3/18/2014
19 http://www.bigdata.com/blog
Federated Query & Custom Services
• Language extension for remote services SERVICE uri { graph-‐pa1ern }
• Integrated into the vectored query engine – Solu7ons vectored into, and out of, remote end points. – Control evalua7on order query hints:
• runFirst, runLast, runOnce, etc.
• ServiceRegistry – Configure service end point behavior – Embed custom “services” (custom indices, monitor transac7ons, etc).
SERVICE { …. } hint:Prior hint:runFirst “true”.
bigdata® Presented at Xcelerate 3/18/2014
High Availability • Shared nothing architecture
– Same data on each node – Coordinate only at commit
• Scaling – 50 billion triples or quads – Query throughput scales linearly
• Self healing – Automa7c failover – Automa7c resync auer disconnect – Online single node disaster recovery
• Online Backup – Online snapshots (full backups) – HA Logs (incremental backups)
• Point in 7me recovery (offline)
SYSTAP™, LLC © 2006-2013 All Rights Reserved
20 http://www.bigdata.com/blog
HAService
Quorum k=3
size=3
follower
leader
HAService
HAService
bigdata® Presented at Xcelerate 3/18/2014
EMC ProSphere • Host-‐to-‐storage management solu7on
– No single model. Informa7on must be combined from many device vendors. RDF is a natural solu7on.
– Applica7on deployed as appliance into data centers everywhere. • Bundles the bigdata plaworm.
– 220+ Engineers in US and India. • SYSTAP
– provides support, custom services, feature development, training.
21 http://www.bigdata.com/blog
22 © Copyright 2012 EMC Corporation. All rights reserved.
bigdata® usage: EMC ProSphere
bigdata RDF store
(journal)
SAIL/bigdata API
Topology Service
Topology Service JVM
REST API
SAN topology view
Maps Service JVM RDF/XML
Temp store REST API
Maps Service
GraphML
Flex UI
bigdata® Presented at Xcelerate 3/18/2014
23 http://www.bigdata.com/blog
Knowledge Base of Biology (KaBOB)
Open Biomedical Ontologies
biomedical data &
informa7on
applica7on data
biomedical knowledge
Entrez Gene
17 databases
DIP
UniProt
GOA
GAD
HGNC
InterPro
Gene Ontology
Sequence Ontology
Cell Type Ontology ChEBI NCBI
Taxonomy Protein Ontology
12 ontologies
… …
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
24 http://www.bigdata.com/blog
bigdata® scale-‐out federa7on
Distributed Index Management and Query
RDF Data and SPARQL Query Managem
ent Functions Client Service
Registrar
Data Service
Client Service
Client Service
Data Service Data Service Data Service
Data Service Data Service Data Service
Zookeeper
Shard Locator
Transaction Mgr
Load Balancer
Unified API
Application Client
Application Client
Application Client
Application Client
Application Client
Client Service
SPARQL XML
SPARQL JSON
RDF/XML
N-Triples
N-Quads
Turtle
TriG
RDF/JSON
bigdata® Presented at Xcelerate 3/18/2014
Kepler GK110 Die Photo
25 http://www.bigdata.com/blog
• Most complex commercial IC – 7.1 billion transistors. – 3x gain in power efficiency. – 2,496 CUDA cores. – 1.5 MB L2 Cache.
bigdata® Presented at Xcelerate 3/18/2014
Graph Processing
GPUs Graphs and
Graph Data Mining
26 http://www.bigdata.com/blog
bigdata® Presented at Xcelerate 3/18/2014
GPU Graph Processing • Mo7va7on – speed
– 3 out of the top 5 super computers are GPU clusters – 3.3 B traversed edges per second (one GPU : Merrill, 2011) – 8.3 B traversed edges per second (quad GPU configura7on : ibid)
• Goal – Blindingly fast SPARQL QUERY and graph data mining on GPU clusters – 20 minutes on Accumulo => 27 milliseconds on a GPU.
• Open source – Deploy in worksta7ons, HPC clusters, EC2, or your own data center
27 http://www.bigdata.com/blog
bigdata® Presented at Xcelerate 3/18/2014
● ●
●
●
●
● ●
●
●● ●
● ●●●● ● ●
● ●
●
●● ●
●●
●
●
●●
●●●
●
●●
●
●
101
102
103
2002 2004 2006 2008 2010 2012Date
GFL
OPS
Precision● SP
DP
Vendor●
●
●
●
AMD (GPU)
NVIDIA (GPU)
Intel (CPU)
Intel Xeon Phi
Historical Single−/Double−Precision Peak Compute Rates
Latest Top500
• Why?
• FLOPS/m2
• FLOPS/$
• FLOPS/W
Many Core is the Future • Top 500 Super Computer Sites:
– 3 out of top 5 are GPU clusters (11/2011) – #1 and #8 (11/2012)
• CPU Clock Rates are stagnant. • Simple compute units + parallelism
=> Increased performance.
28
bigdata® Presented at Xcelerate 3/18/2014
29
GPUs – A Game Changer for Graph Analy7cs?
0 500
1000 1500 2000 2500 3000 3500
1 10 100 1000 10000 100000
Millio
n Tr
aver
sed
Edge
s per
Se
cond
Average Traversal Depth
NVIDIA Tesla C2050 Multicore per socket Sequential
• Graphs are everywhere in data, also a powerful data model for federa7on
• GPUs may be the technology that finally delivers real-‐7me analy7cs on large graphs
• 10x speedup over CPU • 10x DRAM bandwidth
• This is a hard problem • Data dependent parallelism • Non-‐locality • PCIe bus is bobleneck
• Significant speed up over CPU on BFS • 3 billion edges per second on one GPU
(see chart). • Roadmap
• GPU accelerated vertex-‐centric graph mining plaworm.
• GPU accelerated graph query
0
1 1 2
1
1
2
2 2
2
1
3
2
3
2
1 2
2
Breadth-First Search on Graphs 10x Speedup on GPUs
bigdata® Presented at Xcelerate 3/18/2014
GAS – a Graph-‐Parallel Abstrac7on
• Graph-‐Parallel Vertex-‐Centric API ala GraphLab • “Think like a vertex”
• Gather: collect informa7on about my neighborhood
• Apply: update my value
• Scaber: signal adjacent ver7ces • Can write all sorts of graph algorithms this way
– BFS, PageRank, Connected Component, Triangle Coun7ng, Max Flow, etc.
bigdata® Presented at Xcelerate 3/18/2014
34.8 Billion Triangles Triangle Coun2ng on Twiber Graph
64 Machines 15 Seconds
1636 Machines 423 Minutes
Hadoop [WWW’11]
S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11
Why? Wrong Abstrac2on à Broadcast O(degree2) messages per Vertex
bigdata® Presented at Xcelerate 3/18/2014
GPU Speedups vs GraphLab (SSSP)
SYSTAP™, LLC © 2006-2014 All Rights Reserved
32 http://www.bigdata.com/blog
bigdata® Presented at Xcelerate 3/18/2014
Graph Mining on GPU Clusters
SYSTAP™, LLC © 2006-2012 All Rights Reserved
33 http://www.bigdata.com/blog
• 2D par77oning (aka vertex cuts) • Minimizes the communica7on volume. • Batch parallel Gather in row, Scaber in
column.
bigdata® Presented at Xcelerate 3/18/2014
34 http://www.bigdata.com/blog
RDF Database
Overview New features
Integra7on Points REST API
bigdata® Presented at Xcelerate 3/18/2014
35 http://www.bigdata.com/blog
Bigdata 1.3 • Fast, scalable, open source, standards compliant database
-‐ Single machine to 50B triples or quads -‐ Plus a dedicated provenance mode.
-‐ Scales horizontally on a cluster -‐ SPARQL 1.1 Query, Property Paths, Update, Federated Query, etc. -‐ Na7ve RDFS+ inference. -‐ Vectored query engine.
-‐ High Availability -‐ RDF Graph Mining
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
36 http://www.bigdata.com/blog
Sesame Sail & Repository APIs • Java APIs for managing and querying RDF data • Extension methods for bigdata:
– Non-‐blocking readers – RDFS+ truth maintenance – Seman7c / graph search – Change history – Custom services – Etc.
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
37 http://www.bigdata.com/blog
NanoSparqlServer • Easy deployment
– Embedded or Servlet Container – Java client encapsulates remote opera7ons.
• High performance SPARQL end point – Op7mized for bigdata MVCC seman7cs (queries are non-‐blocking) – Built in resource management – Scalable!
• Simple REST API – SPARQL 1.1 Query and Update – Simple and useful REST-‐ful INSERT/UPDATE/DELETE methods – ESTCARD exposes super fast range counts for triple paberns – “Explain” a query. – Monitor / cancel running queries.
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
38 http://www.bigdata.com/blog
SPARQL Query • High level query language for graph data.
– Standard (W3C)
• Based on graph pabern matching. – SELECT vars WHERE graph-‐pa1ern
• Returns result set (table) – CONSTRUCT template WHERE graph-‐pa1ern
• Returns sub-‐graph – DESCRIBE uri
• Returns graph for that object.
• Database can op7mize physical storage and joins – Versus naviga7on-‐only APIs such as blueprints
SELECT ?x ?z!WHERE {! ?x foaf:knows ?y .! ?y foaf:knows ?z .!}
bigdata® Presented at Xcelerate 3/18/2014
39 http://www.bigdata.com/blog
BSBM 100M (Single Server) • Graph shows series of trials
for the BSBM reduced query mix (w/o Q5).
• Metric is Query Mixes per Hour (QMpH). Higher is beber.
• 8 client curve shows JVM and disk warm up effects. Both are hot for 16 client curve.
• Occasional low points are GC. • Apple mini (4 cores, 16G
RAM and SSD). Machine is CPU bound at 16 clients. No IO Wait.
BSBM 100M
0
10,000
20,000
30,000
40,000
50,000
60,000
1 11 21 31 41 51
trials
QM
pH
8 clients
16 clients
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
40 http://www.bigdata.com/blog
JVM Heap Pressure JVMs provide fast evaluation (rivaling hand-coded C++) through sophisticated online compilation and auto-tuning.
Applica2on Throughput
GC Workload Applica2on Workload è
JVM Resou
rces è
However, a non-linear interaction between the application workload (object creation and retention rate), and GC running time and cycle time can steal cycles and cause application throughput to plummet.
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
41 http://www.bigdata.com/blog
Choose standard or analy7c operators • Easy to specify which
– URL query parameter or SPARQL query hint
• Java operators – Use the managed Java heap. – Can some7mes be faster or offer beber concurrency
• E.g., dis7nct solu7ons is based on a concurrent hash map
– BUT • The Java heap can not handle very large materialized data sets. • GC overhead can steal your computer
• Analy7c operators – Scale up gracefully – Zero GC overhead.
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
42 http://www.bigdata.com/blog
SPARQL Update • Graph Management
– Create, Add, Copy, Move, Clear, Drop
• Graph Data Opera7ons – LOAD uri – INSERT DATA, DELETE DATA – DELETE/INSERT
• Can be used as a RULES language, update procedures, etc.
( WITH IRIref )? ( ( DeleteClause InsertClause? ) | InsertClause ) ( USING ( NAMED )? IRIref )* WHERE GroupGraphPattern
bigdata® Presented at Xcelerate 3/18/2014
43 http://www.bigdata.com/blog
SPARQL UPDATE Extension • Language extension for SPARQL UPDATE
– Adds durable solu7on sets.
• Easy to slice result sets:
• Re-‐group or re-‐order results:
• Expensive JOINs are NOT recomputed.
INSERT INTO %solutionSet1 SELECT ?product ?reviewer WHERE { … }
SELECT ... { INCLUDE %solutionSet1 } OFFSET 0 LIMIT 1000 SELECT ... { INCLUDE %solutionSet1 } OFFSET 1000 LIMIT 1000 SELECT ... { INCLUDE %solutionSet1 } OFFSET 2000 LIMIT 1000
SELECT ... { INCLUDE %solutionSet1 } GROUP BY ?x
SELECT ... { INCLUDE %solutionSet1 } ORDER BY ASC(?x)
bigdata® Presented at Xcelerate 3/18/2014
Seman7c Search • Enable in your KB proper7es:
– com.bigdata.rdf.store.AbstractTripleStore.textIndex=true.
• Simple full text search: prefix bd: <http://www.bigdata.com/rdf/search#>! select ?s, ?o {!
!?o bd:search “mike” . # all literals with “mike” token.!!?s ?p ?o . # all subjects for those literals.!!}!
• Lots of op7ons (cosine relevance, rank, match all terms, etc.). • Low latency, web facing applica7ons built by “slicing” the
search index. • See hbps://sourceforge.net/apps/mediawiki/bigdata/index.php?7tle=FullTextSearch
SYSTAP™, LLC © 2006-2013 All Rights Reserved
44 http://www.bigdata.com/blog
bigdata® Presented at Xcelerate 3/18/2014
45 http://www.bigdata.com/blog
Federated Query & Custom Services
• Language extension for remote services SERVICE uri { graph-‐pa1ern }
• Integrated into the vectored query engine – Solu7ons vectored into, and out of, remote end points. – Control evalua7on order query hints:
• runFirst, runLast, runOnce, etc.
• ServiceRegistry – Configure service end point behavior – Embed custom “services” (custom indices, monitor transac7ons, etc).
SERVICE { …. } hint:Prior hint:runFirst “true”.
bigdata® Presented at Xcelerate 3/18/2014
RDF Graph Mining • “GAS” SERVICE – 2x faster than neo4j, 4x faster than 7tan. • Handles typed linked and link abributes efficiently PREFIX gas: <hbp://www.bigdata.com/rdf/gas#> SELECT ?depth (count(?out) as ?cnt) { SERVICE gas:service { gas:program gas:gasClass "com.bigdata.rdf.graph.analy7cs.BFS" . gas:program gas:in <ip:/112.174.24.90> . # one or more 7mes, specifies the ini7al fron7er. gas:program gas:out ?out . # exactly once -‐ will be bound to the visited ver7ces. gas:program gas:out1 ?depth . # exactly once -‐ will be bound to the depth of the visited ver7ces. gas:program gas:maxItera7ons 4 . # op7onal limit on breadth first expansion. gas:program gas:maxVisited 2000 . # op7onal limit on the #of visited ver7ces. } } group by ?depth order by ?depth
SYSTAP™, LLC © 2006-2014 All Rights Reserved
46 http://www.bigdata.com/blog
depth count 1 207 2 1,985 3 9,861 4 29,366
SPARQL Query
bigdata® Presented at Xcelerate 3/18/2014
47 http://www.bigdata.com/blog
Highly Available Replica7on Cluster
(HAJournalServer)
(bigdata 1.3 release)
bigdata® Presented at Xcelerate 3/18/2014
HA Replica7on Cluster • Same backend database (Journal + RWStore)
– Same data scale as Journal. – Same IO profile as Journal. – Same low latency query profile as Journal.
• REST API is mandatory – SPARQL Query – SPARQL UPDATE – SPARQL Federated Query
• Can be used to cross tenant or machine boundaries. • Writes on leader
– Iden7fied in zookeeper or using REST API. • Query on leader or followers
– Each query is answered 100% locally. • Zero coordina7on overhead. • Linear scaling in query throughput
SYSTAP™, LLC © 2006-2013 All Rights Reserved
48 http://www.bigdata.com/blog
bigdata® Presented at Xcelerate 3/18/2014
High Availability • Shared nothing architecture
– Same data on each node – Coordinate only at commit
• Scaling – 50 billion triples or quads – Query throughput scales linearly
• Self healing – Automa7c failover – Automa7c resync auer disconnect – Online single node disaster recovery
• Online Backup – Online snapshots (full backups) – HA Logs (incremental backups)
• Point in 7me recovery (offline)
SYSTAP™, LLC © 2006-2014 All Rights Reserved
49 http://www.bigdata.com/blog
HAService
Quorum k=3
size=3
follower
leader
HAService
HAService
bigdata® Presented at Xcelerate 3/18/2014
High Availability
50 http://www.bigdata.com/blog
HAService
Quorum k=3
size=3
follower
leader
HAService
HAService Clients • Write on the leader. Read on any service. • Writes replicated using low-‐level transfers. • Quorum fully consistent at each commit.
write
bigdata® Presented at Xcelerate 3/18/2014
High Availability
51 http://www.bigdata.com/blog
read
read
read HAService
Quorum k=3
size=3
follower
leader
HAService
HAService Clients • Write on the leader. Read on any service. • Writes replicated using low-‐level transfers. • Quorum fully consistent at each commit.
bigdata® Presented at Xcelerate 3/18/2014
Self-‐Healing
HAService
Quorum k=3
size=2
follower
leader
HAService
HAService
join
synchronize
Service can fail for a variety of reasons: • JVM down • Machine down • Network par77on • Zookeeper 7meout • Discovery failure • Wrong commit point • Severe clock skew
(delta)
• Goal is to guarantee eventual consistency without allowing intermediate illegal states.
• Persistent state of the service must remain self-‐consistent
bigdata® Presented at Xcelerate 3/18/2014
HA Load Balancer
SYSTAP™, LLC © 2006-2014 All Rights Reserved
53 http://www.bigdata.com/blog
HAService leader
proxy
HAService
HAService
• Transparent Proxy • Writes proxied to the leader. • Reads load balanced over quorum.
• Host metrics (CPU, IO Wait) • Service metrics (GC Time)
• Custom policies • Tenant aware • Low & high latency pools Jeby container
Load balancer
proxy
request
bigdata® Presented at Xcelerate 3/18/2014
HA Deployment
SYSTAP™, LLC © 2006-2014 All Rights Reserved
54 http://www.bigdata.com/blog
• EC2 • SSD instance types (IOPS, ephemeral) • Snapshots and logs on EBS (durable) • Restore from EBS on instance restart • Coming soon
• Click start HA clusters on EC2 • Private clouds
• OpenStack • chef/puppet
HAService
HAService
HAService
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2014 All Rights Reserved
55 http://www.bigdata.com/blog
HA Service Architecture
REST API (SPARQL, GAS, etc.) Ensem
ble
Zookeeper
Unified API
Application Client
Application Client
Application Client
Application Client
Application Client
SPARQL XML
SPARQL JSON
RDF/XML
N-Triples
N-Quads
Turtle
TriG
RDF/JSON
Zookeeper
Zookeeper
ServiceStarter
HAJournalServer
Lookup Service Class Server
jetty
REST API Load Balancer
HAGlue API Journal
ServiceStarter
HAJournalServer
Lookup Service Class Server
jetty
REST API Load Balancer
HAGlue API Journal
ServiceStarter
HAJournalServer
Lookup Service Class Server
jetty
REST API Load Balancer
HAGlue API Journal
HA R
eplication Cluster
bigdata® Presented at Xcelerate 3/18/2014
BSBM 100M (3-‐Node HA Cluster)
56 http://www.bigdata.com/blog
0"
20000"
40000"
60000"
80000"
100000"
120000"
140000"
160000"
1" 11" 21" 31" 41" 51" 61" 71" 81" 91" 101" 111"
QMpH
%
Query%Performance%Scales%Linearly%with%Cluster%Size%BSBM%100M,%3@node%replicaBon%cluster%using%Intel%Mac%Minis%• 3-‐Node, Shared-‐Nothing
Replica7on Cluster – 3x 2011 Mac Mini (4
cores, 16G RAM and SSD).
• Query Scales Linearly • CPU bound
– 70-‐90k QMpH on newer servers.
Aggregate Throughput
Per-node Throughput
bigdata® Presented at Xcelerate 3/18/2014
HA DEMO • Brief demonstra7on of an HA-‐3 cluster • Start 3 services (A,B,C) • Load some data (TBL foaf crawl)
DROP ALL; LOAD <file:/Users/bryan/Documents/workspace/BIGDATA_RELEASE_1_2_0/bigdata-‐rdf/src/resources/data/foaf/data-‐0.nq.gz>; LOAD <file:/Users/bryan/Documents/workspace/BIGDATA_RELEASE_1_2_0/bigdata-‐rdf/src/resources/data/foaf/data-‐1.nq.gz>; LOAD <file:/Users/bryan/Documents/workspace/BIGDATA_RELEASE_1_2_0/bigdata-‐rdf/src/resources/data/foaf/data-‐2.nq.gz>;
– Iden7cal data on all services SELECT (count(*) as ?c) { ?s ?p ?o } LIMIT 1
• Self-‐healing – Shutdown C – Load more data on A + B. Commit. LOAD <file:/Users/bryan/Documents/workspace/BIGDATA_RELEASE_1_2_0/bigdata-‐rdf/src/resources/data/foaf/data-‐3.nq.gz>;
– Restart C. Resyncs and joins quorum. • Snapshot service.
File Quads Size (gz) 7mbl/data-‐0.nq.gz 89 2.5K 7mbl/data-‐1.nq.gz 16516 293K 7mbl/data-‐2.nq.gz 87250 1.2M 7mbl/data-‐3.nq.gz 388412 5.1M 7mbl/data-‐4.nq.gz 9405528 113M 7mbl/data-‐5.nq.gz 93898523 1016M 7mbl/data-‐6.nq.gz 101010423 1.2G
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
58 http://www.bigdata.com/blog
Bigdata®
Services and dynamics
bigdata® Presented at Xcelerate 3/18/2014
Related “Graph” Technologies
SYSTAP™, LLC © 2006-2013 All Rights Reserved
59 http://www.bigdata.com/blog
Redpoint repositions existing technology. MPGraph compares favorably with high end hardware solutions from YARC, Oracle, and SAP, but is open source and uses commodity hardware.
Pair up bigdata and MPGraph
STTR
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
60 http://www.bigdata.com/blog
bigdata® is a federa7on of services
Distributed Index Management and Query
RDF Data and SPARQL Query Managem
ent Functions Client Service
Registrar
Data Service
Client Service
Client Service
Data Service Data Service Data Service
Data Service Data Service Data Service
Zookeeper
Shard Locator
Transaction Mgr
Load Balancer
Unified API
Application Client
Application Client
Application Client
Application Client
Application Client
Client Service
SPARQL XML
SPARQL JSON
RDF/XML
N-Triples
N-Quads
Turtle
TriG
RDF/JSON
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
61 http://www.bigdata.com/blog
Typical Souware Stack
Unified API Unified API Implementation
API Frameworks (Spring, etc.) Sesame Framework
SAIL RDF SPARQL
Bigdata RDF Database Bigdata Component Services
OS (Linux)
Java
Application Layer
HTT
P Ji
ni
Zook
eepe
r
Cluster and Storage Management
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
62 http://www.bigdata.com/blog
Service Discovery (1/2) • Services discover registrars. Registrar discovery is configurable using either unicast or mul7cast protocols.
• Services adver7se themselves and lookup other services (a).
• Clients use the shard locator to locate key-‐range shards for scale-‐out indices (b).
(b)
Client Service
Shard Locator
Data Service
Registrar
Data Service
Data Service
Service registration and discovery (a)
RMI / NIO Bus
(a)
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
63 http://www.bigdata.com/blog
Service Discovery (2/2) • Clients resolve shard locators to data service iden7fiers (a), then lookup the data services in service registrar (b).
• Data moves directly between clients and services (c).
• Service protocols not limited to RMI. Custom NIO protocols for data high throughput.
• Client libraries encapsulate this for applica7ons, including caching of service lookup and shard resolu7on.
(c)
(c)
(c)
(a)
Client Service
RMI / NIO Bus
Shard Locator
Data Service
Registrar
Data Service
Data Service
(b)
Service lookup
Control Messages and
Data Flow
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
64 http://www.bigdata.com/blog
Persistence Store Mechanisms • Read/Write (RW) store
– Efficiently recycles alloca7on slots on the backing file – Used by services that need persistence without sharding, HA, etc. – Also used in the scale-‐up single machine database (~50B triples)
• Write Once, Read Many (WORM) store – Append-‐only, log structured store (aka “journal”) – Used by the data services used to absorb writes – Plays an important role in the scale-‐out database architecture
• Index segment (SEG) store – Read-‐op7mized B+Tree files – Generated by bulk index build opera7ons on a data service – Plays an important role in dynamic sharding and analy7c query
RWStore Journal
WORM Journal
index segments
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
65 http://www.bigdata.com/blog
The Data Service (1/2)
Data Services
overflow
Scattered writes
Gathered reads
Clients Append only journals and read-optimized index segments are basic building blocks.
index segments
journal
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
66 http://www.bigdata.com/blog
The Data Service (2/2)
Data Service
index segments
Journal • Block append for ultra fast writes • Target size on disk ~ 200MB.
Index Segments • 98%+ of all data on disk • Target shard size on disk ~200M • Bloom filters (fast rejec7on) • At most one IO per leaf access • Mul7-‐block IO for leaf scans
journal
Behind the scenes on a data service.
Overflow • Fast synchronous overflow • Asynchronous index builds • Key to Dynamic Sharding
overflow
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
67 http://www.bigdata.com/blog
Bigdata® Indices
Scale-‐out B+Tree and Dynamic Sharding
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
68 http://www.bigdata.com/blog
Bigdata® Indices
• Dynamically key-‐range par77oned B+Trees for indices – Index entries (tuples) map unsigned byte[ ] keys to byte[ ] values. – “deleted” flag and 7mestamp used for MVCC and dynamic sharding.
• Index par77ons distributed across data services on a cluster – Located by centralized metadata service
nodes Csep. keys 7child refs. A B
A B4 9a b c d
leaves a b c dkeys 1 2 3 4 5 6 7 8 9 10values v1 v2 v3 v4 v5 v6 v7 v8 v9 v10
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
69 http://www.bigdata.com/blog
Dynamic Key Range Par77oning
p0 split p1 p2
p2 join p3 p1
p3 move p4
Splits break down the shards as the data scale increases.
Moves redistribute shards onto exis7ng or new nodes in the cluster.
Joins merge shards when data scale decreases.
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
70 http://www.bigdata.com/blog
Shard Locator Service
p0 ([],∞)
DataService1
Ini7al condi7ons place the first shard on an arbitrary data service represen7ng the en7re index key range.
Dynamic Key Range Par77oning
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
71 http://www.bigdata.com/blog
SYSTAP™, LLC © 2006-2011 All Rights Reserved
Writes cause the shard to grow. Eventually its size on disk exceeds a preconfigured threshold.
Dynamic Key Range Par77oning
p0 ([],∞)
Shard Locator Service
DataService1
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
72 http://www.bigdata.com/blog
p1
p5 p2 p3
p4 p6
p7
p8 p9 Shard Locator Service
([],∞)
DataService1
Instead of a simple two-‐way split, the ini7al shard is “scaber-‐split” so that all data services can start managing data.
Dynamic Key Range Par77oning
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
73 http://www.bigdata.com/blog
The newly created shards are then dispersed around the cluster. Subsequent splits are two-‐way and moves occur based on rela7ve server load (decided by Load Balancer Service).
DataService1
DataService2 Shard Locator Service
p1
p2
p9
(1)
(9)
(2)
(…)
DataService9
Dynamic Key Range Par77oning
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
74 http://www.bigdata.com/blog
Shard Evolu7on
build p0
p2 merge p3
Builds generate index segments from just the old journal.
Merge compacts the shard view into a single index segment.
journal0
journal0 p0 p1
build p1 journal0
build p2 journal0
p0
p0 p1
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
75 http://www.bigdata.com/blog
Shard Evolu7on
• Ini7al journal on DS.
• Incremental build of new segments for a shard with each journal overflow.
• Shard periodically undergoes compac7ng merge.
• Shard will split at 200MB.
READ
READ
journaln seg 0
seg n-‐1
build WRITE
seg n
time
journaln+1
tn
t0
tn+1
READ
WRITE
WRITE merge
journal0
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
76 http://www.bigdata.com/blog
Bulk Data Load
High Throughput with Dynamic Sharding
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
77 http://www.bigdata.com/blog
Bulk Data Load
• Very high data load rates – 1B triples in under an hour (beber than 300,000 triples per second on a 16 node cluster).
• Executed as a distributed job – Read data from a file system, the web, HDFS, etc.
• Database remains available for query during load – Read from historical commit points with snapshot isola7on.
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
78 http://www.bigdata.com/blog
Distributed Bulk Data Loader
Applica7on client iden7fies RDF data files to be loaded
One client is elected as the job master and coordinates the job across the other client.
Writes are scabered across the data service nodes
Application Client
Client Service
Data Service Data Service Data Service Data Service Data Service
Client Service Client Service Client Service Clients read directly from shared storage.
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
79 http://www.bigdata.com/blog
Input (2.5B triples) 16 node cluster loading at 180k triples per second
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1 61 121 181
Billi
ons
minutes
told
Trip
les
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
80 http://www.bigdata.com/blog
Throughput (2.5B triples) 16 node cluster, 2.5B triples.
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
1 61 121 181
minutes
tripl
es p
er s
econ
d
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
81 http://www.bigdata.com/blog
Disk U7liza7on (~56 bytes/triple) Bytes Under Management
0
20
40
60
80
100
120
140
160
180
200
1 61 121 181
Bill
ions
minutes
byte
s on
dis
k
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
82 http://www.bigdata.com/blog
Shards over 7me (2.5B triples) Total Shard Count on Cluster over time for 2.5B triples
0
100
200
300
400
500
600
700
800
1 61 121 181
minutes
Shar
ds
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
83 http://www.bigdata.com/blog
Dynamic Sharding in Ac7on Shard builds, merges, and splits for 2.5B triples
0
2000
4000
6000
8000
10000
12000
14000
16000
1 61 121 181minutes
#of o
pera
tions
SplitBuildMerge
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
84 http://www.bigdata.com/blog
Scaber Splits in Ac7on (Zoom on first 3 minutes)
bigdata® Presented at Xcelerate 3/18/2014
Vectored Query in Scale-‐Out
SYSTAP™, LLC © 2006-2012 All Rights Reserved
85 http://www.bigdata.com/blog
bigdata® Presented at Xcelerate 3/18/2014
bigdata ®
Bryan Thompson SYSTAP, LLC
bryan@systap.com
86 http://www.bigdata.com/blog
http://www.systap.com/bigdata.htm http://www.bigdata.com/blog
bigdata® Presented at Xcelerate 3/18/2014
Backup Material
87 http://www.bigdata.com/blog
SYSTAP™, LLC © 2006-2014 All Rights Reserved
bigdata® Presented at Xcelerate 3/18/2014
Statement Iden7fiers
RDF Graphs with efficient link abributes
SYSTAP™, LLC © 2006-2012 All Rights Reserved
88 http://www.bigdata.com/blog
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
89 http://www.bigdata.com/blog
Statement Level Metadata
• Important to know where data came from in a mashup
• :mike :memberOf :SYSTAP . • dc:source <hbp://www.systap.com> .
• But you CAN NOT say that in RDF.
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
90 http://www.bigdata.com/blog
RDF “Reifica7on”
• Creates a “model” of the statement: _:s1 rdf:subject :mike . _:s1 rdf:predicate :memberOf . _:s1 rdf:object :SYSTAP . _:s1 rdf:type rdf:Statement .
• Then you can say: _:s1 dc:source <http://www.systap.com> .
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
91 http://www.bigdata.com/blog
Statement Iden7fier Mode (SIDs)
• Special database mode – SIDs look just like blank nodes – Leverages the named graph of the statement
• SIDs let you do exactly what you want: :mike :memberOf :SYSTAP _:s1 . _:s1 dc:source <http://www.systap.com> .
• Use SIDs in SPARQL: select ?s ?o ?source
where {
GRAPH ?sid { ?s :memberOf ?o } . ?sid dc:source ?source .
}
bigdata® Presented at Xcelerate 3/18/2014
SYSTAP™, LLC © 2006-2012 All Rights Reserved
92 http://www.bigdata.com/blog
Reifica7on Done Right
• Extends the concept to support quads. – Outcome from Dagstuhl 2012 Seman7c Data Management workshop.
– Collabora7ve effort with SYSTAP, Open Link, Humboldt University, Karlsruhe Ins7tute of Technology.
– W3C Member Submission in prepara7on. – Harmonized with RDF model theory & SPARQL algebra. – Efficient in index structures and queries.
• Extensions for N3, TURTLE, and SPARQL are proposed. • Interchange and query for link abributes (graph databases).
bigdata® Presented at Xcelerate 3/18/2014
Works with triples or quads • Inline statements into statements.
<< :mike :memberOf :SYSTAP >>
dc:source <http://www.systap.com>
• Same syntax works for query select ?s ?o ?source
where { << ?s :memberOf ?o >> dc:source ?source .
• Standardized approach for: • Link abributes (graph databases). • Confidence measures (en7ty / link extractors). • Datum level security models.
SYSTAP™, LLC © 2006-2012 All Rights Reserved
93 http://www.bigdata.com/blog