SYSTAP,LLC’ -...

bigdata® Presented at Xcelerate 3/18/2014

SYSTAP, LLC

Graphs Graph Databases

Graph Analy7cs on GPUs

1 http://www.bigdata.com/blog

Graph Database •  High performance, Scalable

–  50B edges/node –  High level query language –  Efficient Graph Traversal –  High 9s solu7on

•  Open Source (Subscrip7ons) –  Autodesk, EMC, market data,

genomics and personalized medicine, etc.

GPU Analy2cs •  Extreme Performance

–  5-‐100x faster than graphlab –  10,000x faster than graphdbs

•  DARPA funding •  Disrup7ve technology

–  Early adopters –  Huge ROIs

•  Open Source

Small Business, Founded 2006 100% Employee Owned

SYSTAP, LLC

Related “Graph” Technologies

Redpoint repositions existing technology. MPGraph compares favorably with high end hardware solutions from YARC, Oracle, and SAP, but is open source and uses commodity hardware.

Pair up bigdata and MPGraph

Similar models, different problems •  Graph query and graph analy7cs (traversal/mining)

–  Related data models –  Very different computa7onal requirements

•  Many technologies are a bad match or limited solu7on –  Key-‐value stores (bigtable, Accumulo, Cassandra, HBase) –  Map-‐reduce

•  An7-‐pabern –  Dump all data into “big bucket”

Similar models, different problems •  Graph query and graph analy7cs (traversal/mining)

–  Related data models –  Very different computa7onal requirements

•  Many technologies are a bad match or limited solu7on –  Key-‐value stores (bigtable, Accumulo, Cassandra, HBase) –  Map-‐reduce

•  An7-‐pabern –  Dump all data into “big bucket”

Storage and computa2on pa:erns must be correctly matched for high performance.

SYSTAP™, LLC

Op7mize for the right problem •  Graph Query

–  Declara7ve Query Language (SPARQL) •  Query op7miza7on is cri7cal for performance.

–  Index Locality (1D par77oning, mul7ple indices) •  Get everything about a subject on one page of the index.

–  Scale-‐out must flow queries over the data •  Otherwise slams the network and the client

–  Must order and constrain joins to read as lible data as possible •  As-‐bound vectored nested index joins (bigdata) •  Sideways informa7on passing and merge joins (RDF3X)

Vectored Query in Scale-‐Out

Op7mize for the right problem •  Storage and computa7on paberns must be correctly matched

for high performance. •  Graph analy7cs:

–  Parallelism – work must be distributed and balanced. –  Memory bandwidth – memory, not disk, is the bobleneck –  2D par77oning – O(N) communica7ons pabern (versus O(N*N))

BFS PR

Accelerated Graph Analy7cs

The Seman7c Web •  The Seman7c Web is a stack of standards developed by the

W3C for the interchange and query of metadata and graph structured data. –  Open data –  Linked data –  Mul7ple sources of authority –  Self-‐describing data and rules –  Federa7on or aggrega7on –  And, increasingly, provenance

The Standards or the Data?

TBL, 2000. S. Bratt, 2006.

The data – it’s about the data

The killer “big data” app •  Clouds + “Open” Data = Big Data Integra7on •  Cri7cal advantages

–  Fast integra7on cycle –  Open standards –  Integrate heterogeneous data, linked data,

structured data, data at rest, and streams. –  Maintain fine-‐grained provenance of

federated data. –  Opportunis7c exploita7on of data

–  Fragmented informa7on –  Dynamic informa7on –  Latent Informa7on (graph mining)

Unifying Architecture (example)

Unified Data Model

Streams

Unstructured

Semi-structured

Structured

Heterogeneous Data Sources as Input

Federate

Aggregate

Resource Centric (Linked Data)

Discover

Unified Compute and Storage Model

Data Bus

Update Query Database (SSD)

- Business Logic - Web Clients - Peer Systems

Graph Mining (GPUs)

- Aggregated –or– federated - High-level Query (SPARQL)

- Graph traversal / mining - “Think like a vertex”

Data Cache

- Key Value Stores

Road Map •  Column-‐wise

–  Faster load and query. –  Increased data density and scaling. –  Integra7on point with GPU (shared data).

•  Mul7-‐node GPU –  2D decomposi7on (DARPA STTR)

•  Performance op7miza7on for scale-‐out –  Reducing latency and increasing throughput –  Integra7on point for SPARQL accelera7on and 2D GPU cluster.

•  SPARQL on GPU –  Query at 3 billion edges/second –  Same underlying library, but horizontal scaling is NOT 2D.

Customers and Use Cases

Manufacturing Product Data is Heterogeneous

…and difficult to find, re-use, and share

Autodesk PLM360

Hadoop / bigdata® pipeline

Inference Cloud

HA Query Cloud

Durable Queue

Map/Reduce Layer

Durable Queue

+/- stmts

BD Journals on PFS / HDFS

Linear scaling on query throughput

Scalable inference workload

Can be used for custom quads-mode inference strategies

Federated Query & Custom Services

•  Language extension for remote services SERVICE uri { graph-‐pa1ern }

•  Integrated into the vectored query engine –  Solu7ons vectored into, and out of, remote end points. –  Control evalua7on order query hints:

•  runFirst, runLast, runOnce, etc.

•  ServiceRegistry –  Configure service end point behavior –  Embed custom “services” (custom indices, monitor transac7ons, etc).

SERVICE { …. } hint:Prior hint:runFirst “true”.

High Availability •  Shared nothing architecture

–  Same data on each node –  Coordinate only at commit

•  Scaling –  50 billion triples or quads –  Query throughput scales linearly

•  Self healing –  Automa7c failover –  Automa7c resync auer disconnect –  Online single node disaster recovery

•  Online Backup –  Online snapshots (full backups) –  HA Logs (incremental backups)

•  Point in 7me recovery (offline)

HAService

Quorum k=3

size=3

follower

leader

HAService

EMC ProSphere •  Host-‐to-‐storage management solu7on

–  No single model. Informa7on must be combined from many device vendors. RDF is a natural solu7on.

–  Applica7on deployed as appliance into data centers everywhere. •  Bundles the bigdata plaworm.

–  220+ Engineers in US and India. •  SYSTAP

–  provides support, custom services, feature development, training.

bigdata® usage: EMC ProSphere

bigdata RDF store

(journal)

SAIL/bigdata API

Topology Service

Topology Service JVM

REST API

SAN topology view

Maps Service JVM RDF/XML

Temp store REST API

Maps Service

GraphML

Flex UI

Knowledge Base of Biology (KaBOB)

Open Biomedical Ontologies

biomedical data &

informa7on

applica7on data

biomedical knowledge

Entrez Gene

17 databases

UniProt

InterPro

Gene Ontology

Sequence Ontology

Cell Type Ontology ChEBI NCBI

Taxonomy Protein Ontology

12 ontologies

… …

bigdata® scale-‐out federa7on

Distributed Index Management and Query

RDF Data and SPARQL Query Managem

ent Functions Client Service

Registrar

Data Service

Client Service

Data Service Data Service Data Service

Zookeeper

Shard Locator

Transaction Mgr

Load Balancer

Unified API

Application Client

Client Service

SPARQL XML

SPARQL JSON

RDF/XML

N-Triples

N-Quads

Turtle

RDF/JSON

Kepler GK110 Die Photo

•  Most complex commercial IC –  7.1 billion transistors. –  3x gain in power efficiency. –  2,496 CUDA cores. –  1.5 MB L2 Cache.

Graph Processing

GPUs Graphs and

Graph Data Mining

GPU Graph Processing •  Mo7va7on – speed

–  3 out of the top 5 super computers are GPU clusters –  3.3 B traversed edges per second (one GPU : Merrill, 2011) –  8.3 B traversed edges per second (quad GPU configura7on : ibid)

•  Goal –  Blindingly fast SPARQL QUERY and graph data mining on GPU clusters –  20 minutes on Accumulo => 27 milliseconds on a GPU.

•  Open source –  Deploy in worksta7ons, HPC clusters, EC2, or your own data center

● ●

●● ●

● ●●●● ● ●

● ●

●● ●

●●

●●●

●●

2002 2004 2006 2008 2010 2012Date

Precision● SP

Vendor●

AMD (GPU)

NVIDIA (GPU)

Intel (CPU)

Intel Xeon Phi

Historical Single−/Double−Precision Peak Compute Rates

Latest Top500

• Why?

• FLOPS/m2

• FLOPS/$

• FLOPS/W

Many Core is the Future •  Top 500 Super Computer Sites:

–  3 out of top 5 are GPU clusters (11/2011) –  #1 and #8 (11/2012)

•  CPU Clock Rates are stagnant. •  Simple compute units + parallelism

=> Increased performance.

GPUs – A Game Changer for Graph Analy7cs?

1000 1500 2000 2500 3000 3500

1 10 100 1000 10000 100000

Millio

Average Traversal Depth

NVIDIA Tesla C2050 Multicore per socket Sequential

•  Graphs are everywhere in data, also a powerful data model for federa7on

•  GPUs may be the technology that finally delivers real-‐7me analy7cs on large graphs

•  10x speedup over CPU •  10x DRAM bandwidth

•  This is a hard problem •  Data dependent parallelism •  Non-‐locality •  PCIe bus is bobleneck

•  Significant speed up over CPU on BFS •  3 billion edges per second on one GPU

(see chart). •  Roadmap

•  GPU accelerated vertex-‐centric graph mining plaworm.

•  GPU accelerated graph query

Breadth-First Search on Graphs 10x Speedup on GPUs

GAS – a Graph-‐Parallel Abstrac7on

•  Graph-‐Parallel Vertex-‐Centric API ala GraphLab •  “Think like a vertex”

• Gather: collect informa7on about my neighborhood

• Apply: update my value

• Scaber: signal adjacent ver7ces •  Can write all sorts of graph algorithms this way

–  BFS, PageRank, Connected Component, Triangle Coun7ng, Max Flow, etc.

34.8 Billion Triangles Triangle Coun2ng on Twiber Graph

64 Machines 15 Seconds

1636 Machines 423 Minutes

Hadoop [WWW’11]

S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11

Why? Wrong Abstrac2on à Broadcast O(degree2) messages per Vertex

GPU Speedups vs GraphLab (SSSP)

Graph Mining on GPU Clusters

•  2D par77oning (aka vertex cuts) •  Minimizes the communica7on volume. •  Batch parallel Gather in row, Scaber in

column.

RDF Database

Overview New features

Integra7on Points REST API

Bigdata 1.3 •  Fast, scalable, open source, standards compliant database

-‐  Single machine to 50B triples or quads -‐  Plus a dedicated provenance mode.

-‐  Scales horizontally on a cluster -‐  SPARQL 1.1 Query, Property Paths, Update, Federated Query, etc. -‐  Na7ve RDFS+ inference. -‐  Vectored query engine.

-‐  High Availability -‐  RDF Graph Mining

Sesame Sail & Repository APIs •  Java APIs for managing and querying RDF data •  Extension methods for bigdata:

–  Non-‐blocking readers –  RDFS+ truth maintenance –  Seman7c / graph search –  Change history –  Custom services –  Etc.

NanoSparqlServer •  Easy deployment

–  Embedded or Servlet Container –  Java client encapsulates remote opera7ons.

•  High performance SPARQL end point –  Op7mized for bigdata MVCC seman7cs (queries are non-‐blocking) –  Built in resource management –  Scalable!

•  Simple REST API –  SPARQL 1.1 Query and Update –  Simple and useful REST-‐ful INSERT/UPDATE/DELETE methods –  ESTCARD exposes super fast range counts for triple paberns –  “Explain” a query. –  Monitor / cancel running queries.

SPARQL Query •  High level query language for graph data.

–  Standard (W3C)

•  Based on graph pabern matching. –  SELECT vars WHERE graph-‐pa1ern

•  Returns result set (table) –  CONSTRUCT template WHERE graph-‐pa1ern

•  Returns sub-‐graph –  DESCRIBE uri

•  Returns graph for that object.

•  Database can op7mize physical storage and joins –  Versus naviga7on-‐only APIs such as blueprints

SELECT ?x ?z!WHERE {! ?x foaf:knows ?y .! ?y foaf:knows ?z .!}

BSBM 100M (Single Server) •  Graph shows series of trials

for the BSBM reduced query mix (w/o Q5).

•  Metric is Query Mixes per Hour (QMpH). Higher is beber.

•  8 client curve shows JVM and disk warm up effects. Both are hot for 16 client curve.

•  Occasional low points are GC. •  Apple mini (4 cores, 16G

RAM and SSD). Machine is CPU bound at 16 clients. No IO Wait.

BSBM 100M

10,000

20,000

30,000

40,000

50,000

60,000

1 11 21 31 41 51

trials

8 clients

16 clients

JVM Heap Pressure JVMs provide fast evaluation (rivaling hand-coded C++) through sophisticated online compilation and auto-tuning.

Applica2on Throughput

GC Workload Applica2on Workload è

JVM Resou

rces è

However, a non-linear interaction between the application workload (object creation and retention rate), and GC running time and cycle time can steal cycles and cause application throughput to plummet.

Choose standard or analy7c operators •  Easy to specify which

–  URL query parameter or SPARQL query hint

•  Java operators –  Use the managed Java heap. –  Can some7mes be faster or offer beber concurrency

•  E.g., dis7nct solu7ons is based on a concurrent hash map

–  BUT •  The Java heap can not handle very large materialized data sets. •  GC overhead can steal your computer

•  Analy7c operators –  Scale up gracefully –  Zero GC overhead.

SPARQL Update •  Graph Management

–  Create, Add, Copy, Move, Clear, Drop

•  Graph Data Opera7ons –  LOAD uri –  INSERT DATA, DELETE DATA –  DELETE/INSERT

•  Can be used as a RULES language, update procedures, etc.

( WITH IRIref )? ( ( DeleteClause InsertClause? ) | InsertClause ) ( USING ( NAMED )? IRIref )* WHERE GroupGraphPattern

SPARQL UPDATE Extension •  Language extension for SPARQL UPDATE

–  Adds durable solu7on sets.

•  Easy to slice result sets:

•  Re-‐group or re-‐order results:

•  Expensive JOINs are NOT recomputed.

INSERT INTO %solutionSet1 SELECT ?product ?reviewer WHERE { … }

SELECT ... { INCLUDE %solutionSet1 } OFFSET 0 LIMIT 1000 SELECT ... { INCLUDE %solutionSet1 } OFFSET 1000 LIMIT 1000 SELECT ... { INCLUDE %solutionSet1 } OFFSET 2000 LIMIT 1000

SELECT ... { INCLUDE %solutionSet1 } GROUP BY ?x

SELECT ... { INCLUDE %solutionSet1 } ORDER BY ASC(?x)

Seman7c Search •  Enable in your KB proper7es:

–  com.bigdata.rdf.store.AbstractTripleStore.textIndex=true.

•  Simple full text search: prefix bd: <http://www.bigdata.com/rdf/search#>! select ?s, ?o {!

!?o bd:search “mike” . # all literals with “mike” token.!!?s ?p ?o . # all subjects for those literals.!!}!

•  Lots of op7ons (cosine relevance, rank, match all terms, etc.). •  Low latency, web facing applica7ons built by “slicing” the

search index. •  See hbps://sourceforge.net/apps/mediawiki/bigdata/index.php?7tle=FullTextSearch

Federated Query & Custom Services

•  Language extension for remote services SERVICE uri { graph-‐pa1ern }

•  Integrated into the vectored query engine –  Solu7ons vectored into, and out of, remote end points. –  Control evalua7on order query hints:

•  runFirst, runLast, runOnce, etc.

•  ServiceRegistry –  Configure service end point behavior –  Embed custom “services” (custom indices, monitor transac7ons, etc).

SERVICE { …. } hint:Prior hint:runFirst “true”.

RDF Graph Mining •  “GAS” SERVICE – 2x faster than neo4j, 4x faster than 7tan. •  Handles typed linked and link abributes efficiently PREFIX gas: <hbp://www.bigdata.com/rdf/gas#> SELECT ?depth (count(?out) as ?cnt) { SERVICE gas:service { gas:program gas:gasClass "com.bigdata.rdf.graph.analy7cs.BFS" . gas:program gas:in <ip:/112.174.24.90> . # one or more 7mes, specifies the ini7al fron7er. gas:program gas:out ?out . # exactly once -‐ will be bound to the visited ver7ces. gas:program gas:out1 ?depth . # exactly once -‐ will be bound to the depth of the visited ver7ces. gas:program gas:maxItera7ons 4 . # op7onal limit on breadth first expansion. gas:program gas:maxVisited 2000 . # op7onal limit on the #of visited ver7ces. } } group by ?depth order by ?depth

depth count 1 207 2 1,985 3 9,861 4 29,366

SPARQL Query

Highly Available Replica7on Cluster

(HAJournalServer)

(bigdata 1.3 release)

HA Replica7on Cluster •  Same backend database (Journal + RWStore)

–  Same data scale as Journal. –  Same IO profile as Journal. –  Same low latency query profile as Journal.

•  REST API is mandatory –  SPARQL Query –  SPARQL UPDATE –  SPARQL Federated Query

•  Can be used to cross tenant or machine boundaries. •  Writes on leader

–  Iden7fied in zookeeper or using REST API. •  Query on leader or followers

–  Each query is answered 100% locally. •  Zero coordina7on overhead. •  Linear scaling in query throughput

High Availability •  Shared nothing architecture

–  Same data on each node –  Coordinate only at commit

•  Scaling –  50 billion triples or quads –  Query throughput scales linearly

•  Self healing –  Automa7c failover –  Automa7c resync auer disconnect –  Online single node disaster recovery

•  Online Backup –  Online snapshots (full backups) –  HA Logs (incremental backups)

•  Point in 7me recovery (offline)

HAService

Quorum k=3

size=3

follower

leader

HAService

High Availability

HAService

Quorum k=3

size=3

follower

leader

HAService

HAService Clients • Write on the leader. Read on any service. • Writes replicated using low-‐level transfers. • Quorum fully consistent at each commit.

High Availability

read HAService

Quorum k=3

size=3

follower

leader

HAService

HAService Clients • Write on the leader. Read on any service. • Writes replicated using low-‐level transfers. • Quorum fully consistent at each commit.

Self-‐Healing

HAService

Quorum k=3

size=2

follower

leader

HAService

synchronize

Service can fail for a variety of reasons: •  JVM down •  Machine down •  Network par77on •  Zookeeper 7meout •  Discovery failure •  Wrong commit point •  Severe clock skew

(delta)

•  Goal is to guarantee eventual consistency without allowing intermediate illegal states.

•  Persistent state of the service must remain self-‐consistent

HA Load Balancer

HAService leader

HAService

• Transparent Proxy • Writes proxied to the leader. • Reads load balanced over quorum.

•  Host metrics (CPU, IO Wait) •  Service metrics (GC Time)

• Custom policies •  Tenant aware •  Low & high latency pools Jeby container

Load balancer

request

HA Deployment

• EC2 •  SSD instance types (IOPS, ephemeral) •  Snapshots and logs on EBS (durable) •  Restore from EBS on instance restart •  Coming soon

• Click start HA clusters on EC2 • Private clouds

•  OpenStack •  chef/puppet

HAService

HA Service Architecture

REST API (SPARQL, GAS, etc.) Ensem

Zookeeper

Unified API

Application Client

SPARQL XML

SPARQL JSON

RDF/XML

N-Triples

N-Quads

Turtle

RDF/JSON

Zookeeper

ServiceStarter

HAJournalServer

Lookup Service Class Server

REST API Load Balancer

HAGlue API Journal

ServiceStarter

HAJournalServer

HAGlue API Journal

ServiceStarter

HAJournalServer

HAGlue API Journal

eplication Cluster

BSBM 100M (3-‐Node HA Cluster)

20000"

40000"

60000"

80000"

100000"

120000"

140000"

160000"

1" 11" 21" 31" 41" 51" 61" 71" 81" 91" 101" 111"

Query%Performance%Scales%Linearly%with%Cluster%Size%BSBM%100M,%3@node%replicaBon%cluster%using%Intel%Mac%Minis%•  3-‐Node, Shared-‐Nothing

Replica7on Cluster –  3x 2011 Mac Mini (4

cores, 16G RAM and SSD).

•  Query Scales Linearly •  CPU bound

–  70-‐90k QMpH on newer servers.

Aggregate Throughput

Per-node Throughput

HA DEMO •  Brief demonstra7on of an HA-‐3 cluster •  Start 3 services (A,B,C) •  Load some data (TBL foaf crawl)

DROP ALL; LOAD <file:/Users/bryan/Documents/workspace/BIGDATA_RELEASE_1_2_0/bigdata-‐rdf/src/resources/data/foaf/data-‐0.nq.gz>; LOAD <file:/Users/bryan/Documents/workspace/BIGDATA_RELEASE_1_2_0/bigdata-‐rdf/src/resources/data/foaf/data-‐1.nq.gz>; LOAD <file:/Users/bryan/Documents/workspace/BIGDATA_RELEASE_1_2_0/bigdata-‐rdf/src/resources/data/foaf/data-‐2.nq.gz>;

–  Iden7cal data on all services SELECT (count(*) as ?c) { ?s ?p ?o } LIMIT 1

•  Self-‐healing –  Shutdown C –  Load more data on A + B. Commit. LOAD <file:/Users/bryan/Documents/workspace/BIGDATA_RELEASE_1_2_0/bigdata-‐rdf/src/resources/data/foaf/data-‐3.nq.gz>;

–  Restart C. Resyncs and joins quorum. •  Snapshot service.

File Quads Size (gz) 7mbl/data-‐0.nq.gz 89 2.5K 7mbl/data-‐1.nq.gz 16516 293K 7mbl/data-‐2.nq.gz 87250 1.2M 7mbl/data-‐3.nq.gz 388412 5.1M 7mbl/data-‐4.nq.gz 9405528 113M 7mbl/data-‐5.nq.gz 93898523 1016M 7mbl/data-‐6.nq.gz 101010423 1.2G

Bigdata®

Services and dynamics

Related “Graph” Technologies

Redpoint repositions existing technology. MPGraph compares favorably with high end hardware solutions from YARC, Oracle, and SAP, but is open source and uses commodity hardware.

Pair up bigdata and MPGraph

bigdata® is a federa7on of services

Distributed Index Management and Query

RDF Data and SPARQL Query Managem

ent Functions Client Service

Registrar

Data Service

Client Service

Data Service Data Service Data Service

Zookeeper

Shard Locator

Transaction Mgr

Load Balancer

Unified API

Application Client

Client Service

SPARQL XML

SPARQL JSON

RDF/XML

N-Triples

N-Quads

Turtle

RDF/JSON

Typical Souware Stack

Unified API Unified API Implementation

API Frameworks (Spring, etc.) Sesame Framework

SAIL RDF SPARQL

Bigdata RDF Database Bigdata Component Services

OS (Linux)

Application Layer

Cluster and Storage Management

Service Discovery (1/2) •  Services discover registrars. Registrar discovery is configurable using either unicast or mul7cast protocols.

•  Services adver7se themselves and lookup other services (a).

•  Clients use the shard locator to locate key-‐range shards for scale-‐out indices (b).

Client Service

Shard Locator

Data Service

Registrar

Data Service

Service registration and discovery (a)

RMI / NIO Bus

Service Discovery (2/2) •  Clients resolve shard locators to data service iden7fiers (a), then lookup the data services in service registrar (b).

•  Data moves directly between clients and services (c).

•  Service protocols not limited to RMI. Custom NIO protocols for data high throughput.

•  Client libraries encapsulate this for applica7ons, including caching of service lookup and shard resolu7on.

Client Service

RMI / NIO Bus

Shard Locator

Data Service

Registrar

Data Service

Service lookup

Control Messages and

Data Flow

Persistence Store Mechanisms •  Read/Write (RW) store

–  Efficiently recycles alloca7on slots on the backing file –  Used by services that need persistence without sharding, HA, etc. –  Also used in the scale-‐up single machine database (~50B triples)

•  Write Once, Read Many (WORM) store –  Append-‐only, log structured store (aka “journal”) –  Used by the data services used to absorb writes –  Plays an important role in the scale-‐out database architecture

•  Index segment (SEG) store –  Read-‐op7mized B+Tree files –  Generated by bulk index build opera7ons on a data service –  Plays an important role in dynamic sharding and analy7c query

RWStore Journal

WORM Journal

index segments

The Data Service (1/2)

Data Services

overflow

Scattered writes

Gathered reads

Clients Append only journals and read-optimized index segments are basic building blocks.

index segments

journal

The Data Service (2/2)

Data Service

index segments

Journal • Block append for ultra fast writes • Target size on disk ~ 200MB.

Index Segments • 98%+ of all data on disk • Target shard size on disk ~200M • Bloom filters (fast rejec7on) • At most one IO per leaf access • Mul7-‐block IO for leaf scans

journal

Behind the scenes on a data service.

Overflow • Fast synchronous overflow • Asynchronous index builds • Key to Dynamic Sharding

overflow

Bigdata® Indices

Scale-‐out B+Tree and Dynamic Sharding

Bigdata® Indices

•  Dynamically key-‐range par77oned B+Trees for indices –  Index entries (tuples) map unsigned byte[ ] keys to byte[ ] values. –  “deleted” flag and 7mestamp used for MVCC and dynamic sharding.

•  Index par77ons distributed across data services on a cluster –  Located by centralized metadata service

nodes Csep. keys 7child refs. A B

A B4 9a b c d

leaves a b c dkeys 1 2 3 4 5 6 7 8 9 10values v1 v2 v3 v4 v5 v6 v7 v8 v9 v10

Dynamic Key Range Par77oning

p0 split p1 p2

p2 join p3 p1

p3 move p4

Splits break down the shards as the data scale increases.

Moves redistribute shards onto exis7ng or new nodes in the cluster.

Joins merge shards when data scale decreases.

Shard Locator Service

p0 ([],∞)

DataService1

Ini7al condi7ons place the first shard on an arbitrary data service represen7ng the en7re index key range.

Writes cause the shard to grow. Eventually its size on disk exceeds a preconfigured threshold.

p0 ([],∞)

Shard Locator Service

DataService1

p5 p2 p3

p8 p9 Shard Locator Service

([],∞)

DataService1

Instead of a simple two-‐way split, the ini7al shard is “scaber-‐split” so that all data services can start managing data.

The newly created shards are then dispersed around the cluster. Subsequent splits are two-‐way and moves occur based on rela7ve server load (decided by Load Balancer Service).

DataService1

DataService2 Shard Locator Service

DataService9

Shard Evolu7on

build p0

p2 merge p3

Builds generate index segments from just the old journal.

Merge compacts the shard view into a single index segment.

journal0

journal0 p0 p1

build p1 journal0

build p2 journal0

Shard Evolu7on

•  Ini7al journal on DS.

•  Incremental build of new segments for a shard with each journal overflow.

•  Shard periodically undergoes compac7ng merge.

•  Shard will split at 200MB.

journaln seg 0

seg n-‐1

build WRITE

journaln+1

WRITE merge

journal0

Bulk Data Load

High Throughput with Dynamic Sharding

Bulk Data Load

•  Very high data load rates –  1B triples in under an hour (beber than 300,000 triples per second on a 16 node cluster).

•  Executed as a distributed job –  Read data from a file system, the web, HDFS, etc.

•  Database remains available for query during load –  Read from historical commit points with snapshot isola7on.

Distributed Bulk Data Loader

Applica7on client iden7fies RDF data files to be loaded

One client is elected as the job master and coordinates the job across the other client.

Writes are scabered across the data service nodes

Application Client

Client Service

Data Service Data Service Data Service Data Service Data Service

Client Service Client Service Client Service Clients read directly from shared storage.

Input (2.5B triples) 16 node cluster loading at 180k triples per second

1 61 121 181

minutes

Throughput (2.5B triples) 16 node cluster, 2.5B triples.

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

180,000

200,000

1 61 121 181

minutes

Disk U7liza7on (~56 bytes/triple) Bytes Under Management

1 61 121 181

minutes

Shards over 7me (2.5B triples) Total Shard Count on Cluster over time for 2.5B triples

1 61 121 181

minutes

Dynamic Sharding in Ac7on Shard builds, merges, and splits for 2.5B triples

1 61 121 181minutes

SplitBuildMerge

Scaber Splits in Ac7on (Zoom on first 3 minutes)

Vectored Query in Scale-‐Out

bigdata ®

Bryan Thompson SYSTAP, LLC

bryan@systap.com

http://www.systap.com/bigdata.htm http://www.bigdata.com/blog

Backup Material

Statement Iden7fiers

RDF Graphs with efficient link abributes

Statement Level Metadata

•  Important to know where data came from in a mashup

•  :mike :memberOf :SYSTAP . •  dc:source <hbp://www.systap.com> .

•  But you CAN NOT say that in RDF.

RDF “Reifica7on”

•  Creates a “model” of the statement: _:s1 rdf:subject :mike . _:s1 rdf:predicate :memberOf . _:s1 rdf:object :SYSTAP . _:s1 rdf:type rdf:Statement .

•  Then you can say: _:s1 dc:source <http://www.systap.com> .

Statement Iden7fier Mode (SIDs)

•  Special database mode –  SIDs look just like blank nodes –  Leverages the named graph of the statement

•  SIDs let you do exactly what you want: :mike :memberOf :SYSTAP _:s1 . _:s1 dc:source <http://www.systap.com> .

•  Use SIDs in SPARQL: select ?s ?o ?source

where {

GRAPH ?sid { ?s :memberOf ?o } . ?sid dc:source ?source .

Reifica7on Done Right

•  Extends the concept to support quads. –  Outcome from Dagstuhl 2012 Seman7c Data Management workshop.

–  Collabora7ve effort with SYSTAP, Open Link, Humboldt University, Karlsruhe Ins7tute of Technology.

–  W3C Member Submission in prepara7on. –  Harmonized with RDF model theory & SPARQL algebra. –  Efficient in index structures and queries.

•  Extensions for N3, TURTLE, and SPARQL are proposed. •  Interchange and query for link abributes (graph databases).

Works with triples or quads •  Inline statements into statements.

<< :mike :memberOf :SYSTAP >>

dc:source <http://www.systap.com>

•  Same syntax works for query select ?s ?o ?source

where { << ?s :memberOf ?o >> dc:source ?source .

•  Standardized approach for: •  Link abributes (graph databases). •  Confidence measures (en7ty / link extractors). •  Datum level security models.

SYSTAP,LLC’ -...

Documents

Transcript of SYSTAP,LLC’ -...

Introduction - Semantic Community - Semanticommunity.infosemanticommunity.info/@api/deki/files/33572/M0398_v1_… · Web viewThe Information Technology Laboratory (ITL) at NIST

Serial Number Range - MinnParfiles.minnpar.com/partbooks/BOOM LIFT MANUALS/GENIE BOOM LIFT... · Serial Number Range S-60 from SN 001 to 652 Part No. 28780 ... DECAL,NOTICE,MAX.LOAD

X64 Xcelera-CL LX1 User's Manual - STEMMER … Xcelera-CL LX1 User's Manual Contents i Contents OVERVIEW_____ 5 X64 XCELERA-CL LX1 PART NUMBERS5 ABOUT THE X64 XCELERA-CL LX1 …

Xcelera R3.X0001

Serial Number Range - cigpower.com.t manual/part 28780.pdf · Serial Number Range S-60 from SN 001 to 652 Part No. 28780 ... DECAL,NOTICE,MAX.LOAD 660 LBS ... 3 6198 NUT,NYLOCK,1/2-13

BANKERS’ PAY AND EXTREME WAGE INEQUALITY IN THE UKeprints.lse.ac.uk/28780/1/cepsp21.pdf · decades. In 1979 a man at the 90th percentile (just in the richest tenth of workers) earned

[XLS]semanticommunity.infosemanticommunity.info/@api/deki/files/34365/RDAClimate... · Web viewapi/deki/files/34362/Berman_Announcement_07-15-2015.docx ...

Welcome to E-LIS repository - E-LIS repositoryeprints.rclis.org/28780/1/eMarketingforLibraries.pdfE-marketing and some related concepts E-marketing is a wider concept in practice but

semanticommunity.infosemanticommunity.info/@api/deki/files/33791/NISTBigData… · XLS file · Web view... innovative forms of information processing for enhanced insight and decision

semanticommunity.infosemanticommunity.info/@api/deki/files/29000/CODATABig… · XLS file · Web viewMore than 80 000 documents ... It is available free of charge to the public ...

semanticommunity.infosemanticommunity.info/.../files/36182/FM_006-40_-_1939.docx · Web viewFM 6-40. FIELD. ARTILLERY. FIELD. MANUAL. FIRING. Prepared under direction of. the. Chief

Xcelera Complaint

X64 Xcelera-CL PX4 User's Manual - ADSTEC · 2015-07-10 · Teledyne Dalsa 7075 Place Robert-Joncas, Suite 142 St-Laurent, Quebec, H4M 2Z2 Canada *OC-X4CM-PUSR0* X64 Xcelera-CL PX4™

DATA SHEET Merge Cardio€¦ · DATA SHEET Merge Cardio ... Philips Xcelera 77.5 +5% ... Implementation & Training Functionality & Upgrades Service & Support General

ТЕХНОЛОГИЯ ДЕРЕВООБРАБАТЫВАЮЩЕЙ И ...[ГОСТ 28780-90, статья 1],_____ 45 лакокрасочный материал: Жидкий, пастообразный

DCAM-API for Windows (May 2012 TEST.4087) Compatibility ... for...DCAM-API for Windows (May 2012) Compatibility Note DALSA Xcelera (CameraLink) – Cont’d Important Note Older motherboards

IntelliSpace PACS - Philips · 2 3 As a result of the migration to Philips IntelliSpace PACS and Xcelera, UCHealth projects a five-year savings of 11.1 million USD, as …

School District Reference Map (2010 Census) · ARKANSAS 05 Greenwood 28780 Hartford 30490 Hackett 29290 Huntington 33940 Mena 45170 Magazine 43310 Midland 45500 Booneville 07720 Blue

Introduction - Semantic Community - Semanticommunity.infosemanticommunity.info/.../M0393_v1_3613775223.docx · Web viewDRAFT NIST Big Data Interoperability Framework: Volume 2,

X64 Xcelera-CL LX1 User's Manual - ads-tec.co.jp INSTALLATION PROBLEMS _____ 21 OVERVIEW ... X64 Xcelera-CL LX1 User's Manual Installing X64 Xcelera-CL LX1 9 Installing X64 Xcelera-CL

X64 Xcelera-CL PX4 User's Manual - ADSTEC · 2015-07-10 · Teledyne Dalsa 7075 Place Robert-Joncas, Suite 142 St-Laurent, Quebec, H4M 2Z2 Canada OC-X4CM-PUSR0 X64 Xcelera-CL PX4™