Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph...

Piecing together large puzzles, efficiently

Scalable bulk loading into graph databasesWork in progress paper

Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter Saake

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

Agenda

● Motivation● Background & The Graph Loading Process● Experiments● Conclusion & Future Work


http://progress_bar_id

Motivation

How can we understand better the networks that we belong to?



Motivation



Motivation


● An example of a practical application: Recommendations@Pinterest


Motivation


● An example of a practical application: Dependency-driven analytics@Microsoft


Motivation


● Large graphs are ubiquitous○ ⅕ of participants use graphs with >100 M edges

● Scalability is the main challenge● Graph DBMSs are the most popular tool, at the moment


Motivation


● User experience starts with data loading

● This can still be improved○ Currently no standard scale-out solution for the process (our focus)○ Limited handling of variable input data characteristics.

bin/neo4j-import --into retail.db --id-type string \ --nodes:Customer customers.csv --nodes products.csv \ --nodes orders_header.csv,orders1.csv,orders2.csv \ --relationships:CONTAINS order_details.csv \ --relationships:ORDERED customer_orders_header.csv,orders1.csv,orders2.csv


Background


● Input data characteristics

Edge Lists, from SNAP Astro-Physics Collaboration Dataset

Implicit Entities, from SNAP Amazon Movie Reviews Dataset

Also property encodings, others…

Working today with large and diverse graph datasets is cumbersome


Background


● But before going any further, the single unavoidable slide :)● Property graphs (the underlying logical model we’re assuming)

● Directed● Labeled● Attributed,● Multi-graph


The Graph Loading Process


Topology-onlyrepresentations

Complete representations

● Moving data from input files to physical storage, while keeping with constraints


The Graph Loading Process



Experiments


● The basic question we address today:

○ How much can an developer nowadays scale-out and tune the process, without changing database internals?


Experiments


● Setup

JanusGraph (formerly Titan)

Datasets: Wiki-RfA (10,835 V, 159,388 E) and Google-Web (875,731 V, 5,105,039 E)


Experiments


● Setup○ JanusGraph Version 0.1.1 (May,11,2017) ○ Apache Cassandra 2.1.1. ○ Commodity multi-core machine composed of 2 Intel(R) Xeon(R) CPU E5-2609

v2 @ 2.50GHz processors (8 cores in total) with 251 GB of memory.


Experiments


Gains from batching

● Fit more data inside a single transaction● The bigger the batch size, the faster the

loading process○ Batching works!

● Larger batch sizes don‘t guarantee better performance○ Poor use of transaction caches○ Higher costs for failed transactions


Experiments



Experiments


Adding some parallelism

● Partition the data into chunks and load in parallel○ Here we report average of strategies.

● This consistently reduces the loading time● Less impact than batching.

○ Multiple users on the same data bring transaction commits overheads.


Experiments



Experiments


A closer look at the partitioning strategies

● EE: Part Edges, Balance Edges● VV: PV, BV● BE: PV, BE● DS: Extension to BE, deals with skew

All achieve good balancing in these datasets


Experiments



Experiments


No big differences between them for these datasets

Only imbalance in Wiki-Rfa VV 2 part.

Distribution Across Partitions in Google Web =>


Experiments


No big differences between them for these datasets

Only imbalance in Wiki-Rfa VV 2 part.

Distribution Across Partitions in Wiki-RfA =>


Experiments


Putting it all together

Load Time Using Different Partitioning Strategies with Batch Size = 10, 100, 1000 (Wiki - RfA)

● Combination of batching and partitioning leads to degraded performance○ On multi-user environment transaction commit time increases with batch sizes if users select the same data.

○ It also increases with more users.

No improvements over batching


Experiments


Load Time Using Different Partitioning Strategies with Batch Size = 10, 100, 1000 (Google Web)

● Combination of batching and partitioning leads to degraded performance○ On multi-user environment transaction commit time increases with batch sizes if users select the same data.

○ It also increases with more users.

No improvements over batching


Conclusion


● Batching is the best first strategy. We've seen gains from 100 minutes to 1.5.○ Small disclaimer: gains do not grow in proportion to sizes.

● But the combination of batching and partitioning is not straight-forward and can bring deterioration.○ How can we make them work well together?

● EE, BE/DS could be the default partitioning strategy○ But load imbalance is not the single factor affecting performance

More studies are next, moving our questions in studying physical storage alternatives, in tune with a broader picture of interest in supporting adaptive HTAP designs.


Thanks :)

Questions?



Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph...

Documents

Transcript of Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph...