Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph...

27
Piecing together large puzzles, efficiently Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter Saake Databases and Software Engineering Workgroup, OvGU University of Magdeburg

Transcript of Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph...

Page 1: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Piecing together large puzzles, efficiently

Scalable bulk loading into graph databasesWork in progress paper

Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter Saake

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

Page 2: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Agenda

● Motivation● Background & The Graph Loading Process● Experiments● Conclusion & Future Work

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

Page 3: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Motivation

How can we understand better the networks that we belong to?

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

Page 4: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Motivation

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

Page 5: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Motivation

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

● An example of a practical application: Recommendations@Pinterest

Page 6: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Motivation

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

● An example of a practical application: Dependency-driven analytics@Microsoft

Page 7: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Motivation

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

● Large graphs are ubiquitous○ ⅕ of participants use graphs with >100 M edges

● Scalability is the main challenge● Graph DBMSs are the most popular tool, at the moment

Page 8: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Motivation

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

● User experience starts with data loading

● This can still be improved○ Currently no standard scale-out solution for the process (our focus)○ Limited handling of variable input data characteristics.

bin/neo4j-import --into retail.db --id-type string \ --nodes:Customer customers.csv --nodes products.csv \ --nodes orders_header.csv,orders1.csv,orders2.csv \ --relationships:CONTAINS order_details.csv \ --relationships:ORDERED customer_orders_header.csv,orders1.csv,orders2.csv

Page 9: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Background

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

● Input data characteristics

Edge Lists, from SNAP Astro-Physics Collaboration Dataset

Implicit Entities, from SNAP Amazon Movie Reviews Dataset

Also property encodings, others…

Working today with large and diverse graph datasets is cumbersome

Page 10: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Background

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

● But before going any further, the single unavoidable slide :)● Property graphs (the underlying logical model we’re assuming)

● Directed● Labeled● Attributed,● Multi-graph

Page 11: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

The Graph Loading Process

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

Topology-onlyrepresentations

Complete representations

● Moving data from input files to physical storage, while keeping with constraints

Page 12: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

The Graph Loading Process

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

Page 13: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Experiments

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

● The basic question we address today:

○ How much can an developer nowadays scale-out and tune the process, without changing database internals?

Page 14: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Experiments

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

● Setup

JanusGraph (formerly Titan)

Datasets: Wiki-RfA (10,835 V, 159,388 E) and Google-Web (875,731 V, 5,105,039 E)

Page 15: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Experiments

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

● Setup○ JanusGraph Version 0.1.1 (May,11,2017) ○ Apache Cassandra 2.1.1. ○ Commodity multi-core machine composed of 2 Intel(R) Xeon(R) CPU E5-2609

v2 @ 2.50GHz processors (8 cores in total) with 251 GB of memory.

Page 16: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Experiments

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

Gains from batching

● Fit more data inside a single transaction● The bigger the batch size, the faster the

loading process○ Batching works!

● Larger batch sizes don‘t guarantee better performance○ Poor use of transaction caches○ Higher costs for failed transactions

Page 17: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Experiments

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

Page 18: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Experiments

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

Adding some parallelism

● Partition the data into chunks and load in parallel○ Here we report average of strategies.

● This consistently reduces the loading time● Less impact than batching.

○ Multiple users on the same data bring transaction commits overheads.

Page 19: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Experiments

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

Page 20: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Experiments

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

A closer look at the partitioning strategies

● EE: Part Edges, Balance Edges● VV: PV, BV● BE: PV, BE● DS: Extension to BE, deals with skew

All achieve good balancing in these datasets

Page 21: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Experiments

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

Page 22: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Experiments

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

No big differences between them for these datasets

Only imbalance in Wiki-Rfa VV 2 part.

Distribution Across Partitions in Google Web =>

Page 23: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Experiments

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

No big differences between them for these datasets

Only imbalance in Wiki-Rfa VV 2 part.

Distribution Across Partitions in Wiki-RfA =>

Page 24: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Experiments

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

Putting it all together

Load Time Using Different Partitioning Strategies with Batch Size = 10, 100, 1000 (Wiki - RfA)

● Combination of batching and partitioning leads to degraded performance○ On multi-user environment transaction commit time increases with batch sizes if users select the same data.

○ It also increases with more users.

No improvements over batching

Page 25: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Experiments

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

Load Time Using Different Partitioning Strategies with Batch Size = 10, 100, 1000 (Google Web)

● Combination of batching and partitioning leads to degraded performance○ On multi-user environment transaction commit time increases with batch sizes if users select the same data.

○ It also increases with more users.

No improvements over batching

Page 26: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Conclusion

Databases and Software Engineering Workgroup, OvGU University of Magdeburg

● Batching is the best first strategy. We've seen gains from 100 minutes to 1.5.○ Small disclaimer: gains do not grow in proportion to sizes.

● But the combination of batching and partitioning is not straight-forward and can bring deterioration.○ How can we make them work well together?

● EE, BE/DS could be the default partitioning strategy○ But load imbalance is not the single factor affecting performance

More studies are next, moving our questions in studying physical storage alternatives, in tune with a broader picture of interest in supporting adaptive HTAP designs.

Page 27: Scalable bulk loading into graph databases efficiently · Scalable bulk loading into graph databases Work in progress paper Gabriel Campero Durand, Jingyi Ma, Marcus Pinnecke, Gunter

Thanks :)

Questions?

Databases and Software Engineering Workgroup, OvGU University of Magdeburg