Not Your Father's Database by Vida Ha

28
Not Your Father’s Database: How to Use Apache Spark Properly in Your Big Data Architecture Spark Summit East 2016

Transcript of Not Your Father's Database by Vida Ha

Page 1: Not Your Father's Database by Vida Ha

Not Your Father’s Database: How to Use Apache Spark Properly in Your Big Data Architecture

Spark Summit East 2016

Page 2: Not Your Father's Database by Vida Ha

Not Your Father’s Database: How to Use Apache Spark Properly in Your Big Data Architecture

Spark Summit East 2016

Page 3: Not Your Father's Database by Vida Ha

About Me

2005 Mobile Web & Voice Search

3

Page 4: Not Your Father's Database by Vida Ha

About Me

2005 Mobile Web & Voice Search

4

2012 Reporting & Analytics

Page 5: Not Your Father's Database by Vida Ha

About Me

2005 Mobile Web & Voice Search

5

2012 Reporting & Analytics

2014 Solutions Engineering

Page 6: Not Your Father's Database by Vida Ha

This system talks like a SQL Database…

Is this your Spark infrastructure?

6

HDFS

SQ

L

Page 7: Not Your Father's Database by Vida Ha

But the performance is very different…

Is this your Spark infrastructure?

7

SQ

L

HDFS

Page 8: Not Your Father's Database by Vida Ha

Just in Time Data Warehouse w/ Spark

HDFS

Page 9: Not Your Father's Database by Vida Ha

Just in Time Data Warehouse w/ Spark

HDFS

Page 10: Not Your Father's Database by Vida Ha

Just in Time Data Warehouse w/ Spark

and more…HDFS

Page 11: Not Your Father's Database by Vida Ha

11

Know when to use other data stores besides file systems

Today’s Goal

Page 12: Not Your Father's Database by Vida Ha

Good: General Purpose Processing

Types of Data Sets to Store in File Systems: • Archival Data • Unstructured Data • Social Media and other web datasets • Backup copies of data stores

12

Page 13: Not Your Father's Database by Vida Ha

Types of workloads • Batch Workloads • Ad Hoc Analysis

– Best Practice: Use in memory caching • Multi-step Pipelines • Iterative Workloads

13

Good: General Purpose Processing

Page 14: Not Your Father's Database by Vida Ha

Benefits: • Inexpensive Storage • Incredibly flexible processing • Speed and Scale

14

Good: General Purpose Processing

Page 15: Not Your Father's Database by Vida Ha

Bad: Random Access

sqlContext.sql( “select * from my_large_table where id=2I34823”)

Will this command run in Spark?

15

Page 16: Not Your Father's Database by Vida Ha

Bad: Random Access

sqlContext.sql( “select * from my_large_table where id=2I34823”)

Will this command run in Spark? Yes, but it’s not very efficient — Spark may have to go through all your files to find your row.

16

Page 17: Not Your Father's Database by Vida Ha

Bad: Random Access

Solution: If you frequently randomly access your data, use a database.

• For traditional SQL databases, create an index on your key column.

• Key-Value NOSQL stores retrieves the value of a key efficiently out of the box.

17

Page 18: Not Your Father's Database by Vida Ha

Bad: Frequent Inserts

sqlContext.sql(“insert into TABLE myTable select fields from my2ndTable”)

Each insert creates a new file: • Inserts are reasonably fast. • But querying will be slow…

18

Page 19: Not Your Father's Database by Vida Ha

Bad: Frequent Inserts

Solution: • Option 1: Use a database to support the inserts. • Option 2: Routinely compact your Spark SQL table files.

19

Page 20: Not Your Father's Database by Vida Ha

Good: Data Transformation/ETL

Use Spark to splice and dice your data files any way:

File storage is cheap: Not an “Anti-pattern” to duplicately store your data.

20

Page 21: Not Your Father's Database by Vida Ha

Bad: Frequent/Incremental Updates

Update statements — not supported yet.

Why not? • Random Access: Locate the row(s) in the files. • Delete & Insert: Delete the old row and insert a new one. • Update: File formats aren’t optimized for updating rows.

Solution: Many databases support efficient update operations.21

Page 22: Not Your Father's Database by Vida Ha

Use Case: Up-to-date, live views of your SQL tables.

Tip: Use ClusterBy for fast joins or Bucketing with 2.0.

Bad: Frequent/Incremental Updates

22

Incremental SQL Query

Database Snapshot

+

Page 23: Not Your Father's Database by Vida Ha

Good: Connecting BI Tools

Tip: Cache your tables for optimal performance.

23

HDFS

Page 24: Not Your Father's Database by Vida Ha

Bad: External Reporting w/ load

Too many concurrent requests will overload Spark.

24

HDFS

Page 25: Not Your Father's Database by Vida Ha

Solution: Write out to a DB to handle load.

Bad: External Reporting w/ load

25

HDFS

DB

Page 26: Not Your Father's Database by Vida Ha

Good: Machine Learning & Data Science

Use MLlib, GraphX and Spark packages for machine learning and data science.

Benefits: • Built in distributed algorithms. • In memory capabilities for iterative workloads. • Data cleansing, featurization, training, testing, etc.

26

Page 27: Not Your Father's Database by Vida Ha

Bad: Searching Content w/ load

sqlContext.sql(“select * from mytable where name like '%xyz%'”)

Spark will go through each row to find results.

27

Page 28: Not Your Father's Database by Vida Ha

Thank you