Hadoop and rdbms with sqoop

24
© 2010 Quest Software, Inc. ALL RIGHTS RESERVED Exchanging data with the Elephant: Connecting Hadoop and an RDBMS using SQOOP Guy Harrison Director, R&D Melbourne www.guyharrison.net [email protected] @ guyharrison

description

Presentation given at Hadoop World NYC 2011. Moving data between Hadoop and RDBMS with SQOOP

Transcript of Hadoop and rdbms with sqoop

Page 1: Hadoop and rdbms with sqoop

© 2010 Quest Software, Inc. ALL RIGHTS RESERVED

Exchanging data with the Elephant: Connecting Hadoop and an RDBMS using SQOOP

Guy HarrisonDirector, R&D Melbournewww.guyharrison.net [email protected]@guyharrison

Page 2: Hadoop and rdbms with sqoop

2

Introductions

Page 3: Hadoop and rdbms with sqoop

3

Page 4: Hadoop and rdbms with sqoop

4

Agenda

• RDBMS-Hadoop interoperability scenarios• Interoperability options• Cloudera SQOOP• Extending SQOOP • Quest OraOop extension for Cloudera SQOOP• Performance comparisons • Lessons learned and best practices

Page 5: Hadoop and rdbms with sqoop

Scenario #1: Reference data in RDBMS

RDBMS

Customers

WebLogs

Products

HDFS

Page 6: Hadoop and rdbms with sqoop

Scenario #2: Hadoop for off-line analytics

RDBMS

Customers

Products

HDFS

Sales History

Page 7: Hadoop and rdbms with sqoop

Scenario #3: Hadoop for RDBMS archive

RDBMS

HDFS

Sales 2008

Sales 2009

Sales 2010

Sales 2008

Page 8: Hadoop and rdbms with sqoop

Scenario #4: MapReduce results to RDBMS

RDBMS

WebLogs

HDFS

WebLog

Summary

Page 9: Hadoop and rdbms with sqoop

9

Options for RDBMS inter-op• DBInputFormat:

– Allows database records to be used as mapper inputs.– BUT:

• Not inherently scalable or efficient• For repeated analysis, better to stage in Hadoop• Tedious coding of DBWritable classes for each table

• SQOOP– Open source utility provided by Cloudera– Configurable command line interface to copy RDBMS->HDFS– Support for Hive, Hbase– Generates java classes for future M-R tasks– Extensible to provide optimized adaptors for specific targets– Bi-Directional

Page 10: Hadoop and rdbms with sqoop

10

SQOOP Details • SQOOP import

– Divide table into ranges using primary key max/min– Create mappers for each range – Mappers write to multiple HDFS nodes– Creates text or sequence files – Generates Java class for resulting HDFS file– Generates Hive definition and auto-loads into HIVE

• SQOOP export– Read files in HDFS directory via MapReduce– Bulk parallel insert into database table

Page 11: Hadoop and rdbms with sqoop

11

SQOOP details• SQOOP features:

– Compatible with almost any JDBC enabled database– Auto load into HIVE – Hbase support – Special handling for database LOBs– Job management – Cluster configuration (jar file distribution)– WHERE clause support– Open source, and included in Cloudera distributions

• SQOOP fast paths & plug ins– Invoke mysqldump, mysqlimport for MySQL jobs – Similar fast paths for PostgreSQL– Extensibility architecture for 3rd parties (like Quest )

• Teradata, Netezza, etc.

Page 12: Hadoop and rdbms with sqoop

12

Working with Oracle • SQOOP approach is generic and applicable to all RDBMS• However for Oracle, sub-optimal in some respects:

– Oracle may parallelize and serialize individual mappers – Oracle optimizer may decline to use index range scans– Oracle physical storage often deliberately not in primary key order

(reverse key indexes, hash partitioning, etc)– Primary keys often not be evenly distributed– Index range scans use single block random reads

• vs. faster multi-block table scans– Index range scans load into Oracle buffer cache

• Pollutes cache increasing IO for other users• Limited help to SQOOP since rows are only read once

• Luckily, SQOOP extensibility allows us to add optimizations for specific targets

Page 13: Hadoop and rdbms with sqoop

Oracle – parallelism

Oracle

SALES

table

Oracle PQ

Slave

Oracle PQ

Slave

Oracle PQ

Slave

Oracle PQ

Slave

Oracle PQ

Slave

Oracle PQ

Slave

Oracle PQ

Slave

Oracle PQ

Slave

ScanSort

Oracle PQ

Slave

Oracle PQ

Slave

Oracle PQ

Slave

Oracle PQ

Slave

Aggregate

Oracle

Master

(QC)

Client

(JDBC)

SELECT cust_id, SUM (amount_sold)

FROM sh.sales

GROUP BY cust_id

ORDER BY 2 DESC

Page 14: Hadoop and rdbms with sqoop

Oracle

SALES

table

HDFS

Hadoop Mapper

Hadoop Mapper

Hadoop Mapper

Hadoop Mapper

Oracle – parallelism

Page 15: Hadoop and rdbms with sqoop

Oracle

SALES

table

HDFS

Hadoop Mapper

Hadoop Mapper

Hadoop Mapper

Hadoop Mapper

Oracle – parallelism

Page 16: Hadoop and rdbms with sqoop

Buffer cache

Oracle table

Index range scans

Index block Index block

Index range scan

Hadoop Mapper

Oracle

Session

ID > 0 and ID < MAX/2 Hadoop Mapper

Oracle

Session

ID > MAX/2

Index block Index block

Index range scan

Index block Index block

Page 17: Hadoop and rdbms with sqoop

Oracle

SALES

table

HDFS

Ideal architecture

Hadoop MapperOracle

Session

Hadoop MapperOracle

Session

Hadoop MapperOracle

Session

Hadoop MapperOracle

Session

Page 18: Hadoop and rdbms with sqoop

18

Quest/Cloudera OraOop for SQOOP• Design goals

– Partition data based on physical storage– By-pass Oracle buffering– By-pass Oracle parallelism– Do not require or use indexes– Never read the same data block more than once– Support esoteric datatypes (eventually) – Support RAC clusters

• Availability:– Freely available from www.quest.com/ora-oop– Packaged with Cloudera Enterprise – Commercial support from Quest/Cloudera within Enterprise

distribution

Page 19: Hadoop and rdbms with sqoop

19

0 5 10 15 20 25 30 350

1000

2000

3000

4000

5000

6000

7000

50M row, 50GB Oracle table to 16-node Hadoop cluster

SQOOP

SQOOP with Ora-Oop

Number of Hadoop mappers

Ela

pse

d t

ime

(s)

OraOop Throughput

Page 20: Hadoop and rdbms with sqoop

20

16 mappers, 50M rows, 50 GB clustered data

Elasped time

CPU Time

Network round trips

IO requests

IO time

0 20 40 60 80 100

80.84

89.72

98.95

99.08

98.71

16 mappers, 50M rows, 50 GB clustered data

Pct reduction

Oracle overhead

Page 21: Hadoop and rdbms with sqoop

21

Extending SQOOP•  SQOOP lets you concentrate on the RDBMS logic, not the

Hadoop plumbing:– Extend ManagerFactory (what to handle)– Extend ConnManager (DB connection and metadata)– For imports:

• Extend DataDrivenDBInputFormat (gets the data)– Data allocation (getSplits())– Split serialization (“io.serializations” property)– Data access logic (createDBRecordReader(),

getSelectQuery())» Implement progress (nextKeyValue(), getProgress())

– Similar procedure for extending exports

Page 22: Hadoop and rdbms with sqoop

22

SQOOP/OraOop best practices

• Use sequence files for LOBs OR– Set inline-lob-limit

• Directly control datanodes for widest destination bandwidth– Can’t rely on mapred.max.maps.per.node

• Set number of mappers realistically • Disable speculative execution (our default)

– Leads to duplicate DB reads • Set Oracle row fetch size extra high

– Keeps the mappers streaming to HDFS

Page 23: Hadoop and rdbms with sqoop

23

Conclusion

• RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption

• SQOOP provides a good general purpose tool for transferring data between any JDBC database and Hadoop

• SQOOP extensions can provide optimizations for specific targets

• Each RDBMS offers distinct tuning opportunities, so SQOOP extensions offer real value

• Try out OraOop for SQOOP!

Page 24: Hadoop and rdbms with sqoop

© 2010 Quest Software, Inc. ALL RIGHTS RESERVED

너를 감사하십시요 Thank You Danke Schön

Gracias 有難う御座いました Merci

बहवः� धन्यवःदाः� Obrigado 谢谢