Hadoop and rdbms with sqoop

© 2010 Quest Software, Inc. ALL RIGHTS RESERVED

Exchanging data with the Elephant: Connecting Hadoop and an RDBMS using SQOOP

Guy HarrisonDirector, R&D Melbournewww.guyharrison.net [email protected]@guyharrison

http://www.guyharrison.net/

mailto:[email protected]

http://www.twitter.com/guyharrison

http://www.twitter.com/guyharrison

2

Introductions

4

Agenda

• RDBMS-Hadoop interoperability scenarios• Interoperability options• Cloudera SQOOP• Extending SQOOP • Quest OraOop extension for Cloudera SQOOP• Performance comparisons • Lessons learned and best practices

Scenario #1: Reference data in RDBMS

RDBMS

Customers

WebLogs

Products

HDFS

Scenario #2: Hadoop for off-line analytics

RDBMS

Customers

Products

HDFS

Sales History

Scenario #3: Hadoop for RDBMS archive

RDBMS

HDFS

Sales 2008

Sales 2009

Sales 2010

Sales 2008

Scenario #4: MapReduce results to RDBMS

RDBMS

WebLogs

HDFS

WebLog

Summary

9

Options for RDBMS inter-op• DBInputFormat:

– Allows database records to be used as mapper inputs.– BUT:

• Not inherently scalable or efficient• For repeated analysis, better to stage in Hadoop• Tedious coding of DBWritable classes for each table

• SQOOP– Open source utility provided by Cloudera– Configurable command line interface to copy RDBMS->HDFS– Support for Hive, Hbase– Generates java classes for future M-R tasks– Extensible to provide optimized adaptors for specific targets– Bi-Directional

10

SQOOP Details • SQOOP import

– Divide table into ranges using primary key max/min– Create mappers for each range – Mappers write to multiple HDFS nodes– Creates text or sequence files – Generates Java class for resulting HDFS file– Generates Hive definition and auto-loads into HIVE

• SQOOP export– Read files in HDFS directory via MapReduce– Bulk parallel insert into database table

11

SQOOP details• SQOOP features:

– Compatible with almost any JDBC enabled database– Auto load into HIVE – Hbase support – Special handling for database LOBs– Job management – Cluster configuration (jar file distribution)– WHERE clause support– Open source, and included in Cloudera distributions

• SQOOP fast paths & plug ins– Invoke mysqldump, mysqlimport for MySQL jobs – Similar fast paths for PostgreSQL– Extensibility architecture for 3rd parties (like Quest )

• Teradata, Netezza, etc.

12

Working with Oracle • SQOOP approach is generic and applicable to all RDBMS• However for Oracle, sub-optimal in some respects:

– Oracle may parallelize and serialize individual mappers – Oracle optimizer may decline to use index range scans– Oracle physical storage often deliberately not in primary key order

(reverse key indexes, hash partitioning, etc)– Primary keys often not be evenly distributed– Index range scans use single block random reads

• vs. faster multi-block table scans– Index range scans load into Oracle buffer cache

• Pollutes cache increasing IO for other users• Limited help to SQOOP since rows are only read once

• Luckily, SQOOP extensibility allows us to add optimizations for specific targets

Oracle – parallelism

Oracle

SALES

table

Oracle PQ

Slave

Oracle PQ

Slave

Oracle PQ

Slave

Oracle PQ

Slave

Oracle PQ

Slave

Oracle PQ

Slave

Oracle PQ

Slave

Oracle PQ

Slave

ScanSort

Oracle PQ

Slave

Oracle PQ

Slave

Oracle PQ

Slave

Oracle PQ

Slave

Aggregate

Oracle

Master

(QC)

Client

(JDBC)

SELECT cust_id, SUM (amount_sold)

FROM sh.sales

GROUP BY cust_id

ORDER BY 2 DESC

Oracle

SALES

table

HDFS

Hadoop Mapper

Hadoop Mapper

Hadoop Mapper

Hadoop Mapper

Oracle – parallelism

Buffer cache

Oracle table

Index range scans

Index block Index block

Index range scan

Hadoop Mapper

Oracle

Session

ID > 0 and ID < MAX/2 Hadoop Mapper

Oracle

Session

ID > MAX/2


Index range scan


Oracle

SALES

table

HDFS

Ideal architecture

Hadoop MapperOracle

Session

Hadoop MapperOracle

Session

Hadoop MapperOracle

Session

Hadoop MapperOracle

Session

18

Quest/Cloudera OraOop for SQOOP• Design goals

– Partition data based on physical storage– By-pass Oracle buffering– By-pass Oracle parallelism– Do not require or use indexes– Never read the same data block more than once– Support esoteric datatypes (eventually) – Support RAC clusters

• Availability:– Freely available from www.quest.com/ora-oop– Packaged with Cloudera Enterprise – Commercial support from Quest/Cloudera within Enterprise

distribution

http://www.quest.com/ora-oop

19

0 5 10 15 20 25 30 350

1000

2000

3000

4000

5000

6000

7000

50M row, 50GB Oracle table to 16-node Hadoop cluster

SQOOP

SQOOP with Ora-Oop

Number of Hadoop mappers

Ela

pse

d t

ime

(s)

OraOop Throughput

20

16 mappers, 50M rows, 50 GB clustered data

Elasped time

CPU Time

Network round trips

IO requests

IO time

0 20 40 60 80 100

80.84

89.72

98.95

99.08

98.71

16 mappers, 50M rows, 50 GB clustered data

Pct reduction

Oracle overhead

21

Extending SQOOP• SQOOP lets you concentrate on the RDBMS logic, not the

Hadoop plumbing:– Extend ManagerFactory (what to handle)– Extend ConnManager (DB connection and metadata)– For imports:

• Extend DataDrivenDBInputFormat (gets the data)– Data allocation (getSplits())– Split serialization (“io.serializations” property)– Data access logic (createDBRecordReader(),

getSelectQuery())» Implement progress (nextKeyValue(), getProgress())

– Similar procedure for extending exports

22

SQOOP/OraOop best practices

• Use sequence files for LOBs OR– Set inline-lob-limit

• Directly control datanodes for widest destination bandwidth– Can’t rely on mapred.max.maps.per.node

• Set number of mappers realistically • Disable speculative execution (our default)

– Leads to duplicate DB reads • Set Oracle row fetch size extra high

– Keeps the mappers streaming to HDFS

23

Conclusion

• RDBMS-Hadoop interoperability is key to Enterprise Hadoop adoption

• SQOOP provides a good general purpose tool for transferring data between any JDBC database and Hadoop

• SQOOP extensions can provide optimizations for specific targets

• Each RDBMS offers distinct tuning opportunities, so SQOOP extensions offer real value

• Try out OraOop for SQOOP!

Hadoop and rdbms with sqoop

Technology

Transcript of Hadoop and rdbms with sqoop