How to load data from ORACLE to HDFS by using SQOOP

How to Load Data from

Oracle to HDFS By Using

SqoopBhaskara Reddy Sannapureddy –Senior Project

Manager@Infosys, +91-7702577769

SQOOP

• The tool you use for this comes as part of the Cloudera CDH4 Hadoop distribution that’s on BigDataLite, and it’s called “Sqoop”.

• “Sqoop”, short for “SQL to Hadoop”, gives you the ability to do the following Oracle data transfer tasks amongst other ones:

• Import whole tables, or whole schemas, from Oracle and other relational databases into Hadoop’s file system, HDFS

• Export data from HDFS back out to these databases – with the export and import being performed through MapReduce jobs

• Import using an arbitrary SQL SELECT statement, rather than grabbing whole tables

• Perform incremental loads, specifying a key column to determine what to exclude

• Load directly into Hive tables, creating HDFS files in the background and the Hive metadata automatically

Sqoop is a command-line interface application for transferring data between relational

databases and Hadoop. Sqoop Helps in efficiently transferring bulk data between Hadoop

and oracle database.

http://blog.cloudera.com/blog/2009/06/introducing-sqoop/?__hstc=150481449.f60443014eca523b028b24010df43b24.1423369890808.1423369890808.1423369890808.1&__hssc=150481449.1.1423369890808&__hsfp=3511704174

DOCUMENTATION

• Documentation for Sqoop as shipped with CDH4 can be found on the

Cloudera website here, and there are even optimisations and plugins

for databases such as Oracle to enable faster, direct loads – for

example OraOOP.

http://archive.cloudera.com/cdh4/cdh/4/sqoop/SqoopUserGuide.html?__hstc=150481449.f60443014eca523b028b24010df43b24.1423369890808.1423369890808.1423369890808.1&__hssc=150481449.1.1423369890808&__hsfp=3511704174

http://archive.cloudera.com/cdh/3/adapters/oraoopuserguide.pdf?__hstc=150481449.f60443014eca523b028b24010df43b24.1423369890808.1423369890808.1423369890808.1&__hssc=150481449.1.1423369890808&__hsfp=3511704174

PRE-REQUISITES

1.Oracle Database 10g express edition should be installed.

2.Oracle connector(ojdbc6_g.jar) ,The jar file can be downloaded

fromhttp://www.oracle.com/technetwork/database/enterprise-edition/jdbc-

112010-090769.html

3. Download and install JDBC drivers for sqoop before you can use it, but

BigDataLite comes with the required Oracle JDBC drivers.

http://www.oracle.com/technetwork/database/enterprise-edition/jdbc-112010-090769.html

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_13_7.html?__hstc=150481449.f60443014eca523b028b24010df43b24.1423369890808.1423369890808.1423369890808.1&__hssc=150481449.1.1423369890808&__hsfp=3511704174

SQOOP 2 Design Goals

DEMO 1: STEP 1

For Example :

create table ACTIVITY(Activity_Period varchar(50),Operating_Airline

varchar(50),Operating_Airline_IATA_Code varchar(50),Published_Airline

varchar(50),Published_Airline_IATA_Code varchar(50),GEO_Summary varchar(50),GEO_Region

varchar(50),Activity_Type_Code varchar(50),Cargo_Type_Code varchar(50),Cargo_Aircraft_Type

varchar(50),Cargo_Weight_LBS varchar(50),Cargo_Metric_TONS varchar(50));

DEMO 1: STEP 2

• Import the data of the table ACTIVITY present in Oracle database to HDFS. Note that the oracle connector should be present in the sqoop directory and the command should be executed from the sqoop library.

Syntax :

/usr/bin/sqoop import --connect jdbc:oracle:thin:system/system@<IP address>:1521:xe --username <username> -P--table <database name>.<table name> --columns "<column names>" --target-dir <target directory path> -m 1

Example :

[cloudera@localhost sqoop]$ /usr/bin/sqoop import --connect jdbc:oracle:thin:system/[email protected]:1521:xe

--username system -P --table system.ACTIVITY --columns "Activity_Period,Operating_Airline,Operating_Airline_IATA_Code,Published_Airline,Published_Airline_IATA_Code,GEO_Summary,GEO_Region,Activity_Type_Code,Cargo_Type_Code,Cargo_Aircraft_Type,Cargo_Weight_LBS,Cargo_Metric_TONS"--target-dir/user/cloudera/sqoop_out -m 1

DEMO1: SQOOP IMPORT PROCESS

14/03/21 18:22:39 INFO mapred.JobClient: Map input records=11

14/03/21 18:22:39 INFO mapred.JobClient: Map output records=11

14/03/21 18:22:39 INFO mapred.JobClient: Input split bytes=464

14/03/21 18:22:39 INFO mapred.JobClient: Spilled Records=0

14/03/21 18:22:39 INFO mapred.JobClient: CPU time spent (ms)=3430

14/03/21 18:22:39 INFO mapred.JobClient: Physical memory (bytes) snapshot=506802176

14/03/21 18:22:39 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2714157056

14/03/21 18:22:39 INFO mapred.JobClient: Total committed heap usage (bytes)=506724352

14/03/21 18:22:39 INFO mapreduce.ImportJobBase: Transferred 103 bytes in 56.4649 seconds

(1.8241 bytes/sec)

14/03/21 18:22:39 INFO mapreduce.ImportJobBase: Retrieved 11 records.

See Sqoop process your command in its console output, and then run the

MapReduce jobs to bring in the data via the Oracle JDBC driver:

DEMO1 : RESULT IN HDFSBy default, Sqoop will put the resulting file in your user’s home directory in HDFS. Let’s take a look and see what’s

there:

• [oracle@bigdatalite ~]$ hadoop fs -ls /user/oracle/ACTIVITYFound 6 items

• -rw-r--r-- 1 oracle supergroup 0 2014-03-21 18:22 /user/oracle/ACTIVITY/_SUCCESS

• drwxr-xr-x - oracle supergroup 0 2014-03-21 18:21 /user/oracle/ACTIVITY/_logs

• -rw-r--r-- 1 oracle supergroup 27 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00000




• [oracle@bigdatalite ~]$ hadoop fs -cat /user/oracle/ACTIVITY/part-m-000001,Rate

• 2,Completed

• 3,Pause

• [oracle@bigdatalite ~]$ hadoop fs -cat /user/oracle/ACTIVITY/part-m-000014,Start

• 5,Browse

What you can see there is that sqoop has imported the data as a series of “part-m” files, CSV files

with one per MapReduce reducer. There’s various options in the docs for specifying compression

and other performance features for sqoop imports, but the basic format is a series of CSV files, one

per reducer.

DEMO 2

• http://www.edureka.co/blog/hdfs-using-sqoop/

http://www.edureka.co/blog/hdfs-using-sqoop/

IMPORT ISSUES & SOLUTION

If you hit below error:

ERROR tool.ImportTool: Encountered IOException running import job: java.io.FileNotFoundException: File does not exist: hdfs://hostname:8020/usr/lib/sqoop/lib/ant-eclipse-1.0-jvm1.2.jar

Or

ERROR tool.ImportTool: Encountered IOException running import job: java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/usr/lib/sqoop/lib/jackson-mapper-asl-1.9.13.jar

Solution:

Your mapred-site.xml file is configured wrongly. Check the file content, The content should looks like shown below:

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

Generally mapred-site.xml file content is made the same as core-site.xml content. Core-site.xml file content should looks like shown below:

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://master:54310</value>

</property>

</configuration>

REFERENCES

• Sqoop User Guide:

http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html

• Transferring Bulk Data between Oracle Database and Hadoop

Ecosystem with Sqoop

http://www.toadworld.com/platforms/oracle/w/wiki/10891.transferring-bulk-data-

between-oracle-database-and-hadoop-ecosystem-with-sqoop.aspx

http://www.toadworld.com/platforms/oracle/w/wiki/10891.transferring-bulk-data-between-oracle-database-and-hadoop-ecosystem-with-sqoop.aspx

THANK YOU

How to load data from ORACLE to HDFS by using SQOOP

Technology

Transcript of How to load data from ORACLE to HDFS by using SQOOP