How to load data from ORACLE to HDFS by using SQOOP
-
Upload
bhaskara-reddy-sannapureddy -
Category
Technology
-
view
274 -
download
2
Transcript of How to load data from ORACLE to HDFS by using SQOOP
How to Load Data from
Oracle to HDFS By Using
SqoopBhaskara Reddy Sannapureddy –Senior Project
Manager@Infosys, +91-7702577769
SQOOP
• The tool you use for this comes as part of the Cloudera CDH4 Hadoop distribution that’s on BigDataLite, and it’s called “Sqoop”.
• “Sqoop”, short for “SQL to Hadoop”, gives you the ability to do the following Oracle data transfer tasks amongst other ones:
• Import whole tables, or whole schemas, from Oracle and other relational databases into Hadoop’s file system, HDFS
• Export data from HDFS back out to these databases – with the export and import being performed through MapReduce jobs
• Import using an arbitrary SQL SELECT statement, rather than grabbing whole tables
• Perform incremental loads, specifying a key column to determine what to exclude
• Load directly into Hive tables, creating HDFS files in the background and the Hive metadata automatically
Sqoop is a command-line interface application for transferring data between relational
databases and Hadoop. Sqoop Helps in efficiently transferring bulk data between Hadoop
and oracle database.
DOCUMENTATION
• Documentation for Sqoop as shipped with CDH4 can be found on the
Cloudera website here, and there are even optimisations and plugins
for databases such as Oracle to enable faster, direct loads – for
example OraOOP.
PRE-REQUISITES
1.Oracle Database 10g express edition should be installed.
2.Oracle connector(ojdbc6_g.jar) ,The jar file can be downloaded
fromhttp://www.oracle.com/technetwork/database/enterprise-edition/jdbc-
112010-090769.html
3. Download and install JDBC drivers for sqoop before you can use it, but
BigDataLite comes with the required Oracle JDBC drivers.
SQOOP 2 Design Goals
DEMO 1: STEP 1
For Example :
create table ACTIVITY(Activity_Period varchar(50),Operating_Airline
varchar(50),Operating_Airline_IATA_Code varchar(50),Published_Airline
varchar(50),Published_Airline_IATA_Code varchar(50),GEO_Summary varchar(50),GEO_Region
varchar(50),Activity_Type_Code varchar(50),Cargo_Type_Code varchar(50),Cargo_Aircraft_Type
varchar(50),Cargo_Weight_LBS varchar(50),Cargo_Metric_TONS varchar(50));
DEMO 1: STEP 2
• Import the data of the table ACTIVITY present in Oracle database to HDFS. Note that the oracle connector should be present in the sqoop directory and the command should be executed from the sqoop library.
Syntax :
/usr/bin/sqoop import --connect jdbc:oracle:thin:system/system@<IP address>:1521:xe --username <username> -P--table <database name>.<table name> --columns "<column names>" --target-dir <target directory path> -m 1
Example :
[cloudera@localhost sqoop]$ /usr/bin/sqoop import --connect jdbc:oracle:thin:system/[email protected]:1521:xe
--username system -P --table system.ACTIVITY --columns "Activity_Period,Operating_Airline,Operating_Airline_IATA_Code,Published_Airline,Published_Airline_IATA_Code,GEO_Summary,GEO_Region,Activity_Type_Code,Cargo_Type_Code,Cargo_Aircraft_Type,Cargo_Weight_LBS,Cargo_Metric_TONS"--target-dir/user/cloudera/sqoop_out -m 1
DEMO1: SQOOP IMPORT PROCESS
14/03/21 18:22:39 INFO mapred.JobClient: Map input records=11
14/03/21 18:22:39 INFO mapred.JobClient: Map output records=11
14/03/21 18:22:39 INFO mapred.JobClient: Input split bytes=464
14/03/21 18:22:39 INFO mapred.JobClient: Spilled Records=0
14/03/21 18:22:39 INFO mapred.JobClient: CPU time spent (ms)=3430
14/03/21 18:22:39 INFO mapred.JobClient: Physical memory (bytes) snapshot=506802176
14/03/21 18:22:39 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2714157056
14/03/21 18:22:39 INFO mapred.JobClient: Total committed heap usage (bytes)=506724352
14/03/21 18:22:39 INFO mapreduce.ImportJobBase: Transferred 103 bytes in 56.4649 seconds
(1.8241 bytes/sec)
14/03/21 18:22:39 INFO mapreduce.ImportJobBase: Retrieved 11 records.
See Sqoop process your command in its console output, and then run the
MapReduce jobs to bring in the data via the Oracle JDBC driver:
DEMO1 : RESULT IN HDFSBy default, Sqoop will put the resulting file in your user’s home directory in HDFS. Let’s take a look and see what’s
there:
• [oracle@bigdatalite ~]$ hadoop fs -ls /user/oracle/ACTIVITYFound 6 items
• -rw-r--r-- 1 oracle supergroup 0 2014-03-21 18:22 /user/oracle/ACTIVITY/_SUCCESS
• drwxr-xr-x - oracle supergroup 0 2014-03-21 18:21 /user/oracle/ACTIVITY/_logs
• -rw-r--r-- 1 oracle supergroup 27 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00000
• -rw-r--r-- 1 oracle supergroup 17 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00001
• -rw-r--r-- 1 oracle supergroup 24 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00002
• -rw-r--r-- 1 oracle supergroup 35 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00003
• [oracle@bigdatalite ~]$ hadoop fs -cat /user/oracle/ACTIVITY/part-m-000001,Rate
• 2,Completed
• 3,Pause
• [oracle@bigdatalite ~]$ hadoop fs -cat /user/oracle/ACTIVITY/part-m-000014,Start
• 5,Browse
What you can see there is that sqoop has imported the data as a series of “part-m” files, CSV files
with one per MapReduce reducer. There’s various options in the docs for specifying compression
and other performance features for sqoop imports, but the basic format is a series of CSV files, one
per reducer.
IMPORT ISSUES & SOLUTION
If you hit below error:
ERROR tool.ImportTool: Encountered IOException running import job: java.io.FileNotFoundException: File does not exist: hdfs://hostname:8020/usr/lib/sqoop/lib/ant-eclipse-1.0-jvm1.2.jar
Or
ERROR tool.ImportTool: Encountered IOException running import job: java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/usr/lib/sqoop/lib/jackson-mapper-asl-1.9.13.jar
Solution:
Your mapred-site.xml file is configured wrongly. Check the file content, The content should looks like shown below:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Generally mapred-site.xml file content is made the same as core-site.xml content. Core-site.xml file content should looks like shown below:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
</property>
</configuration>
REFERENCES
• Sqoop User Guide:
http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
• Transferring Bulk Data between Oracle Database and Hadoop
Ecosystem with Sqoop
http://www.toadworld.com/platforms/oracle/w/wiki/10891.transferring-bulk-data-
between-oracle-database-and-hadoop-ecosystem-with-sqoop.aspx
THANK YOU