Big Data Marriage of RDBMS-DWH and Hadoop & Co.

36
2014 © Trivadis BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN Big Data Marriage of RDBMS-DWH and Hadoop & Co. Author: Jan Ott – Trivadis AG Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 1

Transcript of Big Data Marriage of RDBMS-DWH and Hadoop & Co.

Page 1: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

Big Data

Marriage of

RDBMS-DWH

and Hadoop & Co.

Author: Jan Ott – Trivadis AG

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 1

Page 2: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Mit über 600 IT- und Fachexperten bei Ihnen vor Ort

2

12 Trivadis Niederlassungen mit über 600 Mitarbeitenden

200 Service Level Agreements

Mehr als 4'000 Trainingsteilnehmer

Forschungs- und Entwicklungs-budget: CHF 5.0 Mio. / EUR 4.0 Mio.

Finanziell unabhängig und nachhaltig profitabel

Erfahrung aus mehr als 1'900 Projekten pro Jahr bei über 800 Kunden

Hamburg

Düsseldorf

Frankfurt

Freiburg München

Wien

Basel Zürich Bern

Lausanne

2

Stuttgart

Brugg

03.04.2014

2

Page 3: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Agenda

1.  Introduction 2.  First Steps in the Big Data – Hadoop World 3.  Project 1 4.  Project 2 5.  Summary

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 3

Page 4: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Introduction §  A view words about Big Data – Hadoop §  First Steps in the Big Data – Hadoop World

§  Get some data into Hadoop §  Get the data into Oracle §  First performance figures

§  Project 1 §  Data in Hadoop as store only §  Transformation in DB §  Next steps

§  Project 2 §  Lambda Architecture §  Setup §  Next steps

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 4

Page 5: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Big Data: Introduction §  Big Data - V’s – 3, 4 or 5

§  Volume – scale of data §  Velocity – analysis of streaming data §  Variety – different form of data §  Veracity – uncertainty of data (IBM) §  Value – business value (Microsoft)

§  Hadoop and its Zoo §  HDFS – MapReduce §  HBase, Hive, Impala, … §  Zookeeper

§  NoSQL Databases §  Semantic Web §  Architecture

§  LAMBDA

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 5

Turning Data into Insights

Page 6: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

What is Hadoop

§  a file system – HDFS §  Based on papers from Google §  Apache Open Source Project

§  Goal §  Fast §  Handles huge amount of data §  Handles unstructured to fully structured data §  Horizontally scalable §  Reliable

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 6

Page 7: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Agenda

1.  Introduction 2.  First Steps in the Big Data – Hadoop World 3.  Project 1 4.  Project 2 5.  Summary

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 7

Page 8: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

First Steps in the Big Data – Hadoop World

§  It is a zoo and you can get lost §  Keep it simple

§  Get some data into Hadoop

§  Get Oracle DB to access the data in Hadoop

§  Hadoop – Java (keep it to a minimum)

§  Data small and known §  EMP table from scott/tiger

§  Get an environment that is setup §  Oracle VM – Big Data Light

§  Pick one way to get the data into Hadoop §  Hadoop shell interface

§  Pick one way to make the data accessible by Oracle §  Oracle Connectors – SQL Connector

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 8

Page 9: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Pre-Requisite – Environment

§  Oracle Big Data Lite §  VM §  Version 2.4.1 §  http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-

bigdatalite-2104726.html

§  Contains §  Oracle Database 12c (12.1.0.1) §  Cloudera’s Distribution including Apache Hadoop (CDH4.5) §  Oracle Big Data Connectors 2.4

-  Oracle SQL Connector for HDFS 2.3.0 §  Oracle SQL Developer 4.0 §  …

§  Oracle Virtual Box

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 9

Page 10: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Information about the VM

§  Login §  oracle/welcome1

§  Start here §  ���file:///home/oracle/GettingStarted/StartHere.html

§  Start the Oracle DB

§  Your done preparing

§  On the side – Oracle has a Movie example

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 10

Page 11: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

The Steps – simple – focus

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 11

SQL Query

External Table

Page 12: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Step 1 – Prepare the Data

§  EMP => ���/home/oracle/Desktop/demo/emp.txt §  Comma delimited §  Flat file §  No XML – focus

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 12

SQL> SELECT empno || ',' || ename || ',' || job || ',’ || mgr || ',' || hiredate || ',' || sal || ',’ || comm || ',' || deptno FROM emp ORDER BY empno;

Page 13: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Step 1 – Prepare the Data

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 13

7369,SMITH,CLERK,7902,17-DEC-80,800,,20 7499,ALLEN,SALESMAN,7698,20-FEB-81,1600,300,30 7521,WARD,SALESMAN,7698,22-FEB-81,1250,500,30 7566,JONES,MANAGER,7839,02-APR-81,2975,,20 7654,MARTIN,SALESMAN,7698,28-SEP-81,1250,1400,30 7698,BLAKE,MANAGER,7839,01-MAY-81,2850,,30 7782,CLARK,MANAGER,7839,09-JUN-81,2450,,10 7788,SCOTT,ANALYST,7566,09-DEC-82,3000,,20 7839,KING,PRESIDENT,,17-NOV-81,5000,,10 7844,TURNER,SALESMAN,7698,08-SEP-81,1500,0,30 7876,ADAMS,CLERK,7788,12-JAN-83,1100,,20 7900,JAMES,CLERK,7698,03-DEC-81,950,,30 7902,FORD,ANALYST,7566,03-DEC-81,3000,,20 7934,MILLER,CLERK,7782,23-JAN-82,1300,,10

Page 14: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Step 2 – Get the data into HDFS §  HDFS Shell Commands

§  cat §  chmod §  cp §  ls §  put §  … §  http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-common/

FileSystemShell.html

§  See what there is in HDFS

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 14

$ hadoop fs –ls Found 6 items drwx------ - oracle supergroup 0 2014-03-19 11:11 .Trash drwx------ - oracle supergroup 0 2014-01-24 20:15 .staging drwxr-x--- - oracle supergroup 0 2014-01-13 00:15 moviedemo drwx------ - oracle supergroup 0 2014-01-24 16:32 moviework drwxr-xr-x - oracle supergrou 0 2013-12-27 16:36 olhcache drwxr-xr-x - oracle supergroup 0 2013-12-27 16:36 temp_out_session

Page 15: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Step 2 – Get the data into HDFS

§  Create a new directory

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 15

$ hadoop fs –mkdir demo_hdfs $ hadoop fs –ls Found 7 items drwx------ - oracle supergroup 0 2014-03-19 11:11 .Trash drwx------ - oracle supergroup 0 2014-01-24 20:15 .staging drwxr-xr-x - oracle supergroup 0 2014-03-19 11:21 demo_hdfs drwxr-x--- - oracle supergroup 0 2014-01-13 00:15 moviedemo drwx------ - oracle supergroup 0 2014-01-24 16:32 moviework drwxr-xr-x - oracle supergroup 0 2013-12-27 16:36 olhcache drwxr-xr-x - oracle supergroup 0 2013-12-27 16:36 temp_out_session

§  Copy the file into it $ hadoop fs -put /home/oracle/Desktop/demo/emp.txt demo_hdfs $ hadoop fs -ls -R demo -rw-r--r-- 1 oracle super… 1218 2014-03-19 11:30 demo_hdfs/emp.txt

Page 16: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Step 2 – Get the data into HDFS

§  See what is there

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 16

$ hadoop fs -cat demo_hdfs/emp.txt 7369,SMITH,CLERK,7902,17-DEC-80,880,,20 7499,ALLEN,SALESMAN,7698,20-FEB-81,1600,300,30 7521,WARD,SALESMAN,7698,22-FEB-81,1250,500,30 7566,JONES,MANAGER,7839,02-APR-81,2975,,20 7654,MARTIN,SALESMAN,7698,28-SEP-81,1250,1400,30 7698,BLAKE,MANAGER,7839,01-MAY-81,2850,,30 7782,CLARK,MANAGER,7839,09-JUN-81,2450,,10 7788,SCOTT,ANALYST,7566,19-APR-87,3000,,20 7839,KING,PRESIDENT,,17-NOV-81,5000,,10 7844,TURNER,SALESMAN,7698,08-SEP-81,1500,0,30 7876,ADAMS,CLERK,7788,23-MAY-87,1100,,20 7900,JAMES,CLERK,7698,03-DEC-81,950,,30 7902,FORD,ANALYST,7566,03-DEC-81,3000,,20 7934,MILLER,CLERK,7782,23-JAN-82,1300,,10

§  That’s it – the data is in Hadoop – HDFS

Page 17: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Oracle SQL Connector §  Connects HDFS with an Oracle External Table

§  Sources can be §  Data Pump File §  Hive Table §  Delimited Text File

§  Uses Oracle Loader

§  Parallel – needs several files

§  Limits §  Read only §  No Indexes §  Always full table scan

§  Location File

§  Oracle provides – command line tool §  http://docs.oracle.com/cd/E37231_01/doc.20/e36961/sqlch.htm

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 17

Page 18: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Step 3 – Setup External Table to HDFS

§  Rights §  OSCH_BIN_PATH §  Default Directory

-  Log Files -  Location Files

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 18

SQL> GRANT READ, WRITE, EXECUTE ON DIRECTORY OSCH_BIN_PATH TO scott;

Grant succeeded.

SQL> CREATE OR REPLACE DIRECTORY demo_dir AS '/home/oracle/Desktop/demo';

Directory created.

SQL> GRANT READ, WRITE, EXECUTE ON DIRECTORY demo_dir TO scott;

Grant succeeded.

Page 19: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Step 3 – Setup External Table to HDFS

§  Create External Table – Command Line Tool

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 19

$ hadoop oracle.hadoop.exttab.ExternalTable \ -D oracle.hadoop.exttab.tableName=EMP_HDFS_EXT_TAB \ -D oracle.hadoop.exttab.columnNames=empno,ename,job,mgr,hiredate,sal,comm,deptno \ -D oracle.hadoop.exttab.locationFileCount=1 \ -D oracle.hadoop.exttab.dataPaths=demo_hdfs/*.txt \ -D oracle.hadoop.exttab.columnCount=8 \ -D oracle.hadoop.exttab.defaultDirectory=demo_dir \ -D oracle.hadoop.connection.url=jdbc:oracle:thin:@localhost:1521:orcl \ -D oracle.hadoop.connection.user=SCOTT \ -D oracle.hadoop.exttab.printStackTrace=true \ -createTable

Page 20: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Step 3 – Setup External Table to HDFS

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 20

CREATE TABLE "SCOTT"."EMP_HDFS_EXT_TAB” ( "EMPNO" VARCHAR2(4000), "ENAME" VARCHAR2(4000), "JOB" VARCHAR2(4000), "MGR" VARCHAR2(4000), "HIREDATE" VARCHAR2(4000), "SAL" VARCHAR2(4000), "COMM" VARCHAR2(4000), "DEPTNO" VARCHAR2(4000) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY "DEMO_DIR” ACCESS PARAMETERS ( RECORDS DELIMITED BY 0X'0A' CHARACTERSET AL32UTF8 PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream' FIELDS TERMINATED BY 0X'2C'

Page 21: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Step 3 – Setup External Table to HDFS

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 21

...( RECORDS DELIMITED BY 0X'0A' CHARACTERSET AL32UTF8 PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream' FIELDS TERMINATED BY 0X'2C' MISSING FIELD VALUES ARE NULL ( "EMPNO" CHAR(4000), "ENAME" CHAR(4000), "JOB" CHAR(4000), "MGR" CHAR(4000), "HIREDATE" CHAR(4000), "SAL" CHAR(4000), "COMM" CHAR(4000), "DEPTNO" CHAR(4000) ) ) LOCATION ( 'osch-20140319061232-9144-1' ) ) PARALLEL REJECT LIMIT UNLIMITED;

Page 22: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Step 3 – Setup External Table to HDFS

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 22

cat /home/oracle/Desktop/demo/osch* <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <locationFile> <header> <version>1.0</version> <fileName>osch-20140319061232-9144-1</fileName> <createDate>2014-03-19T18:12:32</createDate> <publishDate>2014-03-19T06:12:32</publishDate> <productName>Oracle SQL Connector for HDFS Release 2.3.0 – Production</productName> <productVersion>2.3.0</productVersion> </header> <uri_list> <uri_list_item size="605" compressionCodec="">hdfs://bigdatalite.localdomain:8020/user/oracle/demo_hdfs/emp.txt</uri_list_item> </uri_list> </locationFile>

§  Creates a location file §  Filename Pattern: osch-timestamp-number-n

Page 23: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Step 3 – Setup External Table to HDFS

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 23

cat /home/oracle/Desktop/demo/EMP_HDFS_EXT_TAB* LOG file opened at 03/19/14 20:06:31

KUP-05007: Warning: Intra source concurrency disabled because the preprocessor option is being used.

Field Definitions for table EMP_HDFS_EXT_TAB Record format DELIMITED, delimited by 0A Data in file has same endianness as the platform Rows with all null fields are accepted

Fields in Data Source:

EMPNO CHAR (4000) Terminated by "2C” Trim whitespace same as SQL Loader ENAME CHAR (4000) Terminated by "2C” Trim whitespace same as SQL Loader ...

§  Creates log/bad files

Page 24: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Performance

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 24

§  It is for HUGE amounts of data §  Small files are just slow

# of Files

Size per File

Total Size # of Rows per File

Total # of Rows

Time

1 605 B 605 B 14 14 ������������00:53.17

10 605 B 6 KB 14 140 ������00:52.06

10 422 KB 4.2 MB ���10’000 ���100’000 ���������00:51.68

10 42 MB 420 MB ���1’000’000 10’000’000 ���01:00.92

50 42 MB 2.1 GB 1’000’000 50’000’000 ������01:27.81

Page 25: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Agenda

1.  Introduction 2.  First Steps in the Big Data – Hadoop World 3.  Project 1 4.  Project 2 5.  Summary

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 25

Page 26: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Project 1 – Pilot Phase

§  Move to Hadoop for delivered files §  Start collecting §  Files get copied into HDFS one to one §  No decision had to be taken

-  Schema – schema-less -  Table design - non -  …

§  Immutable Data Store - Create and Read -  No update / No delete

§  Add External Tables with ORACLE SQL Connector §  Data useable

§  Build Hadoop infrastructure

§  Next §  Analyze the Zoo

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 26

Page 27: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Agenda

1.  Introduction 2.  First Steps in the Big Data – Hadoop World 3.  Project 1 4.  Project 2 5.  Summary

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 27

Page 28: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Project 2 – Figures

§  400 – 500 Mio tweets per day

§  1 tweet contains §  Around 50 metadata pieces

-  Geo-location -  Re-tweets -  Followers

§  That is about 2 A4 pages

§  Twitter Sample Stream §  1% §  4-5 Mio tweets per day §  50 tweets per second

§  20 other streams with defined key words

§  HDFS §  1 TB every 2 months including replication

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 28

Page 29: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Problem – history data and actual data

§  The problem with simple batch approaches: missing data!

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 29

Batch updates at regular intervals

Data at time t1

Data at time t2

Data at time t3

Batch Data Batch Data

Batch Data

Data access at arbitrary times Time

non-batch processed data

t1 t2 t3

Page 30: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

The Lambda Architecture

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 30

Batch layer

Speed layer

All Data (HDFS)

Pre-computed Views

(MapReduce) Batch (re) compute

Query &

Merge

Process Stream

Incremented Views

Realtime Increment

Serving layer

QFD = Query Focused Data

QFD 1 QFD 2 QFD n …

QFD 1 QFD 2 QFD n Realtime views

Batch views New Data

Stream

adapted from Kinley (2013)

Page 31: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

The Lambda Architecture - adopted

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 31

Batch layer

Speed layer

All Data (HDFS)

Pre-computed Views

(MapReduce) Batch (re) compute

Query &

Merge REST

Process Stream

Incremented Views

Realtime Increment

Serving layer

QFD = Query Focused Data

QFD 1 QFD 2 QFD n …

QFD 1 QFD 2 QFD n Realtime views

Batch views Messaging Kafka

Client Web App

Consumer layer

Twitter API

Java APP

Hadoop

Storm

Impala

Cassandra

Page 32: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Project 2

§  Live streams

§  Batch computation

§  Scalable

§  Open for new sources

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 32

Page 33: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Agenda

1.  Introduction 2.  First Steps in the Big Data – Hadoop World 3.  Project 1 4.  Project 2 5.  Summary

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 33

Page 34: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Summary

§  Big Data <> Hadoop

§  Hadoop – HDFS §  File System §  Block Size – 128 MB §  Nothing for small files §  No optimization with indexes

§  A new World

§  Hadoop and its Zoo

§  Lots can be done with RDBMS

§  Start to collect now

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 34

Page 35: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN

Questions? THANK YOU.

Trivadis AG

Jan Ott

Europa-Strasse 5 CH-8152 Glattbrugg-Zurich

Tel. +41-44-808 70 20 (reception) Fax +41-44-808 70 21

[email protected] www.trivadis.com

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 35

Page 36: Big Data Marriage of RDBMS-DWH and Hadoop & Co.

2014 © Trivadis

Sources

§  Oracle Connection to Hadoop with Oracle §  https://blogs.oracle.com/bigdataconnectors/entry/how_to_load_oracle_tables

§  Pictures §  Oracle.com §  Twitter.com §  Apache.com

§  Big Data – MEAP by Nathan Marz

Big Data - Marriage of RDBMS-DWH and Hadoop & Co. 36