Hadoop databases for oracle DBAs
-
Upload
maxym-kharchenko -
Category
Data & Analytics
-
view
2.120 -
download
3
Transcript of Hadoop databases for oracle DBAs
Session ID:
Prepared by:
Hadoop databases: Hive, Impala, Spark, PrestoFor ORACLE DBAs
557
Maxym Kharchenko, Gluent
@maxymkh
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Whoami• Database Kernel developer
-> ORACLE DBA-> Database Hadoop/Cloud developer
• Worked with ORACLE for the last 15 years
• OCM, ORACLE Ace alumni, Amazon alumni
• Last year: OLTP -> Hadoop
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Shameless plug about my company
GluentOracle
TeradataNoSQL
Big Data Sources
MSSQL
App X
App Y
App Z
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Agenda• What’s Hadoop databases ?
• Hive/Impala/Spark vs. ORACLE (hopefully, demo)
• Best ways to start
April 2-6, 2017 in Las Vegas, NV USA #C17LV
What is Hadoop:• For “Big data”
• Can deal with “Unstructured” data
• Distributed
• Consists of: HDFS + MapReduce
• Requires you to write MapReduce jobs, NoSql
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Yes, but what does it all mean ?
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Imagine that you are Googlein the early 2000s
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Target Ads• You need to query web crawler data
• Which is unbelievably huge
• These queries need to be:
• (reasonably) Fast• (reasonably) Cheap• (reasonably) Easy to use
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Let’s build a Data Warehouse
April 2-6, 2017 in Las Vegas, NV USA #C17LV
(traditional) Data warehouse • Been there for years
• Mature and (relatively) advanced
• SQL !!!
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Data Warehouse scorecardRequirements RDBMS(reasonably) Fast (reasonably) Cheap (reasonably) Easy to use Able to process data ¯\_(ツ )_/¯
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Scaling up “Big data” ain’t cheap• Can’t fit all of the data
on a single box
• Cost is quicklygetting out of hand
April 2-6, 2017 in Las Vegas, NV USA #C17LV
(cheap) Commodity systemsmake “big data” feasible
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Solution = commodity systems
=
$$$$$ $$
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Commodity systems scorecardRequirements Commodity(reasonably) Fast (reasonably) Cheap (reasonably) Easy to use Able to process data
April 2-6, 2017 in Las Vegas, NV USA #C17LV
All your queries are Java Classes
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Google• 2003:
Google File System(GFS) paper
• 2004:Google MapReduce(MR) paper
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hadoop• 2006: Hadoop
April 2-6, 2017 in Las Vegas, NV USA #C17LV
”Traditional Data Warehouse” vs. Hadoop
Requirements Hadoop Data Warehouse(reasonably) Fast (reasonably) Cheap (reasonably) Easy to use Able to process data ¯\_(ツ )_/¯
April 2-6, 2017 in Las Vegas, NV USA #C17LV
• 2010: Facebook releases Apache Hive
• SQL on Hadoop !
SQL on Hadoop - Hive
April 2-6, 2017 in Las Vegas, NV USA #C17LV
• 2012: Cloudera announces Impala
• Faster SQL on Hadoop !
Another SQL on Hadoop - Impala
April 2-6, 2017 in Las Vegas, NV USA #C17LV
And then, it exploded …
“Hadoop” vs “Relational” databasesDemo … hopefully
April 2-6, 2017 in Las Vegas, NV USA #C17LV
This is not about NoSql :-)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: Tablessql> describe sh.products;
+-----------------------+----------------+---------+| name | type | comment |+-----------------------+----------------+---------+| prod_id | bigint | || prod_name | string | || prod_desc | string | || prod_category_id | bigint | || prod_category_desc | string | || supplier_id | bigint | || prod_total_id | decimal(38,18) | || prod_src_id | decimal(38,18) | || prod_eff_from | timestamp | || prod_eff_to | timestamp | || prod_valid | string | |+-----------------------+----------------+---------+
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: Running SQL queriessql> select prod_id, count(1)from sh.sales s, sh.channels cwhere c.channel_id = s.channel_id and c.channel_desc='Catalog'group by prod_idorder by 2 desclimit 5;
+------------------------+----------+| prod_id | count(1) |+------------------------+----------+| 43.000000000000000000 | 5182 || 46.000000000000000000 | 5165 || 22.000000000000000000 | 5162 || 123.000000000000000000 | 5152 || 32.000000000000000000 | 5145 |+------------------------+----------+Fetched 5 row(s) in 3.26s
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: Queries are optimizedsql> explain select count(1) from sh.times;+----------------------------------------------------------+| Explain String |+----------------------------------------------------------+| Estimated Per-Host Requirements: Memory=10.00MB VCores=1 || || 03:AGGREGATE [FINALIZE] || | output: count:merge(1) || | || 02:EXCHANGE [UNPARTITIONED] || | || 01:AGGREGATE || | output: count(1) || | || 00:SCAN HDFS [sh.times] || partitions=16/16 files=32 size=500.45KB |+----------------------------------------------------------+
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: What gets optimized• No “regular” indexes
• But many operationsare distributed
SALES 1TIMES 1
SALES 2TIMES 2
SALES 3TIMES 3
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: Native cloud filesystem supportsql> show partition sh.sales;
s3a://bucket1/sh/sales/time_id=2011-01 | PARQUET s3a://bucket1/sh/sales/time_id=2011-02 | PARQUET s3a://bucket1/sh/sales/time_id=2011-03 | PARQUET s3a://bucket1/sh/sales/time_id=2011-04 | PARQUET s3a://bucket1/sh/sales/time_id=2011-05 | PARQUET s3a://bucket1/sh/sales/time_id=2011-06 | PARQUET s3a://bucket1/sh/sales/time_id=2011-07 | PARQUET s3a://bucket1/sh/sales/time_id=2011-08 | PARQUET hdfs://clust1/sh/sales/time_id=2011-09 | PARQUET hdfs://clust1/sh/sales/time_id=2011-10 | PARQUET hdfs://clust1/sh/sales/time_id=2011-11 | PARQUET hdfs://clust1/sh/sales/time_id=2011-12 | PARQUET
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Database engine does NOT ”own” data
April 2-6, 2017 in Las Vegas, NV USA #C17LV
example01.dbfsysaux01.dbfsystem01.dbftemp01.dbfundotbs01.dbfusers01.dbf
a01_data.parqa01_data.parqa03_data.parqa04_data.parqa05_data.parqa06_data.parq
Different: Different engines can work withthe same data files (even at the same time)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: … or copies of the data files
hdfs://adhoc/a.parqhdfs://adhoc/b.parqhdfs://adhoc/c.parqhdfs://adhoc/d.parqhdfs://adhoc/e.parqhdfs://adhoc/f.parq
hdfs://prod/a.parqhdfs://prod/b.parqhdfs://prod/c.parqhdfs://prod/d.parqhdfs://prod/e.parqhdfs://prod/f.parq
s3://backup/a.parqs3://backup/b.parqs3://backup/c.parqs3://backup/d.parqs3://backup/e.parqs3://backup/f.parq
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: Open data formats• Not proprietary – many
tools can read/write
• No additional $$for “advanced features”:
• Columnar storage• Storage indexes• Compression
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: “sqlplus-like” clients> impala-shell -i 10.0.0.1
[10.0.0.1:21000] > select prod_id, count(1)from sh.sales group by prod_id order by 2 desc limit 1;
+-----------------------+----------+| prod_id | count(1) |+-----------------------+----------+| 48.000000000000000000 | 74026 |+-----------------------+----------+
> beeline –u 'jdbc:hive2://10.0.0.1:10000'0: jdbc:hive2://10.0.0.1:1> select prod_id, count(1)from sh.sales group by prod_id order by 2 desc limit 1;
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: External dictionary
User data
Dictionary (SYS)
User data
Dictionary (SYS)
Hive Metastore
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: Append only, “ETL-like” DML• Hadoop DML
is more like ETL
• Data is presumed static
• ACID: someinterpretation required
• Schema on read
UPDATE t SET a=12 WHERE b=1;
Table T (base):
a_data.orc
Table T (base):
a_data.orc
Table T (delta):
b_data.orc
Compactor runs …
Table T (base):
c_data.orc
Databases
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Apache Hive
Slave C
• “Designed” for “batch” queries (*)
• Runs on top of standardHadoop RM: YARN
• Supports multiple “engines”: MR, TEZ, Spark
• SerDes
YARN NM
datanode
Master
Hiveserver2
namenode
Slave C
YARN NM
datanode
Slave C
YARN NM
datanode
YARN RM
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Slave A
Apache Impala• Designed for
“quick interactive” queries
• “Data-local” execution
• In-memory processingimpalad
datanode
Slave B
impalad
datanode
Slave C
impalad
datanode
Master
statestored
namenode
catalogd
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Apache Spark• “Better Hadoop”
with “native”:SQL, Mlib, GraphX
• In-memory processing, based on RDDs
• Supports many clusters: “native”, YARN, Mesos
• Flexible programming model
Master
Driver
Slave A
Executor
Slave B
Executor
Slave C
Executor
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Presto
Slave A
• Designed for “interactive” queries
• In-memory processing
• Custom storage “plugins”: Hive, Kafka, MySql, Postgres,… worker
Slave B
worker
Slave C
worker
Master
coordinator
How to start
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Step 1: Google “Hadoop ecosystem”
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Step 2: Try to install the simplest thing
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Step 3
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Step 4
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hint: Nobody builds their own Linux anymore
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Chose Hadoop distribution that suits you
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hadoop distributions• Pre-built and pre-integrated
(aka: all things work out of the box)
• Each has their own “philosophy” …
• … As well as preferred Hadoop database
April 2-6, 2017 in Las Vegas, NV USA #C17LV
So what’s in it for me ?• It’s interesting (cool technology that hits many recent
buzzwords)
• If you know ORACLE, it’s close to your skill set
• It’s promising and future oriented
Q&A
Please Complete Your Session Evaluation Evaluate this session in your COLLABORATE app. Pull up this session and tap "Session Evaluation" to complete the survey.
Session ID: 557