Collecting and analyzing sensor data with hadoop or other no sql databases
Hadoop databases for oracle DBAs
-
Upload
maxym-kharchenko -
Category
Data & Analytics
-
view
395 -
download
0
Transcript of Hadoop databases for oracle DBAs
![Page 1: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/1.jpg)
Session ID:
Prepared by:
Hadoop databases: Hive, Impala, Spark, PrestoFor ORACLE DBAs
557
Maxym Kharchenko, Gluent
@maxymkh
![Page 2: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/2.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Whoami• Database Kernel developer
-> ORACLE DBA-> Database Hadoop/Cloud developer
• Worked with ORACLE for the last 15 years
• OCM, ORACLE Ace alumni, Amazon alumni
• Last year: OLTP -> Hadoop
![Page 3: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/3.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Shameless plug about my company
GluentOracle
TeradataNoSQL
Big Data Sources
MSSQL
App X
App Y
App Z
![Page 4: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/4.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Agenda• What’s Hadoop databases ?
• Hive/Impala/Spark vs. ORACLE (hopefully, demo)
• Best ways to start
![Page 5: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/5.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
What is Hadoop:• For “Big data”
• Can deal with “Unstructured” data
• Distributed
• Consists of: HDFS + MapReduce
• Requires you to write MapReduce jobs, NoSql
![Page 6: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/6.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Yes, but what does it all mean ?
![Page 7: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/7.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Imagine that you are Googlein the early 2000s
![Page 8: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/8.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Target Ads• You need to query web crawler data
• Which is unbelievably huge
• These queries need to be:
• (reasonably) Fast• (reasonably) Cheap• (reasonably) Easy to use
![Page 9: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/9.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Let’s build a Data Warehouse
![Page 10: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/10.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
(traditional) Data warehouse • Been there for years
• Mature and (relatively) advanced
• SQL !!!
![Page 11: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/11.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Data Warehouse scorecardRequirements RDBMS(reasonably) Fast (reasonably) Cheap (reasonably) Easy to use Able to process data ¯\_(ツ )_/¯
![Page 12: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/12.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Scaling up “Big data” ain’t cheap• Can’t fit all of the data
on a single box
• Cost is quicklygetting out of hand
![Page 13: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/13.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
(cheap) Commodity systemsmake “big data” feasible
![Page 14: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/14.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Solution = commodity systems
=
$$$$$ $$
![Page 15: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/15.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Commodity systems scorecardRequirements Commodity(reasonably) Fast (reasonably) Cheap (reasonably) Easy to use Able to process data
![Page 16: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/16.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
All your queries are Java Classes
![Page 17: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/17.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Google• 2003:
Google File System(GFS) paper
• 2004:Google MapReduce(MR) paper
![Page 18: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/18.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hadoop• 2006: Hadoop
![Page 19: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/19.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
”Traditional Data Warehouse” vs. Hadoop
Requirements Hadoop Data Warehouse(reasonably) Fast (reasonably) Cheap (reasonably) Easy to use Able to process data ¯\_(ツ )_/¯
![Page 20: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/20.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
• 2010: Facebook releases Apache Hive
• SQL on Hadoop !
SQL on Hadoop - Hive
![Page 21: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/21.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
• 2012: Cloudera announces Impala
• Faster SQL on Hadoop !
Another SQL on Hadoop - Impala
![Page 22: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/22.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
And then, it exploded …
![Page 23: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/23.jpg)
“Hadoop” vs “Relational” databasesDemo … hopefully
![Page 24: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/24.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
This is not about NoSql :-)
![Page 25: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/25.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: Tablessql> describe sh.products;
+-----------------------+----------------+---------+| name | type | comment |+-----------------------+----------------+---------+| prod_id | bigint | || prod_name | string | || prod_desc | string | || prod_category_id | bigint | || prod_category_desc | string | || supplier_id | bigint | || prod_total_id | decimal(38,18) | || prod_src_id | decimal(38,18) | || prod_eff_from | timestamp | || prod_eff_to | timestamp | || prod_valid | string | |+-----------------------+----------------+---------+
![Page 26: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/26.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: Running SQL queriessql> select prod_id, count(1)from sh.sales s, sh.channels cwhere c.channel_id = s.channel_id and c.channel_desc='Catalog'group by prod_idorder by 2 desclimit 5;
+------------------------+----------+| prod_id | count(1) |+------------------------+----------+| 43.000000000000000000 | 5182 || 46.000000000000000000 | 5165 || 22.000000000000000000 | 5162 || 123.000000000000000000 | 5152 || 32.000000000000000000 | 5145 |+------------------------+----------+Fetched 5 row(s) in 3.26s
![Page 27: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/27.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: Queries are optimizedsql> explain select count(1) from sh.times;+----------------------------------------------------------+| Explain String |+----------------------------------------------------------+| Estimated Per-Host Requirements: Memory=10.00MB VCores=1 || || 03:AGGREGATE [FINALIZE] || | output: count:merge(1) || | || 02:EXCHANGE [UNPARTITIONED] || | || 01:AGGREGATE || | output: count(1) || | || 00:SCAN HDFS [sh.times] || partitions=16/16 files=32 size=500.45KB |+----------------------------------------------------------+
![Page 28: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/28.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: What gets optimized• No “regular” indexes
• But many operationsare distributed
SALES 1TIMES 1
SALES 2TIMES 2
SALES 3TIMES 3
![Page 29: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/29.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: Native cloud filesystem supportsql> show partition sh.sales;
s3a://bucket1/sh/sales/time_id=2011-01 | PARQUET s3a://bucket1/sh/sales/time_id=2011-02 | PARQUET s3a://bucket1/sh/sales/time_id=2011-03 | PARQUET s3a://bucket1/sh/sales/time_id=2011-04 | PARQUET s3a://bucket1/sh/sales/time_id=2011-05 | PARQUET s3a://bucket1/sh/sales/time_id=2011-06 | PARQUET s3a://bucket1/sh/sales/time_id=2011-07 | PARQUET s3a://bucket1/sh/sales/time_id=2011-08 | PARQUET hdfs://clust1/sh/sales/time_id=2011-09 | PARQUET hdfs://clust1/sh/sales/time_id=2011-10 | PARQUET hdfs://clust1/sh/sales/time_id=2011-11 | PARQUET hdfs://clust1/sh/sales/time_id=2011-12 | PARQUET
![Page 30: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/30.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Database engine does NOT ”own” data
![Page 31: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/31.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
example01.dbfsysaux01.dbfsystem01.dbftemp01.dbfundotbs01.dbfusers01.dbf
a01_data.parqa01_data.parqa03_data.parqa04_data.parqa05_data.parqa06_data.parq
Different: Different engines can work withthe same data files (even at the same time)
![Page 32: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/32.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: … or copies of the data files
hdfs://adhoc/a.parqhdfs://adhoc/b.parqhdfs://adhoc/c.parqhdfs://adhoc/d.parqhdfs://adhoc/e.parqhdfs://adhoc/f.parq
hdfs://prod/a.parqhdfs://prod/b.parqhdfs://prod/c.parqhdfs://prod/d.parqhdfs://prod/e.parqhdfs://prod/f.parq
s3://backup/a.parqs3://backup/b.parqs3://backup/c.parqs3://backup/d.parqs3://backup/e.parqs3://backup/f.parq
![Page 33: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/33.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: Open data formats• Not proprietary – many
tools can read/write
• No additional $$for “advanced features”:
• Columnar storage• Storage indexes• Compression
![Page 34: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/34.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Same: “sqlplus-like” clients> impala-shell -i 10.0.0.1
[10.0.0.1:21000] > select prod_id, count(1)from sh.sales group by prod_id order by 2 desc limit 1;
+-----------------------+----------+| prod_id | count(1) |+-----------------------+----------+| 48.000000000000000000 | 74026 |+-----------------------+----------+
> beeline –u 'jdbc:hive2://10.0.0.1:10000'0: jdbc:hive2://10.0.0.1:1> select prod_id, count(1)from sh.sales group by prod_id order by 2 desc limit 1;
![Page 35: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/35.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: External dictionary
User data
Dictionary (SYS)
User data
Dictionary (SYS)
Hive Metastore
![Page 36: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/36.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Different: Append only, “ETL-like” DML• Hadoop DML
is more like ETL
• Data is presumed static
• ACID: someinterpretation required
• Schema on read
UPDATE t SET a=12 WHERE b=1;
Table T (base):
a_data.orc
Table T (base):
a_data.orc
Table T (delta):
b_data.orc
Compactor runs …
Table T (base):
c_data.orc
![Page 37: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/37.jpg)
Databases
![Page 38: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/38.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Apache Hive
Slave C
• “Designed” for “batch” queries (*)
• Runs on top of standardHadoop RM: YARN
• Supports multiple “engines”: MR, TEZ, Spark
• SerDes
YARN NM
datanode
Master
Hiveserver2
namenode
Slave C
YARN NM
datanode
Slave C
YARN NM
datanode
YARN RM
![Page 39: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/39.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Slave A
Apache Impala• Designed for
“quick interactive” queries
• “Data-local” execution
• In-memory processingimpalad
datanode
Slave B
impalad
datanode
Slave C
impalad
datanode
Master
statestored
namenode
catalogd
![Page 40: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/40.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Apache Spark• “Better Hadoop”
with “native”:SQL, Mlib, GraphX
• In-memory processing, based on RDDs
• Supports many clusters: “native”, YARN, Mesos
• Flexible programming model
Master
Driver
Slave A
Executor
Slave B
Executor
Slave C
Executor
![Page 41: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/41.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Presto
Slave A
• Designed for “interactive” queries
• In-memory processing
• Custom storage “plugins”: Hive, Kafka, MySql, Postgres,… worker
Slave B
worker
Slave C
worker
Master
coordinator
![Page 42: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/42.jpg)
How to start
![Page 43: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/43.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Step 1: Google “Hadoop ecosystem”
![Page 44: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/44.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Step 2: Try to install the simplest thing
![Page 45: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/45.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Step 3
![Page 46: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/46.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Step 4
![Page 47: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/47.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hint: Nobody builds their own Linux anymore
![Page 48: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/48.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Chose Hadoop distribution that suits you
![Page 49: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/49.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
Hadoop distributions• Pre-built and pre-integrated
(aka: all things work out of the box)
• Each has their own “philosophy” …
• … As well as preferred Hadoop database
![Page 50: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/50.jpg)
April 2-6, 2017 in Las Vegas, NV USA #C17LV
So what’s in it for me ?• It’s interesting (cool technology that hits many recent
buzzwords)
• If you know ORACLE, it’s close to your skill set
• It’s promising and future oriented
![Page 51: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/51.jpg)
Q&A
![Page 52: Hadoop databases for oracle DBAs](https://reader036.fdocuments.in/reader036/viewer/2022062412/58ef18901a28abd15e8b4689/html5/thumbnails/52.jpg)
Please Complete Your Session Evaluation Evaluate this session in your COLLABORATE app. Pull up this session and tap "Session Evaluation" to complete the survey.
Session ID: 557