Connecting Hadoop and Oracle - DOAG Deutsche ORACLE ... · gluent.com 1 Connecting Hadoop and...
Transcript of Connecting Hadoop and Oracle - DOAG Deutsche ORACLE ... · gluent.com 1 Connecting Hadoop and...
gluent.com 2
Intro:Aboutme
• TanelPõder• OracleDatabasePerformancegeek(18+years)• ExadataPerformancegeek• LinuxPerformancegeek• HadoopPerformancegeek
• CEO&co-founder:
ExpertOracleExadatabook
(2nd editionisoutnow!)
Instantpromotion
gluent.com 4
GluentOracle
TeradataNoSQL
BigDataSources
MSSQL
AppX
AppY
AppZ
Gluentasadatavirtualizationlayer
OpenDataFormats!
gluent.com 5
GluentAdvisor(DatabaseOffload)
1. Analyzes DBstorageuseandaccesspatternsforsafeoffloading
2. 500+Databasesanalyzed
3. 10+PB analyzed– 81% offloadable
4. 2-24x queryspeedup
10PBInterestedinanalyzingyourdatabase?
gluent.com 7
WhyHadoop?
• ScalabilityinSoftware!
• OpenDataFormats• Future-proof!
• OneData,ManyEngines!
Hive Impala Spark Presto Giraph
SQL-on-Hadoopisonlyoneofmanyapplicationsof
Hadoop
GraphProcessingFulltextsearchImageProcessingLogprocessing
Streams
gluent.com 8
All processingcanbeclosetodata!
• OneofthemantrasofHadoop:• "Movingcomputationischeaperthanmovingdata"
• Hadoopdataprocessingframeworks hidethecomplexityofplacingthecomputationofdataontheirlocal(orclosest)clusternodes• Nosuper-expensiveinterconnectnetwork,MPI&schedulingcomplexity• Noneedforsharedstorage(expensiveSAN+StorageArrays!)
• Nowyoucanbuildasupercomputerfromcommoditypizzaboxes!
Node1+disks
Node2+disks
Node3+disks
Node4+disks
Node5+disks
Node6+disks
Node...+disks
NodeN+disks
Yourcode
Yourcode
Yourcode
Yourcode
Yourcode
Yourcode
Yourcode
Yourcode
Distributedparallelexecutionframework Impala,Hive,Spark, HBase/coproc.etc…
YourCode
gluent.com 9
OneData,ManyEngines!
• Decouplingstoragefromcompute+opendataformats=flexiblefuture-proofdatasystems!
HDFS
Parquet Avro XML CSV
AmazonS3
Parquet WebLog
Kudu
Column-store
ImpalaSQLHive Xyz…
Solr /Search SparkMR
libkudulibparquet
gluent.com 10
3majorbigdataplumbingscenarios
1. Load datafromHadooptoOracle
2. Offload datafromOracletoHadoop
3. Query HadoopdatainOracle
Abigdifferencebetweendatacopy andon-
demanddataquery tools
gluent.com 11
UseCase:MoveDatato/fromHadoop:Sqoop
• AparallelJDBC<->HDFSdatacopy tool• Apache2.0licensed(likemostotherHadoopcomponents)• GeneratesMapReducejobsthatconnecttoOraclewithJDBC
OracleDB
MapReduceMapReduceMapReduceMapReduce JDBC
SQOOP
file
Hadoop/HDFSfilefile
file
file
Hivemetadata
gluent.com 12
Example:CopyDatatoHadoopwithSqoop
• Sqoopis"just"aJDBCDBclientcapableofwritingtoHDFS• Itisveryflexible
sqoop import --connect jdbc:oracle:thin:@oel6:1521/LIN112 --username system--null-string '' --null-non-string '' --target-dir=/user/impala/offload/tmp/ssh/sales--append -m1 --fetch-size=5000 --fields-terminated-by '','' --lines-terminated-by ''\\n'' --optionally-enclosed-by ''\"'' --escaped-by ''\"'' --split-by TIME_ID --query "\"SELECT * FROM SSH.SALES WHERE TIME_ID < TO_DATE('
1998-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS','NLS_CALENDAR=GREGORIAN') AND \$CONDITIONS\""
Justanexampleofsqoopsyntax
gluent.com 13
Sqoopusecases
1. CopydatafromOracletoHadoop• Entireschemas,tables,partitionsoranySQLqueryresult
2. CopydatafromHadooptoOracle• ReadHDFSfiles,converttoJDBCarraysandinsert
3. Copychangedrows fromOracletoHadoop• SqoopsupportsaddingWHEREsyntaxtotablereads:• WHERE last_chg > TIMESTAMP'2015-01-10 12:34:56'
• WHERE ora_rowscn > 123456789
gluent.com 14
SqoopwithOracleperformanceoptimizations- 1
• IscalledOraOOP• WasaseparatemodulebyGuyHarrisonteamatQuest(Dell)• IncludedinstandardSqoopv1.4.5+ andSqoop2comingtoo!
• sqoop import direct=true …
• ParallelSqoopOraclereadsdonebyblockrangesorpartitions• Previouslyrequiredwideindexrangescans toparallelizeworkload
• ReadOracledatawithfulltablescans anddirectpathreads• MuchlessIO,CPU,buffercacheimpactonproductiondatabases
gluent.com 15
SqoopwithOracleperformanceoptimizations- 2
• DataloadingintoOracleviadirectpathinsert ormerge• Bypassbuffercacheandmostredo/undogeneration
• HCCcompressionon-the-flyforloadedtables• ALTER TABLE fact COMPRESS FOR QUERY HIGH
• Allfollowingdirectpathinsertswillbecompressed
• Sqoopisprobablythebesttoolfordatacopy/move usecase• Forbatchdatamovementneeds• IfusedwiththeOracleoptimizations
gluent.com 16
Whataboutreal-timechangereplication?
• OracleChangeDataCapture• Supportedin11.2– butnotrecommendedbyOracleanymore• Desupportedin12.1
• OracleGoldenGateorDBVisitReplicate
• Somemanualwork/scriptinginvolvedwithallthesetools:
1. SqoopthesourcetablesnapshotstoHadoop2. Replicatechangesto"delta"tablesonHDFSorinHBase3. Usea"merge"viewonHive/Impalatomergebasetable withdelta
gluent.com 17
“Applying”updates/changesinHadoop
• HDFSfilesareappend-only bydesign(scalabilityreasons)• NorandomwritesorchangingofexistingHDFSfiledata
• Basetable+deltatableapproach1. Abasetable containsoriginal(batchoffloaded)data2. Adeltatable containsanynewversionsofdata3. Amergeview performsanouterjointoshowlatestdatafrombase
tableordeltatable(ifany)4. Optional“compaction”– mergingbase+delta togetherinbackground
• HiveACIDtransactionsusethisapproachunderthehood• Cloudera’sKUDU projectusessimilarconceptwithoutHDFS
• …butdeltascanbeinmemoryandstoredinflash(ifavailable)
gluent.com 18
References:IncrementaldataupdatesinHadoop
• Cloudera(Impala)• http://blog.cloudera.com/blog/2015/11/how-to-ingest-and-query-
fast-data-with-impala-without-kudu/
• Cloudera(Kudu)• https://blog.cloudera.com/blog/2015/09/kudu-new-apache-hadoop-
storage-for-fast-analytics-on-fast-data/
• Hortonworks (Hive)• http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-
2.3.0/bk_dataintegration/content/incrementally-updating-hive-table-with-sqoop-and-ext-table.html
gluent.com 19
UseCase:Query Hadoopdataasifitwasin(Oracle)RDBMS
Theultimategoal(formeatleast:)
• LowcostandscalabilityofHadoop…
• …togetherwiththesophisticationofOracleSQLengine
• Keepusingyourexisting(Oracle)applicationswithoutre-engineering&rewritingSQL
gluent.com 20
WhatdoSQLqueriesdo?
1. Retrievedata
2. Processdata
3. Returnresults
DiskreadsDecompressionFilteringColumnextraction&pruning
JoinParalleldistributeAggregateSort
FinalresultcolumnprojectionSenddatabackovernetwork…ordumptodisk
gluent.com 21
Howdoesthequeryoffloadingwork?(Ext.tab)
-----------------------------------------------------------| Id | Operation | Name |-----------------------------------------------------------
| 0 | SELECT STATEMENT | || 1 | SORT ORDER BY | ||* 2 | FILTER | || 3 | HASH GROUP BY | ||* 4 | HASH JOIN | ||* 5 | HASH JOIN | ||* 6 | EXTERNAL TABLE ACCESS FULL| CUSTOMERS_EXT |
|* 7 | EXTERNAL TABLE ACCESS FULL| ORDERS_EXT || 8 | EXTERNAL TABLE ACCESS FULL | ORDER_ITEMS_EXT |-----------------------------------------------------------
Dataretrievalnodes
Dataprocessingnodes
• Dataretrieval nodesdowhattheyneedtoproducerows/cols• ReadOracledatafiles,readexternaltables,callODBC,readmemory
• Intherestoftheexecutionplan,everythingworksasusual
gluent.com 22
Howdoesthequeryoffloadingwork?
Dataretrievalnodesproducerows/columns
Dataprocessingnodesconsumerows/columns
Dataprocessing nodesdon'tcarehowandwheretherowswerereadfrom,thedataflowformatisalwaysthesame.
gluent.com 23
Howdoesthequeryoffloadingwork?(ODBC)
--------------------------------------------------------| Id | Operation | Name | Inst |IN-OUT|
--------------------------------------------------------| 0 | SELECT STATEMENT | | | || 1 | SORT ORDER BY | | | |
|* 2 | FILTER | | | || 3 | HASH GROUP BY | | | ||* 4 | HASH JOIN | | | ||* 5 | HASH JOIN | | | |
| 6 | REMOTE | orders | IMPALA | R->S || 7 | REMOTE | customers | IMPALA | R->S || 8 | REMOTE | order_items | IMPALA | R->S |
--------------------------------------------------------Remote SQL Information (identified by operation id):----------------------------------------------------
6 - SELECT `order_id`,`order_mode`,`customer_id`,`order_status` FROM `soe`.`orders` WHERE `order_mode`='online' AND `order_status`=5 (accessing 'IMPALA.LOCALDOMAIN' )
7 – SELECT `customer_id`,`cust_first_name`,`cust_last_name`,`nls_territory`,
`credit_limit` FROM `soe`.`customers` WHERE `nls_territory` LIKE 'New%' (accessing 'IMPALA.LOCALDOMAIN' )
8 - SELECT `order_id`,`unit_price`,`quantity` FROM `soe`.`order_items` (accessing'IMPALA.LOCALDOMAIN' )
Dataretrievalnodes
RemoteHadoopSQL
Dataprocessingbusinessasusual
gluent.com 25
DimensionalDWdesign
Hotnesso
fdata
DWFACTTABLESinTeraBytes
HASH(customer_id)
RANGE
(order_d
ate)
OldFactdatararelyupdated
Facttablestime-partitioned
DDDD
D
D
DIMENSIONTABLESinGigaBytes
Monthstoyearsofhistory
Aftermultiplejoinsondimensiontables–afullscanisdoneon
thefacttable
Somefilterpredicatesdirectlyonthefacttable(timerange),countrycodeetc)
gluent.com 26
DWDataOffloadtoHadoop
CheapScalableStorage(e.g HDFS+Hive/Impala)
HotDataDim.Tables
HadoopNode
HadoopNode
HadoopNode
HadoopNode
HadoopNode
ExpensiveStorage
ExpensiveStorage
Time-partitionedfacttable HotDataColdDataDim.Tables
gluent.com 27
1.CopyDatatoHadoopwithSqoop
• Sqoopis"just"aJDBCDBclientcapableofwritingtoHDFS• Itisveryflexible
sqoop import --connect jdbc:oracle:thin:@oel6:1521/LIN112 --username system--null-string '' --null-non-string '' --target-dir=/user/impala/offload/tmp/ssh/sales--append -m1 --fetch-size=5000 --fields-terminated-by '','' --lines-terminated-by ''\\n'' --optionally-enclosed-by ''\"'' --escaped-by ''\"'' --split-by TIME_ID --query "\"SELECT * FROM SSH.SALES WHERE TIME_ID < TO_DATE('
1998-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS','NLS_CALENDAR=GREGORIAN') AND \$CONDITIONS\""
Justanexampleofsqoopsyntax
gluent.com 28
2.LoadDataintoHive/Impalainread-optimizedformat
• Theadditionalexternaltablestepgivesbetterflexibilitywhenloadingintothetargetread-optimizedtable
CREATE TABLE IF NOT EXISTS SSH_tmp.SALES_ext (PROD_ID bigint, CUST_ID bigint, TIME_ID timestamp, CHANNEL_ID bigint, PROMO_ID bigint, QUANTITY_SOLD bigint, AMOUNT_SOLD bigint
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '"' LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION '/user/impala/offload/tmp/ssh/sales'
INSERT INTO SSH.SALES PARTITION (year) SELECT t.*, CAST(YEAR(time_id) AS SMALLINT) FROM SSH_tmp.SALES_ext t
gluent.com 29
3a.QueryDataviaaDBLink/HS/ODBC
• (I'mnotcoveringtheODBCdriverinstallandconfig here)
SQL> EXPLAIN PLAN FOR SELECT COUNT(*) FROM ssh.sales@impala;Explained.
SQL> SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);--------------------------------------------------------------| Id | Operation | Name | Cost (%CPU)| Inst |IN-OUT|--------------------------------------------------------------| 0 | SELECT STATEMENT | | 0 (0)| | || 1 | REMOTE | | | IMPALA | R->S |--------------------------------------------------------------
Remote SQL Information (identified by operation id):----------------------------------------------------
1 - SELECT COUNT(*) FROM `SSH`.`SALES` A1 (accessing 'IMPALA’)
TheentirequerygetssenttoImpalathankstoOracleHeterogenous Services
gluent.com 30
3b.QueryDataviaOracleSQLConnectorforHDFSCREATE TABLE SSH.SALES_DP( "PROD_ID" NUMBER,
"CUST_ID" NUMBER,"TIME_ID" DATE,"CHANNEL_ID" NUMBER,
"PROMO_ID" NUMBER,"QUANTITY_SOLD" NUMBER,"AMOUNT_SOLD" NUMBER
)
ORGANIZATION EXTERNAL( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "OFFLOAD_DIR"
ACCESS PARAMETERS( external variable data
PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream')
LOCATION( 'osch-tanel-00000', 'osch-tanel-00001',
'osch-tanel-00002', 'osch-tanel-00003',
'osch-tanel-00004', 'osch-tanel-00005' )
)REJECT LIMIT UNLIMITED
NOPARALLEL
Theexternalvariabledataoptionsaysit'saDataPumpformatinputstream
hdfs_stream istheJavaHDFSclientappinstalledintoyourOracleserveraspartoftheOracleSQLConnectorforHadoop
Thelocationfilesaresmall.xmlconfig filescreatedbyOracleLoaderforHadoop(tellingwhereinHDFSourfilesare)
gluent.com 31
3b.QueryDataviaOracleSQLConnectorforHDFS
• ExternalTableaccessisparallelizable!• MultipleHDFSlocationfilesmustbecreatedforPX(donebyloader)
SQL> SELECT /*+ PARALLEL(4) */ COUNT(*) FROM ssh.sales_dp;
----------------------------------------------------------------------------------| Id | Operation | Name | TQ |IN-OUT| PQ Distrib |----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | || 1 | SORT AGGREGATE | | | | || 2 | PX COORDINATOR | | | | || 3 | PX SEND QC (RANDOM) | :TQ10000 | Q1,00 | P->S | QC (RAND) |
| 4 | SORT AGGREGATE | | Q1,00 | PCWP | || 5 | PX BLOCK ITERATOR | | Q1,00 | PCWC | || 6 | EXTERNAL TABLE ACCESS FULL| SALES_DP | Q1,00 | PCWP | |
----------------------------------------------------------------------------------
OracleParallelSlavescanaccessadifferentDataPumpfilesonHDFS
Anyfilterpredicatesarenotoffloaded.Alldata/columnsareread!
gluent.com 32
3c.QueryDataviaBigDataSQL
SQL> CREATE TABLE movielog_plus2 (click VARCHAR2(40))3 ORGANIZATION EXTERNAL
4 (TYPE ORACLE_HDFS5 DEFAULT DIRECTORY DEFAULT_DIR6 ACCESS PARAMETERS (7 com.oracle.bigdata.cluster=bigdatalite8 com.oracle.bigdata.overflow={"action":"truncate"}9 )10 LOCATION ('/user/oracle/moviework/applog_json/')
11 )12 REJECT LIMIT UNLIMITED;
• Thisisjustonesimpleexample:
ORACLE_HIVEwoulduseHivemetadatatofigureoutdatalocationandstructure(butstoragecellsdothediskreadingfromHDFSdirectly)
gluent.com 33
4.GluentQueryperformancewin! OracleonlyqueryCPU
SamequerywithGluent
GluentpushesCPUheavy-liftingtoHadoop
Demovideoavailableathttp://gluent.com/request-a-demo/
gluent.com 34
Howtoquerythefulldataset?
• UNIONALL• LatestpartitionsinOracle:(SALEStable)• OffloadeddatainHadoop:(SALES@impala orSALES_DP ext tab)
SELECT PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID, QUANTITY_SOLD, AMOUNT_SOLD
FROM SSH.SALESWHERE TIME_ID >= TO_DATE(' 1998-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS')
UNION ALL
SELECT "prod_id", "cust_id", "time_id", "channel_id", "promo_id", "quantity_sold", "amount_sold"
FROM SSH.SALES@impalaWHERE "time_id" < TO_DATE(' 1998-07-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS')
CREATE VIEW app_reporting_user.SALES AS
gluent.com 35
Hybridtablewithselectiveoffloading
CheapScalableStorage(e.g HDFS+Hive/Impala)
HotDataDim.Tables
HadoopNode
HadoopNode
HadoopNode
HadoopNode
HadoopNode
ExpensiveStorage
TABLEACCESSFULLEXTERNALTABLEACCESS/DBLINK
SELECT c.cust_gender, SUM(s.amount_sold)
FROM ssh.customers c, sales_v s
WHERE c.cust_id = s.cust_idGROUP BY c.cust_gender
SALESunion-allviewHASHJOIN
GROUPBY
SELECTSTATEMENT
gluent.com 36
PartiallyOffloadedExecutionPlan--------------------------------------------------------------------------| Id | Operation | Name |Pstart| Pstop | Inst |--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | || 1 | HASH GROUP BY | | | | ||* 2 | HASH JOIN | | | | || 3 | PARTITION RANGE ALL | | 1 | 16 | || 4 | TABLE ACCESS FULL | CUSTOMERS | 1 | 16 | || 5 | VIEW | SALES_V | | | || 6 | UNION-ALL | | | | |
| 7 | PARTITION RANGE ITERATOR| | 7 | 68 | || 8 | TABLE ACCESS FULL | SALES | 7 | 68 | || 9 | REMOTE | SALES | | | IMPALA |--------------------------------------------------------------------------
2 - access("C"."CUST_ID"="S"."CUST_ID")
Remote SQL Information (identified by operation id):----------------------------------------------------
9 - SELECT `cust_id`,`time_id`,`amount_sold` FROM `SSH`.`SALES` WHERE `time_id`<'1998-07-01 00:00:00' (accessing 'IMPALA' )
gluent.com 37
Performance
• VendorX:"WeloadNTerabytesperhour"
• VendorY:"WeloadNBillionrowsperhour"
• Q:Whatdatatypes?• SomuchcheapertoloadfixedCHARs• …butthat'snotwhatyourrealdatamodellookslike.
• Q:Howwiderows?• 1widevarchar column or500numericcolumns?
• DatatypeconversionisCPUhungry!
Makesureyouknowwhereinstackyou'llbepayingtheprice
Andisitonceoralwayswhenaccessed?
gluent.com 38
Performance:DataRetrievalLatency
• Cloudera Impala• Lowlatency(subsecond)
• Hive• WaituntilHive2.0isavailable(HortonworksHDP2.5- Q32016)• Hive2.0interactivequerieswithLLAP(LiveLongandProcess)• Otherwiseyou'llwaitforjobsandJVMstostartupforeveryquery• WithLLAP,JVMcontainerswillnotbestartedforsimplescans/filters
• OracleSQLConnectorforHDFS• Multi-secondlatencyduetohdfs_streamJVMstartup
• OracleBigDataSQL• Lowlatencythanksto"ExadatastoragecellsoftwareonHadoop"
Timetofirstrow/byte
gluent.com 39
BigDataSQLPerformanceConsiderations
• Onlydataretrieval (theTABLEACCESSFULLetc)isoffloaded!• Scanning,filtering,projectionpushdown
• Alldataprocessing stillhappensintheDBlayer(onDBCPUs)• GROUPBY,JOIN,ORDERBY,AnalyticFunctions,PL/SQLetc
• JustlikeonExadata…• Storagecellsonlyspeedupdataretrieval
gluent.com 41
Thanks!
Weliberateenterprisedata!http://gluent.com
Also,theusual:
http://[email protected]