Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors

61
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : [email protected] W : www.rittmanmead.com Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors Mark Rittman, CTO, Rittman Mead Oracle Openworld 2014, San Francisco

description

Presented at Oracle Openworld 2014 - a look at the ETL process within a Hadoop cluster, how data gets in, out and around, and how ODI12c and Oracle's Big Data Connectors can be used for this process.

Transcript of Deep-Dive into Big Data ETL with ODI12c and Oracle Big Data Connectors

  • 1. Deep-Dive into Big Data ETL withODI12c and Oracle Big Data ConnectorsMark Rittman, CTO, Rittman MeadOracle Openworld 2014, San FranciscoT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com

2. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.comAbout the SpeakerMark Rittman, Co-Founder of Rittman MeadOracle ACE Director, specialising in Oracle BI&DW14 Years Experience with Oracle TechnologyRegular columnist for Oracle MagazineAuthor of two Oracle Press Oracle BI booksOracle Business Intelligence Developers GuideOracle Exalytics RevealedWriter for Rittman Mead Blog :http://www.rittmanmead.com/blogEmail : [email protected] : @markrittman 3. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.comAbout Rittman MeadOracle BI and DW Gold partnerWinner of five UKOUG Partner of the Year awards in 2013 - including BIWorld leading specialist partner for technical excellence,solutions delivery and innovation in Oracle BIApproximately 80 consultants worldwideAll expert in Oracle BI and DWOffices in US (Atlanta), Europe, Australia and IndiaSkills in broad range of supporting Oracle tools:OBIEE, OBIAODIEEEssbase, Oracle OLAPGoldenGateEndeca 4. Traditional Data Warehouse / BI ArchitecturesThree-layer architecture - staging, foundation and access/performanceAll three layers stored in a relational database (Oracle)ETL used to move data from layer-to-layerT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)Staging Foundation /ODSE : [email protected] : www.rittmanmead.comPerformance /DimensionalETL ETLBI Tool (OBIEE)with metadatalayerOLAP / In-MemoryTool with data loadinto own databaseDirectReadDataLoadTraditional structureddata sourcesDataLoadDataLoadDataLoadTraditional Relational Data Warehouse 5. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.comIntroducing HadoopA new approach to data processing and data storageRather than a small number of large, powerful servers, it spreads processing overlarge numbers of small, cheap, redundant serversSpreads the data youre processing overlots of distributed nodesHas scheduling/workload process that sendsJob Trackerparts of a job to each of the nodes- a bit like Oracle Parallel ExecutionAnd does the processing where the data sits- a bit like Exadata storage serversShared-nothing architectureLow-cost and highly horizontal scalableTask Tracker Task Tracker Task Tracker Task TrackerData Node Data Node Task Tracker Task Tracker 6. Hadoop Tenets : Simplified Distributed ProcessingHadoop, through MapReduce, breaks processing down into simple stagesMap : select the columns and values youre interested in, pass through as key/value pairsReduce : aggregate the resultsMost ETL jobs can be broken down into filtering,projecting and aggregatingHadoop then automatically runs job on clusterShare-nothing small chunks of workRun the job on the node where the data isHandle faults etcGather the results back inT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.comMapperFilter, ProjectMapperFilter, ProjectMapperFilter, ProjectReducerAggregateReducerAggregateOutputOne HDFS file per reducer,in a directory 7. HDFS: Low-Cost, Clustered, Fault-Tolerant StorageThe filesystem behind Hadoop, used to store data for Hadoop analysisUnix-like, uses commands such as ls, mkdir, chown, chmodFault-tolerant, with rapid fault detection and recoveryHigh-throughput, with streaming data access and large block sizesDesigned for data-locality, placing data closed to where it is processedAccessed from the command-line, via internet (hdfs://), GUI tools etc[oracle@bigdatalite mapreduce]$ hadoop fs -mkdir /user/oracle/my_stuff[oracle@bigdatalite mapreduce]$ hadoop fs -ls /user/oracleFound 5 itemsdrwx------ - oracle hadoop 0 2013-04-27 16:48 /user/oracle/.stagingdrwxrwxrwx - oracle hadoop 0 2012-09-18 17:02 /user/oracle/moviedemodrwxrwxrwx - oracle hadoop 0 2012-10-17 15:58 /user/oracle/movieworkdrwxrwxrwx - oracle hadoop 0 2013-05-03 17:49 /user/oracle/my_stuffdrwxrwxrwx - oracle hadoop 0 2012-08-10 16:08 /user/oracle/stageT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 8. Oracles Big Data ProductsOracle Big Data Appliance - Engineered System for Big Data Acquisition and ProcessingCloudera Distribution of HadoopCloudera ManagerOpen-source ROracle NoSQL Database Community EditionOracle Enterprise Linux + Oracle JVMNew - Oracle Big Data SQLOracle Big Data ConnectorsOracle Loader for Hadoop (Hadoop > Oracle RBDMS)Oracle Direct Connector for HDFS (HFDS > Oracle RBDMS)Oracle Data Integration Adapter for HadoopOracle R Connector for HadoopOracle NoSQL Database (column/key-store DB based on BerkeleyDB)T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 9. Moving Data In, Around and Out of HadoopThree stages to Hadoop ETL work, with dedicated Apache / other toolsLoad : receive files in batch, or in real-time (logs, events)Transform : process & transform data to answer questionsStore / Export : store in structured form, or export to RDBMS using SqoopRDBMSImportsT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)LoadingStage!!!!ProcessingStageE : [email protected] : www.rittmanmead.com!!!!Store / ExportStage!!!!Real-TimeLogs / EventsFile /UnstructuredImportsFileExportsRDBMSExports 10. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.comETL OffloadingSpecial use-case : offloading low-value, simple ETL work to a Hadoop clusterReceiving, aggregating, filtering and pre-processing data for an RDBMS data warehousePotentially free-up high-value Exadata / RBDMS servers for analytic work 11. Core Apache Hadoop ToolsApache Hadoop, including MapReduce and HDFSScaleable, fault-tolerant file storage for HadoopParallel programming framework for HadoopApache HiveSQL abstraction layer over HDFSPerform set-based ETL within HadoopApache Pig, SparkDataflow-type languages over HDFS, Hive etcExtensible through UDFs, streaming etcApache Flume, Apache Sqoop, Apache KafkaReal-time and batch loading into HDFSModular, fault-tolerant, wide source/target coverageT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 12. Hive as the Hadoop Data WarehouseMapReduce jobs are typically written in Java, but Hive can make this simplerHive is a query environment over Hadoop/MapReduce to support SQL-like queriesHive server accepts HiveQL queries via HiveODBC or HiveJDBC, automaticallycreates MapReduce jobs against data previously loaded into the Hive HDFS tablesApproach used by ODI and OBIEEto gain access to Hadoop dataAllows Hadoop data to be accessed just likeany other data source (sort of...)T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 13. How Hive Provides SQL Access over HadoopHive uses a RBDMS metastore to holdtable and column definitions in schemasHive tables then map onto HDFS-stored filesManaged tablesExternal tablesOracle-like query optimizer, compiler,executorJDBC and OBDC drivers,plus CLI etcT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.comHive Driver(CompileOptimize, Execute)Managed Tables/user/hive/warehouse/External Tables/user/oracle//user/movies/data/HDFSHDFS or local filesloaded into Hive HDFSarea, using HiveQLCREATE TABLEcommandHDFS files loaded into HDFSusing external process, thenmapped into Hive usingCREATE EXTERNAL TABLEcommandMetastore 14. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.comOracle Loader for HadoopOracle technology for accessing Hadoop data, and loading it into an Oracle databasePushes data transformation, heavy lifting to the Hadoop cluster, using MapReduceDirect-path loads into Oracle Database, partitioned and non-partitionedOnline and offline loadsKey technology for fast load ofHadoop results into Oracle DB 15. Oracle Direct Connector for HDFSEnables HDFS as a data-source for Oracle Database external tablesEffectively provides Oracle SQL access over HDFSSupports data query, or import into Oracle DBTreat HDFS-stored files in the same way as regular filesBut with HDFSs low-cost and fault-toleranceT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 16. Oracle R Advanced Analytics for HadoopAdd-in to R that extends capability to HadoopGives R the ability to create Map and Reduce functionsExtends R data frames to include Hive tablesAutomatically run R functions on Hadoopby using Hive tables as sourceT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 17. Just Released - Oracle Big Data SQLPart of Oracle Big Data 4.0 (BDA-only)Also requires Oracle Database 12c, Oracle Exadata Database MachineExtends Oracle Data Dictionary to cover HiveExtends Oracle SQL and SmartScan to HadoopExtends Oracle Security Model over HadoopFine-grained access controlData redaction, data maskingT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)ExadataStorage ServersE : [email protected] : www.rittmanmead.comExadata DatabaseServerHadoopClusterOracle BigData SQLSQL QueriesSmartScan SmartScan 18. Bringing it All Together : Oracle Data Integrator 12cODI provides an excellent framework for running Hadoop ETL jobsELT approach pushes transformations down to Hadoop - leveraging power of clusterHive, HBase, Sqoop and OLH/ODCH KMs provide native Hadoop loading / transformationWhilst still preserving RDBMS push-downExtensible to cover Pig, Spark etcProcess orchestrationData quality / error handlingMetadata and model-drivenT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 19. The Key to ODI Extensibility - Knowledge ModulesDivides the ETL process into separate steps - extract (load), integrate, check constraints etcODI generates native code for each platform, taking a template for each step + addingtable names, column names, join conditions etcEasy to extendEasy to read the codeMakes it possible for ODI tosupport Spark, Pig etc in futureUses the power of the targetplatform for integration tasksT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com-Hadoop-native ETL 20. Part of the Wider Oracle Data Integration PlatformOracle Data Integrator for large-scale data integration across heterogenous sources andtargetsOracle GoldenGate for heterogeneous data replication and changed data captureOracle Enterprise Data Quality for data profiling and cleansingOracle Data Services Integratorfor SOA message-baseddata federationT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 21. ODI and Big Data Integration ExampleIn this example, well show an end-to-end ETL process on Hadoop using ODI12c & BDAScenario: load webserver log data into Hadoop, process enhance and aggregate,then load final summary table into Oracle Database 12cProcess using Hadoop frameworkLeverage Big Data ConnectorsMetadata-based ETL developmentusing ODI12cReal-world exampleT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 22. ETL & Data Flow through BDA SystemFive-step process to load, transform, aggregate and filter incoming log dataLeverage ODIs capabilities where possibleMake use of Hadoop power+ scalabilityFlumeAgentT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)Sqoop extract!posts(Hive Table)IKM Hive Control Append(Hive table join & load intotarget hive table)categories_sql_extract(Hive Table)E : [email protected] : www.rittmanmead.comhive_raw_apache_access_log(Hive Table)FlumeAgent!!!!!!Apache HTTPServerLog Files (HDFS)Flume MessagingTCP Port 4545(example)IKM File to Hive1 using RegEx SerDelog_entries_and post_detail(Hive Table)IKM Hive Control Append(Hive table join & load intotarget hive table)hive_raw_apache_access_log(Hive Table)2 3GeocodingIP>Country list(Hive Table)IKM Hive Transform(Hive streaming throughPython script)4 5hive_raw_apache_access_log(Hive Table)IKM File / Hive to Oracle(bulk unload to Oracle DB) 23. ETL Considerations : Using Hive vs. Regular Oracle SQLNot all join types are available in Hive - joins must be equality joinsNo sequences, no primary keys on tablesGenerally need to stage Oracle or other external data into Hive before joining to itHive latency - not good for small microbatch-type workBut other alternatives exist - Spark, Impala etcHive is INSERT / APPEND only - no updates, deletes etcBut HBase may be suitable for CRUD-type loadingDont assume that HiveQL == Oracle SQLTest assumptions before committing to platformT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.comvs. 24. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.comFive-Step ETL Process1. Take the incoming log files (via Flume) and load into a structured Hive table2. Enhance data from that table to include details on authors, posts from other Hive tables3. Join to some additional ref. data held in an Oracle database, to add author details4. Geocode the log data, so that we have the country for each calling IP address5. Output the data in summary form to an Oracle database 25. Using Flume to Transport Log Files to BDAApache Flume is the standard way to transport log files from source through to targetInitial use-case was webserver log files, but can transport any file from A>BDoes not do data transformation, but can send to multiple targets / target typesMechanisms and checks to ensure successful transport of entriesHas a concept of agents, sinks and channelsAgents collect and forward log dataSinks store it in final destinationChannels store log data en-routeSimple configuration through INI filesHandled outside of ODI12cT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 26. GoldenGate for Continuous Streaming to HadoopOracle GoldenGate is also an option, for streaming RDBMS transactions to HadoopLeverages GoldenGate & HDFS / Hive Java APIsSample Implementations on MOS Doc.ID 1586210.1 (HDFS) and 1586188.1 (Hive)Likely to be formal part of GoldenGate in future release - but usable nowT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 27. Load Incoming Log Files into Hive TableFirst step in process is to load the incoming log files into a Hive tableAlso need to parse the log entries to extract request, date, IP address etc columnsHive table can then easily be used indownstream transformationsUse IKM File to Hive (LOAD DATA) KMSource can be local files or HDFSEither load file into Hive HDFS area,or leave as external Hive tableAbility to use SerDe to parse file dataT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com1 28. First Though Need to Setup Topology and ModelsHDFS data servers (source) defined using generic File technologyWorkaround to support IKM Hive Control AppendLeave JDBC driver blank, put HDFS URL in JDBC URL fieldT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 29. Defining Physical Schema and Model for HDFS DirectoryHadoop processes typically access a whole directory of files in HDFS, rather than single oneHive, Pig etc aggregate all files in that directory and treat as single fileODI Models usually point to a single file though -how do you set up access correctly?T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 30. Defining Topology and Model for Hive SourcesHive supported out-of-the-box with ODI12c (but requires ODIAAH license for KMs)Most recent Hadoop distributions use HiveServer2 rather than HiveServerNeed to ensure JDBC drivers support Hive versionUse correct JDBC URL format (jdbc:hive2//)T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 31. Final Model and Datastore DefinitionsHDFS files for incoming log data, and any other input dataHive tables for ETL targets and downstream processingUse RKM Hive to reverse-engineer column definition from HiveT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 32. Using IKM File to Hive to Load Web Log File Data into HiveCreate mapping to load file source (single column for weblog entries) into Hive tableTarget Hive table should have column for incoming log row, and parsed columnsT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 33. Specifying a SerDe to Parse Incoming Hive DataSerDe (Serializer-Deserializer) interfaces give Hive the ability to process new file formatsDistributed as JAR file, gives Hive ability to parse semi-structured formatsWe can use the RegEx SerDe to parse the Apache CombinedLogFormat file into columnsEnabled through OVERRIDE_ROW_FORMAT IKM File to Hive (LOAD DATA) KM optionT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 34. Executing First ODI12c MappingEXTERNAL_TABLE option chosen in IKM File to Hive (LOAD DATA) as Flume will continuewriting to it until source log rotateView results of data load in ODI StudioT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 35. Join to Additional Hive Tables, Transform using HiveQLIKM Hive to Hive Control Append can be used to perform Hive table joins, filtering, agg. etc.INSERT only, no DELETE, UPDATE etcNot all ODI12c mapping operators supported, but basic functionality works OKUse this KM to join to other Hive tables,adding more details on post, title etcPerform DISTINCT on join output, loadinto summary Hive tableT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com2 36. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.comJoining Hive TablesOnly equi-joins supportedMust use ANSI syntaxMore complex joins may not producevalid HiveQL (subqueries etc) 37. Filtering, Aggregating and Transforming Within HiveAggregate (GROUP BY), DISTINCT, FILTER, EXPRESSION, JOIN, SORT etc mappingoperators can be added to mapping to manipulate dataGenerates HiveQL functions, clauses etcT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 38. Executing Second MappingODI IKM Hive to Hive Control Append generates HiveQL to perform data loadingIn the background, Hive on BDA creates MapReduce job(s) to load and transform HDFS dataAutomatically runs across the cluster, in parallel and with fault tolerance, HAT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 39. Bring in Reference Data from Oracle DatabaseIn this third step, additional reference data from Oracle Database needs to be addedIn theory, should be able to add Oracle-sourced datastores to mapping and join as usualBut Oracle / JDBC-generic LKMs dont get work with HiveT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com3 40. Options for Importing Oracle / RDBMS Data into HadoopCould export RBDMS data to file, and load using IKM File to HiveOracle Big Data Connectors only export to Oracle, not import to HadoopBest option is to use Apache Sqoop, and newIKM SQL to Hive-HBase-File knowledge moduleHadoop-native, automatically runs in parallelUses native JDBC drivers, or OraOop (for example)Bi-directional in-and-out of Hadoop to RDBMSRun from OS command-lineT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 41. Loading RDBMS Data into Hive using SqoopFirst step is to stage Oracle data into equivalent Hive tableUse special LKM SQL Multi-Connect Global load knowledge module for Oracle sourcePasses responsibility for load (extract) to following IKMThen use IKM SQL to Hive-HBase-File (Sqoop) to load the Hive tableT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 42. Join Oracle-Sourced Hive Table to Existing Hive TableOracle-sourced reference data in Hive can then be joined to existing Hive table as normalFilters, aggregation operators etc can be added to mapping if requiredUse IKM Hive Control Append as integration KMT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 43. ODI Static and Flow Control : Data Quality and Error HandlingCKM Hive can be used with IKM Hive to Hive Control Append to filter out erroneous dataStatic controls can be used to create data firewallsFlow control used in Physical mapping view to handle errors, exceptionsExample: Filter out rows where IP address is from a test harnessT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 44. Enabling Flow Control in IKM Hive to Hive Control AppendCheck the ENABLE_FLOW_CONTROL option in KM settingsSelect CKM Hive as the check knowledge moduleErroneous rows will get moved to E_ table in Hive, not loaded into target Hive tableT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 45. Using Hive Streaming and Python for Geocoding DataAnother requirement we have is to geocode the webserver log entriesAllows us to aggregate page views by countryBased on the fact that IP ranges can usually be attributed to specific countriesNot functionality normally found in Hive etc, but can be done with add-on APIsT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com4 46. How GeoIP Geocoding WorksUses free Geocoding API and database from MaxmindConvert IP address to an integerFind which integer range our IP address sits withinBut Hive cant use BETWEEN in a joinT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 47. Solution : IKM Hive TransformIKM Hive Transform can pass the output of a Hive SELECT statement througha perl, python, shell etc script to transform contentUses Hive TRANSFORM USING AS functionalityhive> add file file:///tmp/add_countries.py;Added resource: file:///tmp/add_countries.pyhive> select transform (hostname,request_date,post_id,title,author,category)> using 'add_countries.py'> as (hostname,request_date,post_id,title,author,category,country)> from access_per_post_categories;T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 48. Creating the Python Script for Hive StreamingSolution requires a Python API to be installed on all Hadoop nodes, along with geocode DBwget !https://raw.github.com/pypa/pip/master/contrib/get-pip.pypython !get-pip.py pipinstall pygeoip!Python script then parses incoming stdin lines using tab-separation of fields, outputs same(but with extra field for the country)#!/usr/bin/pythonimport syssys.path.append('/usr/lib/python2.6/site-packages/')import pygeoipgi = pygeoip.GeoIP('/tmp/GeoIP.dat')for line in sys.stdin:line = line.rstrip()hostname,request_date,post_id,title,author,category = line.split('t')country = gi.country_name_by_addr(hostname)print hostname+'t'+request_date+'t'+post_id+'t'+title+'t'+authorT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com+'t'+country+'t'+category 49. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.comSetting up the MappingMap source Hive table to target, which includes column for extra country column!!!!!!!Copy script + GeoIP.dat file to every nodes /tmp directoryEnsure all Python APIs and libraries are installed on each Hadoop node 50. Configuring IKM Hive TransformTRANSFORM_SCRIPT_NAME specifies name ofscript, and path to scriptTRANSFORM_SCRIPT has issues with parsing;do not use, leave blank and KM will use existing oneOptional ability to specify sort and distributioncolumns (can be compound)Leave other options at defaultT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 51. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.comExecuting the MappingKM automatically registers the script with Hive (which caches it on all nodes)HiveQL output then runs the contents of the first Hive table through the script, outputtingresults to target table 52. Bulk Unload Summary Data to Oracle DatabaseFinal requirement is to unload final Hive table contents to Oracle DatabaseSeveral use-cases for this:Use Hadoop / BDA for ETL offloadingUse analysis capabilities of BDA, but then output results to RDBMS data mart or DWPermit use of more advanced SQL query toolsShare results with other applicationsCan use Sqoop for this, or use Oracle Big Data ConnectorsFast bulk unload, or transparent Oracle access to HiveT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com5 53. IKM File/Hive to Oracle (OLH/ODCH)KM for accessing HDFS/Hive data from OracleEither sets up ODCH connectivity, or bulk-unloads via OLHMap from HDFS or Hive source to Oracle tables (via Oracle technology in Topology)T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 54. Configuring the KM Physical SettingsFor the access table in Physical view, change LKM to LKM SQL Multi-ConnectDelegates the multi-connect capabilities to the downstream node, so you can use a multi-connectIKM such as IKM File/Hive to OracleT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 55. Configuring the KM Physical SettingsFor the target table, select IKM File/Hive to OracleOnly becomes available to select onceLKM SQL Multi-Connect selected for access tableKey option values to set are:OLH_OUTPUT_MODE (use JDBC initially, OCIif Oracle Client installed on Hadoop client node)MAPRED_OUTPUT_BASE_DIR (set to directoryon HFDS that OS user running ODI can access)T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 56. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.comExecuting the MappingExecuting the mapping will invokeOLH from the OS command lineHive table (or HDFS file) contentscopied to Oracle table 57. Create Package to Sequence ETL StepsDefine package (or load plan) within ODI12c to orchestrate the processCall package / load plan execution from command-line, web service call, or scheduleT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com 58. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.comExecute Overall PackageEach step executed in sequenceEnd-to-end ETL process, using ODI12cs metadata-driven development process,data quality handing, heterogenous connectivity, but Hadoop-native processing 59. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.comConclusionsHadoop, and the Oracle Big Data Appliance, is an excellent platform for data capture,analysis and processingHadoop tools such as Hive, Sqoop, MapReduce and Pig provide means to process andanalyse data in parallel, using languages + approach familiar to Oracle developersODI12c provides several benefits when working with ETL and data loading on HadoopMetadata-driven design; data quality handling; KMs to handle technical complexityOracle Data Integrator Adapter for Hadoop provides several KMs for Hadoop sourcesIn this presentation, weve seen an end-to-end example of big data ETL using ODIThe power of Hadoop and BDA, with the ETL orchestration of ODI12c 60. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.comThank You for Attending!Thank you for attending this presentation, and more information can be found at http://www.rittmanmead.comContact us at [email protected] or [email protected] out for our book, Oracle Business Intelligence Developers Guide out now!Follow-us on Twitter (@rittmanmead) or Facebook (facebook.com/rittmanmead) 61. Deep-Dive into Big Data ETL withODI12c and Oracle Big Data ConnectorsMark Rittman, CTO, Rittman MeadOracle Openworld 2014, San FranciscoT : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)E : [email protected] : www.rittmanmead.com