Filling the Data Lake
-
Upload
dataworks-summithadoop-summit -
Category
Technology
-
view
466 -
download
0
Transcript of Filling the Data Lake
Filling the Data Lake June 29, 2016
Chuck Yarbrough Sr Director, Solutions Marketing and Management
@cyarbrough Mark Burnette Enterprise Sales Engineer @MarkCBurnette
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 2
Emerging Big Data Use Cases
Improve operational effectiveness Machines/sensors: predict failures, network attacks
Financial risk management: reduce fraud, increase security
Reduce data warehouse cost
Improve customer experience Build a 360° view to fully understand and serve the customer
Drive personalized and adjusted interaction
Use automated recommendations logic
Drive incremental revenue Predict customer behavior across all channels
Understand and monetize customer behavior
Begin to monetize data as a service
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 3
Spectrum of Big Data Use Cases
Entry
Tran
sfor
m
Advanced
Opt
imiz
e
Data Warehouse Optimization
Streamlined Data
Refinery
Big Data Exploration
Customer 360 Degree
View Harnessing Machine &
Sensor Data
Next Generation Applications
Internal Big Data as a Service
On-Demand Big Data Blending
Big Data Predictive Analytics
Use Case Complexity
Bus
ines
s Im
pact
Monetize My Data
Data Warehouse Optimization
Data Warehouse Optimization
Streamlined Data
Refinery
360 Degree View
Big Data Onboarding
Filling the Data Lake
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 5
Administration Security Lifecycle Management
Data Provenance
Dynamic Data Pipeline Monitoring Automation
Data Pipeline
Data Engineering
Managing and Automating the Pipeline
Data Engineering Analytics Data Preparation
Data Lake
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 8
Does Hadoop Have to be Hard?
Empower team members to
integrate and process Hadoop
Data
Establish a modern data on
boarding process that is flexible and
scalable
Deliver governed analytic insights
for large production use
bases
Things that can help ease the pain
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 9
Proper Care and Feeding of the Data Lake
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 11
More Data, More Problems
Even with good integration tools, major data onboarding projects can be painful:
User Challenges
§ Repetitive manual design
§ Very time-consuming
§ Difficult to maintain
Business Challenges
§ Takes too long
§ Business deadlines at risk
§ Opportunity cost
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 12
How do we effectively scale data pipelines to accommodate exploding data sources, volumes, and complexity?
More Data, More Problems
Have you ever had the pleasure of…
Migrating hundreds of sources between systems?
Enabling business users to onboard a variety of data themselves?
Ingesting hundreds of changing data sources into Hadoop?
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 13
More Data, More Problems
Modern data onboarding is more than just “dumping data” – it includes:
Managing a changing array of data sources
Establishing repeatable processes at scale
Maintaining control and governance
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 14
CSV
RDBMS
Data On Boarding Filling the Data Lake
Ingest Procedures
Disparate Data Sources Integration Processes Transformations
Hadoop
AVRO
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 15
CSV CSV
RDBMS
Data On Boarding at Scale
RDBMS
Disparate Data Sources Integration Processes Transformations
RDBMS
Ingest Procedures
Hadoop
AVRO
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 16
Filling the Data Lake A Modern Data Onboarding Blueprint
Streamline data ingest from wide
variety of source data
Reduce dependence on hard coded data
movement procedures
Simplify regular data movement at scale
into Data Lake
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 18
CSV CSV
RDBMS
Dynamic ELT
Ingest Templates Hadoop
RDBMS
Disparate Data Sources Dynamic Integration Processes Dynamic Transformations
RDBMS
Pass metadata in at run time to generate jobs on the fly (metadata injection)
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 19
CSV
CSV
RDBMS
Templated workflows
RDBMS -> AVRO Template
Hadoop
RDBMS
Disparate Data Sources Dynamic Integration Processes Dynamic Transformations
RDBMS
CSV -> AVRO Template
CSV -> HDFS Template
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 20
Variety – different metadata, one template
Hadoop
Disparate Data Sources Dynamic Integration Processes Dynamic Transformations
CSV -> AVRO Template
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 21
Key Takeaway
Managing ELT and ELT procedures
Managing Metadata
Metadata Injection
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 23
RDBMS Ingestion
Automated Metadata Extraction
Extract table and store in AVRO
§ Database connection details
§ Table(s)
§ Field names (if available)
§ Data types
§ String length
§ Mask for numbers and dates
§ …
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 24
Option 1: Ingest RAW files into HDFS (no parsing)
§ Path to CSVs
CSV Ingestion
Option 2: Parse and store in AVRO
§ Path to CSVs
§ Delimiter
§ Field names (if available)
§ Data types
§ String length
§ Mask for numbers and dates
§ …
Automated Metadata Extraction
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 26
Key Takeaway
ELT development
DAYS Provisioning
MINUTES
Automated Metadata Extraction
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 28
Key Takeaways
Template-based Data Integration
Manage metadata vs.
ELT procedures
Automated Metadata Extraction
Provide minimum required
configuration
Reduce Risk Maintain an organized,
standardized, & clean, data lake
Data Onboarding Blueprint
© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 29
Learn more about Big Data Onboarding at
Pentaho.com
Download Pentaho Platform at
Pentaho.com
What Next?