Filling the Data Lake

31
Filling the Data Lake June 29, 2016 Chuck Yarbrough Sr Director, Solutions Marketing and Management @cyarbrough Mark Burnette Enterprise Sales Engineer @MarkCBurnette

Transcript of Filling the Data Lake

Filling the Data Lake June 29, 2016

Chuck Yarbrough Sr Director, Solutions Marketing and Management

@cyarbrough Mark Burnette Enterprise Sales Engineer @MarkCBurnette

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 2

Emerging Big Data Use Cases

Improve operational effectiveness Machines/sensors: predict failures, network attacks

Financial risk management: reduce fraud, increase security

Reduce data warehouse cost

Improve customer experience Build a 360° view to fully understand and serve the customer

Drive personalized and adjusted interaction

Use automated recommendations logic

Drive incremental revenue Predict customer behavior across all channels

Understand and monetize customer behavior

Begin to monetize data as a service

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 3

Spectrum of Big Data Use Cases

Entry

Tran

sfor

m

Advanced

Opt

imiz

e

Data Warehouse Optimization

Streamlined Data

Refinery

Big Data Exploration

Customer 360 Degree

View Harnessing Machine &

Sensor Data

Next Generation Applications

Internal Big Data as a Service

On-Demand Big Data Blending

Big Data Predictive Analytics

Use Case Complexity

Bus

ines

s Im

pact

Monetize My Data

Data Warehouse Optimization

Data Warehouse Optimization

Streamlined Data

Refinery

360 Degree View

Big Data Onboarding

Filling the Data Lake

What Does Pentaho Do?

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 5

Administration Security Lifecycle Management

Data Provenance

Dynamic Data Pipeline Monitoring Automation

Data Pipeline

Data Engineering

Managing and Automating the Pipeline

Data Engineering Analytics Data Preparation

Data Lake

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 6

The Data Swamp

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 7

The Data Lake

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 8

Does Hadoop Have to be Hard?

Empower team members to

integrate and process Hadoop

Data

Establish a modern data on

boarding process that is flexible and

scalable

Deliver governed analytic insights

for large production use

bases

Things that can help ease the pain

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 9

Proper Care and Feeding of the Data Lake

Data Onboarding Challenges

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 11

More Data, More Problems

Even with good integration tools, major data onboarding projects can be painful:

User Challenges

§  Repetitive manual design

§  Very time-consuming

§  Difficult to maintain

Business Challenges

§  Takes too long

§  Business deadlines at risk

§  Opportunity cost

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 12

How do we effectively scale data pipelines to accommodate exploding data sources, volumes, and complexity?

More Data, More Problems

Have you ever had the pleasure of…

Migrating hundreds of sources between systems?

Enabling business users to onboard a variety of data themselves?

Ingesting hundreds of changing data sources into Hadoop?

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 13

More Data, More Problems

Modern data onboarding is more than just “dumping data” – it includes:

Managing a changing array of data sources

Establishing repeatable processes at scale

Maintaining control and governance

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 14

CSV

RDBMS

Data On Boarding Filling the Data Lake

Ingest Procedures

Disparate Data Sources Integration Processes Transformations

Hadoop

AVRO

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 15

CSV CSV

RDBMS

Data On Boarding at Scale

RDBMS

Disparate Data Sources Integration Processes Transformations

RDBMS

Ingest Procedures

Hadoop

AVRO

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 16

Filling the Data Lake A Modern Data Onboarding Blueprint

Streamline data ingest from wide

variety of source data

Reduce dependence on hard coded data

movement procedures

Simplify regular data movement at scale

into Data Lake

Template-based Approach

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 18

CSV CSV

RDBMS

Dynamic ELT

Ingest Templates Hadoop

RDBMS

Disparate Data Sources Dynamic Integration Processes Dynamic Transformations

RDBMS

Pass metadata in at run time to generate jobs on the fly (metadata injection)

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 19

CSV

CSV

RDBMS

Templated workflows

RDBMS -> AVRO Template

Hadoop

RDBMS

Disparate Data Sources Dynamic Integration Processes Dynamic Transformations

RDBMS

CSV -> AVRO Template

CSV -> HDFS Template

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 20

Variety – different metadata, one template

Hadoop

Disparate Data Sources Dynamic Integration Processes Dynamic Transformations

CSV -> AVRO Template

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 21

Key Takeaway

Managing ELT and ELT procedures

Managing Metadata

Metadata Injection

Metadata Acquisition

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 23

RDBMS Ingestion

Automated Metadata Extraction

Extract table and store in AVRO

§  Database connection details

§  Table(s)

§  Field names (if available)

§  Data types

§  String length

§  Mask for numbers and dates

§  …

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 24

Option 1: Ingest RAW files into HDFS (no parsing)

§  Path to CSVs

CSV Ingestion

Option 2: Parse and store in AVRO

§  Path to CSVs

§  Delimiter

§  Field names (if available)

§  Data types

§  String length

§  Mask for numbers and dates

§  …

Automated Metadata Extraction

Demonstration

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 26

Key Takeaway

ELT development

DAYS Provisioning

MINUTES

Automated Metadata Extraction

Summary

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 28

Key Takeaways

Template-based Data Integration

Manage metadata vs.

ELT procedures

Automated Metadata Extraction

Provide minimum required

configuration

Reduce Risk Maintain an organized,

standardized, & clean, data lake

Data Onboarding Blueprint

© 2015, Pentaho. All rights reserved. pentaho.com. Worldwide +1 (866) 660-7555 29

Learn more about Big Data Onboarding at

Pentaho.com

Download Pentaho Platform at

Pentaho.com

What Next?

Q&A

Thank You