One Large Data Lake, Hold the Hype

Rocky Mountain DataCon 2016

Jared WinickSenior Data/Solutions Engineer, Koverse

Outline

• Issues with the usage of “Data Lake”• Defining Key Characteristics• A Data Lake Implementation Example• Discussion

Just because “Data Lake” isoverusedmisusedabused

doesn’t mean the concept is wrong

The Concept of a Data Lake

We all can agree that a Data Lake is a centralized (at least logically) repository for all forms of data within an organization.

6https://www.wired.com/2013/04/desktop-cluttered-help/

The Concept of a Data Lake

…but there must be more to it than putting all your data in HDFS or S3.

Defining the Key Characteristics

1. Indexing and search across all data2. Interactive access for all users in the enterprise3. Multi-level access control4. Integration with data science tools5. Abstractions

A Data Lake has a platform-application duality

Indexing / Search Across All Data

• A Data Lake is often an entry point for data• It may lack structure or “correctness”• Search enables you to validate and explore your data

Indexing / Search Across All Data

A18923 Search

{ employeeId: “A18923”, email: “jaredwinick@koverse.com”, firstName: “Jared”, …}

Employees Data Set{ id: “a18923”, eventType: “login”, time: 1478557775010 …}

Network Events Data Set

Find data across data sets. Understand its format and structure.

Interactive Access for Everyone

• A Data Lake is strategic and should serve many different types of users.

• Should have self-service features.• Adds up to needing to support interactive, multi-user load.

Multi-Level Access Control• Every organization has data access control requirements these

days.• Different level of granularity for different environments/use

cases.– Data Set Level– Column/Field Level– Row/Record Level

• Far easier to engineer up front than add on later.

Name DeptId DOB email

Multi-Level Access Control

Data Set Column

Integration With Data Science Tools

• The ultimate point of a Data Lake is to “monetize” data– For a corporation this is making or saving money– For a government this is better serving your citizens– For a research organization this is solving new problems/answering

previously unknown questions• Need to be able to analyze and transform data sets into new

data sets• From BI queries to text analytics to machine learning.

Integration With Data Science Tools

A Data Lake needs to support multiple internal analytic “customers” within the organization.• SQL / BI tools for Data Analysts• Spark for Data Engineers• Notebooks and ML libraries for Data Scientists

Abstractions• Provide a level of abstraction over your data

– Data Sets / Collections– Records / Rows– Transformations

• Enables a consistent API for interacting with any data regardless of its shape, size, and content– Reusability– Increased development speed

The Koverse Data Lake

Architecture – High Level

HDFS Zookeeper

Accumulo

Koverse

A distributed key/value store like Apache Accumulo enables storage of very large volumes of data while maintaining low latency access.

Architecture – Distributed Key/Value Store Benefits

These benefits apply to Apache Accumulo, but also likely to Apache HBase, Cassandra and other similar systems

1. Easily scale to trillions of key/values2. Distributed storage

1. Parallel processing in Hadoop MapReduce or Spark2. Fault tolerance

3. Millisecond read latencies with efficient scanning of ranges4. Fine grained access control features

Architecture - Details

Accumulo

Record Table

IndexTable

Statistics/Aggregations

Koverse

Low latency R/W

Efficient Range Scans

Users/Apps use REST

Discussion

One Large Data Lake, Hold the Hype

Data & Analytics

Transcript of One Large Data Lake, Hold the Hype

eBook Hype

HYPE SS13 LOOKBOOK

HYPE, HYPE, HYPE CUBA - Alice d'Orgevalalicedorgeval.com/wp-content/uploads/2016/03/La-Havane... · 2016. 3. 27. · LES ECHOS WEEK-END – 43 11 MARS 2016 STYLE HYPE, HYPE, HYPE,

Hype Minigolf

Hype Cycle Discussion

TheHuntington Hype

Python Hype June

Iphone Hype Application

HYPE Event Guide

Urban Hype

What is Free Hype? - Free Hype Launch Spectacular Pitch

Hype - Doublepage Spread

Hype 2011 connection

Mythbusting the Hype

Laconi hype makerspace

Street Hype Newspaper

*hype Made in HARDENER *hype Made BASE COAT A *hype TOP ...hype+nail+colors.pdf · PEARL * hys¿e Nan FL OZ EXOTIC SHEER CHIFFON * hype 0.30 OZ. CASHMERE SHEER BRIDE *hypes PLANT.

FEBRUARY 2017 Lake Pend Oreille School District will hold ...voteourschools.org/wordpress/wp-content/uploads/... · The Lake Pend Oreille School District serves approximately 3,700

Hype Manual

GREAT SELECTION OF PRE-OWNED VEhICLES 6X3 ADbloximages.newyork1.vip.townnews.com/thederrick.com/...of hype, race day of hype, ﬁrst 20 laps of hype and hopefully Victory Lane hype.

hype Made in HARDENER hype Made BASE COAT A hype TOP ...hype+nail+colors.pdf · PEARL hys¿e Nan FL OZ EXOTIC SHEER CHIFFON * hype 0.30 OZ. CASHMERE SHEER BRIDE *hypes PLANT.