One Large Data Lake, Hold the Hype

21
One Large Data Lake, Hold the Hype Rocky Mountain DataCon 2016 ared Winick enior Data/Solutions Engineer, Koverse

Transcript of One Large Data Lake, Hold the Hype

Page 1: One Large Data Lake, Hold the Hype

One Large Data Lake, Hold the Hype

Rocky Mountain DataCon 2016

Jared WinickSenior Data/Solutions Engineer, Koverse

Page 2: One Large Data Lake, Hold the Hype

2

Outline

• Issues with the usage of “Data Lake”• Defining Key Characteristics• A Data Lake Implementation Example• Discussion

Page 3: One Large Data Lake, Hold the Hype

3

Page 4: One Large Data Lake, Hold the Hype

4

Just because “Data Lake” isoverusedmisusedabused

doesn’t mean the concept is wrong

Page 5: One Large Data Lake, Hold the Hype

5

The Concept of a Data Lake

We all can agree that a Data Lake is a centralized (at least logically) repository for all forms of data within an organization.

Page 6: One Large Data Lake, Hold the Hype

6https://www.wired.com/2013/04/desktop-cluttered-help/

Page 7: One Large Data Lake, Hold the Hype

7

The Concept of a Data Lake

…but there must be more to it than putting all your data in HDFS or S3.

Page 8: One Large Data Lake, Hold the Hype

8

Defining the Key Characteristics

1. Indexing and search across all data2. Interactive access for all users in the enterprise3. Multi-level access control4. Integration with data science tools5. Abstractions

A Data Lake has a platform-application duality

Page 9: One Large Data Lake, Hold the Hype

9

Indexing / Search Across All Data

• A Data Lake is often an entry point for data• It may lack structure or “correctness”• Search enables you to validate and explore your data

Page 10: One Large Data Lake, Hold the Hype

10

Indexing / Search Across All Data

A18923 Search

{ employeeId: “A18923”, email: “[email protected]”, firstName: “Jared”, …}

Employees Data Set{ id: “a18923”, eventType: “login”, time: 1478557775010 …}

Network Events Data Set

Find data across data sets. Understand its format and structure.

Page 11: One Large Data Lake, Hold the Hype

11

Interactive Access for Everyone

• A Data Lake is strategic and should serve many different types of users.

• Should have self-service features.• Adds up to needing to support interactive, multi-user load.

Page 12: One Large Data Lake, Hold the Hype

12

Multi-Level Access Control• Every organization has data access control requirements these

days.• Different level of granularity for different environments/use

cases.– Data Set Level– Column/Field Level– Row/Record Level

• Far easier to engineer up front than add on later.

Page 13: One Large Data Lake, Hold the Hype

13

Name DeptId DOB email

Multi-Level Access Control

Data Set Column

Row

Page 14: One Large Data Lake, Hold the Hype

14

Integration With Data Science Tools

• The ultimate point of a Data Lake is to “monetize” data– For a corporation this is making or saving money– For a government this is better serving your citizens– For a research organization this is solving new problems/answering

previously unknown questions• Need to be able to analyze and transform data sets into new

data sets• From BI queries to text analytics to machine learning.

Page 15: One Large Data Lake, Hold the Hype

15

Integration With Data Science Tools

A Data Lake needs to support multiple internal analytic “customers” within the organization.• SQL / BI tools for Data Analysts• Spark for Data Engineers• Notebooks and ML libraries for Data Scientists

Page 16: One Large Data Lake, Hold the Hype

16

Abstractions• Provide a level of abstraction over your data

– Data Sets / Collections– Records / Rows– Transformations

• Enables a consistent API for interacting with any data regardless of its shape, size, and content– Reusability– Increased development speed

Page 17: One Large Data Lake, Hold the Hype

17

The Koverse Data Lake

Page 18: One Large Data Lake, Hold the Hype

18

Architecture – High Level

HDFS Zookeeper

Accumulo

Spark

Koverse

A distributed key/value store like Apache Accumulo enables storage of very large volumes of data while maintaining low latency access.

Page 19: One Large Data Lake, Hold the Hype

19

Architecture – Distributed Key/Value Store Benefits

These benefits apply to Apache Accumulo, but also likely to Apache HBase, Cassandra and other similar systems

1. Easily scale to trillions of key/values2. Distributed storage

1. Parallel processing in Hadoop MapReduce or Spark2. Fault tolerance

3. Millisecond read latencies with efficient scanning of ranges4. Fine grained access control features

Page 20: One Large Data Lake, Hold the Hype

20

Architecture - Details

Accumulo

Record Table

IndexTable

Statistics/Aggregations

Table

Koverse

Low latency R/W

Spark

Efficient Range Scans

Apps

Users/Apps use REST

Page 21: One Large Data Lake, Hold the Hype

21

Discussion