One Large Data Lake, Hold the Hype

Post on 06-Apr-2017

76 views 2 download

Transcript of One Large Data Lake, Hold the Hype

One Large Data Lake, Hold the Hype

Rocky Mountain DataCon 2016

Jared WinickSenior Data/Solutions Engineer, Koverse

2

Outline

• Issues with the usage of “Data Lake”• Defining Key Characteristics• A Data Lake Implementation Example• Discussion

3

4

Just because “Data Lake” isoverusedmisusedabused

doesn’t mean the concept is wrong

5

The Concept of a Data Lake

We all can agree that a Data Lake is a centralized (at least logically) repository for all forms of data within an organization.

6https://www.wired.com/2013/04/desktop-cluttered-help/

7

The Concept of a Data Lake

…but there must be more to it than putting all your data in HDFS or S3.

8

Defining the Key Characteristics

1. Indexing and search across all data2. Interactive access for all users in the enterprise3. Multi-level access control4. Integration with data science tools5. Abstractions

A Data Lake has a platform-application duality

9

Indexing / Search Across All Data

• A Data Lake is often an entry point for data• It may lack structure or “correctness”• Search enables you to validate and explore your data

10

Indexing / Search Across All Data

A18923 Search

{ employeeId: “A18923”, email: “jaredwinick@koverse.com”, firstName: “Jared”, …}

Employees Data Set{ id: “a18923”, eventType: “login”, time: 1478557775010 …}

Network Events Data Set

Find data across data sets. Understand its format and structure.

11

Interactive Access for Everyone

• A Data Lake is strategic and should serve many different types of users.

• Should have self-service features.• Adds up to needing to support interactive, multi-user load.

12

Multi-Level Access Control• Every organization has data access control requirements these

days.• Different level of granularity for different environments/use

cases.– Data Set Level– Column/Field Level– Row/Record Level

• Far easier to engineer up front than add on later.

13

Name DeptId DOB email

Multi-Level Access Control

Data Set Column

Row

14

Integration With Data Science Tools

• The ultimate point of a Data Lake is to “monetize” data– For a corporation this is making or saving money– For a government this is better serving your citizens– For a research organization this is solving new problems/answering

previously unknown questions• Need to be able to analyze and transform data sets into new

data sets• From BI queries to text analytics to machine learning.

15

Integration With Data Science Tools

A Data Lake needs to support multiple internal analytic “customers” within the organization.• SQL / BI tools for Data Analysts• Spark for Data Engineers• Notebooks and ML libraries for Data Scientists

16

Abstractions• Provide a level of abstraction over your data

– Data Sets / Collections– Records / Rows– Transformations

• Enables a consistent API for interacting with any data regardless of its shape, size, and content– Reusability– Increased development speed

17

The Koverse Data Lake

18

Architecture – High Level

HDFS Zookeeper

Accumulo

Spark

Koverse

A distributed key/value store like Apache Accumulo enables storage of very large volumes of data while maintaining low latency access.

19

Architecture – Distributed Key/Value Store Benefits

These benefits apply to Apache Accumulo, but also likely to Apache HBase, Cassandra and other similar systems

1. Easily scale to trillions of key/values2. Distributed storage

1. Parallel processing in Hadoop MapReduce or Spark2. Fault tolerance

3. Millisecond read latencies with efficient scanning of ranges4. Fine grained access control features

20

Architecture - Details

Accumulo

Record Table

IndexTable

Statistics/Aggregations

Table

Koverse

Low latency R/W

Spark

Efficient Range Scans

Apps

Users/Apps use REST

21

Discussion