One Large Data Lake, Hold the Hype
-
Upload
koverse-inc -
Category
Data & Analytics
-
view
76 -
download
2
Transcript of One Large Data Lake, Hold the Hype
One Large Data Lake, Hold the Hype
Rocky Mountain DataCon 2016
Jared WinickSenior Data/Solutions Engineer, Koverse
2
Outline
• Issues with the usage of “Data Lake”• Defining Key Characteristics• A Data Lake Implementation Example• Discussion
3
4
Just because “Data Lake” isoverusedmisusedabused
doesn’t mean the concept is wrong
5
The Concept of a Data Lake
We all can agree that a Data Lake is a centralized (at least logically) repository for all forms of data within an organization.
6https://www.wired.com/2013/04/desktop-cluttered-help/
7
The Concept of a Data Lake
…but there must be more to it than putting all your data in HDFS or S3.
8
Defining the Key Characteristics
1. Indexing and search across all data2. Interactive access for all users in the enterprise3. Multi-level access control4. Integration with data science tools5. Abstractions
A Data Lake has a platform-application duality
9
Indexing / Search Across All Data
• A Data Lake is often an entry point for data• It may lack structure or “correctness”• Search enables you to validate and explore your data
10
Indexing / Search Across All Data
A18923 Search
{ employeeId: “A18923”, email: “[email protected]”, firstName: “Jared”, …}
Employees Data Set{ id: “a18923”, eventType: “login”, time: 1478557775010 …}
Network Events Data Set
Find data across data sets. Understand its format and structure.
11
Interactive Access for Everyone
• A Data Lake is strategic and should serve many different types of users.
• Should have self-service features.• Adds up to needing to support interactive, multi-user load.
12
Multi-Level Access Control• Every organization has data access control requirements these
days.• Different level of granularity for different environments/use
cases.– Data Set Level– Column/Field Level– Row/Record Level
• Far easier to engineer up front than add on later.
13
Name DeptId DOB email
Multi-Level Access Control
Data Set Column
Row
14
Integration With Data Science Tools
• The ultimate point of a Data Lake is to “monetize” data– For a corporation this is making or saving money– For a government this is better serving your citizens– For a research organization this is solving new problems/answering
previously unknown questions• Need to be able to analyze and transform data sets into new
data sets• From BI queries to text analytics to machine learning.
15
Integration With Data Science Tools
A Data Lake needs to support multiple internal analytic “customers” within the organization.• SQL / BI tools for Data Analysts• Spark for Data Engineers• Notebooks and ML libraries for Data Scientists
16
Abstractions• Provide a level of abstraction over your data
– Data Sets / Collections– Records / Rows– Transformations
• Enables a consistent API for interacting with any data regardless of its shape, size, and content– Reusability– Increased development speed
17
The Koverse Data Lake
18
Architecture – High Level
HDFS Zookeeper
Accumulo
Spark
Koverse
A distributed key/value store like Apache Accumulo enables storage of very large volumes of data while maintaining low latency access.
19
Architecture – Distributed Key/Value Store Benefits
These benefits apply to Apache Accumulo, but also likely to Apache HBase, Cassandra and other similar systems
1. Easily scale to trillions of key/values2. Distributed storage
1. Parallel processing in Hadoop MapReduce or Spark2. Fault tolerance
3. Millisecond read latencies with efficient scanning of ranges4. Fine grained access control features
20
Architecture - Details
Accumulo
Record Table
IndexTable
Statistics/Aggregations
Table
Koverse
Low latency R/W
Spark
Efficient Range Scans
Apps
Users/Apps use REST
21
Discussion