The Central Hub: Defining the Data Lake
-
Upload
eric-kavanagh -
Category
Technology
-
view
23 -
download
1
Transcript of The Central Hub: Defining the Data Lake
![Page 1: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/1.jpg)
Grab some
coffee and
enjoy the
pre-show
banter before
the top of the
hour!
![Page 2: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/2.jpg)
The Data Lake Survival Guide Exploratory Webcast | October 26, 2016
SPONSORED BY
![Page 3: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/3.jpg)
Presenting
Robin Bloor Chief Analyst, The Bloor Group @robinbloor [email protected]
Host: Eric Kavanagh CEO, The Bloor Group @eric_kavanagh [email protected]
Dez Blanchfield Data Scientist, The Bloor Group @dez_blanchfield [email protected]
![Page 4: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/4.jpg)
Findings Webcast January 12, 2017
Data Lake Survival Guide
Roundtable Webcast December 8, 2016
Exploratory Webcast October 26, 2016
![Page 5: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/5.jpg)
Data Lake Survival
Robin Bloor, PhD
![Page 6: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/6.jpg)
The Sequence of Topics….
1 Disturbance in the Force
2 What is a Data Lake, exactly?
3 Streams and Events
![Page 7: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/7.jpg)
1
Disturbance in the Force
![Page 8: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/8.jpg)
The Generic Dimensions of IT q All IT involves 4 components (only)
q Users q Software q Data q Hardware
q They all relate to each other q Change any one of these and the other
three components have to adjust q Aggregate these and you get a process q Time will impose change anyway q We can also consider:
q Staff q Business Processes q Business Information q Facility
q And also q People q Information q Human Activity q Civilization (Stuff)
Four Fundamental (IT) Factors
Hardware
Users
Software Data
Business
InformationB
usinessProcess
Hum
anActivity
AllInform
ation
Staff
Facility
People
Civilization
TIME
![Page 9: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/9.jpg)
The Technology Layers
§ The buying impulse descends through the stack
§ The impact of technology change rises up the stack
§ This ensures the eventual “legacification” of all technology
The BuyingImpulse Goes
Down
TechnologyChange Rises Up
The TechnologyLayers
![Page 10: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/10.jpg)
Disruption in the Technology Layers
§ Disruption (as innovation) can happen in any layer § Where it occurs it will impact all layers above it § And it may also impact the layers below it (but less quickly) § There is no such thing as future-proof; but some technologies definitely live longer
The BuyingImpulse Goes
Down
TechnologyChange Rises Up
The TechnologyLayers
![Page 11: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/11.jpg)
§ Mainframe Computer (Batch architecture)
§ On-line Interaction (Centralized architecture)
§ PC (Client Server)
§ Internet (Multi-tier architecture)
§ Mobile (Service Oriented architecture)
§ Internet of Things (Event Driven Architecture)
Tech Revolutions
Note that all of these disruptive changes were driven by hardware innovation
Cloud
Centralized Computer Systems
PC Based Systems
Integrated Systems
Limited process powerTerminals onlyFew applicationsNo external data sources
Extensive process powerPCs & AppsAnalytics capabilityWealth of applicationsMany external data sources
Moderate process powerPCsSpreadsheets & emailMany applicationsFew external data sources
![Page 12: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/12.jpg)
Parallelism: The Imp Out of the Bottle
u Multicore chips enabled parallelism
u It has changed the whole performance equation
u It enabled Big Data
u Big Data is really Big Processing
![Page 13: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/13.jpg)
The Impact of Parallelism
We used to see 10x performance improvement every 6 years, now we
see 1000x (and that’s just an approximation)
![Page 14: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/14.jpg)
Hardware Factors q CPUs, GPUs & FPGAs
q Cross breeding
q SoCs
q 3D Xpoint and PCM (and memristor?)
q SSDs & parallel access
q Parallel hardware architectures
Performance is accelerating and costs continue to fall.
![Page 15: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/15.jpg)
The Perfect Storm (Software)
q The triumph of Open Source as a business model
q The dominance of Apache q Hadoop, the platform
for data q Spark, for speed q Kafka, for connectivity
q The triumph of the cloud and its dominance
q Little data is also big data
q Cost challenges
![Page 16: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/16.jpg)
Then the DataLake evaporatedinto the Cloud
2
What is a Data Lake?
![Page 17: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/17.jpg)
Everything in flux
u Hardware (network, storage, servers)
u Data Sources u Data Staging u Data Volumes u Data Flow u Data Governance u Data Usage u Data Structures u Schema definition u Ingest Speeds u Data Workloads
![Page 18: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/18.jpg)
Hadoop Applications
![Page 19: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/19.jpg)
The Scale Out Applications
§ Data Ingest & Staging
§ Data Governance
§ Software development platform
§ Analytics environment
§ Database/Data Warehouse
§ Data Archiving
§ Video rendering & other niche apps
The Data Lake involves just the first two and does not necessarily involve Hadoop
![Page 20: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/20.jpg)
Data Lake, Refinery, Hub, in Overview
Think Logical, Implement Physical
![Page 21: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/21.jpg)
The Data Lake Analytics Picture Data Sources
Analytics
ServiceMgt
Life CycleMgt
MetaDataDiscovery
MDM
MetaDataMgt
DataCleansing
DataLineage
ROUND|UP
WRANGLING
Staging Area(Hadoop)
Data Warehouseor other location
Data Streams
ETL
ETL
![Page 22: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/22.jpg)
How Data Gets to be Wrong
u Accidentally born wrong
u Deliberately born wrong
u Defective sensor/data source
u Murdered (truncated, overwritten)
u Corrupted in flight (rare)
u Corrupted by bad code (surely not!)
u Corrupted by bad DBA
![Page 23: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/23.jpg)
Data Governance
If data governance was important before Big Data, (and it was) it is far more important in the era of
Data Lakes
![Page 24: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/24.jpg)
What Needs To Be Governed
![Page 25: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/25.jpg)
Data Governance
Data Flows and Data Storage
Security & Access
Data cleansing and transformation
Data meaning
Data provenance and lineage
Data archive and disposal
Availability and performance
![Page 26: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/26.jpg)
Analytics Is a Process Not an Activity
q Data Analytics is a multi-disciplinary end-to-end process
q Until recently it was a walled-garden. But the walls were torn down by… § Data availability § Scalable technology § Open source tools
q It is now becoming an integrated process
Data Governance is a process, not an activity!!
![Page 27: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/27.jpg)
The Global Map and Data Options
u Move the data to the processing
u Move the processing to the data
u Move the processing and the data
u Shard
All network nodes can be data creators, data stores and
processing points.
![Page 28: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/28.jpg)
Logical Data Lakes
Soon we will be speaking of a logical data lake and multiple
physical data lakes
![Page 29: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/29.jpg)
3
Events and Streams
![Page 30: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/30.jpg)
Big Data, Event Data – The Data of Everything
WHAT IS BIG DATA?
Business data
Traditional data
Log file data
Operational data
Mobile data
Location data Social
network data
Public data
Commercial databases
Streaming data
Internet of Things
![Page 31: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/31.jpg)
A TRANSACTION is a MOLECULE of ATOMIC EVENTS
The ATOM of data has become the EVENT
Events: Atoms and Molecules
![Page 32: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/32.jpg)
It’s Become and Event Based World
![Page 33: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/33.jpg)
Events
Think of events as drops of water. They can live in streams, and they can also live in data pools and data
lakes.
![Page 34: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/34.jpg)
Two Data Flows
![Page 35: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/35.jpg)
The Traffic Cop (Events)
![Page 36: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/36.jpg)
Event Types
q Instantiation Event q A State Report q A Trigger Event q A Correction Event
We also need to consider: Data Refinement Aggregations Homogeneous Collections Derived Data
![Page 37: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/37.jpg)
§ The pulse and the threshold alert
§ Some of this involves distributed processing
§ There are known apps and unknown apps, so analytical exploration needs to be enabled
§ Only aggregations will migrate
DepotDepot
CentralHub
SourceProc.
DepotProc.
CentralProc.
Sensors, controllers, CPUs
Data Data
Data
Event Based IoT Architecture
![Page 38: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/38.jpg)
u Time
u Geographic location
u Virtual/logical location
u Source device
u Device ID
u Actors
u Ownership/Provenance
u Values
Events and Event Data
![Page 39: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/39.jpg)
Spark, Storm, Flink & Kafka
u Spark has dethroned Hadoop as a platform and has momentum, both for microbatch and streaming
u Storm provides batch and streaming (event processing capabilities) concurrently via the lambda architecture
u Flink was purpose built for streaming
u Kafka is the pipe
u Lambda and Zeta Architectures…
![Page 40: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/40.jpg)
In Summary
1 Disturbance in the Force
2 What is a Data Lake, exactly?
3 Streams and Events
![Page 41: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/41.jpg)
![Page 42: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/42.jpg)
Questions?
![Page 43: The Central Hub: Defining the Data Lake](https://reader033.fdocuments.in/reader033/viewer/2022042907/5870f76c1a28ab5f528b4e53/html5/thumbnails/43.jpg)
THANK YOU!
FIND OUT MORE at InsideAnalysis.com