Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
-
Upload
the-hive -
Category
Technology
-
view
107 -
download
1
description
Transcript of Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
![Page 1: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/1.jpg)
Big Data App Server
Lance Riedel
![Page 2: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/2.jpg)
Big Data App Server
A new applica5on framework for (4 V’s): • Volume of raw data (Petabytes) • Velocity at which it is being generated/
ingested • Variety of data sources and schemas • Advanced data sciences and analy5cs that
can be applied to extract Value
![Page 3: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/3.jpg)
![Page 4: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/4.jpg)
Big Data App Server Use Cases
• Log/Machine Analy5cs • Security/Fraud Detec5on • Sensor Data Analy5cs • Financial Analy5cs • Retail Analy5cs • Ad Targe5ng • Recommenda5on (e.g. NeMlix, Amazon)
![Page 5: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/5.jpg)
Components B
ig D
ata
Pla
tform
![Page 6: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/6.jpg)
APP SERVER COMPONENTS
![Page 7: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/7.jpg)
Storage and Compute B
ig D
ata
Pla
tform
![Page 8: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/8.jpg)
Storage and Compute
Mo8va8on Google needed to capture the web and process it efficiently • Calculate importance of pages, words,
domains against each other • The more cost-‐effec5ve they could make
it -‐ the more they could process, index, understand
![Page 9: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/9.jpg)
Storage/Compute: Centralized
• Centralized doesn’t scale! • Move a lot of data – boWleneck
![Page 10: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/10.jpg)
Storage/Compute: Sharding
• Sharding is spliXng the problem into isolated chunks • Sharding scales, but fails when you need to look across the data
• E.G. How to calculate term weights or top pages across shards??
✓ ✓ ✓ ✓ ✓ ✓ ✓
≠
![Page 11: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/11.jpg)
DFS, MapReduce
• Used a new programming model to distribute computa5on AND data (NOT sharding)
• Runs on commodity hardware • Failure resilience using so_ware control • Easy to calculate across corpus • Two parts of a complete Solu5on:
• Distributed File System – DFS • MapReduce
![Page 12: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/12.jpg)
Distributed File System
![Page 13: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/13.jpg)
MapReduce
• Process where the data resides (Data and compute are local to each other) • Map (read the data, emit a key and a value) • Reduce (group all values per key, perform another opera5on)
![Page 14: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/14.jpg)
Hadoop
• Open Source implementa5on of Google’s DFS and MapReduce whitepaper
• Huge Eco-‐System • Used by: Yahoo, Facebook, TwiWer, LinkedIn, Sears, Apple, The New York Times, Telefonica, +1000’s more!
![Page 15: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/15.jpg)
Management B
ig D
ata
Pla
tform
![Page 16: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/16.jpg)
Data Ingestion
Mo8va8on • Data origina5ng from a
variety of sources
• Some data more valuable than others: • Time-‐to-‐live (TTL) • Guarantees on
delivery
![Page 17: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/17.jpg)
Data Ingestion: Apache Flume
• A scalable, fault-‐tolerant, configurable topology data inges5on pipeline that works hand in hand with the Hadoop Eco-‐System
• Configurable delivery guarantees -‐ rou5ng, replica5on, failover • Extensible sources and sinks allows for pluggable data sources
• Scales out horizontally – 100k’s messages/sec
![Page 18: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/18.jpg)
Workflow
Mo8va8on Transforming, storing, joining, data can take a lot of steps that need to be repeatable and traceable – the programming model for data
![Page 19: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/19.jpg)
Workflow: Oozie
A workflow engine that understands the dependency graph of work and can schedule, replay, and report on the steps • Jobs triggered by 5me (frequency) and data
availability • Integrated with the rest of the Hadoop stack • Scalable, reliable and extensible system.
![Page 20: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/20.jpg)
Schema Management
Mo8va8on As data sources explode, the need to understand the data schemas becomes a principle concern
![Page 21: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/21.jpg)
Schema: HCatalog
• A table and storage management layer for Hadoop
• Enables users with different data processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the grid.
![Page 22: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/22.jpg)
Schema: Avro
• A data serializa5on system • When Avro data is stored in a file, its schema is stored with it
• Correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.
• Most technologies in the Hadoop stack understand avro– interoperability/data passing
![Page 23: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/23.jpg)
Data Access, Querying B
ig D
ata
Pla
tform
![Page 24: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/24.jpg)
Data Access
Mo8va8on Various data access paWerns require data stores beyond just the DFS files. An example is a key value store that needs random access to data. Solu8on(s) There are a number of solu5ons depending on the use case. • Google’s BigTable whitepaper • SQL has been adapted to Hadoop
![Page 25: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/25.jpg)
Data Access: HBase
• The Hadoop database -‐ a distributed, scalable, big data store (sorted map) – from Google’s BigTable, backed by Hadoop DFS
• Linear and modular scalability. • Automa5c and configurable sharding of tables
• Automa5c failover support • Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
![Page 26: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/26.jpg)
Data Access: SQL – Hive, Impala
• SQL querying of raw data on the distributed file system
• Impala – Query files on HDFS including SELECT, JOIN, and aggregate func5ons – in real 5me
• Hive – provides easy data summariza5on, ad-‐hoc queries, and the analysis of large datasets stored in Hadoop compa5ble file systems
![Page 27: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/27.jpg)
Analytics B
ig D
ata
Pla
tform
![Page 28: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/28.jpg)
Data Analytics
Mo8va8on • Discover the latent value of the data. The core
mo5va5on behind Big Data! • Clustering, Machine Learning, Correla5ons,
Modeling – the guts of the Data Science – o_en extremely diverse use cases.
Solu8on(s) A pluggable architecture that can share schemas, but allow for a suite of tools appropriate for the use case
![Page 29: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/29.jpg)
Data Analytics: Example Frameworks • Mahout
• Machine learning, clustering • PaWern -‐ Machine Learning DSL for Hadoop from
Cascading • 0xData
• Open source math and predic5on engine for big data • Sample Algorithms
• Random Forest algorithm • K-‐Means Clustering • Hierarchical Clustering • Linear Regression • Logis5c Regression • Support Vector Machines • Ar5ficial Neural Networks • Associa5on Rule Learning
![Page 30: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/30.jpg)
Serving B
ig D
ata
Pla
tform
![Page 31: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/31.jpg)
Serving
Mo8va8on • Powering applica5ons for end users • Search/browse and recommenda5on engines
allow real-‐5me access to data
![Page 32: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/32.jpg)
Serving: Search – Solr Cloud • Builds indexes on top of Hadoop • Horizontally scalable, fault tolerant • Incredible flexibility in indexing op5ons
• Tokeniza5on • Field types • Data storage
• Search op5ons just as flexible • AND,OR,NOT, wildcard • Facets (counts from a derived ontology) • Extensive algorithm and weigh5ng plug-‐ability
![Page 33: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/33.jpg)
Serving: Manas – Matching Engine
• The Hive’s massively scalable matching engine
• Handles 100’s millions to billions of documents efficiently while matching against 100’s to 1000’s features
• Nothing exists today in the Open Source community that has these capabili5es
![Page 34: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/34.jpg)
EXAMPLE APP USE-‐CASE
![Page 35: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/35.jpg)
App Server Data Flow
![Page 36: Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event](https://reader036.fdocuments.in/reader036/viewer/2022062511/54c6a2e04a7959d9148b4571/html5/thumbnails/36.jpg)
SecurityX on App Server