Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Big Data App Server

Lance Riedel

Big Data App Server

A new applica5on framework for (4 V’s): •  Volume of raw data (Petabytes) •  Velocity at which it is being generated/

ingested •  Variety of data sources and schemas •  Advanced data sciences and analy5cs that

can be applied to extract Value

Big Data App Server Use Cases

•  Log/Machine Analy5cs •  Security/Fraud Detec5on •  Sensor Data Analy5cs •  Financial Analy5cs •  Retail Analy5cs •  Ad Targe5ng •  Recommenda5on (e.g. NeMlix, Amazon)

Components B

ig D

ata

Pla

tform

APP SERVER COMPONENTS

Storage and Compute B

ig D

ata

Pla

tform

Storage and Compute

Mo8va8on Google needed to capture the web and process it efficiently •  Calculate importance of pages, words,

domains against each other •  The more cost-‐effec5ve they could make

it -‐ the more they could process, index, understand

Storage/Compute: Centralized

•  Centralized doesn’t scale! •  Move a lot of data – boWleneck

Storage/Compute: Sharding

•  Sharding is spliXng the problem into isolated chunks •  Sharding scales, but fails when you need to look across the data

•  E.G. How to calculate term weights or top pages across shards??

✓ ✓ ✓ ✓ ✓ ✓ ✓

≠

DFS, MapReduce

•  Used a new programming model to distribute computa5on AND data (NOT sharding)

•  Runs on commodity hardware •  Failure resilience using so_ware control •  Easy to calculate across corpus •  Two parts of a complete Solu5on:

•  Distributed File System – DFS •  MapReduce

Distributed File System

MapReduce

•  Process where the data resides (Data and compute are local to each other) •  Map (read the data, emit a key and a value) •  Reduce (group all values per key, perform another opera5on)

Hadoop

•  Open Source implementa5on of Google’s DFS and MapReduce whitepaper

•  Huge Eco-‐System •  Used by: Yahoo, Facebook, TwiWer, LinkedIn, Sears, Apple, The New York Times, Telefonica, +1000’s more!

Management B

ig D

ata

Pla

tform

Data Ingestion

Mo8va8on •  Data origina5ng from a

variety of sources

•  Some data more valuable than others: •  Time-‐to-‐live (TTL) •  Guarantees on

delivery

Data Ingestion: Apache Flume

•  A scalable, fault-‐tolerant, configurable topology data inges5on pipeline that works hand in hand with the Hadoop Eco-‐System

•  Configurable delivery guarantees -‐ rou5ng, replica5on, failover •  Extensible sources and sinks allows for pluggable data sources

•  Scales out horizontally – 100k’s messages/sec

Workflow

Mo8va8on Transforming, storing, joining, data can take a lot of steps that need to be repeatable and traceable – the programming model for data

Workflow: Oozie

A workflow engine that understands the dependency graph of work and can schedule, replay, and report on the steps •  Jobs triggered by 5me (frequency) and data

availability •  Integrated with the rest of the Hadoop stack •  Scalable, reliable and extensible system.

Schema Management

Mo8va8on As data sources explode, the need to understand the data schemas becomes a principle concern

Schema: HCatalog

•  A table and storage management layer for Hadoop

•  Enables users with different data processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the grid.

Schema: Avro

•  A data serializa5on system •  When Avro data is stored in a file, its schema is stored with it

•  Correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.

•  Most technologies in the Hadoop stack understand avro– interoperability/data passing

Data Access, Querying B

ig D

ata

Pla

tform

Data Access

Mo8va8on Various data access paWerns require data stores beyond just the DFS files. An example is a key value store that needs random access to data. Solu8on(s) There are a number of solu5ons depending on the use case. •  Google’s BigTable whitepaper •  SQL has been adapted to Hadoop

Data Access: HBase

•  The Hadoop database -‐ a distributed, scalable, big data store (sorted map) – from Google’s BigTable, backed by Hadoop DFS

•  Linear and modular scalability. •  Automa5c and configurable sharding of tables

•  Automa5c failover support •  Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.

Data Access: SQL – Hive, Impala

•  SQL querying of raw data on the distributed file system

•  Impala – Query files on HDFS including SELECT, JOIN, and aggregate func5ons – in real 5me

•  Hive – provides easy data summariza5on, ad-‐hoc queries, and the analysis of large datasets stored in Hadoop compa5ble file systems

Analytics B

ig D

ata

Pla

tform

Data Analytics

Mo8va8on •  Discover the latent value of the data. The core

mo5va5on behind Big Data! •  Clustering, Machine Learning, Correla5ons,

Modeling – the guts of the Data Science – o_en extremely diverse use cases.

Solu8on(s) A pluggable architecture that can share schemas, but allow for a suite of tools appropriate for the use case

Data Analytics: Example Frameworks •  Mahout

•  Machine learning, clustering •  PaWern -‐ Machine Learning DSL for Hadoop from

Cascading •  0xData

•  Open source math and predic5on engine for big data •  Sample Algorithms

•  Random Forest algorithm •  K-‐Means Clustering •  Hierarchical Clustering •  Linear Regression •  Logis5c Regression •  Support Vector Machines •  Ar5ficial Neural Networks •  Associa5on Rule Learning

Serving B

ig D

ata

Pla

tform

Serving

Mo8va8on •  Powering applica5ons for end users •  Search/browse and recommenda5on engines

allow real-‐5me access to data

Serving: Search – Solr Cloud •  Builds indexes on top of Hadoop •  Horizontally scalable, fault tolerant •  Incredible flexibility in indexing op5ons

•  Tokeniza5on •  Field types •  Data storage

•  Search op5ons just as flexible •  AND,OR,NOT, wildcard •  Facets (counts from a derived ontology) •  Extensive algorithm and weigh5ng plug-‐ability

Serving: Manas – Matching Engine

•  The Hive’s massively scalable matching engine

•  Handles 100’s millions to billions of documents efficiently while matching against 100’s to 1000’s features

•  Nothing exists today in the Open Source community that has these capabili5es

EXAMPLE APP USE-‐CASE

App Server Data Flow

SecurityX on App Server

Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Technology

Transcript of Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event