Shawndra Hill Upenn Jasonalb Big Data WK11
-
Upload
sargentshriver -
Category
Documents
-
view
80 -
download
0
description
Transcript of Shawndra Hill Upenn Jasonalb Big Data WK11
3/24/2013
1
Jason P. Albert, University of Pennsylvania [email protected]
Big Data
Jason Albert
University of Pennsylvania
Jason P. Albert, University of Pennsylvania [email protected]
Big Data
PERSPECTIVES
3/24/2013
2
Jason P. Albert, University of Pennsylvania [email protected] 3
What is Big Data?
high volume, velocity and/or variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. (Gartner)
1 Terabyte = 1024 Gigabytes
1 Petabyte = 1024 Terabytes
1 Exabyte = 1024 Petabytes
1 Zettabyte = 1024 Petabytes
1 ZB (1,099,511,627,776GB) * 7.9 = 8,686,141,859,430GB
Jason P. Albert, University of Pennsylvania [email protected] 4
How do we handle Big Data?
“MAD” Information Management is the approach:
Must be Magnetic, attracting all data sources
Must be Agile for easy accommodation of data at a rapid pace
Must provide sophisticated statistical methods for its Deep data repository
Why is MAD a departure from traditional Data Warehouse?
3/24/2013
3
Jason P. Albert, University of Pennsylvania [email protected] 5
What is the Scope of the Solution?
An End to End Solution must be Considered:
Consume: Volume, Velocity, Variety
Store: Gigabytes, Terabytes, Petabytes
Process: Cluster, Classify, Predict
Present: Visualize, Interact, Evaluate
Jason P. Albert, University of Pennsylvania [email protected] 6
Perspectives on Big Data
Does it handle Big Data? Volume Velocity Variety
Is it considered MAD? Magnetic Agile Deep
Is it an End-to-End Solution? Consume Store Process Present
3/24/2013
4
Jason P. Albert, University of Pennsylvania [email protected] 7
Options to Consider
Two promising options with low market penetration (Gartner)
MapReduce and alternatives
In-memory Computing
Jason P. Albert, University of Pennsylvania [email protected]
Big Data
MAP REDUCE
3/24/2013
5
Jason P. Albert, University of Pennsylvania [email protected] 9
Hadoop = MapReduce + HDFS
Open Source, Batch Oriented, Data Intensive general purpose framework for creating distributed applications that process big data i.e. Volume, Velocity, Variety
Hadoop Distrbuted File System (HDFS)
Data distributed and replicated over multiple systems
Block oriented
MapReduce
Map function processes intermediate key/value pairs
Reduce function merges intermediate values
Facilitates parallel processing of multi-terabytes of data on large clusters of commodity platforms
Scale Out • Fully depreciated • Repurposed • Low Cost
Jason P. Albert, University of Pennsylvania [email protected] 10
MapReduce Workflow
1. Input data is distributed 2. Map Tasks work on a split of data
Map(key, value) for each word x in value: output.collect(x,1)
3. Mapper outputs intermediate data 4. Data is exchanged between nodes 5. Intermediate data of same key goes
to same reducer Reduce(keyword, (listOfValues)) for each x in (listOfValues):
sum += x; output.collect(keyword, sum);
6. Reducer output is stored
$ hadoop jar wordcount.jar WordCount /usr/input /usr/output
1. Jack be nimble, Jack be quick, Jack jump over the candlestick.
2. (0, "Jack be nimble,") (15, “Jack be quick") (28, " Jack jump over the candlestick")
3. (“Jack”, 1), (“be”,1), (“nimble,”,1), (“Jack”,1), (“be”, 1),(“quick,”, 1), (“Jack”,1), (“jump”, 1),(“over”, 1),(“the”, 1), (“candlestick.”, 1)
4. …
5. (“Jack”, (1,1,1)), (“be”, (1,1)), (“nimble,”,(1)), (“quick”, (1)), (“jump”, (1)),(“over”, (1)), (“the”, (1)), (“candlestick.”, (1))
6. (“Jack”, 3), (“be”, 2), (“nimble,”,1), (“quick”, 1), (“jump”, 1), (“over”, 1), (“the”, 1), (“candlestick.”, 1)
3/24/2013
6
Jason P. Albert, University of Pennsylvania [email protected] 11
Scale-Out: MapReduce + HDFS
Jason P. Albert, University of Pennsylvania [email protected] 12
Case Study: Recommendations
1) 9 TB of W3C Extended Log File Format data
2) MapReduce program: sessionExtractor
Session Person Person
SDF92MGSLOK4M23K B041Q3EV N23KFMWE
ASD90K23MOLFWQIE EM9IU67Y
Example: LinkedIn “People you may know” Application • Behavior Analytics • Risk & Fraud Analysis • Social Network "Connectedness“
• Text Analysis • Regressions (Financial)
3/24/2013
7
Jason P. Albert, University of Pennsylvania [email protected] 13
Supplemental Case Study
Product Sentiment Analysis over Time 1 Month of Twitter Feeds and Opinion Boards onto HDFS
Process using Word Count example of Positive and Negative words associated with a Product over time
This type of analysis is being done with some success
http://techcrunch.com/2012/05/18/study-twitter-sentiment-mirrored-facebooks-stock-price-today/ http://www.cs.ucr.edu/~vagelis/publications/wsdm2012-microblog-financial.pdf
Jason P. Albert, University of Pennsylvania [email protected] 14
MapReduce is Different
MapReduce handles processing differently: Distributed Programming Fault Tolerant MapReduce handles modeling differently: Schema-less Orientated toward exploration and discovery MapReduce handles data differently: Mostly unstructured data objects Vast number of attributes and data sources Data sources added and/or updated frequently Quality is unknown
External References http://developer.yahoo.com/hadoop/ http://code.google.com/edu/parallel/mapreduce-tutorial.html
3/24/2013
8
Jason P. Albert, University of Pennsylvania [email protected] 15
MapReduce…
…handle Big Data? …considered MAD?
Magnetism
Agile
MapReduce requires algorithm development
Deep
…and End to End Solution?
Jason P. Albert, University of Pennsylvania [email protected]
Big Data
IN-MEMORY COMPUTING
3/24/2013
9
Jason P. Albert, University of Pennsylvania [email protected] 17
In-Memory Computing
Overview All relevant structured data in-memory
Cache aware memory organization (current bottleneck between CPU and main memory)
Data partitioning for parallel execution
Computation
Computation
Application Stack
Database Stack
Current Methodology
Future Methodology
Optimized for disk access on platforms with limited main memory and slow disk I/O.
Leveraging current innovations in Hardware & Software to move computations into the Database
Jason P. Albert, University of Pennsylvania [email protected] 18
In-Memory Workflow
In-memory computing applies a combination of:
Optimization: for Query Pruning and Data Distribution
Execution: SQL statement plan for computational parallelization
Stores: Column store with partitioning/compression (5-30x ratio)
Persistence: Temporal Tables and MVCC
http://ark.intel.com/
IBM x3850 x5 QPI Scaling Or Max5 Tray
2,3,4TB RAM 2-4 CPU @ 10 Cores/each > 4 TB @ 8x HDD
3/24/2013
10
Jason P. Albert, University of Pennsylvania [email protected] 19
Scale-Out Strategy for In-Memory
Jason P. Albert, University of Pennsylvania [email protected] 20
Capturing and Presenting
Data Provisioning IM-DBMS does not currently accommodate transaction workloads
Trigger Replication new transactions to replicate to an in-memory DB facilitating real time operational analysis, planning, and simulation.
Extraction using ETL (Extract, Transform, Load) tools with a large variety of external and internal source system support handles other data sources in near real-time but require job scheduling
e.g. SAP HANA
3/24/2013
11
Jason P. Albert, University of Pennsylvania [email protected] 21
Case Study: Sales Analysis
1) Load 1.1 Billion PoS in < 1 sec 3) Drill Down Into Category < 1 sec
4) Plan/Actuals as Schema & Visualize
2) Identify Top Selling Categories
Link to Video: PoS from HANA using Business Objects Explorer
Jason P. Albert, University of Pennsylvania [email protected] 22
Examples of Performance Gains
Report on Product Dimensions 120 million line items Standard ERP solution: several minutes on pre-aggregated
dataset; more for drilldown In-Memory: less than 1 second on line item level data;
minute delay for drilldown
Genome Analysis: Optimized Data Warehouse: Sequence Alignment 81 minutes
+ Variant Calling: 65 minutes In-Memory: Sequence Alignment 15 minutes + Variant Calling
19.5 minutes (6.5 min estimated) Approximately 2hr savings
3/24/2013
12
Jason P. Albert, University of Pennsylvania [email protected] 23
In-Memory Computing…
…handle Big Data?
…considered MAD?
Magnetism
Unstructured data still requires pre-processing
Agile
Deep
Unsupervised
Supervised
…an End to End Solution?
Jason P. Albert, University of Pennsylvania [email protected]
Big Data
HDFS + MAP REDUCE + IN MEMORY
3/24/2013
13
Jason P. Albert, University of Pennsylvania [email protected] 25
Case Study: Recommendations
1) 9 TB of W3C Extended Log File Format data
2) MapReduce program: sessionExtractor
Session Product Product
SDF92MGSLOK4M23K B041Q3EV N23KFMWE
ASD90K23MOLFWQIE EM9IU67Y
Hadoop-HANA Connector
18M Records
Jason P. Albert, University of Pennsylvania [email protected] 26
Scale-Out: MapReduce + HDFS
Recall this slide as the Foundation
3/24/2013
14
Jason P. Albert, University of Pennsylvania [email protected] 27
+ Case Study: Predictive Analysis
1) Add Connection Details to all Data Reader Component
4) K-Means Cluster of Sessions
2) Retrieves records
5) Write back to database for persistence
3) Join 1.1B PoS records to Session Data
4) Explore Outcome
6) Use to provide Recommendations for Future Website Visitors
Jason P. Albert, University of Pennsylvania [email protected] 28
Scale-Out Strategy for In-Memory
Recall this slide as the Foundation
3/24/2013
15
Jason P. Albert, University of Pennsylvania [email protected] 29
Better together
…handle Big Data?
MapReduce Enables Magnetism preprocesses unstructured data
In-Memory Enables Agility Data Provisioning
Replication
Extraction
Both MapReduce and In-Memory Enable Deep Analysis During MapReduce preprocessing
Unsupervised & Supervised for In-Memory
…an End to End Solution?
Jason P. Albert, University of Pennsylvania [email protected] 30
SAP HANA + Intel Distribution of Hadoop
This is New News February 27, 2013
http://www.sap.com/corporate-en/news.epx?PressID=20498
3/24/2013
16
Jason P. Albert, University of Pennsylvania [email protected] 31
MAD Improvement Focus
Transformative potential in five domains U.S. Healthcare
E.U. Public Sector administration
Retail
Manufacturing
Personal Location
Most significant constraint: Shortage of talent to take advantage of the insights gained from large datasets Deep analytical talent with technical skills in statistics to provide insights
Data-savvy analysts to interpret/challenge/base decisions on results
Support personnel who develop/implement/maintain the architecture
Big data: The next frontier for innovation, competition, and productivity McKinsey Global Institute
Jason P. Albert, University of Pennsylvania [email protected]
Big Data
QUESTIONS?