Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Builiding analytical apps on Hadoop
-
Upload
dmitry-makarchuk -
Category
Documents
-
view
1.592 -
download
0
Transcript of Builiding analytical apps on Hadoop
1
Headline Goes HereSpeaker Name or Subhead Goes Here
DO NOT USE PUBLICLY PRIOR TO 10/23/12
Building Analytical Applications on HadoopJosh Wills | Director of Data Science November 2012
2
About Me
3
What are ‘Analytical Applications?’
4
The Humble Dashboard
5
Crossfilter with Flight Information
6
New York Times Electoral Vote Map
7
New York Times Electoral Vote Map (Detail)
8
Analytical Applications vs. Frameworks
9
A Case Study
Developing Analytical Applications
10
2012: The Predicting of the President
11
RealClearPolitics
• Simple Average of Polls
• Transparent
• Simple Interactions
12
FiveThirtyEight
• “Foxy” Model
• Opaque
• Simple Interactions with a richer UI
13
Princeton Election Consortium
•Medians and Polynomials
• Transparent
• Rich Interactions
14
How Did They Do?
15
A Few of These, Because They’re Fun
16
A Few of These, Because They’re Fun
17
A Few of These, Because They’re Fun
18
Here’s the Rub: One Expert Beat Nate
19
Index Funds, Hedge Funds, and Warren Buffett
20
A Brief Introduction to Hadoop
21
Data Storage in 2001: Databases
• Structured schemas• Intensive processing
done where data is stored• Somewhat reliable• Expensive at scale
22
Data Storage in 2001: Filers
• No schemas, stores any kind of file• No data processing
capability• Reliable• Expensive at scale
23
And Then, This Happened
24
Data Economics: Return on Byte
25
Big Data Economics
• No individual record is particularly valuable• Having every record is
incredibly valuable• Web index• Recommendation systems• Sensor data• Market basket analysis• Online advertising
26
Introduction to Hadoop
27
The Hadoop Distributed File System
• Based on the Google File System• Data stored in large files
• Large block size: 64MB to 256MB per block• Blocks are replicated to
multiple nodes in the cluster
28
Simple, Reliable Processing: MapReduce
• Map Stage• Embarrassingly parallel
• Shuffle Stage: Large-scale distributed sort• Reduce Stage
• Process all of the values that have the same key in a single step• Process the data where it is stored• Write once and you’re done.
29
Developing Analytical Applications with Hadoop
30
Novelty is the Enemy of Adoption
31
The Best Way to Get Started: Apache Hive
• Apache Hive• Data Warehouse System on
top of Hadoop• SQL-based query language
• SELECT, INSERT, CREATE TABLE
• Includes some MapReduce-specific extensions
32
Borrowing Abstractions
33
Improving the UX (http://github.com/cloudera/impala)
34
Moving Beyond the Abstractions
35
Making the Abstract Concrete
36
Cloudera’s Data Science Course
37
Analytical Applications I Love
38
The Experiments Dashboard
39
Adverse Drug Events
40
Gene Sequencing and Analytics
41
The Doctor’s Perspective
42
A Couple of Themes
1. Structure data the data in the way that makes sense for the problem.
2. Interactive inputs, not just interactive outputs.
3. Simpler interfaces that yield more sophisticated answers.
43
Working Towards The Dream
44
Moving Beyond MapReduce
Developing Analytical Applications
45
The Cambrian Explosion…of Frameworks
46
It’s Frameworks All The Way Down: Spark
• Developed at Berkeley’s AMP Lab• Defines operations on
distributed in-memory collections• Written in Scala• Supports reading to and
writing from HDFS
47
IFATWD: Graphlab
• Developed at CMU• Lower-level primitives
• (but higher than MPI)• Map/Reduce =>
Update/Sort• Flexible, allows for
asynchronous computations• Reads from HDFS
48
Playing with YARN
49
BranchReduce (http://github.com/cloudera/branchreduce)
50