4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland,...
-
Upload
marian-carroll -
Category
Documents
-
view
227 -
download
4
Transcript of 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland,...
4. Scalability and MapReduce
Prof. Tudor DumitrașAssistant Professor, ECEUniversity of Maryland, College Park
ENEE 759D | ENEE 459D | CMSC 858Z
http://ter.ps/759d
https://www.facebook.com/SDSAtUMD
Today’s Lecture• Where we’ve been
– How to say “hapax legomenon” and “heteroskedasticity”
– Interpretation of Statistics
– Attributes of Big Data
• Where we’re going today– Threats to validity
– Scalability
– MapReduce
• Where we’re going next– Machine learning
2
The IROP Keyboard[Zeller, 2011]
3
To prevent bugs, remove the keystrokesthat predict 74% of failure-prone modules in Eclipse
4
Sample C
Sample D
Sample E
V1 ?V2 ?
V3 ?
Does this work?
What am I measuring?
How well does this work in the real world?
Will this work tomorrow?
D
E
F
C
G
N
S T
Reconstruct Lineage
Korgo worm family
What Am I Measuring: Scalability vs. Latency
• Analyzing data in parallel– To access 1 TB in 1 min, must distribute data over 20 disks
– Parallelism is useful for algorithms where complexity constants matter• N log N operations sequentially => (N log N)/K operations in parallel
– Scalability: ability to throw resources at the problem
• You can measure scalability– Scaleup (weak scalability):
• More resources => solve proportionally bigger problem with same latency
– Speedup (strong scalability): • More resources => proportionally lower latency with same problem size
5
Can we make use of 1000s of cheap computers?
Some Problems Are Embarrassingly Parallel (1)
6
Input: many TIFF images
Distribute images among K computers
f is a function to convert TIFF to PNG; apply it to every item
Output: a big distributed set of converted images
f ff f f f
Task: Convert 405K TIFF images (~4 TB) to PNG
http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/
Some Problems Are Embarrassingly Parallel (2)
7
Input: millions of documents
Distribute documentsamong K computers
For each document f returns a set of <word, freq> pairs
Output: a big a big distributed list of sets of word freqs.
f ff f f f
Task: Compute the word frequency of 5M documents
Adapted from slides by Bill Howe
Some Problems Are Embarrassingly Parallel (3)
8
Input: millions of documents
Distribute documentsamong K computers
For each document f returns a set of <word, freq> pairs
f ff f f f
Task: Compute the word frequency across all documents
Now what? We don’t want a bunch of little histograms – we want
one big histogram
MapReduce
Distribute documentsamong K computers
For each document f returns a set of <word, freq> pairs
A big distributed list of sets of word freqs.
map mapmap map map map
Task: Compute the word frequency across all documents
reduce reduce reduce reduce Add the countsof each word
Shuffle <word, freq> pairs so that all the counts for a word are sent to the same host
Output: the distributed histogram
Hadoop on One Slide
Source: Huy Vo
• MapReduce was invented at Google[Dean & Ghemawat, OSDI’04]
• Hadoop = open source implementation
• Data stored on HDFS distributed file system
– Direct-attached storage
– No schema needed on load
• Programmers write Map and Reduce functions
• Framework provides automated parallelization and fault tolerance– Data replication, restarting
failed tasks
– Scheduling Map and Reduce tasks on hosts with local copies of input data
10
MapReduce Programming Model
11
• Iput & Output: each a set of key/value pairs • Programmer specifies two functions:map (in_key, in_value) -> list(out_key, intermediate_value)
– Processes input key/value pair
– Produces set of intermediate pairs
reduce (out_key, list(intermediate_value)) -> list(out_value)
– Combines all intermediate values for a particular key
– Produces a set of merged output values (usually just one)
• Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell
Slide source: Google
Example: What Does This Do?map(String input_key, String input_value):
// input_key: document name// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, 1);
reduce(String output_key, Iterator intermediate_values):
// output_key: word // output_values: ????
int result = 0;
for each v in intermediate_values:
result += v;
EmitFinal(output_key, result); 12
Big Data in the Security Industry• Booz Allen Hamilton
– Dr. Brian Keller’s colloquium “Innovating with Analytics”
– Sponsors Data Science Bowl, October 5th 1-5:30 pm CSIC 2117 & 2120 https://www.datasciencebowl.com/
• Symantec– WINE platform for data analytics in security
• Google– Mine user access patterns to mitigate data loss due to stolen credentials
• Supplementary to passwords and two-factor authentication
– Fuzz testing at scale
13
Big Data for Security: Benefits and Challenges• Benefits
– Ability to analyze data at scale (e.g., the information on the 403 millions malware variants created in 2011)
– MapReduce provides simple programming model, automated parallelization and fault tolerance• Commercial parallel DBs (e.g. Vertica, Greenplum, Aster Data) also provide some
of these benefits, but they are very expensive
• Challenges– Lack of ground truth on malware families
– Lack of contextual data: e.g., date and time of appearance
– Inability to collect some types of data owing to privacy concerns
– Sharing data (e.g., malware samples are dangerous, some data sets may include personal information)
14
Illustrate general threats to validity in experimental cyber security
Threats to Validity
Construct validity: use metrics that model the hypothesis
Internal validity: establish causal connection
Content validity: include only and all relevant data
External validity: generalize results beyond experimental data
Does it work?What am I
measuring?
Will it work in the real world? Will it work
tomorrow?Will it work tomorrow?
15
Review of Lecture• What did we learn?
– Construct, content, internal, external validity
– Programming in MapReduce
– Measuring scalability
• What’s next?– Paper discussion: ‘Before We Knew It: An Empirical Study of Zero-Day
Attacks In The Real World’
– Next lecture: Machine learning techniques
• Deadline reminder– Pilot project reports due on Wednesday
– Post report on Piazza16