4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland,...

16
4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z http://ter.ps/ 759d https://www.facebook.com/SDSAtUMD

Transcript of 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland,...

Page 1: 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

4. Scalability and MapReduce

Prof. Tudor DumitrașAssistant Professor, ECEUniversity of Maryland, College Park

ENEE 759D | ENEE 459D | CMSC 858Z

http://ter.ps/759d

https://www.facebook.com/SDSAtUMD

Page 2: 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

Today’s Lecture• Where we’ve been

– How to say “hapax legomenon” and “heteroskedasticity”

– Interpretation of Statistics

– Attributes of Big Data

• Where we’re going today– Threats to validity

– Scalability

– MapReduce

• Where we’re going next– Machine learning

2

Page 3: 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

The IROP Keyboard[Zeller, 2011]

3

To prevent bugs, remove the keystrokesthat predict 74% of failure-prone modules in Eclipse

Page 4: 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

4

Sample C

Sample D

Sample E

V1 ?V2 ?

V3 ?

Does this work?

What am I measuring?

How well does this work in the real world?

Will this work tomorrow?

D

E

F

C

G

N

S T

Reconstruct Lineage

Korgo worm family

Page 5: 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

What Am I Measuring: Scalability vs. Latency

• Analyzing data in parallel– To access 1 TB in 1 min, must distribute data over 20 disks

– Parallelism is useful for algorithms where complexity constants matter• N log N operations sequentially => (N log N)/K operations in parallel

– Scalability: ability to throw resources at the problem

• You can measure scalability– Scaleup (weak scalability):

• More resources => solve proportionally bigger problem with same latency

– Speedup (strong scalability): • More resources => proportionally lower latency with same problem size

5

Can we make use of 1000s of cheap computers?

Page 6: 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

Some Problems Are Embarrassingly Parallel (1)

6

Input: many TIFF images

Distribute images among K computers

f is a function to convert TIFF to PNG; apply it to every item

Output: a big distributed set of converted images

f ff f f f

Task: Convert 405K TIFF images (~4 TB) to PNG

http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/

Page 7: 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

Some Problems Are Embarrassingly Parallel (2)

7

Input: millions of documents

Distribute documentsamong K computers

For each document f returns a set of <word, freq> pairs

Output: a big a big distributed list of sets of word freqs.

f ff f f f

Task: Compute the word frequency of 5M documents

Adapted from slides by Bill Howe

Page 8: 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

Some Problems Are Embarrassingly Parallel (3)

8

Input: millions of documents

Distribute documentsamong K computers

For each document f returns a set of <word, freq> pairs

f ff f f f

Task: Compute the word frequency across all documents

Now what? We don’t want a bunch of little histograms – we want

one big histogram

Page 9: 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

MapReduce

Distribute documentsamong K computers

For each document f returns a set of <word, freq> pairs

A big distributed list of sets of word freqs.

map mapmap map map map

Task: Compute the word frequency across all documents

reduce reduce reduce reduce Add the countsof each word

Shuffle <word, freq> pairs so that all the counts for a word are sent to the same host

Output: the distributed histogram

Page 10: 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

Hadoop on One Slide

Source: Huy Vo

• MapReduce was invented at Google[Dean & Ghemawat, OSDI’04]

• Hadoop = open source implementation

• Data stored on HDFS distributed file system

– Direct-attached storage

– No schema needed on load

• Programmers write Map and Reduce functions

• Framework provides automated parallelization and fault tolerance– Data replication, restarting

failed tasks

– Scheduling Map and Reduce tasks on hosts with local copies of input data

10

Page 11: 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

MapReduce Programming Model

11

• Iput & Output: each a set of key/value pairs • Programmer specifies two functions:map (in_key, in_value) -> list(out_key, intermediate_value)

– Processes input key/value pair

– Produces set of intermediate pairs

reduce (out_key, list(intermediate_value)) -> list(out_value)

– Combines all intermediate values for a particular key

– Produces a set of merged output values (usually just one)

• Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell

Slide source: Google

Page 12: 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

Example: What Does This Do?map(String input_key, String input_value):

// input_key: document name// input_value: document contents

for each word w in input_value:

EmitIntermediate(w, 1);

reduce(String output_key, Iterator intermediate_values):

// output_key: word // output_values: ????

int result = 0;

for each v in intermediate_values:

result += v;

EmitFinal(output_key, result); 12

Page 13: 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

Big Data in the Security Industry• Booz Allen Hamilton

– Dr. Brian Keller’s colloquium “Innovating with Analytics”

– Sponsors Data Science Bowl, October 5th 1-5:30 pm CSIC 2117 & 2120 https://www.datasciencebowl.com/

• Symantec– WINE platform for data analytics in security

• Google– Mine user access patterns to mitigate data loss due to stolen credentials

• Supplementary to passwords and two-factor authentication

– Fuzz testing at scale

13

Page 14: 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

Big Data for Security: Benefits and Challenges• Benefits

– Ability to analyze data at scale (e.g., the information on the 403 millions malware variants created in 2011)

– MapReduce provides simple programming model, automated parallelization and fault tolerance• Commercial parallel DBs (e.g. Vertica, Greenplum, Aster Data) also provide some

of these benefits, but they are very expensive

• Challenges– Lack of ground truth on malware families

– Lack of contextual data: e.g., date and time of appearance

– Inability to collect some types of data owing to privacy concerns

– Sharing data (e.g., malware samples are dangerous, some data sets may include personal information)

14

Illustrate general threats to validity in experimental cyber security

Page 15: 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

Threats to Validity

Construct validity: use metrics that model the hypothesis

Internal validity: establish causal connection

Content validity: include only and all relevant data

External validity: generalize results beyond experimental data

Does it work?What am I

measuring?

Will it work in the real world? Will it work

tomorrow?Will it work tomorrow?

15

Page 16: 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

Review of Lecture• What did we learn?

– Construct, content, internal, external validity

– Programming in MapReduce

– Measuring scalability

• What’s next?– Paper discussion: ‘Before We Knew It: An Empirical Study of Zero-Day

Attacks In The Real World’

– Next lecture: Machine learning techniques

• Deadline reminder– Pilot project reports due on Wednesday

– Post report on Piazza16