4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland,...

4. Scalability and MapReduce

Prof. Tudor DumitrașAssistant Professor, ECEUniversity of Maryland, College Park

ENEE 759D | ENEE 459D | CMSC 858Z

http://ter.ps/759d

https://www.facebook.com/SDSAtUMD

Today’s Lecture• Where we’ve been

– How to say “hapax legomenon” and “heteroskedasticity”

– Interpretation of Statistics

– Attributes of Big Data

• Where we’re going today– Threats to validity

– Scalability

– MapReduce

• Where we’re going next– Machine learning

The IROP Keyboard[Zeller, 2011]

To prevent bugs, remove the keystrokesthat predict 74% of failure-prone modules in Eclipse

Sample C

Sample D

Sample E

V1 ?V2 ?

Does this work?

What am I measuring?

How well does this work in the real world?

Will this work tomorrow?

Reconstruct Lineage

Korgo worm family

What Am I Measuring: Scalability vs. Latency

• Analyzing data in parallel– To access 1 TB in 1 min, must distribute data over 20 disks

– Parallelism is useful for algorithms where complexity constants matter• N log N operations sequentially => (N log N)/K operations in parallel

– Scalability: ability to throw resources at the problem

• You can measure scalability– Scaleup (weak scalability):

• More resources => solve proportionally bigger problem with same latency

– Speedup (strong scalability): • More resources => proportionally lower latency with same problem size

Can we make use of 1000s of cheap computers?

Some Problems Are Embarrassingly Parallel (1)

Input: many TIFF images

Distribute images among K computers

f is a function to convert TIFF to PNG; apply it to every item

Output: a big distributed set of converted images

f ff f f f

Task: Convert 405K TIFF images (~4 TB) to PNG

http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/

Input: millions of documents

Distribute documentsamong K computers

For each document f returns a set of <word, freq> pairs

Output: a big a big distributed list of sets of word freqs.

f ff f f f

Task: Compute the word frequency of 5M documents

Adapted from slides by Bill Howe

Input: millions of documents

f ff f f f

Task: Compute the word frequency across all documents

Now what? We don’t want a bunch of little histograms – we want

one big histogram

MapReduce

A big distributed list of sets of word freqs.

map mapmap map map map

Task: Compute the word frequency across all documents

reduce reduce reduce reduce Add the countsof each word

Shuffle <word, freq> pairs so that all the counts for a word are sent to the same host

Output: the distributed histogram

Hadoop on One Slide

Source: Huy Vo

• MapReduce was invented at Google[Dean & Ghemawat, OSDI’04]

• Hadoop = open source implementation

• Data stored on HDFS distributed file system

– Direct-attached storage

– No schema needed on load

• Programmers write Map and Reduce functions

• Framework provides automated parallelization and fault tolerance– Data replication, restarting

failed tasks

– Scheduling Map and Reduce tasks on hosts with local copies of input data

MapReduce Programming Model

• Iput & Output: each a set of key/value pairs • Programmer specifies two functions:map (in_key, in_value) -> list(out_key, intermediate_value)

– Processes input key/value pair

– Produces set of intermediate pairs

reduce (out_key, list(intermediate_value)) -> list(out_value)

– Combines all intermediate values for a particular key

– Produces a set of merged output values (usually just one)

• Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell

Slide source: Google

Example: What Does This Do?map(String input_key, String input_value):

// input_key: document name// input_value: document contents

for each word w in input_value:

EmitIntermediate(w, 1);

reduce(String output_key, Iterator intermediate_values):

// output_key: word // output_values: ????

int result = 0;

for each v in intermediate_values:

result += v;

EmitFinal(output_key, result); 12

Big Data in the Security Industry• Booz Allen Hamilton

– Dr. Brian Keller’s colloquium “Innovating with Analytics”

– Sponsors Data Science Bowl, October 5th 1-5:30 pm CSIC 2117 & 2120 https://www.datasciencebowl.com/

• Symantec– WINE platform for data analytics in security

• Google– Mine user access patterns to mitigate data loss due to stolen credentials

• Supplementary to passwords and two-factor authentication

– Fuzz testing at scale

Big Data for Security: Benefits and Challenges• Benefits

– Ability to analyze data at scale (e.g., the information on the 403 millions malware variants created in 2011)

– MapReduce provides simple programming model, automated parallelization and fault tolerance• Commercial parallel DBs (e.g. Vertica, Greenplum, Aster Data) also provide some

of these benefits, but they are very expensive

• Challenges– Lack of ground truth on malware families

– Lack of contextual data: e.g., date and time of appearance

– Inability to collect some types of data owing to privacy concerns

– Sharing data (e.g., malware samples are dangerous, some data sets may include personal information)

Illustrate general threats to validity in experimental cyber security

Threats to Validity

Construct validity: use metrics that model the hypothesis

Internal validity: establish causal connection

Content validity: include only and all relevant data

External validity: generalize results beyond experimental data

Does it work?What am I

measuring?

Will it work in the real world? Will it work

tomorrow?Will it work tomorrow?

Review of Lecture• What did we learn?

– Construct, content, internal, external validity

– Programming in MapReduce

– Measuring scalability

• What’s next?– Paper discussion: ‘Before We Knew It: An Empirical Study of Zero-Day

Attacks In The Real World’

– Next lecture: Machine learning techniques

• Deadline reminder– Pilot project reports due on Wednesday

– Post report on Piazza16

4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland,...

Documents

Transcript of 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland,...

3. Statistical Inference Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z .

Parasitics Bruce Jacob ENEE 359a University of Digital ... · Bruce Jacob University of Maryland ECE Dept. SLIDE 1 UNIVERSITY OF MARYLAND ENEE 359a Digital VLSI Design Some ... 0.5µm

ILLUMINATION INVARIANT FACE RECOGNITION - …raghuram/Publications/CourseProjects/ENEE63… · ILLUMINATION INVARIANT FACE RECOGNITION Raghuraman Gopalan raghuram@umd.edu ENEE 631:

"IT Sector Priorities in Brazil" XII ENEE

CEERE ENEE Courses Updated

Manual de Estructuras Enee 2012

LEADING THE WAY TO A HEALTHIER COMMUNITY ...files.constantcontact.com/b41edd78501/16afbd39-759d-40f5...the cafeteria on the Community Relations bulletin board. All winners received

Lancaster resident for the futurelinpub.blob.core.windows.net/pdf/1/67608d6c-759d... · arts education purposely designed to inspire a life-long pursuit of learning, loving and leading.

Digital Logic Design ENEE 244-010x

ENEE 660 HW Sol #5

ENEE 631 Project Video Codec and Shot Segmentation

Germania Enee Siluij in Qua Candide Lect (1515)

Evolving NoSQL Databases Without Downtime · Evolving NoSQL Databases Without Downtime Karla Saur - University of Maryland (now at Intel Labs) Tudor Dumitraș - University of Maryland

ENEE 474 Fall 2010 Homeworks

TRANSMISSION MEDIA MAXWELL’S EQUATIONS AND TRANSMISSION MEDIA CHARACTERISTICS ENEE 482 Spring 2002 DR. KAWTHAR ZAKI.

Portable Image File Viewer ENEE 408G: Multimedia Signal Processing Seun Fabayo John Glancy Gordon Krauthamer.

Notes for ENEE 664: Optimal Controlandre/664.pdfNotes for ENEE 664: Optimal Control Andr´e L. Tits DRAFT January 2016

ENEE 408C Lab Capstone Project: Digital System Design Spring 2006

ENEE 407: RF Microwave Devices Design and Laboratory Testing

Robert Prior ENEE 499 Dr. Jonathan Simoncansl.isr.umd.edu/simonlab/pubs/Prior2006.pdf · 2008. 8. 28. · Robert Prior ENEE 499 Advisor: Dr. Jonathan Simon Using PCA to Remove Biological