Big Data Processing, 2014/15
Transcript of Big Data Processing, 2014/15
![Page 1: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/1.jpg)
Big Data Processing, 2014/15Lecture 1: Introduction!
!Claudia Hauff (Web Information Systems)!
1
![Page 2: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/2.jpg)
Course organisation• 2 lectures a week (Tuesday, Thursday) for 7 weeks
• Content: • Data streaming algorithms • MapReduce (bulk of the course) • Approaches to iterative big-data problems
• One “free” lecture in January (suggestions?) • e.g. Big data visualisations, Industry talk, etc.
• Course style: interactive when possible
2
![Page 3: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/3.jpg)
Course content• Introduction!
• Data streams 1 & 2
• The MapReduce paradigm!
• Looking behind the scenes of MapReduce: HDFS & Scheduling!
• Algorithm design for MapReduce!
• A high-level language for MapReduce: Pig 1 & 2!
• MapReduce is not a database, but HBase nearly is
• Lets iterate a bit: Graph algorithms & Giraph!
• How does all of this work together? ZooKeeper/Yarn
3
Small changes are possible.
![Page 4: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/4.jpg)
4
• A mix of pen-and-paper and programming
• Individual work
• 5 lab sessions on Hadoop & Co between Nov. 24 and Jan. 5 on Mondays 8.45am - 12.45pm • Lab attendance is not compulsory !
• Two lab assistants will help you with your lab work (Kilian Grashoff and Robert Carosi)
• Assignments are handed out on Thursdays, and due the following Thursday (6 in total);
Assignments
![Page 5: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/5.jpg)
5
• Course grade: 25% assignments, 75% final exam
• Weekly quizzes: up to 1 additional grade point • +0.5 if you answer >50% correctly • +1 if you answer >75% correctly
• You can pass without handing in the assignments (not recommended!)
• Final exam: January 28, 2-5pm (MC & open questions)
• Resit: April 15, 2-5pm (MC & open questions)
Grading & exams
Assignments and quizzes?? Yes!
![Page 6: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/6.jpg)
6
• Questions are always welcome
• Contact preferably via mail: [email protected]
• Questions and answers may be posted on blackboard if they are relevant for others as well
• If the pace is too slow/fast, please let me know
Questions?
![Page 7: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/7.jpg)
7
• Recommended material are either book chapters or papers, usually available through the TU Delft campus network
• Course covers a recent topic, no single book available that covers all angles!
• MapReduce: Data-Intensive Text Processing with MapReduce by Lin, Dyer and Hirst, Morgan and Claypool Publishers, 2010. http://lintool.github.io/MapReduceAlgorithms/
Reading material (lectures)
![Page 8: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/8.jpg)
8
• A book on Java if you are not yet comfortable with it!
• Hadoop: Hadoop: The Definite Guide by Tom White, O’Reilly Media, 2012 (3rd edition)
Reading material (assignments)
![Page 9: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/9.jpg)
Big Data
![Page 10: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/10.jpg)
10
• Explain the ideas behind the “big data buzz”
• Understand and describe the three different paradigms covered in class
• Code productively in one of the most important big data software frameworks we have to date: Hadoop (and tools building on it)
• Transform big data problems into sensible algorithmic solutions
Course objectives
![Page 11: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/11.jpg)
11
• Explain and recognise the V’s of big data in use case scenarios
• Explain the main differences between data streaming and MapReduce algorithms
• Identify the correct approach (streaming vs. MapReduce) to be taken in an application setting
Today’s learning objectives
![Page 12: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/12.jpg)
12
• A buzz word, fuzzy boundaries
!
!
!
!
• Requires novel infrastructure to support storage and processing
What is “big data”?
“Massive amounts of diverse, unstructured data produced by high-performance applications.”
“Data too large & complex to be effectively handled by standard database technologies currently found in most organisations.”
![Page 13: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/13.jpg)
13
• Weather forecasting has been a long-term scientific challenge • Supercomputers were already used in the 1970s • Equation crunching
Large-scale computing is not new
ECMWF Source
![Page 14: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/14.jpg)
14
• Weather forecasting has been a long-term scientific challenge • Supercomputers were already used in the 1970s • Equation crunching
Large-scale computing is not new
ECMWF Source
![Page 15: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/15.jpg)
15
• So-called big data technologies are about discovering patterns (in semi/unstructured data)
• Main focus is on how to make computations on big data feasible, i.e. without a supercomputer • We use cluster(s) of commodity hardware!
• Next quarter: Data Mining course (will use Hadoop*) focuses on how to discover those patterns
Big data processing
*probably
![Page 16: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/16.jpg)
Just an academic exercise?• Cloud computing: “Anything running inside a browser
that gathers and stores user-generated content”
• Utility computing • Computing as a metered service!• A “cloud user” buys any amount of computing
power from a “cloud provider” (pay-per-use) • Virtual machine instances • IaaS: infrastructure as a service!• Amazon Web Services is the dominant provider
16
![Page 17: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/17.jpg)
Just an academic exercise?
17
No!!
AWS EC2 Pricing
You can run your own big data experiments!
![Page 18: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/18.jpg)
Progress often driven by industry• Development of big data standards & (open source)
software commonly driven by companies such as Google, Facebook, Twitter, Yahoo! …
• Why do they care about big data?
!
• More knowledge leads to • better customer engagement • fraud prevention • new products
18
More data More knowledge
More money
![Page 19: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/19.jpg)
Big data analytics: IBM pitch
19
https://www.youtube.com/watch?v=1RYKgj-QK4I
![Page 20: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/20.jpg)
A concrete example: big data vs. small data
20
Scaling to very very large corpora for !natural language disambiguation.!M. Banko and E. Brill, 2001.
qual
ity
perfect
poortraining datalittle a lot
Task: confusion set disambiguation then vs. than to vs. two vs. too
with little training simple algorithms fail
complex algorithm
but with a lot of data …
![Page 21: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/21.jpg)
The 3 V’s
• Volume: large amounts of data
• Variety: data comes in many different forms from diverse sources
• Velocity: the content is changing quickly
21
![Page 22: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/22.jpg)
The 5 V’s• Volume: large amounts of data
• Variety: data comes in many different forms from diverse sources
• Velocity: the content is changing quickly
• Value: data alone is not enough; how can value be derived from it?
• Veracity: can we trust the data? how accurate is it?
22
![Page 23: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/23.jpg)
The 7 V’s• Volume: large amounts of data
• Variety: data comes in many different forms from diverse sources
• Velocity: the content is changing quickly
• Value: data alone is not enough; how can value be derived from it?
• Veracity: can we trust the data? how accurate is it?
• Validity: ensure that the interpreted data is sound
• Visibility: data from diverse sources need to be stitched together
23
3/5 V’s most !
commonly used
![Page 24: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/24.jpg)
The 5 V’s• Volume: large amounts of data
• Variety: data comes in many different forms from diverse sources
• Velocity: the content is changing quickly
• Value: data alone is not enough; how can value be derived from it?
• Veracity: can we trust the data? how accurate is it?
24
Question: how do these attributes apply to the use case of Flickr?
![Page 25: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/25.jpg)
Instantiations
25
![Page 26: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/26.jpg)
Terminology
• Batch processing: running a series of computer programs without human intervention
• Near real-time: brief delay between the data becoming available and it being processed
• Real-time: guaranteed responses between the data becoming available and it being processed
26
![Page 27: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/27.jpg)
Terminology contd.
• Structured data (well defined fields)
• Semi-structured data
• Unstructured data(by humans for humans)
27most common today
standard in the past
![Page 28: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/28.jpg)
Unstructured text• To get value out of unstructured text we need to
impose structure automatically!• Parse text • Extract meaning from it (can be easy or difficult)
• Amount of data we create is more than doubling every two years, most new data is unstructured or at most semi-structured
• Text is not everything - images, video, audio, etc.
28
![Page 29: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/29.jpg)
Extracting meaning
29
easy
difficult
![Page 30: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/30.jpg)
IBM Watson: How it works
30
https://www.youtube.com/watch?v=_Xcmh1LQB9I
8 minutes worth your time - even if not in class. !Contains a nice piece about the use of unstructured text.
![Page 31: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/31.jpg)
Examples of Volume & Velocity: Twitter• >500 million tweets a day • On average >5700 tweets a second • Peaks of >100,000 tweets a second
• Super Bowl • US election • New Year’s Eve • Football World Cup (672M tweets
in total #WorldCup) !
• Messages are instantly accessible for search!• Messages are used in post-hoc analyses to gather insights
31
![Page 32: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/32.jpg)
32
http://bl.ocks.org/anonymous/raw/0c64880b3a791dffb6e4/
#WorldCup
![Page 33: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/33.jpg)
Examples of Volume: the Large Hadron Collider @ CERN• The world’s largest particle accelerator buried in a
tunnel with 27km circumference • Enormous amounts of
sensors register the passage of particles
• 40 million events/second (1MB raw data per event)
• Generates 15 Petabytes a year (15-25 million GB)
33
Image src
![Page 34: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/34.jpg)
Examples of Velocity: targeted advertising on the Web• US revenues in 2013: ~$40 billion • Advertisers usually get paid per click • For each search request, search engines decide
• whether to show an ad • which ad to show
• Users willing at best to wait 2 seconds for their search results
• Feedback loop via user clicks, user searches, mouse movements, etc.
34
![Page 35: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/35.jpg)
Examples of Velocity: targeted advertising on the Web• US revenues in 2013: ~$40 billion • Advertisers usually get paid per click • For each search request, search engines decide
• whether to show an ad • which ad to show
• Users willing at best to wait 2 seconds for their search results
• Feedback loop via user clicks, user searches, mouse movements, etc.
35
![Page 36: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/36.jpg)
More interactions/data on the Web• YouTube: 4 billion views a day, one hour of video
uploaded every second • Facebook: 483 million daily active users (Dec.
2011); 300 Petabytes of data • Google: >1 billion searches per day (March 2011) • Google processed 100 Terabytes of data per day
in 2004 and 20 Petabytes data/day in 2008 • Internet Archive: contains 2 Petabytes of data,
grows 20 Terabytes per month (2011)36
![Page 37: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/37.jpg)
Use case: movie recommendations
37
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Bob
Alice
Tom
Jane
Frank
Joe
Bob likes X-MEN.
Jane dislikes X-MEN.Should we suggest
X-MEN to Frank?
Question: how can we come up with a suggestion for Frank?!
Think about data-intensive and non-intensive approaches.
![Page 38: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/38.jpg)
Use case contd.• Ignore the data, use experts instead (movie
reviewers); assumes no large subscriber/reviewer divergence
• Use all data but ignore individual preferences; assumes that most users are close to the average
• Lump people into preference groups based on shared likes/dislikes; compute group-based average score per movie
• Focus computational effort on difficult movies (some are universally liked/disliked)
38
A whole research field is concerned with this question: recommender systems.
![Page 39: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/39.jpg)
Use case contd.• Netflix Prize: open competition for the best
collaborative filtering algorithm to predict user ratings for films, based on previous ratings (>100 million ratings by 0.5 million users for ~17,000 movies)
• First competitor to improve over Netflix’s baseline by 10% receives $1,000,000
• Competition started in 2006, price money was paid out in 2009 (winner was 20 minutes quicker than the runner up - same performance!)
39Many research teams competed - innovation driven by industry again!
![Page 40: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/40.jpg)
Example of Variety: Restaurant Locator• Task: given a person’s location, list the top five
restaurants in the neighbourhood • Required data:
• World map • List of all restaurants in the world (opening
hours, GPS coordinates, menu, special offers) • Reviews/ratings • Optional: social media stream(s)
• Data is continuously changing (restaurants close, new ones open, data formats change, etc.)
40
Question: what kind of data do you need for this task?
![Page 41: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/41.jpg)
Society can benefit too, not just companies• Accurate predictions of natural disasters and
diseases • Better responses to disaster discovery
• Timely & effective decisions • Provide resources where need the most
!
• Complete disease/genomics databases to enable biomedical discoveries
• Accurate models to support forecasting of ecosystem developments
41
![Page 42: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/42.jpg)
Idea: earthquake warnings• Social sensors: users (humans) that use Twitter,
Facebook, Instagram, i.e. portals with real-time posting abilities !
!
!
• Challenges: how to detect when a tweet is about an actual earthquake, which earthquake is it about and where is the centre
42
Earthquake!occurs
Warn people further away
people in the area tweet about it
Question: do you think this is possible? If so, what are the challenges?
![Page 43: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/43.jpg)
Idea: earthquake warnings• Social sensors: users (humans) that use Twitter,
Facebook, Instagram, i.e. portals with real-time posting abilities !
!
!
• Challenges: how to detect when a tweet is about an actual earthquake, which earthquake is it about and where is the centre
43
Earthquake!occurs
Warn people further away
people in the area tweet about it
![Page 44: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/44.jpg)
Idea: earthquake warnings contd.• Goal: warning should reach people earlier than
the seismic waves • Travel times of seismic waves: 3-7km/s; arrival
time of a wave 100km away: 20 seconds!• Performance of an existing system (Sakaki et al.,
2010):
44 traditional warning system
Twitter-based
earthquake
It is possible!!
![Page 45: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/45.jpg)
A brief introduction to Streaming & MapReduce
![Page 46: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/46.jpg)
A brief introduction to Streaming
![Page 47: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/47.jpg)
Data streaming scenario• Continuous and rapid input of data (“stream of data”) • Limited memory to store the data - less than linear in
the input size • Limited time to process each data item - sequential
access • Algorithms have one (or very few passes) over the
data • Can be approached from a practical or mathematical
point of view: metric embedding, pseudo-random computations, sparse approx. theory …
47
We go for the practical setup!
![Page 48: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/48.jpg)
Data streaming example
48
3 6 5 4 1 8 2What is the
missing number?
stream of n numbers; permutation from 1 to n; one number is missing; we are allowed one pass over the data
Solution 1: memorise all number seen so far; memory requirements: n bit (impractical for large n)
![Page 49: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/49.jpg)
Data streaming example
49
3 6 5 4 1 8 2What is the
missing number?
stream of n numbers; permutation from 1 to n; one number is missing; we are allowed one pass over the data
Solution 2: (closed form)
sum of all numbers from 1 to n
subtract seen numbers
memory: 2logn
![Page 50: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/50.jpg)
Data streaming contd.
50
3 4 9 4 9 8 2What is the average?!
What is the median?
we are allowed one pass over the data, can only store 3 numbers
• Average: can be computed by keeping track of two numbers (sum and #numbers seen)
!• Median: sample data points - but how?
![Page 51: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/51.jpg)
Data streaming contd.
51
• Typically: simple functions of the stream are computed and used as input to other algorithms • Median • Number of distinct elements • Longest increasing sequence • …
• Closed form solutions are rare • Common approaches are approximations of the
true value: sampling, hashing
![Page 52: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/52.jpg)
A brief introduction to MapReduce
![Page 53: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/53.jpg)
MapReduce is an industry standard
53
QCon 2013 (San Francisco)“QCon empowers software development by facilitating the spread of knowledge and
innovation in the developer community. A practitioner-driven conference …”
Hadoop is the open-source implementation of the MapReduce framework.
![Page 54: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/54.jpg)
Industry is moving fast
54 QCon 2014 (San Francisco)
![Page 55: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/55.jpg)
MapReduce
55
• Designed for batch processing over large data sets
• No limits on the number of passes, memory or time
• Programming model for distributed computations inspired by the functional programming paradigm
![Page 56: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/56.jpg)
MapReduce example
56
WordCount: given an input text, determine the frequency of each word. The “Hello World” of the MapReduce realm.
Input text: The dog walks around the house. The dog is in the house.
Input: text
text 1
text 2
text 3
Mapper
Mapper
sort Reducer!(add)
(in,1)!(the,1)!
(house,1)
(dog,1)!(dog,1)!
(walks,1)!(house,1)!(house,1)!
…
dog: 2!walks: 1!house: 2!
…
Mapper
![Page 57: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/57.jpg)
MapReduce example
57
Input: text
text 1
text 2
text 3
Mapper
Mapper
sort Reducer!(add)
(in,1)!(the,1)!
(house,1)
(dog,1)!(dog,1)!
(walks,1)!(house,1)!(house,1)!
…
dog: 2!walks: 1!house: 2!
…
Mapper
We implement the Mapper and the Reducer. Hadoop (and other tools) are responsible for the “rest”.
![Page 58: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/58.jpg)
Summary• What are the characteristics of “big data”?
• Example use cases of big data
• Hopefully a convincing argument why you should care
• A brief introduction of data streams and MapReduce
58
![Page 59: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/59.jpg)
Reading material
59
Required reading!None. !
Recommended reading!Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information by Jules Berman. Chapters 1, 14 and 15.
![Page 60: Big Data Processing, 2014/15](https://reader035.fdocuments.in/reader035/viewer/2022081802/586b5d7b1a28ab2b068bb587/html5/thumbnails/60.jpg)
THE END