New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE...
Transcript of New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE...
![Page 1: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/1.jpg)
BIG DATA
Djoerd Hiemstrahttp://www.cs.utwente.nl/~hiemstra
Big Data & Education, KIVI, 13 April 2016
![Page 2: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/2.jpg)
2
WHY BIG DATA?
![Page 3: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/3.jpg)
3
Source: http://en.wikipedia.org/wiki/Mount_Everest
![Page 4: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/4.jpg)
4
19 May 2012:234 people
reach the top
![Page 5: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/5.jpg)
5
James Hays and Alexei Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH), 26(3), 2007.
![Page 6: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/6.jpg)
6
James Hays and Alexei Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH), 26(3), 2007.
![Page 7: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/7.jpg)
7
James Hays and Alexei Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH), 26(3), 2007.
![Page 8: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/8.jpg)
8
James Hays and Alexei Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH), 26(3), 2007.
![Page 9: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/9.jpg)
9
THE PROGRAM VS.
THE DATA...
![Page 10: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/10.jpg)
10
Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonableeffectiveness of data. IEEE Intelligent Systems, 24(2), 2009
![Page 11: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/11.jpg)
11
SOME HISTORY
![Page 12: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/12.jpg)
12
“Every time I fire a linguist, the performance of the speech recognizer goes up”
(Frederick Jelinek, 1932 – 2010)
1980'S SPEECH RECOGNITION
![Page 13: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/13.jpg)
13http://www.lrec-conf.org/lrec2004/doc/jelinek.pdf
![Page 14: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/14.jpg)
14
“There is no data like more data” (Robert L. Mercer, IBM in 1985)
1990's: ALL NLP
![Page 15: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/15.jpg)
15
DATADRIVEN MACHINE TRANSLATION Datadriven machine translation
![Page 16: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/16.jpg)
16
MY MASTER THESIS
Datadriven machine translation
Djoerd Hiemstra, Using statistical methods to create a bilingual dictionary, Master’s Thesis, University of Twente, 1996
![Page 17: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/17.jpg)
17
“More data is more important than better algorithms”
(Michele Banko & Eric Brill, Microsoft)
2000's DATA-DRIVEN METHODS IN PRODUCTION
![Page 18: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/18.jpg)
18
Michele Banko and Eric Brill. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting of the ACL, 2001.
![Page 19: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/19.jpg)
19
Thorsten Brants, Ashok Popat, Peng Xu, Franz Och, Jeffrey Dean. Large Language Models in Machine Translation. In: Proceedings of EMNLP, 2007
![Page 20: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/20.jpg)
20
How toget here
if you are (not) Google?
Thorsten Brants, Ashok Popat, Peng Xu, Franz Och, Jeffrey Dean. Large Language Models in Machine Translation. In: Proceedings of EMNLP, 2007
![Page 21: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/21.jpg)
21
HOW? TEACH A COURSE(and get a DIY datacenter)
![Page 22: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/22.jpg)
22
COURSE: MANAGING BIG DATA
M.Sc. Course Computer Science First edition: Nov. 2009 – Feb. 2010 with Maarten Fokkinga and Robin Aly
![Page 23: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/23.jpg)
23
COURSE: MANAGING BIG DATA
File systems (Google File System) New Storage model (BigTable) Programming paradigm (MapReduce) Programming languages (Haskell, Java,...) New Query languages (Yahoo Pig,...) Queuing and streams (Kafka, Twitter Storm, … )
![Page 24: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/24.jpg)
24
![Page 25: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/25.jpg)
25
![Page 26: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/26.jpg)
26
CASE STUDY: CLUEWEB09
Web crawl of 1 billion pages (25 TB) crawled in Jan. – Feb. 2009 using only the English pages (0.5 billion)
![Page 27: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/27.jpg)
27
CASE STUDY: CLUEWEB09
Rebuild Google's experimental infrastructure Jeffrey Dean. Challenges in building large-scale
information retrieval systems. In WSDM 2009 Using Hadoop
![Page 28: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/28.jpg)
28
SEQUENTIAL SEARCH
50 test queries take less than 30 minutes on Anchor Text representation
Language model, no smoothing, length prior Expected Precision at 5, 10 and 20 documents
(MTC method): 0.42 0.39 0.35
(0.44 0.42 0.38 U. Amsterdam)(0.43 0.38 0.38 Microsoft Asia)(0.42 0.40 0.39 Microsoft UK)
![Page 29: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/29.jpg)
29
EXPERIMENTAL RESULTS
![Page 30: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/30.jpg)
30
BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM
1. Less time coding and debugging
2. Easy to include new information that is not in the engine’s standard inverted index
3. Oversee all the code in the experiment
4. Large-scale experiments in reasonable time
![Page 31: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/31.jpg)
31
BIG DATA FOR RESEARCHERS
Minimize coding a proof-of-concept Faster turnaround of the experimental cycle:
= more experiments on more data = more improvement of search quality = better system!
![Page 32: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/32.jpg)
32
MORE INFO
Djoerd Hiemstra and Claudia Hauff. MapReduce for information retrieval evaluation. In: CLEF, Multilingual and Multimodal Information Access Evaluation, pages 64-69, 2010
Software open source at http://mirex.sourceforge.net Wait: where
about is this on Google's
graph?
![Page 33: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/33.jpg)
33
We were here in2010!
![Page 34: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/34.jpg)
34
YOU CAN DO WHAT GOOGLE DOES!
(in 3 to 4 years)
![Page 35: New Lecture 2 – Theoretical Underpinnings of MapReduce · 2016. 4. 18. · BRUTE-FORCE MAP/REDUCE INSTEAD OF PRODUCTION SYSTEM 1. Less time coding and debugging 2. Easy to include](https://reader036.fdocuments.in/reader036/viewer/2022081407/605292013518e829e805af66/html5/thumbnails/35.jpg)
35
ACKNOWLEDGEMENTS
Yahoo Research
Netherlands Organization for Scientific Research (NWO).
COMMIT, a public-private ICT research community