MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop...
Transcript of MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop...
![Page 2: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/2.jpg)
MapReduceusing Hadoop
● MapReduce in theory ...
– Introduction
– Implementation
– Scalable distributed FS ● … and praxis
● Discussion
![Page 3: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/3.jpg)
MapReduceIntroduction
● Basis : functional blocks map and reduce
● differs from known funtional map / reduce functions used e.g. in haskell [2]
● Google Implementation: a Framework named 'MapReduce' (since 2003)
● But many other (also OS) implementation
![Page 4: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/4.jpg)
MapReduceFramework Properties
● Automatic parallelization, distribution and scheduling of jobs
● Fault-tolerant
● Automatic burden-sharing
● Optimizing network and data transfer
● Monitoring
![Page 5: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/5.jpg)
MapReduceFramework Applications
● Indexing of large data sets (e.g. for searching)
● Distributed Search of Pattern in large data sets
● Sorting large data sets
● Evaluation of log data (web)
● Grep data (of interest) from documents (e.g.web pages)
● Graphgeneration (user profiling, web page linking)
![Page 6: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/6.jpg)
MapReduceProgramming Model
[a]Map :: (key1,a) � [(key2,b)]
Reduce :: key2 � [b] � b[b]
12
3
4
5
6
1
Combine :: [(key2,b)] � [(key2,[b])]
MapReduceLibrary
3 ½
![Page 7: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/7.jpg)
MapReduceExample WordCount
map(String value):List<Pair> intermediateResult;Foreach word w in value:
intermediateResult.add(w,1);Return intermediateResult;
reduce(String key, List value):result = 0;Foreach v in values:
result += v;return result;
Counting the number of occurrences of each word in a large collection of documents !
![Page 8: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/8.jpg)
MapReduceGoogle's Implementation
Execution Overview taken from [1]
Program
Master
worker
worker
worker
worker
worker
Output
Output
Map phase Reduce phaseIntermediate files(on local disks)
OutputInput
(6)write(5)remote read(4)local write
(3)read
(2)assign map
(2)assign reduce
(1)fork (1)fork
![Page 9: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/9.jpg)
GFSGoogle's File System
“Moving Computation is Cheaper than Moving Data !”
![Page 10: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/10.jpg)
GFSGoogle's File System
● DFS - optimized for very large datasets
● Data is stored in chunks (typ. 64 Mb of size)
● Chunks are stored redundant (typ. 3 times) on so called chunkserver
● High data throughput vs. random access time
● Write Once, Read Many times data
● Streaming access data
● Fault tolerant
![Page 11: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/11.jpg)
GFSArchitecture
Architecture taken from [3]
Application
GFS Client
GFS master
File namespace Chunk 2ef0
/foo/bar
Chunkserver
FileSystem
Chunkserver
FileSystem
... ...
File name
Chunk handleChunk location
Chunk data
Chunk handleByte range
state instruction
![Page 12: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/12.jpg)
… and praxisMapReduce using Hadoop
● Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant.
● Hadoop is OpenSource, available at apache.org
– MapReduce : MapReduce Framework
– HDFS : Hadoop File System
– Hbase : Distributed Database
● Implemented in Java using C++ to speed up some operations !
● Currently supported by Yahoo, Amazon (A9), Google, IBM, …
● Requirements : Java 1.6 and ssh/sshd running
● Different run modes : single, pseudodistributed and distributed
![Page 13: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/13.jpg)
… and praxisMapReduce using Hadoop
Counting the number of occurrences of each word in a large collection of documents !
Usings Hadoop's Streaming Interface
![Page 14: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/14.jpg)
… and praxisPython Mapper
![Page 15: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/15.jpg)
… and praxisPython Reducer
![Page 16: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/16.jpg)
Discussion (1)
[5]
Using MapReduce for bioinformatic applications ?
![Page 17: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/17.jpg)
Discussion (2)using MapReduce - an expensive risk ?
http://www.h-online.com/open/news/item/Google-patents-Map-Reduce-908602.html
![Page 18: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/18.jpg)
End
Thanks for your attention!
![Page 19: MapReduce - uni-bielefeld.dejkrueger/documents/... · … and praxis MapReduce using Hadoop Hadoop was created by Doug Cutting, who named it after his son's stuffed elephant. Hadoop](https://reader034.fdocuments.in/reader034/viewer/2022042223/5ec9aa7c62c48e455a33a380/html5/thumbnails/19.jpg)
References(1) J. Dean and S. Ghemawat – Google Inc. :: MapReduce : Simplified Data Processing on Large Cluster, OSDI
2004
(2) R. Lämmel – Microsoft :: MapReduce : Google's MapReduce Programming Model – Revistied, SCP 2007
(3) S. Ghemawat et al – Google Inc. :: The Google File System, SOSP 2003
(4) R.Grimm :: Das MapReduce-Framework :: http://www.linux-magazin.de/layout/set/print/content/view/full/46285
(5) M.Schatz :: CloudBurst : highly sensitive read mapping with MapReduce
(6) Apache Hadoop :: http://hadoop.apache.org