Map/Reduce & Hadoop - Rijksuniversiteit Groningenbelikov/VO2012/Lectures/lecturehadoop.pdf · Why...

12
Virtual Observations 2012 Virtual Observations 2012 Map/Reduce & Hadoop 2012/10/25 Hugo Buddelmeijer

Transcript of Map/Reduce & Hadoop - Rijksuniversiteit Groningenbelikov/VO2012/Lectures/lecturehadoop.pdf · Why...

Page 1: Map/Reduce & Hadoop - Rijksuniversiteit Groningenbelikov/VO2012/Lectures/lecturehadoop.pdf · Why learn Map/Reduce • Declarative: specify what not how • Let the system optimize

Virtual Observations 2012

Virtual Observations 2012

Map/Reduce & Hadoop

2012/10/25 Hugo Buddelmeijer

Page 2: Map/Reduce & Hadoop - Rijksuniversiteit Groningenbelikov/VO2012/Lectures/lecturehadoop.pdf · Why learn Map/Reduce • Declarative: specify what not how • Let the system optimize

Outline

• Why Map/Reduce or Hadoop?

• The Map/Reduce paradigm

• The Hadoop implementation

• Map/Reduce & Astronomy

• Hadoop @ RUG

Page 3: Map/Reduce & Hadoop - Rijksuniversiteit Groningenbelikov/VO2012/Lectures/lecturehadoop.pdf · Why learn Map/Reduce • Declarative: specify what not how • Let the system optimize

Why use Map/Reduce

• Need problems that

• - require large data sets

• - have highly parallelisable algorithms

• - result in (relatively) small solutions

Page 4: Map/Reduce & Hadoop - Rijksuniversiteit Groningenbelikov/VO2012/Lectures/lecturehadoop.pdf · Why learn Map/Reduce • Declarative: specify what not how • Let the system optimize

Why learn Map/Reduce

• Declarative: specify what not how

• Let the system optimize the how

e.g.: input is ‘delivered’ to the program, not ‘retrieved’ by the program.

Alan Perlis’ Epigram 19 (1982):

“A language that doesn't affect the way you think about programming, is not worth knowing.”

Page 5: Map/Reduce & Hadoop - Rijksuniversiteit Groningenbelikov/VO2012/Lectures/lecturehadoop.pdf · Why learn Map/Reduce • Declarative: specify what not how • Let the system optimize

The Map/Reduce Paradigm

• Mapper: (k1, v1) -> list(k2, v2)

• Reducer: (k2, list(v2)) -> list(k3,v3)

Page 6: Map/Reduce & Hadoop - Rijksuniversiteit Groningenbelikov/VO2012/Lectures/lecturehadoop.pdf · Why learn Map/Reduce • Declarative: specify what not how • Let the system optimize

‘Hello World’: Word Count

Page 7: Map/Reduce & Hadoop - Rijksuniversiteit Groningenbelikov/VO2012/Lectures/lecturehadoop.pdf · Why learn Map/Reduce • Declarative: specify what not how • Let the system optimize

A Change of Thought

• Separate data handling from processing

• No side effects in algorithms

• No POSIX! (e.g. no seeking in files)

Page 8: Map/Reduce & Hadoop - Rijksuniversiteit Groningenbelikov/VO2012/Lectures/lecturehadoop.pdf · Why learn Map/Reduce • Declarative: specify what not how • Let the system optimize

The Hadoop Implementation

• Open source Map/Reduce implementationfrom Apache

• HDFS: Hadoop Distributed File System– Bring processing to the data

• Many additional software, e.g.– HBase: Database based on M/R

Page 9: Map/Reduce & Hadoop - Rijksuniversiteit Groningenbelikov/VO2012/Lectures/lecturehadoop.pdf · Why learn Map/Reduce • Declarative: specify what not how • Let the system optimize

Map/Reduce & Astronomy

• Wiley et al. 2011: Coaddition– Map: select &

align frames

– Reduce:

stack frames

• Starr et al. 2012: Classifying Transients

Page 10: Map/Reduce & Hadoop - Rijksuniversiteit Groningenbelikov/VO2012/Lectures/lecturehadoop.pdf · Why learn Map/Reduce • Declarative: specify what not how • Let the system optimize

Future Astronomy

• Query large catalogs

• Perform full sky image processing

• Instrument calibration

Page 11: Map/Reduce & Hadoop - Rijksuniversiteit Groningenbelikov/VO2012/Lectures/lecturehadoop.pdf · Why learn Map/Reduce • Declarative: specify what not how • Let the system optimize

More Hadoop Concepts

• Sequence Files

• Java / Pipes / Streaming

• Pydoop (self contained w/ pyfits)

• Tadoop = Target (/Astro-WISE) + Hadoop

Page 12: Map/Reduce & Hadoop - Rijksuniversiteit Groningenbelikov/VO2012/Lectures/lecturehadoop.pdf · Why learn Map/Reduce • Declarative: specify what not how • Let the system optimize

Hadoop @ RUG

• Experimental 6 node cluster @CIT

• Maintained by Fokke Dijkstra & Bob Droge

• Cloudera CDH3u4 installation

• More Information:

• Cloudera.com

• Tom White – Hadoop - O’Reilly