Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big...

Running MapReduce in Docker

EECS 4415

Big Data Systems

Tilemachos Pechlivanoglou

tipech@eecs.yorku.ca

Last week’s WordCount example

mapper.pyreducer.py

System representation

Docker

docker -it run eecsyorku/eecs4415

Docker

🗀 .../Documents/A2

mapper.py reducer.py input.txt

🗀 /app

docker -it run -v $PWD:/app eecsyorku/eecs4415

Docker

🗀 /app

docker -it run -p 9870:9870 -p 8088:8088

-v $PWD:/app eecsyorku/eecs4415

Docker

🗀 /app

Browser

localhost:<port>

9870 8088

docker -it run -p 9870:9870 -p 8088:8088

-v $PWD:/app eecsyorku/eecs4415

Web UI

hdfs dfs -ls /

Docker

🗀 /app

🗀 /

<empty>

hdfs dfs -put ./input.txt /

Docker

🗀 /app

🗀 /

input.txt

hdfs dfs -mkdir /test

Docker

🗀 /app

🗀 /

input.txt 🗀 test

hdfs dfs -rm -r /test

Docker

🗀 /app

🗀 /

input.txt

Running MapReduce

hadoop jar /usr/hadoop-3.0.0/share/hadoop/tools/lib/hadoop-streaming-3.0.0.jar \

-file ./mapper.py \

-mapper ./mapper.py \

-file ./reducer.py \

-reducer ./reducer.py \

-input /input.txt \

-output /output

hdfs dfs -get /output/part*

Docker

🗀 /app

HDFS🗀 /

input.txt

🗀 output

_SUCCESS

part-00000

Success!

Contents of part-00000:

Dealing with imports

● For Python’s externally imported packages (nltk, sklearn):○ program will run properly outside Hadoop, but will fail without reason in it○ they need to be loaded into HDFS somehow

● To load them, compress as zip:○ zip -r nltkandyaml.zip nltk sklearn

○ mv ntlk_sklearn.zip /path/to/where/your/mapper/will/be/nltk_sklearn.mod

○ hadoop … -file ./nltk_sklearn.mod

● And manually import:○ import zipimport

○ importer = zipimport.zipimporter('nltk_sklearn.mod')

○ sklearn = importer.load_module('sklearn')

○ nltk = importer.load_module('nltk')

Thank you!

Links to check:https://zettadatanet.wordpress.com/2015/04/04/a-hands-on-introduction-to-mapreduce-in-python/https://afourtech.com/guide-docker-commands-examples/https://medium.com/@rrfd/your-first-map-reduce-using-hadoop-with-python-and-osx-ca3b6f3dfe78https://hadoop.apache.org/docs/r1.2.1/streaming.html

Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big...

Documents

Transcript of Running MapReduce in Dockerpapaggel/courses/eecs... · Running MapReduce in Docker EECS 4415 Big...

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals

MapReduce basics

Towards Energy Efficient MapReduce - EECS at UC Berkeley€¦ · Towards Energy Efficient MapReduce Yanpei Chen Laura Keys Randy H. Katz Electrical Engineering and Computer Sciences

Pipelined-MapReduce an Improved MapReduce

2 MapReduce

Mapreduce introduction

Introduction to MapReduce - York Universitypapaggel/courses/eecs... · Iterating over words: Making everything lowercase: Incrementing the count of every word in the dictionary (if

MapReduce Tutorial

Introduction to Python - York University · Introduction to Python EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca

Hadoop MapReduce

MapReduce VaR

MapReduce. MapReduce Outline MapReduce Architecture MapReduce Internals MapReduce Examples JobTracker Interface.

EE324 DISTRIBUTED SYSTEMS FALL 2015 MapReduce. Overview 2 MapReduce.

EECS 2031 - Software Tools Lab 2 Tutorial: Introduction to ...utn/2031/Unix_Tutorials/tutor1_slides.pdfEECS 2031 - Software Tools Lab 2 Tutorial: Introduction to UNIX/Linux Tilemachos

Processing with What is MapReduce? Hadoop/MapReduce ...

MapReduce Theory, Implementation and Algorithms course/cs402/2009 Hongfei Yan School of EECS, Peking University 7/2/2009 Refer to.

MapReduce succinctly

MapReduce Programming

MapReduce-MPI Library Users Manualmapreduce.sandia.gov/doc/Manual.pdf · MapReduce-MPI WWW Site - MapReduce-MPI Documentation What is a MapReduce? The canonical example of a MapReduce

Improving MapReduce Performance in Heterogeneous Environmentsbnrg.cs.berkeley.edu/~adj/publications/paper-files/EECS-2008-99.pdf · Improving MapReduce Performance in Heterogeneous