Hands on Hadoop

14
Paul Tarjan Chief Technical Monkey Yahoo! http://paulisageek.com

description

My intro talk for hadoop and how to use it with python streaming.Code is here : http://github.com/ptarjan/hands-on-hadoop-tutorial/

Transcript of Hands on Hadoop

Page 1: Hands on Hadoop

Paul Tarjan

Chief Technical Monkey

Yahoo!

http://paulisageek.com

Page 2: Hands on Hadoop

Data is ugly 1 08bade48-1081-488f-b459-6c75d75312ae 2

2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a 100.0 2.0 0.0 29 08bade48-1081-488f-b459-6c75d75312ae 3 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0

29 08bade48-1081-488f-b459-6c75d75312ae 2 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0 11 08bade48-1081-488f-b459-6c75d75312ae 1 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 2.0 0.0

Page 3: Hands on Hadoop

Summaries are OK

Page 4: Hands on Hadoop

Graphs are pretty

Page 5: Hands on Hadoop

Word count, yay!

Given a document, how many of each word are there?

Real world :Given our search logs, how many people

click on result 1Given our flickr photos, how many cat

photos are thereGive our web crawl, what are the 10 most

popular words?

Page 6: Hands on Hadoop

Map/Reduce

Page 7: Hands on Hadoop

Hadoop streaming Basically, write some code that runs like :

$ cat data | mapper.py | sort | reducer.py

Page 8: Hands on Hadoop

Python wordcount mapper

http://github.com/ptarjan/hands-on-hadoop-tutorial/blob/master/mapper.py

Page 9: Hands on Hadoop

Python wordcount reducer

http://github.com/ptarjan/hands-on-hadoop-tutorial/blob/master/reducer.py

Page 10: Hands on Hadoop

Run

$ cat data | mapper.py | sort | reducer.py If that works, run it live

Put the data into hadoops Distributed File System (DFS)

Run hadoopRead the output data in the DFS

Page 11: Hands on Hadoop

Run Hadoop Stream data through these two files, saving the

output back to HDFS:$HADOOP_HOME/bin/hadoop jar \ $HADOOP_HOME/hadoop-streaming.jar \ -input input_dir \ -output output_dir \ -mapper mapper.py \ -reducer reducer.py \ -file mapper.py -file reducer.py

Page 12: Hands on Hadoop

View output

View output files: $ hadoop dfs -ls output_dir

Note multiple output files ("part-00000", "part-00001", etc)

View output file contents: $ hadoop dfs -cat output_dir/part-00000

Page 13: Hands on Hadoop

Do it live!

http://github.com/ptarjan/hands-on-hadoop-tutorial/

Page 14: Hands on Hadoop

Live Commands # make a new directory mkdir count_example cd count_example

# get the files (could be done with a git clone, but wget is more prevalent than git for now) wget -o /dev/null http://github.com/ptarjan/hands-on-hadoop-tutorial/raw/master/mapper.py wget -o /dev/null http://github.com/ptarjan/hands-on-hadoop-tutorial/raw/master/reducer.py wget -o /dev/null http://github.com/ptarjan/hands-on-hadoop-tutorial/raw/master/reducer_numsort.py wget -o /dev/null http://github.com/ptarjan/hands-on-hadoop-tutorial/raw/master/hamlet.txt chmod u+x *.py

# run it locally cat hamlet.txt | ./mapper.py | head cat hamlet.txt | ./mapper.py | sort | ./reducer.py | head cat hamlet.txt | ./mapper.py | sort | ./reducer.py | sort -k 2 -r -n | head cat hamlet.txt | ./mapper.py | sort | ./reducer_numsort.py | head

# put the data in hadoop fs -mkdir count_example hadoop fs -put hamlet.txt count_example

# yahoo search specific - CHANGE TO YOUR OWN QUEUE PARAMS=-Dmapred.job.queue.name=search_fast_lane

# run it hadoop jar $HADOOP_HOME/hadoop-streaming.jar $PARAMS -mapper mapper.py -reducer reducer.py -input count_example/hamlet.txt -output count_example/hamlet_out -file mapper.py -file reducer.py

# view the output hadoop fs -ls count_example/hamlet_out/ hadoop fs -cat count_example/hamlet_out/* | head

# run the num sorted hadoop jar $HADOOP_HOME/hadoop-streaming.jar $PARAMS -mapper mapper.py -reducer reducer_numsort.py -input count_example/hamlet.txt -output count_example/hamlet_numsort_out -file mapper.py -file

reducer_numsort.py

# view the output mkdir out hadoop fs -cat count_example/hamlet_numsort_out/* | sort -nrk 2 > out/hamlet_numsort.txt head out/hamlet_numsort.txt

# test that hadoop worked # apply the same sort to make sure ties are broken the same way cat hamlet.txt | ./mapper.py | sort | ./reducer_numsort.py | sort -nrk 2 > out/hamlet_numsort_local.txt diff out/hamlet_numsort.txt out/hamlet_numsort_local.txt

# EXTRA (wikipedia) wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz gunzip -c enwiki-latest-all-titles-in-ns0.gz | hadoop fs -put - count_example/wiki_titles hadoop jar $HADOOP_HOME/hadoop-streaming.jar $PARAMS -mapper mapper.py -reducer reducer_numsort.py -input count_example/wiki_titles -output count_example/wiki_titles_out -file mapper.py -file reducer_numsort.py hadoop fs -cat count_example/wiki_titles_out/* | head

# cleanup hadoop fs -rmr count_example cd .. rm -r count_example