Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

31
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center

Transcript of Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Page 1: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Introduction to Hadoop Programming

Bryon Gill, Pittsburgh Supercomputing Center

Page 2: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

What We Will Discuss

• Hadoop Architecture Overview• Practical Examples

• “Classic” Map-Reduce• Hadoop Streaming• Spark

• Hbase and Other Applications

Page 3: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Hadoop Overview

• Framework for Big Data• Map/Reduce• Platform for Big Data Applications

Page 4: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Map/Reduce

• Apply a Function to all the Data• Harvest, Sort, and Process the Output

Page 5: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

© 2010 Pittsburgh Supercomputing Center© 2014 Pittsburgh Supercomputing Center

5

Big Data

… Split n

Split 1

Split 3

Split 4

Split 2

… Output n

Output 1

Output 3

Output 4

Output 2

Reduce F(x)

Map F(x)

Result

Map/Reduce

Page 6: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

HDFS

• Distributed FS Layer• WORM fs

– Optimized for Streaming Throughput

• Exports• Replication• Process data in place

Page 7: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

HDFS Invocations: Getting Data In and Out

• hdfs dfs -ls• hdfs dfs -put• hdfs dfs -get• hdfs dfs -rm• hdfs dfs -mkdir• hdfs dfs -rmdir

Page 8: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Writing Hadoop Programs

• Wordcount Example: Wordcount.java– Map Class– Reduce Class

Page 9: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Exercise 1: Compiling

hadoop com.sun.tools.javac.Main WordCount.java

Page 10: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Exercise 1: Packaging

jar cf wc.jar WordCount*.class

Page 11: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Exercise 1: Submitting

• hadoop \jar wc.jar \WordCount \/datasets/compleat.txt \output \-D mapred.reduce.tasks=2

Page 12: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Configuring your Job Submission

• Mappers and Reducers• Java options• Other parameters

Page 13: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Monitoring

• Important Web Interface Ports:– dxchd01.psc.edu:8088 – Yarn Resource Manager (Track Jobs)– dxchd01.psc.edu:50070 – HDFS (Namenode)– dxchd02.psc.edu:8042 – NodeManager (Slave Nodes)– dxchd02.psc.edu:50075 – Datanode (Slave Nodes)– dxchd01.psc.edu:8080 – Spark Master Web interface

Page 14: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Hadoop Streaming

• Write Map/Reduce Jobs in any language• Excellent for Fast Prototyping

Page 15: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Hadoop Streaming: Bash Example

• Bash wc and cat• hadoop jar \

$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \ -input /datasets/plays/ \-output streaming-out \ -mapper '/bin/cat' \ -reducer '/usr/bin/wc -l '

Page 16: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Hadoop Streaming Python Example

• Wordcount in python• hadoop jar

$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming*.jar \ -file mapper.py \-mapper mapper.py \-file reducer.py \-reducer reducer.py \-input /datasets/plays/ \-output pyout

Page 17: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Spark

• Alternate programming framework using HDFS• Optimized for in-memory computation• Well supported in Java, Python, Scala

Page 18: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Spark Resilient Distributed Dataset (RDD)

“Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDD’s.

(Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Zaharia et al.)

Page 19: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Spark Resilient Distributed Dataset (RDD)

• RDD for short• Persistence-enabled data collections• Transformations• Actions• Flexible implementation: memory vs. hybrid vs. disk

Page 20: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Selected RDD Transformations

• map(func)• filter(func)• flatMap(func)• sample(withReplacement,fraction,seed)• distinct([numTasks)]• union(otherDataset)• join(otherDataset)• cogroup(otherDataset,[numTasks])• cartesian(otherDataset)• groupByKey([numTasks])• reduceByKey(func,[numTasks])• sortByKey([ascending],[numTasks])• repartition(numPartitions)• pipe(command,[envVars])

Page 21: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Selected RDD Actions

• reduce(func)• count()• collect()• take(n)• saveAsTextFile(path)

Actions Trigger Transformations!

Page 22: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Spark example

• lettercount.py

Page 23: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Spark Machine Learning Library

• Clustering (K-Means)• Many others, list at

http://spark.apache.org/docs/1.0.1/mllib-guide.html

Page 24: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

K-means

• Randomly seed cluster starting points

• Test each point with respect to the others in its cluster to find a new mean

• If the centroids change do it again

• If the centroids stay the same they've converged and we're done.

• Awesome visualization:

http://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Page 25: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Exercise 2: K-Means

• spark-submit \ $SPARK_HOME/examples/src/main/python/mllib/kmeans.py \ hdfs://dxchd01.psc.edu:/datasets/kmeans_data.txt 3

• spark-submit \ $SPARK_HOME/examples/src/main/python/mllib/kmeans.py \ hdfs://dxchd01.psc.edu:/datasets/archiver.txt 2

Page 26: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

NoSQL Databases

• “Not Only SQL”• Similar interface to RDBMS• Trade Consistency Guarantees for Horizontal Scaling

Page 27: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

NoSQL Overview

• HBase• Hive• Cassandra• Others

Page 28: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

HBase

• Modeled after Google’s Bigtable• Linear scalability• Automated failovers• Hadoop (and Spark) integration• Store files -> HDFS -> FS

Page 29: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

HBase Example

hbase shellstatushelpcreate ‘$userid-table', ‘$userid-family'put 'u', 'row01', ‘$userid-family:col01', 'value1'scan ‘$userid-table'disable ‘$userid-table'drop ‘$userid-table'

Page 30: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Questions?

• Thanks!

Page 31: Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

References and Useful Links

• HDFS shell commands:

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

• Apache Hadoop Official Releases:https://hadoop.apache.org/releases.html

• Apache Spark Documentation

http://spark.apache.org/docs/latest/• Apache Spark MLlib Documentation:

https://spark.apache.org/docs/latest/mllib-guide.html

• Apache Hbase:

http://hbase.apache.org/