Big Data: Data Analysis Boot Camp Hadoop and RIntroduction Basics Hands-onQ &...

26
1/26 Introduction Basics Hands-on Q&A Conclusion References Files Big Data: Data Analysis Boot Camp Hadoop and R Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017 24 September 2017

Transcript of Big Data: Data Analysis Boot Camp Hadoop and RIntroduction Basics Hands-onQ &...

  • 1/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Big Data: Data Analysis Boot CampHadoop and R

    Chuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhD

    24 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 201724 September 2017

  • 2/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Table of contents (1 of 1)

    1 Introduction

    2 Basics

    3 Hands-on

    4 Q & A

    5 Conclusion

    6 References

    7 Files

  • 3/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    What are we going to cover?

    1 Look at the Hadoop map-reduceprogramming model

    2 Pick apart the “classic”map-reduce word count program

    3 Look at how the map-reduce modelcan be used with complex keys

  • 4/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Hadoop Distributed File System (hdfs)

    The Hadoop Distributed File System (HDFS)

    “The Hadoop Distributed File System (HDFS) is adistributed file system designed to run on commodityhardware. It has many similarities with existingdistributed file systems. However, the differences fromother distributed file systems are significant. HDFS ishighly fault-tolerant and is designed to be deployed onlow-cost hardware. HDFS provides high throughputaccess to application data and is suitable for applicationsthat have large data sets. HDFS relaxes a few POSIXrequirements to enable streaming access to file systemdata. HDFS was originally built as infrastructure for theApache Nutch web search engine project. HDFS is nowan Apache Hadoop subproject.”

    A. Staff [2]

  • 5/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Hadoop Distributed File System (hdfs)

    HDFS Assumptions and Goals[2]

    Hardware Failure Hardware failure is the norm rather than theexception.

    Streaming Data Access Applications that run on HDFS needstreaming access to their data sets.

    Large Data Sets Applications that run on HDFS have large datasets. A typical file in HDFS is gigabytes toterabytes in size.

    Simple Coherency Model HDFS applications need awrite-once-read-many access model for files.

    Moving Computation is Cheaper than Moving Data Acomputation requested by an application is muchmore efficient if it is executed near its data.

    Portability Across Heterogeneous Hardware and Software PlatformsHDFS has been designed to be easily portable fromone platform to another.

  • 6/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Hadoop Distributed File System (hdfs)

    HDFS Implementations[3]

    Hardware Failure Redundant copies of the data are kept by thesystem.

    Streaming Data Access Applications that run on HDFS needstreaming access to their data sets. Programs readand write data from and to STDIN and STDOUT.

    Large Data Sets An HDFS data file is “chuncked” to minimizetotal program execution time.

    Simple Coherency Model HDFS trades off some POSIXrequirements for performance, so some operationsmay behave differently than you expect them to.

    Moving Computation is Cheaper than Moving Data map() arecopied to the data and the results are copied to thereducer functions.

    Portability Across Heterogeneous Hardware and Software PlatformsSystems IAW standards gain market share.

  • 7/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Hadoop Distributed File System (hdfs)

    HDFS terminology

    Some terms:namenode manages the filesystem

    namespace. It maintains thefilesystem tree and the metadatafor all the files and directories inthe tree.

    client accesses the filesystem on behalfof the user by communicatingwith the namenode anddatanodes. The client presents aPOSIX-like filesystem interface.

    datanode are the workhorses of thefilesystem. They store andretrieve blocks when they are toldto (by clients or the namenode).

    Our applications are clients, andthe mysteries of the name anddata nodes are hidden from us.

    Image from [1].

  • 8/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Hadoop Distributed File System (hdfs)

    Same image.

    Image from [1].

  • 9/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Map/Reduce computing model

    Map reduce model from 50,000 foot view.

    A simple and powerful model:

    1 A line of data is presentedto a “mapper” function.

    2 The “mapper” outputs 0 ormore key and value tuplesper presented input line

    3 Hadoop sorts and merges allkeys and values so thatthere is one key with one ormore values

    4 The “reducer” processeseach key and associatedvalues to the output

    Image from [1].

  • 10/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Map/Reduce computing model

    Same image.

    Image from [1].

  • 11/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Map/Reduce computing model

    A lower level view

    There are a lot of processes andcoordination happening behindthe scenes. The client submits ajob to Hadoop, mapper functionsare copied to the data, key valuesare sorted, then presented to thereducers, and output is written.Much of this activity can bemonitored at port 8787.

    Image from [1].

  • 12/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Map/Reduce computing model

    Same image.

    Image from [1].

  • 13/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Word count

    Classic word count program

    The program is in the attachedfile (Hadoop word count). We’ll:

    1 Set some environmentvariables for Hadoop

    2 Load necessary R libraries

    3 Download and save the textfile

    4 Do some HDFShousekeeping

    5 Define and execute themap-reduce job

    6 See where the results endedup

  • 14/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Word count

    Ways to modify the word count program.

    Remove all “words” thatare in fact a space

    Remove all “stop” words

    Remove all words that arenumbers

    Stem all words

    Process a different text file

    Create a histogram of thefirst n most common words

    Estimate the “reading”level of the processed text

    Create a word cloud of inthe shape of somethingassociated with the text

  • 15/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Airports and travel

    Looking at air traffic between US domestic airports(attached Airport route exploration)

    Mashing data from differentsources.

    Use the US GovernmentBureau of TransportationStatistics to get route data

    Use the OpenFlights to findairport latitude andlongitude

    Use Hadoop map/reducemodel to create a pivot table

    Plot resultsAttached file.

  • 16/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Airports and travel

    Same image.

    Attached file.

  • 17/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Airports and travel

    Bureau of Transportation Statistics home page

    https://www.bts.gov

    https://www.bts.gov

  • 18/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Airports and travel

    BTS Airlines and Airports page

    https://www.rita.dot.gov/bts/sites/rita.dot.gov.bts/

    files/subject_areas/airline_information/index.html

    https://www.rita.dot.gov/bts/sites/rita.dot.gov.bts/files/subject_areas/airline_information/index.htmlhttps://www.rita.dot.gov/bts/sites/rita.dot.gov.bts/files/subject_areas/airline_information/index.html

  • 19/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Airports and travel

    BTS Domestic Data Selection page

    https://www.transtats.bts.gov/DL_SelectFields.asp?

    Table_ID=258&DB_Short_Name=Air%20Carriers

    https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=258&DB_Short_Name=Air%20Carriershttps://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=258&DB_Short_Name=Air%20Carriers

  • 20/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Airports and travel

    OpenFlights home page

    https://openflights.org/data.html

    https://openflights.org/data.html

  • 21/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Airports and travel

    Lessons learned about keyval()

    Some “interesting” things about the keyval() function:

    1 The last call wins. If your processing creates a collection ofkey value pairs, the last keyval() call is the data passed toreduce().

    2 keyval() is vectorized. There can be more than one key orvalue passed to the function.

    To pass more than one key value combination, use:keyval(c(...), c(...))

    Be aware that the shorter argument will be recycled as necessaryto match the longer argument.Execute keyval at the R prompt to see code.

  • 22/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Airports and travel

    Ways to modify the airport program.

    Change the lines betweenairports to great circleroutesReduce the number ofroutes to those that carrythe greatest weightSee the difference betweencargo and passenger routesModify the routes to showsource and destinationIdentify the most commoncarriers by weight

    Identify the most frequentcarriers

    Compute net weightexchange between airports(find sources and sinks)

    If the data is for USdomestic routes, why arethere links to Chile

    Expand the list of airportlocations to remove allunknown locations

  • 23/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Q & A time.

    Q: How many Oregonians does ittake to screw in a light bulb?A: Three. One to screw in thelight bulb and two to fend off allthose Californians trying to sharethe experience.

  • 24/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    What have we covered?

    Gained an understanding of how Rinterfaces with the Hadoopmap-reduce programming model“Played” with a word countprogramLooked at things that airlines carrybetween airports and how todisplay that data

    Next: BDAR Chapter 5, RDBMSs and R

  • 25/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    References (1 of 1)

    [1] Ricky Ho, How Hadoop Map/Reduce works, 2008.

    [2] Apache Staff, HDFS Architecture Guide, https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html,2017.

    [3] Tom White, Hadoop: The Definitive Guide, 4th Edition,O’Reilly Media, Inc., 2015.

    https://hadoop.apache.org/docs/r1.2.1/hdfs_design.htmlhttps://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

  • 26/26

    Introduction Basics Hands-on Q & A Conclusion References Files

    Files of interest

    1 Hadoop word count2 Airport route

    exploration

    3 R library script file

    4 Route information

    rm(list=ls())

    source("library.R")

    loadLibraries