Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

of 45 /45
© 2010 – 2015 Cloudera, Inc. All Rights Reserved Introduc=on to Apache Hadoop and its Ecosystem Mark Grover | Intro to Cloud Compu=ng, Carnegie Mellon SV github.com/markgrover/hadoopintrofast © Copyright 20102014 Cloudera, Inc. All rights reserved.

Embed Size (px)

description

Introduction to Hadoop presentation at Carnegie Mellon University, Silicon Valley Campus.

Transcript of Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

  • 2010 2015 Cloudera, Inc. All Rights Reserved Introduc=on to Apache Hadoop and its Ecosystem Mark Grover | Intro to Cloud Compu=ng, Carnegie Mellon SV github.com/markgrover/hadoop-intro-fast Copyright 2010-2014 Cloudera, Inc. All rights reserved.
  • 2010 2015 Cloudera, Inc. All Rights Reserved About Me CommiNer on Apache Bigtop, commiNer and PPMC member on Apache Sentry (incuba=ng). Contributor to Apache Hadoop, Hive, Spark, Sqoop, Flume. SoUware developer at Cloudera @mark_grover www.linkedin.com/in/grovermark
  • 2010 2015 Cloudera, Inc. All Rights Reserved Co-author OReilly book @hadooparchbook hadooparchitecturebook.com To be released early 2015
  • 2010 2015 Cloudera, Inc. All Rights Reserved About the Presenta=on Whats ahead Fundamental Concepts HDFS: The Hadoop Distributed File System Data Processing with MapReduce Demo Conclusion + Q&A
  • 2010 2015 Cloudera, Inc. All Rights Reserved Fundamental Concepts Why the World Needs Hadoop
  • 2010 2015 Cloudera, Inc. All Rights Reserved Whats the craze about Hadoop? Volume More and more data being generated Machine generated data increasing Velocity Data coming it at higher speed Variety Audio, video, images, log les, web pages, social network connec=ons, etc.
  • 2010 2015 Cloudera, Inc. All Rights Reserved We Need a System that Scales Too much data for tradi=onal tools Two key problems How to reliably store this data at a reasonable cost How to we process all the data weve stored
  • 2010 2015 Cloudera, Inc. All Rights Reserved What is Apache Hadoop? Scalable data storage and processing Distributed and fault-tolerant Runs on standard hardware Two main components Storage: Hadoop Distributed File System (HDFS) Processing: MapReduce Hadoop clusters are composed of computers called nodes Clusters range from a single node up to several thousand nodes
  • 2010 2015 Cloudera, Inc. All Rights Reserved How Did Apache Hadoop Originate? Heavily inuenced by Googles architecture Notably, the Google Filesystem and MapReduce papers Other Web companies quickly saw the benets Early adop=on by Yahoo, Facebook and others 2002 2003 2004 2005 2006 Google publishes MapReduce paper Nutch rewritten for MapReduce Hadoop becomes Lucene subproject Nutch spun off from Lucene Google publishes GFS paper
  • 2010 2015 Cloudera, Inc. All Rights Reserved Comparing Hadoop to Other Systems Monolithic systems dont scale Modern high-performance compu=ng systems are distributed They spread computa=ons across many machines in parallel Widely-used used for scien=c applica=ons Lets examine how a typical HPC system works
  • 2010 2015 Cloudera, Inc. All Rights Reserved Architecture of a Typical HPC System Storage System Compute Nodes Fast Network
  • 2010 2015 Cloudera, Inc. All Rights Reserved Architecture of a Typical HPC System Storage System Compute Nodes Step 1: Copy input data Fast Network
  • 2010 2015 Cloudera, Inc. All Rights Reserved Architecture of a Typical HPC System Storage System Compute Nodes Step 2: Process the data Fast Network
  • 2010 2015 Cloudera, Inc. All Rights Reserved Architecture of a Typical HPC System Storage System Compute Nodes Step 3: Copy output data Fast Network
  • 2010 2015 Cloudera, Inc. All Rights Reserved You Dont Just Need Speed The problem is that we have way more data than code $ du -ks code/ 1,087 $ du ks data/ 854,632,947,314
  • 2010 2015 Cloudera, Inc. All Rights Reserved You Need Speed At Scale Storage System Compute Nodes Bottleneck
  • 2010 2015 Cloudera, Inc. All Rights Reserved Hadoop Design Fundamental: Data Locality This is a hallmark of Hadoops design Dont bring the data to the computa=on Bring the computa=on to the data Hadoop uses the same machines for storage and processing Signicantly reduces need to transfer data across network
  • 2010 2015 Cloudera, Inc. All Rights Reserved Other Hadoop Design Fundamentals Machine failure is unavoidable embrace it Build reliability into the system More is usually beNer than faster Throughput maNers more than latency
  • 2010 2015 Cloudera, Inc. All Rights Reserved The Hadoop Distributed Filesystem HDFS
  • 2010 2015 Cloudera, Inc. All Rights Reserved HDFS: Hadoop Distributed File System Inspired by the Google File System Reliable, low-cost storage for massive amounts of data Similar to a UNIX lesystem in some ways Hierarchical UNIX-style paths (e.g., /sales/alice.txt) UNIX-style le ownership and permissions
  • 2010 2015 Cloudera, Inc. All Rights Reserved HDFS: Hadoop Distributed File System There are also some major devia=ons from UNIX lesystems Highly-op=mized for processing data with MapReduce Designed for sequen=al access to large les Cannot modify le content once wriNen Its actually a user-space Java process Accessed using special commands or APIs No concept of a current working directory
  • 2010 2015 Cloudera, Inc. All Rights Reserved Copying Local Data To and From HDFS Remember that HDFS is dis=nct from your local lesystem hadoop fs put copies local les to HDFS hadoop fs get fetches a local copy of a le from HDFS $ hadoop fs -put sales.txt /reports Hadoop Cluster Client Machine $ hadoop fs -get /reports/sales.txt
  • 2010 2015 Cloudera, Inc. All Rights Reserved HDFS Demo I will now demonstrate the following 1. How to list the contents of a directory 2. How to create a directory in HDFS 3. How to copy a local le to HDFS 4. How to display the contents of a le in HDFS 5. How to remove a le from HDFS
  • 2010 2015 Cloudera, Inc. All Rights Reserved A Scalable Data Processing Framework Data Processing with MapReduce
  • 2010 2015 Cloudera, Inc. All Rights Reserved What is MapReduce? MapReduce is a programming model Its a way of processing data You can implement MapReduce in any language
  • 2010 2015 Cloudera, Inc. All Rights Reserved Understanding Map and Reduce You supply two func=ons to process data: Map and Reduce Map: typically used to transform, parse, or lter data Reduce: typically used to summarize results The Map func=on always runs rst The Reduce func=on runs aUerwards, but is op=onal Each piece is simple, but can be powerful when combined
  • 2010 2015 Cloudera, Inc. All Rights Reserved MapReduce Benets Scalability Hadoop divides the processing job into individual tasks Tasks execute in parallel (independently) across cluster Simplicity Processes one record at a =me Ease of use Hadoop provides job scheduling and other infrastructure Far simpler for developers than typical distributed compu=ng
  • 2010 2015 Cloudera, Inc. All Rights Reserved MapReduce in Hadoop MapReduce processing in Hadoop is batch-oriented A MapReduce job is broken down into smaller tasks Tasks run concurrently Each processes a small amount of overall input MapReduce code for Hadoop is usually wriNen in Java This uses Hadoops API directly You can do basic MapReduce in other languages Using the Hadoop Streaming wrapper program Some advanced features require Java code
  • 2010 2015 Cloudera, Inc. All Rights Reserved MapReduce Example in Python The following example uses Python Via Hadoop Streaming It processes log les and summarizes events by type Ill explain both the data ow and the code
  • 2010 2015 Cloudera, Inc. All Rights Reserved Job Input Heres the job input Each map task gets a chunk of this data to process Typically corresponds to a single block in HDFS 2013-06-29 22:16:49.391 CDT INFO "This can wait" 2013-06-29 22:16:52.143 CDT INFO "Blah blah blah" 2013-06-29 22:16:54.276 CDT WARN "This seems bad" 2013-06-29 22:16:57.471 CDT INFO "More blather" 2013-06-29 22:17:01.290 CDT WARN "Not looking good" 2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant" 2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"
  • 2010 2015 Cloudera, Inc. All Rights Reserved #!/usr/bin/env python import sys levels = ['TRACE', 'DEBUG', 'INFO', 'WARN', 'ERROR', 'FATAL'] for line in sys.stdin: fields = line.split() level = fields[3].upper() if level in levels: print "%st1" % level 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Python Code for Map Func=on If it matches a known level, print it, a tab separator, and the literal value 1 (since the level can only occur once per line) Read records from standard input. Use whitespace to split into elds. Dene list of known log levels Extract level eld and convert to uppercase for consistency.
  • 2010 2015 Cloudera, Inc. All Rights Reserved Output of Map Func=on The map func=on produces key/value pairs as output INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1
  • 2010 2015 Cloudera, Inc. All Rights Reserved The Shue and Sort Hadoop automa9cally merges, sorts, and groups map output The result is passed as input to the reduce func=on More on this later INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1 ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1 Shue and Sort Map Output Reduce Input
  • 2010 2015 Cloudera, Inc. All Rights Reserved Input to Reduce Func=on Reduce func=on receives a key and all values for that key Keys are always passed to reducers in sorted order Although not obvious here, values are unordered ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1
  • 2010 2015 Cloudera, Inc. All Rights Reserved Python Code for Reduce Func=on #!/usr/bin/env python import sys previous_key = None sum = 0 for line in sys.stdin: key, value = line.split() if key == previous_key: sum = sum + int(value) # continued on next slide 1 2 3 4 5 6 7 8 9 10 11 12 13 Ini=alize loop variables Extract the key and value passed via standard input If key unchanged, increment the count
  • 2010 2015 Cloudera, Inc. All Rights Reserved Python Code for Reduce Func=on # continued from previous slide else: if previous_key: print '%st%i' % (previous_key, sum) previous_key = key sum = 1 print '%st%i' % (previous_key, sum) 14 15 16 17 18 19 20 21 22 Print data for the nal key If key changed, print data for old level Start tracking data for the new record
  • 2010 2015 Cloudera, Inc. All Rights Reserved Output of Reduce Func=on Its output is a sum for each level ERROR 1 INFO 4 WARN 2
  • 2010 2015 Cloudera, Inc. All Rights Reserved Recap of Data Flow ERROR 1 INFO 4 WARN 2 INFO 1 INFO 1 WARN 1 INFO 1 WARN 1 INFO 1 ERROR 1 ERROR 1 INFO 1 INFO 1 INFO 1 INFO 1 WARN 1 WARN 1 Map input Map output Reduce input Reduce output 2013-06-29 22:16:49.391 CDT INFO "This can wait" 2013-06-29 22:16:52.143 CDT INFO "Blah blah blah" 2013-06-29 22:16:54.276 CDT WARN "This seems bad" 2013-06-29 22:16:57.471 CDT INFO "More blather" 2013-06-29 22:17:01.290 CDT WARN "Not looking good" 2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant" 2013-06-29 22:17:05.362 CDT ERROR "Out of memory!" Shue and sort
  • 2010 2015 Cloudera, Inc. All Rights Reserved How to Run a Hadoop Streaming Job Ill demonstrate this now
  • 2010 2015 Cloudera, Inc. All Rights Reserved Open Source Tools that Complement Hadoop The Hadoop Ecosystem
  • 2010 2015 Cloudera, Inc. All Rights Reserved The Hadoop Ecosystem "Core Hadoop" consists of HDFS and MapReduce These are the kernel of a much broader plauorm Hadoop has many related projects Some help you integrate Hadoop with other systems Others help you analyze your data These are not considered core Hadoop Rather, theyre part of the Hadoop ecosystem Many are also open source Apache projects
  • 2010 2015 Cloudera, Inc. All Rights Reserved Visual Overview of a Complete Workow Import Transaction Data from RDBMSSessionize Web Log Data with Pig Analyst uses Impala for business intelligence Sentiment Analysis on Social Media with Hive Hadoop Cluster with Impala Generate Nightly Reports using Pig, Hive, or Impala Build product recommendations for Web site
  • 2010 2015 Cloudera, Inc. All Rights Reserved Key Points Were genera=ng massive volumes of data This data can be extremely valuable Companies can now analyze what they previously discarded Hadoop supports large-scale data storage and processing Heavily inuenced by Google's architecture Already in produc=on by thousands of organiza=ons HDFS is Hadoop's storage layer MapReduce is Hadoop's processing framework Many ecosystem projects complement Hadoop Some help you to integrate Hadoop with exis=ng systems Others help you analyze the data youve stored
  • 2010 2015 Cloudera, Inc. All Rights Reserved Highly Recommended Books Author: Tom White ISBN: 1-449-31152-0 Author: Eric Sammer ISBN: 1-449-32705-2
  • 2010 2015 Cloudera, Inc. All Rights Reserved Ques=ons? Thank you for aNending! Ill be happy to answer any addi=onal ques=ons now Demo and slides at github.com/markgrover/hadoop-intro-fast TwiNer: mark_grover Survey page: =ny.cloudera.com/mark