HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce...

of 23 /23
HADOOP ADMIN: Session -2 What is Hadoop?

Embed Size (px)

Transcript of HADOOP ADMIN: Session -2 What is Hadoop?. AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce...

  • Slide 1
  • HADOOP ADMIN: Session -2 What is Hadoop?
  • Slide 2
  • AGENDA Hadoop Demo using Cygwin HDFS Daemons Map Reduce Daemons Hadoop Ecosystem Projects
  • Slide 3
  • Hadoop Using Cygwin What is Cygwin? Hadoop needs Java version 1.6 or higher bin/hadoop bin/hadoop jar hadoop-examples-1.0.4.jar Word count input output Word count example Tokenization problem Modifying the Program
  • Slide 4
  • HDFS Daemons Daemon Name Node Secondary Name Node Data Node How many? 1 Many Purpose Files Metadata,Block2map House keeping, Transaction log check pointing Block data(File contents) Name Node Meta Data in RAM Data Node 1 Secondary Name Node Block Report Heart Beats Not a backup node/stand by Node Read Read Data Block 1 Roll edits Copy Fsimage and edits Replay all edits and create new fs image Rename new edits Send New Fs image
  • Slide 5
  • Map Reduce V1 Daemons Job Tracker Task Tracker Job Tracker Task Tracker
  • Slide 6
  • Word Count over a Given Set of Web Pages see bob throw see1 bob1 throw 1 see 1 spot 1 run 1 bob1 run 1 see 2 spot 1 throw1 see spot run Can we do word count in parallel?
  • Slide 7
  • The MapReduce Framework (pioneered by Google)
  • Slide 8
  • Automatic Parallel Execution in MapReduce (Google) Handles failures automatically, e.g., restarts tasks if a node fails; runs multiples copies of the same task to avoid a slow task slowing down the whole job
  • Slide 9
  • MapReduce in Hadoop (1)
  • Slide 10
  • MapReduce in Hadoop (2)
  • Slide 11
  • Data Flow in a MapReduce Program in Hadoop InputFormat Map function Partitioner Sorting & Merging Combiner Shuffling Merging Reduce function OutputFormat 1:many
  • Slide 12
  • Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job
  • Slide 13
  • Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job
  • Slide 14
  • Map Wave 1 Reduce Wave 1 Map Wave 2 Reduce Wave 2 Input Splits Lifecycle of a MapReduce Job Time How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?
  • Slide 15
  • Job Configuration Parameters 190+ parameters in Hadoop Set manually or defaults are used
  • Slide 16
  • Hadoop Ecosystem/Sub Projects HADOOP PIGHbaseSqoopHive
  • Slide 17
  • PIG One frequent complaint about MR is that its difficult to program One criticism of MapReduce is that the development cycle is very long As you implement the program in MapReduce, youll have to think at the level of mapper and reducer functions and job chaining Pig started as a research project within Yahoo! in the summer of 2006, joining Apache Incubator in September of 2007 Pig Pig is a dataflow programming environment for processing very large files. Pig's language is called Pig Latin Pig is a Hadoop extension that simplifies Hadoop programming by giving you a high-level data processing language while keeping Hadoops simple scalability and reliability Yahoo runs 40% of all its hadoop jobs with Pig. Twitter use PIG Indeed, itwas created at Yahoo! to make it easier for researchers and engineers to mine the huge datasets there
  • Slide 18
  • PIG::How I look like: Not a variable, relation Loads data file into a relation,with a defined schema
  • Slide 19
  • Word count example in PIG Text=LOAD text USING Textloader() Loads each line as one column Tokens=FOREACH text GENERATE FLATTEN(TOKENIZE($0)) as word; Wordcount=FOREACH(GROUP tokens BY word)GENERATE group as word COUNT_STAR($1) PIG JOB MR TRANSFORM ATION MR JOBSHDFS
  • Slide 20
  • PIG Vs Hive Pig is a new language, easy to learn if you know languages similar to Perl Hive is a sub-set of SQL with very simple variations to enable map-reduce like computation. So, if you come from a SQL background you will find Hive QL extremely easy to pickup (many of your SQL queries will run as is), while if you come from a procedural programming background (w/o SQL knowledge) then Pig will be much more suitable for you Hive is a bit easier to integrate with other systems and tools since it speaks the language they already speak (i.e. SQL). Ultimately the choice of whether to use Hive or PIG will depend on the exact requirements of the application domain and the preferences of the implementers and those writing queries.
  • Slide 21
  • HIVE(HQL) Hive is a data ware house infrastructure built on top of Hadoop that can compile SQL queries into MR jobs and run on hadoop cluster Invented at Facebook for their own problems. SQL like query language(HQL/Hive QL) to retrieve the data and process it. JDBC/ODBC access is provided Currently used with respect to Hbase
  • Slide 22
  • Hbase HBase is not about being a high level language that compiles to map-reduce, Hbase is about allowing Hadoop to support lookups/transactions on key/value pairs. HBase allows you to do quick random lookups, versus scan all of data sequentially, do insert/update/delete from middle, not just add/append.
  • Slide 23
  • Sqoop To load bulk data into Hadoop from relational databases Imports individual tables or entire databases to files in HDFS Provides the ability to import from SQL databases straight into your Hive data warehouseHive Importing this table into HDFS could be done with the command: [email protected]$ sqoop --connect jdbc:mysql://db.example.com/website --table USERS \ -- local --hive-import- See more at: