Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 1

Big Data using Hadoop

Hands On Workshop

March 2015

Dr.Thanachart NumnondaCertified Java Programmer

[email protected]

Danairat T.Certified Java Programmer, TOGAF – Silver

[email protected], +66-81-559-1446

mailto:[email protected]

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Launch a virtual server on EC2 Amazon Web Services




Hadoop Installation

Hadoop provides three installation choices:

1. Local mode: This is an unzip and run mode toget you started right away where allparts ofHadoop run within the same JVM

2. Pseudo distributed mode: This mode will berun on different parts of Hadoop as differentJava processors, but within a single machine

3. Distributed mode: This is the real setup thatspans multiple machines




Virtual Server

This lab will use a EC2 virtual server to install aHadoop server using the following features:

● Ubuntu Server 14.04 LTS● m3.mediun 1vCPU, 3.75 GB memory● Security group: default● Keypair: imchadoop




Select a EC2 service and click on Lunch Instance




Select an Amazon Machine Image (AMI) andUbuntu Server 14.04 LTS (PV)




Choose m3.medium Type virtual server




Leave configuration details as default




Add Storage: 20 GB




Name the instance




Select an existing security group > Select SecurityGroup Name: default




Click Launch and choose imchadoop as a key pair




Review an instance / click Connect for an instruction to connect to the instance




Connect to an instance from Mac/Linux




Connect to an instance from Windows using Putty




Connect to the instance




Hands-On: Installing Hadoop




Installing Hadoop and Ecosystem

1. Update the system

2. Configuring SSH

3. Installing JDK1.6

4. Download/Extract Hadoop

5. Installing Hadoop

6. Configure xml files

7. Formatting HDFS

8. Start Hadoop

9. Hadoop Web Console

10. Stop Hadoop

Notes:-

Hadoop and IPv6; Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.If your organisation moves to IPv6 only, you willencounter problems. Source: http://wiki.apache.org/hadoop/HadoopIPv6




1) Update the system: sudo apt-get update




2. Configuring SSH: ssh-keygen




Enabling SSH access to your local machine

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Testing the SSH setup by connecting to your local machine

$ ssh 54.68.149.232

Type Exit

$ exit




3) Install JDK 1.7: sudo apt-get install openjdk-7-jdk

(Enter Y when prompt for answering)

(Type command > java –version




4) Download/Extract Hadoop

1) Type command > wgethttp://mirror.issp.co.th/apache/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz

2) Type command > tar –xvzf hadoop-1.2.1.tar.gz

3) Type command > sudo mv hadoop-1.2.1 /usr/local/hadoop




5) Installing Hadoop

1) Type command > sudo vi $HOME/.bashrc

2) Add config as figure below

1) Type command > exec bash

2) Type command > sudo vi /usr/local/hadoop/conf/hadoop-env.sh

3) Edit the file as figure below




6) Configuring Hadoop conf/*-site.xml

1. core-site.xml (hadoop.tmp.dir, fs.default.name)

2. hdfs-site.xml (dfs.replication)

3. mapred-site.xml (mapred.job.tracker)




Configuring core-site.xml

1) Type command > sudo vi /usr/local/hadoop/conf/core-site.xml

2)Add Private IP of a server as figure below

(in this case a private IP is 172.31.12.11)




Configuring mapred-site.xml

1) Type command > sudo sudo vi /usr/local/hadoop/conf/mapred-site.xml

2)Add Private IP of Jobtracker server as figure below

(in this case a private IP is 172.31.12.11)




Configuring hdfs-site.xml

1) Type command > sudo vi /usr/local/hadoop/conf/hdfs-site.xml

2)Add configure as figure below




7) Formating Hadoop

1)Type command > sudo mkdir /usr/local/hadoop/tmp

2)Type command > sudo chown ubuntu /usr/local/hadoop

3)Type command > sudo chown ubuntu /usr/local/hadoop/tmp

4)Type command > hadoop namenode –format




Starting Hadoop

ubuntu@ip-172-31-12-11:~$ start-all.sh

Starting up a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

[ubuntu@ip-172-31-12-11:~$ jps

11567 Jps

10766 NameNode

11099 JobTracker

11221 TaskTracker

10899 DataNode

11018 SecondaryNameNode

ubuntu@ip-172-31-12-11:~$$

Checking Java Process and you are now running Hadoop as pseudo distributed mode




Hadoop is up!

Viewing the Hadoop HDFS using WebUI http://54.68.149.232:50070/




Stopping Hadoop

ubuntu@ip-172-31-12-11:~$ /usr/local/hadoop/bin/stop-all.sh

stopping jobtracker

localhost: stopping tasktracker

stopping namenode

localhost: stopping datanode

localhost: stopping secondarynamenode




Hands-On: Importing Data to HDFSusing Hadoop Command Line




Importing Data to Hadoop

Download War and Peace Full Text

www.gutenberg.org/ebooks/2600




Importing Data to Hadoop

Download the file pg2600.txt

$ wget https://dl.dropboxusercontent.com/u/12655380/

pg2600.txt

$hadoop fs -mkdir /input

$hadoop fs -mkdir /output

$hadoop fs -copyFromLocal pg2600.txt /input

Import to Hadoop



https://dl.dropboxusercontent.com/u/12655380/


Hands-On: Reviewing, Retrieving,Deleting Data from HDFS




Review file in Hadoop HDFS

ubuntu@ip-172-31-12-11:~$ hadoop fs -cat /input/pg2600.txt

List HDFS File

Read HDFS File

Retrieve HDFS File to Local File System

Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html

ubuntu@ip-172-31-12-11:~$ hadoop fs -copyToLocal /input/pg2600.txt /tmp/file.txt




Review file in Hadoop HDFS using WebUI




Hadoop Port Numbers

Daemon DefaultPort

Configuration Parameter inconf/*-site.xml

HDFS Namenode 50070 dfs.http.address

Datanodes 50075 dfs.datanode.http.address

Secondarynamenode 50090 dfs.secondary.http.address

MR JobTracker 50030 mapred.job.tracker.http.address

Tasktrackers 50060 mapred.task.tracker.http.address




Review Content from System shell




Removing data from HDFS usingShell Command

hdadmin@localhost detach]$ hadoop dfs -rm /input/input_test.txt

Deleted hdfs://localhost:54310/input/input_test.txt

hdadmin@localhost detach]$




Lecture: Understanding Map ReduceProcessing

Client

Name Node Job Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Map Reduce




High Level Architecture of MapReduce




Before MapReduce…

● Large scale data processing was difficult!– Managing hundreds or thousands of processors– Managing parallelization and distribution– I/O Scheduling– Status and monitoring– Fault/crash tolerance

● MapReduce provides all of these, easily!

Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html


MapReduce Overview

● What is it?– Programming model used by Google– A combination of the Map and Reduce models with an

associated implementation– Used for processing and generating large data sets


MapReduce Overview

● How does it solve our previously mentioned problems?– MapReduce is highly scalable and can be used across many

computers.– Many small machines can be used to process jobs that

normally could not be processed by a large machine.


MapReduce Framework

Source: www.bigdatauniversity.com




How Map and Reduce Work Together


How Map and Reduce Work Together

● Map returns information● Reduces accepts information● Reduce applies a user defined function to reduce the

amount of data


Map Abstraction

● Inputs a key/value pair– Key is a reference to the input value– Value is the data set on which to operate

● Evaluation– Function defined by user– Applies to every value in value input

● Might need to parse input● Produces a new list of key/value pairs

– Can be different type from input pair


Reduce Abstraction

● Starts with intermediate Key / Value pairs● Ends with finalized Key / Value pairs

● Starting pairs are sorted by key● Iterator supplies the values for a given key to the

Reduce function.


Reduce Abstraction

● Typically a function that:– Starts with a large number of key/value pairs

● One key/value for each word in all files being greped(including multiple entries for the same word)

– Ends with very few key/value pairs● One key/value for each unique word across all the files with

the number of instances summed into this entry● Broken up so a given worker works with input of the

same key.


Other Applications

● Yahoo!– Webmap application uses Hadoop to create a database of

information on all known webpages● Facebook

– Hive data center uses Hadoop to provide business statistics toapplication developers and advertisers

● Rackspace– Analyzes sever log files and usage data using Hadoop


Why is this approach better?

● Creates an abstraction for dealing with complexoverhead– The computations are simple, the overhead is messy

● Removing the overhead makes programs muchsmaller and thus easier to use– Less testing is required as well. The MapReduce

libraries can be assumed to work properly, so onlyuser code needs to be tested

● Division of labor also handled by theMapReduce libraries, so programmers onlyneed to focus on the actual computation


MapReduce Framework

map: (K1, V1) -> list(K2, V2))

reduce: (K2, list(V2)) -> list(K3, V3)




How does the MapReduce work?

Output in a list of (Key, List of Values)

in the intermediate file

Sorting

Partitioning

Output in a list of (Key, Value)


InputSplit

RecordReader

RecordWriter




How does the MapReduce work?

Sorting

Partitioning

Combining

Car, 2

Car, 2

Bear, {1,1}

Car, {2,1}

River, {1,1}

Deer, {1,1}

Output in a list of (Key, List of Values)


Output in a list of (Key, Value)


InputSplit

RecordReader

RecordWriter




MapReduce Processing – The Dataflow

1. InputFormat, InputSplits, RecordReader

2. Mapper - your focus is here

3. Partition, Shuffle & Sort

4. Reducer - your focus is here

5. OutputFormat, RecordWriter




InputFormat

InputFormat: Description: Key: Value:

TextInputFormat Default format; readslines of text files

The byte offset of theline The line contents

KeyValueInputFormat Parses lines into key,val pairs

Everything up to thefirst tab character

The remainder of theline

SequenceFileInputFormat

A Hadoop-specifichigh-performancebinary format

user-defined user-defined




InputSplitAn InputSplit describes a unit of work that comprises a single maptask.

InputSplit presents a byte-oriented view of the input.

You can control this value by setting the mapred.min.split.sizeparameter in core-site.xml, or by overriding the parameter in theJobConf object used to submit a particular MapReduce job.

RecordReader

RecordReader reads <key, value> pairs from an InputSplit.

Typically the RecordReader converts the byte-oriented view ofthe input, provided by the InputSplit, and presents a record-oriented to the Mapper



http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/mapred/RecordReader.html


Mapper

Mapper: The Mapper performs the user-defined logic to the input akey, value and emits (key, value) pair(s) which are forwarded to theReducers.

Partition, Shuffle & Sort

After the first map tasks have completed, the nodes may still beperforming several more map tasks each. But they also beginexchanging the intermediate outputs from the map tasks to where theyare required by the reducers.

Partitioner controls the partitioning of map-outputs to assign to reducetask . he total number of partitions is the same as the number of reducetasks for the job

The set of intermediate keys on a single node is automatically sortedby internal Hadoop before they are presented to the Reducer

This process of moving map outputs to the reducers is known asshuffling.




ReducerThis is an instance of user-provided code that performs read eachkey, iterator of values in the partition assigned. The OutputCollector object in Reducer phase has a method named collect() which willcollect a (key, value) output.

OutputFormat, Record Writer

OutputFormat governs the writing format in OutputCollector andRecordWriter writes output into HDFS.

OutputFormat: Description

TextOutputFormat Default; writes lines in "key \t value"form

SequenceFileOutputFormatWrites binary files suitable forreading into subsequent MapReducejobs

NullOutputFormat generates no output files




Hands-On: Writing you own MapReduce Program




Wordcount (HelloWord in Hadoop)1. package org.myorg;

2.

3. import java.io.IOException; 4. import java.util.*;

5.

6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*;

11.

12. public class WordCount {

13.

14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text,IntWritable> {

15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text();

17.

18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException {

19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. }




Wordcount (HelloWord in Hadoop)

27.

28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text,IntWritable> {

29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable>output, Reporter reporter) throws IOException {

30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. }

37.




Wordcount (HelloWord in Hadoop)

38. public static void main(String[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setJobName("wordcount");

41.

42. conf.setOutputKeyClass(Text.class); 43. conf.setOutputValueClass(IntWritable.class);

44.

45. conf.setMapperClass(Map.class); 46. 47. conf.setReducerClass(Reduce.class);

48.

49. conf.setInputFormat(TextInputFormat.class); 50. conf.setOutputFormat(TextOutputFormat.class);

51.

52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));

54.

55. JobClient.runJob(conf); 57. } 58. }

59.




Hands-On: Packaging Map Reduceand Deploying to Hadoop Runtime

Environment




Packaging Map Reduce Program

Usage

Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop versioninstalled, compile WordCount.java and create a jar:

$ wget https://dl.dropboxusercontent.com/u/12655380/WordCount.java

$ mkdir hduser $ cd hduserjavac -classpath /usr/local/hadoop/hadoop-core-1.2.1.jar -d hduser WordCount.java$ jar -cvf ./wordcount.jar -C hduser/ .

$ hadoop jar ./wordcount.jar org.myorg.WordCount /input/* /output/wordcount_output_dir

Output:

…….

$ hadoop fs -cat /output/wordcount_output_dir/part-00000




Reviewing MapReduce Output Result




Hands-On: Writing Map/ReduceProgram on Eclipse




Starting Eclipse




Create a Java Project

Let's name it HadoopWordCount




Add dependencies to the project

● Add the following two JARs to your build path● hadoop-common.jar and hadoop-mapreduce-client-core.jar. Both can be

founded at /usr/lib/hadoop/client● By perform the following steps

– Add a folder named lib to the project

– Copy the mentioned JARs in this folder

– Right-click on the project name >> select Build Path >> thenConfigure Build Path

– Click on Add Jars, select these two JARs from the lib folder


Add dependencies to the project


Writing a source code

● Right click the project, the select New >> Package● Name the package as org.myorg● Right click at org.myorg, the select New >> Class● Name the package as WordCount● Writing a source code as shown in previoud slides


Building a Jar file

● Right click the project, the select Export● Select Java and then JAR file● Provide the JAR name, as wordcount.jar● Leave the JAR package options as default● In the JAR Manifest Specification section, in the botton, specify the Main

class● In this case, select WordCount● Click on Finish● The JAR file will be build and will be located at cloudera/workspace

Note: you may need to re-size the dialog font size by select

Windows >> Preferences >> Appearance >> Colors and Fonts


LectureUnderstanding Hive




IntroductionA Petabyte Scale Data Warehouse Using Hadoop

Hive is developed by Facebook, designed to enable easy datasummarization, ad-hoc querying and analysis of largevolumes of data. It provides a simple query language calledHive QL, which is based on SQL




What Hive is NOT

Hive is not designed for online transaction processing anddoes not offer real-time queries and row level updates. It isbest used for batch jobs over large sets of immutable data(like web logs, etc.).




Hive Metastore

● Store Hive metadata

● Configurations

– Embedded: in-process metastore, in-process database

– Local: in-process metastore, out-of-process database

– Remote: out-of-process metastore,out-of-process database


Hive Schema-On-Read

● Faster loads into the database (simply copy or move)

● Slower queries

● Flexibility – multiple schemas for the same data


HiveQL

● Hive Query Language● SQL dialect● No support for:

– UPDATE, DELETE

– Transactions

– Indexes

– HAVING clause in SELECT

– Updateable or materialized views

– Srored procedure


Hive Tables

● Managed- CREATE TABLE

– LOAD- File moved into Hive's data warehouse directory

– DROP- Both data and metadata are deleted.

● External- CREATE EXTERNAL TABLE

– LOAD- No file moved

– DROP- Only metadata deleted

– Use when sharing data between Hive and Hadoop applications

or you want to use multiple schema on the same data


Running Hive

Hive Shell

● Interactive

hive● Script

hive -f myscript● Inline

hive -e 'SELECT * FROM mytable'

Hive.apache.org




System Architecture and Components

• Metastore: To store the meta data.• Query compiler and execution engine: To convert SQL queries to a

sequence of map/reduce jobs that are then executed on Hadoop.• SerDe and ObjectInspectors: Programmable interfaces and

implementations of common data formats and types. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binaryrepresentation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Javaobject that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system.

• UDF and UDAF: Programmable interfaces and implementations foruser defined functions (scalar and aggregate functions).

• Clients: Command line client similar to Mysql command line.

hive.apache.org




Architecture Overview

HDFS

Hive CLIQueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift Jute JSON..

Execution

Hive QL

Parser

Planner

Mgm

t.W

eb U

I

HDFS

DDL

Hive

Hive.apache.org




Sample HiveQL

The Query compiler uses the information stored in the metastore toconvert SQL queries into a sequence of map/reduce jobs, e.g. thefollowing query

SELECT * FROM t where t.c = 'xyz'

SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1)

SELECT t1.c1, count(1) from t1 group by t1.c1

Hive.apache.org




Hands-On: Creating Table andRetrieving Data using Hive




Hive Hands-On Labs

1. Installing Hive

2. Configuring / Starting Hive

3. Creating Hive Table

4. Reviewing Hive Table in HDFS

5. Alter and Drop Hive Table

6. Preparing Dataset

7. Loading Data to Hive Table

8. Querying Data from Hive Table

9. Reviewing Hive Table Content from HDFS Commandand WebUI




1. Installing Hive

# wget http://apache.mesi.com.ar/hive/hive-1.1.0/

apache-hive-1.1.0-bin.tar.gz

# tar -xvzf apache-hive-1.1.0-bin.tar.gz

# sudo mv apache-hive-1.1.0-bin /usr/local

# rm apache-hive-1.1.0-bin.tar.gz

Install Hive binary file



http://apache.mesi.com.ar/hive/hive-1.1.0/


1. Installing HiveEdit $HOME ./bashrc

# sudo vi $HOME/.bashrc




2. Configuring HiveCreating HDFS Directory for Hive

Create hdfs /tmp and /user/hive/warehouse directory

[hdadmin@localhost ~]$ hadoop fs -mkdir /tmp/hive

[hdadmin@localhost ~]$ hadoop fs -mkdir /user/hive/warehouse

[hdadmin@localhost ~]$ hadoop fs -chmod 777 /tmp/hive

[hdadmin@localhost ~]$ hadoop fs -chmod 777 /user/hive/warehouse




2. Start HiveStarting Hive

hive> quit;

Quit from Hive




3. Creating Hive Table

hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROWFORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

OK

Time taken: 4.069 seconds

hive (default)> show tables;

OK

test_tbl


hive (default)> describe test_tbl;

OK

id int

country string


hive (default)>

See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html




4. Reviewing Hive Table in HDFS

[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse

Found 1 items

drwxr-xr-x - hdadmin supergroup 0 2013-03-17 17:51 /user/hive/warehouse/test_tbl

[hdadmin@localhost hdadmin]$

Review Hive Table fromHDFS WebUI




5. Alter and Drop Hive Table

hive (default)> alter table test_tbl add columns (remarks STRING);

hive (default)> describe test_tbl;

OK

id int

country string

remarks string


hive (default)> drop table test_tbl;

OK


See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html




6. Preparing Large Datasethttp://grouplens.org/datasets/movielens/




MovieLen Dataset

1)Type command > wgethttp://files.grouplens.org/datasets/movielens/ml-100k.zip

2)Type command > sudo apt-get install unzip

3)Type command > unzip ml-100k.zip

4)Type command > more ml-100k/u.user




6. Loading Data to Hive Table

hive (default)> exit;

ubuntu@ip-172-31-12-11:~/ml-100k$ hadoop fs -put u.user /dataset/movielens/users

Loading data to Hive table

$ hive

hive (default)> CREATE EXTERNAL TABLE users (userid INT, age INT,

gender STRING, occupation STRING, zipcode STRING) ROW FORMAT

DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE

LOCATION '/dataset/movielens/users';

Creating Hive table




7. Querying Data from Hive Table




8. Loading Data to test_tbl Table

$ hive

hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROWFORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

Creating Hive table

hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data.csv' INTO TABLEtest_tbl;

Copying data from file:/tmp/test_tbl_data.csv

Copying file: file:/tmp/test_tbl_data.csv

Loading data to table default.test_tbl

OK


hive (default)>

Loading data to Hive table




9. Reviewing Hive Table Content from HDFS Commandand WebUI

[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse/test_tbl

Found 1 items

-rw-r--r-- 1 hdadmin supergroup 59 2013-03-17 18:08/user/hive/warehouse/test_tbl/test_tbl_data.csv


[hdadmin@localhost hdadmin]$ hadoop fs -cat/user/hive/warehouse/test_tbl/test_tbl_data.csv

1,USA

62,Indonesia

63,Philippines

65,Singapore

66,Thailand





Loading Data to Hive Table

$ hive

hive (default)> hive> CREATE TABLE products

(

prod_name STRING,

description STRING,

category STRING,

qty_on_hand INT,

prod_num STRING,

packaged_with ARRAY<STRING>

)

row format delimited

fields terminated by ','

collection items terminated by ':'

stored as textfile;

Creating Hive table




LectureUnderstanding Pig




IntroductionA high-level platform for creating MapReduce programs Using Hadoop

Pig is a platform for analyzing large data sets that consists ofa high-level language for expressing data analysis programs,coupled with infrastructure for evaluating these programs.The salient property of Pig programs is that their structure isamenable to substantial parallelization, which in turns enablesthem to handle very large data sets.




Pig Components

● Two Compnents● Language (Pig Latin)● Compiler

● Two Execution Environments● Local

pig -x local● Distributed

pig -x mapreduce

Hive.apache.org




Running Pig

● Script

pig myscript● Command line (Grunt)

pig● Embedded

Writing a java program

Hive.apache.org




Pig Latin

Hive.apache.org




Pig Execution Stages

Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi




Why Pig?

● Makes writing Hadoop jobs easier● 5% of the code, 5% of the time● You don't need to be a programmer to write Pig scripts

● Provide major functionality required forDatawareHouse and Analytics● Load, Filter, Join, Group By, Order, Transform

● User can write custom UDFs (User Defined Function)

Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi




Pig v.s. Hive

Hive.apache.org




Hands-On: Running a Pig script




Installing Pig

# wgethttp://archive.apache.org/dist/hadoop/pig/stable/pig-0.7.0.tar.gz

# tar -xvzf pig-0.7.0.tar.gz

# sudo mv pig-0.7.0 /usr/local/

# rm pig-0.7.0.tar.gz

Install Pig binary file




Installing PigEdit $HOME ./bashrc





Starting Pig Command Line




countryFilter.pig

A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float,lifeex:int, mysch:int, eysch:int, gni:int);B = FILTER A BY gni > 2000;C = ORDER B BY gni;dump C;

#Preparing Data

ubuntu@ip-172-31-12-11:~$ wget https://www.dropbox.com/s/pp168a6oiwqkxyu/

hdi-data.csv

#Edit Your Script

ubuntu@ip-172-31-12-11:~$ vi countryFilter.pig

Writing a Pig Script



https://www.dropbox.com/s/pp168a6oiwqkxyu/


ubuntu@ip-172-31-12-11:~$ pig -x local

grunt > run countryFilter.pig

Running a Pig Script




Lecture: Understanding Sqoop




Introduction

Sqoop (“SQL-to-Hadoop”) is a straightforward command-linetool with the following capabilities:

• Imports individual tables or entire databases to files inHDFS

• Generates Java classes to allow you to interact with yourimported data

• Provides the ability to import from SQL databases straightinto your Hive data warehouse

See also: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html




Architecture Overview

Hive.apache.org




Hands-On: Loading Data from DBMSto Hadoop HDFS




Sqoop Hands-On Labs

1. Loading Data into MySQL DB

2. Installing Sqoop

3. Configuring Sqoop

4. Installing DB driver for Sqoop

5. Importing data from MySQL to Hive Table

6. Reviewing data from Hive Table

7. Reviewing HDFS Database Table files




1. MySQL RDS Server on AWS

A RDS Server is running on AWS with the followingconfiguration

> database: imc_db

> username: admin

> password: imcinstitute

>addr: imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com

[This address may change]




1. country_tbl data

Testing data query from MySQL DB

Table name > country_tbl




2. Installing Sqoop

# wget http://apache.osuosl.org/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz

# tar -xvzf sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz

# sudo mv sqoop-1.4.5.bin__hadoop-1.0.0 /usr/local/

# rm sqoop-1.4.5.bin__hadoop-1.0.0




Installing SqoopEdit $HOME ./bashrc





3. Configuring Sqoop

ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-1.0.0/conf/

ubuntu@ip-172-31-12-11:~$ vi sqoop-env.sh




4. Installing DB driver for Sqoop

ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-1.0.0/lib/

ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.05/lib$wget https://www.dropbox.com/s/6zrp5nerrwfixcj/mysql-connector-java-5.1.23-bin.jar

ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.055/lib$exit



https://www.dropbox.com/s/6zrp5nerrwfixcj/mysql-connector-java-5.1.23-bin.jar


5. Importing data from MySQL to Hive Table

[hdadmin@localhost ~]$sqoop import --connectjdbc:mysql://imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com/imc_db --username admin -P --table country_tbl--hive-import --hive-table country -m 1

Warning: /usr/lib/hbase does not exist! HBase imports will fail.

Please set $HBASE_HOME to the root of your HBase installation.

Warning: $HADOOP_HOME is deprecated.

Enter password: <enter here>




6. Reviewing data from Hive Table





Start Web Browser to http://http://54.68.149.232:50070 then navigate to /user/hive/warehouse



http://localhost:50070/


LectureUnderstanding HBase




IntroductionAn open source, non-relational, distributed database

HBase is an open source, non-relational, distributed databasemodeled after Google's BigTable and is written in Java. It isdeveloped as part of Apache Software Foundation's ApacheHadoop project and runs on top of HDFS (, providingBigTable-like capabilities for Hadoop. That is, it provides afault-tolerant way of storing large quantities of sparse data.




HBase Features

● Hadoop database modelled after Google's Bigtab;e● Column oriented data store, known as Hadoop Database● Support random realtime CRUD operations (unlike

HDFS)● No SQL Database● Opensource, written in Java● Run on a cluster of commodity hardware

Hive.apache.org




When to use Hbase?

● When you need high volume data to be stored ● Un-structured data● Sparse data● Column-oriented data● Versioned data (same data template, captured at various

time, time-elapse data)● When you need high scalability

Hive.apache.org




Which one to use?

● HDFS● Only append dataset (no random write)● Read the whole dataset (no random read)

● HBase● Need random write and/or read● Has thousands of operation per second on TB+ of data

● RDBMS● Data fits on one big node● Need full transaction support● Need real-time query capabilities

Hive.apache.org




HBase Components

Hive.apache.org

● Region● Row of table are stores

● Region Server● Hosts the tables

● Master● Coordinating the Region

Servers● ZooKeeper● HDFS● API

● The Java Client API




HBase Architecture

Hive.apache.org




HBase Shell Commands

Hive.apache.org




Hands-On: Running HBase




Installing HBase

# wget http://apache.cs.utah.edu/hbase/hbase-1.0.0/hbase-1.0.0-bin.tar.gz

# tar -xvzf hbase-1.0.0-bin.tar.gz

# sudo mv hbase-1.0.0 /usr/local/

# rm hbase-1.0.0-bin.tar.gz




Installing HBaseEdit $HOME ./bashrc





Starting HBase shell

ubuntu@ip-172-31-12-11:~$ start-hbase.sh

starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin-master-localhost.localdomain.out

ubuntu@ip-172-31-12-11:~$$ jps

3064 TaskTracker

2836 SecondaryNameNode

2588 NameNode

3513 Jps

3327 HMaster

2938 JobTracker

2707 DataNode

ubuntu@ip-172-31-12-11:~$ hbase shell

HBase Shell; enter 'help<RETURN>' for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013

hbase(main):001:0>




Create a table and insert data in HBase

hbase(main):009:0> create 'test', 'cf'

0 row(s) in 1.0830 seconds

hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1'


hbase(main):011:0> scan 'test'

ROW COLUMN+CELL

row1 column=cf:a, timestamp=1375363287644,value=val1


hbase(main):002:0> get 'test', 'row1'

COLUMN CELL

cf:a timestamp=1375363287644, value=val1





Recommendation to Further Study




Thank you

www.imcinstitute.comwww.facebook.com/imcinstitute



http://www.imcinstitute.com/

Hadoop Workshop on EC2 : March 2015

Technology

Transcript of Hadoop Workshop on EC2 : March 2015