COSC 6397 Big Data Analytics Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout ·...
Transcript of COSC 6397 Big Data Analytics Hadoop MapReduce Infrastructure: Pig, Hive, and Mahout ·...
1
COSC 6397
Big Data Analytics
Hadoop MapReduce Infrastructure:
Pig, Hive, and Mahout
Edgar Gabriel
Spring 2017
Pig
• Pig is a platform for analyzing large data sets
– abstraction on top of Hadoop
– Provides high level programming language designed
for data processing
– Converted into MapReduce and executed on Hadoop
Clusters
2
Why using Pig?
• MapReduce requires programmers
– Must think in terms of map and reduce functions
– More than likely will require Java programming
• Pig provides high-level language that can be used by Analysts and
Scientists
– Does not require know how in parallel programming
• Pig’s Features
– Join Datasets
– Sort Datasets
– Filter
– Data Types
– Group By
– User Defined Functions
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Pig Components
• Pig Latin
– Command based language
– Designed specifically for data transformation and flow
expression
• Execution Environment
– The environment in which Pig Latin commands are
executed
– Supporting local and Hadoop execution modes
• Pig compiler converts Pig Latin to MapReduce
– Automatic vs. user level optimizations compared to
manual MapReduce code
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
3
Running Pig
• Script
– Execute commands in a file
– $pig scriptFile.pig
• Grunt
– Interactive Shell for executing Pig Commands
– Started when script file is NOT provided
– Can execute scripts from Grunt via run or exec commands
• Embedded
– Execute Pig commands using PigServer class
– Can have programmatic access to Grunt via PigRunner
class
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Pig Latin concepts
• Building blocks
– Field – piece of data
– Tuple – ordered set of fields, represented with “(“ and
“)” (10.4, 5, word, 4, field1)
– Bag – collection of tuples, represented with “{“ and “}” {
(10.4, 5, word, 4, field1), (this, 1, blah) }
• Some similarities to relational databases
– Bag is a table in the database
– Tuple is a row in a table
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
4
Simple Pig Latin example$ pig
grunt> cat /input/pig/a.txt
a 1
d 4
c 9
k 6
grunt> records = LOAD '/input/a.txt' as (letter:chararray, count:int);
grunt> dump records;
...
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer
.MapReduceLauncher - 50% complete
2012-07-14 17:36:22,040 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer
.MapReduceLauncher - 100% complete
...
(a,1)
(d,4)
(c,9)
(k,6)
grunt>
Load grunt in default map-reduce mode
grunt supports file system commands
Load contents of text file into a bag called records
Display records on screen
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Simple Pig Latin example
• No action is taken until DUMP or STORE commands are
encountered
– Pig will parse, validate and analyze statements but not execute
them
• STORE – saves results (typically to a file)
• DUMP – displays the results to the screen
– doesn’t make sense to print large arrays to the screen
– For information and debugging purposes you can print a small
sub-set to the screen
grunt> records = LOAD '/input/excite-small.log'
AS (userId:chararray, timestamp:long, query:chararray);
grunt> toPrint = LIMIT records 5;
grunt> DUMP toPrint;
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
5
Simple Pig Latin example
LOAD 'data' [USING function] [AS schema];
• data – name of the directory or file
– Must be in single quotes
• USING – specifies the load function to use
– By default uses PigStorage which parses each line into
fields using a delimiter
– Default delimiter is tab (‘\t’)
– The delimiter can be customized using regular
expressions
• AS – assign a schema to incoming data
– Assigns names and types to fields ( alias:type)
– (name:chararray, age:int, gpa:float)
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
records = LOAD '/input/excite-small.log‘ USING
PigStorage() AS (userId:chararray, timestamp:long,
query:chararray);
• int Signed 32-bit integer 10
• long Signed 64-bit integer 10L or 10l
• float 32-bit floating point 10.5F or 10.5f
• double 64-bit floating point 10.5 or 10.5e2 or 10.5E2
• chararray Character array (string)
in Unicode UTF-8 hello world
• bytearray Byte array (blob)
• tuple An ordered set of fields (T: tuple (f1:int, f2:int))
• bag A collection of tuples (B: bag {T: tuple(t1:int, t2:int)})
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
6
Pig Latin Diagnostic Tools
• Display the structure of the Bag
– grunt> DESCRIBE <bag_name>;
• Display Execution Plan
– Produces Various reports, e.g. logical plan, MapReduce
plan
– grunt> EXPLAIN <bag_name>;
• Illustrate how Pig engine transforms the data
– grunt> ILLUSTRATE <bag_name>;
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Joining Two Data Sets
• Join Steps
– Load records into a bag from input #1
– Load records into a bag from input #2
– Join the 2 data-sets (bags) by provided join key
• Default Join is Inner Join
– Rows are joined where the keys match
– Rows that do not have matches are not included in the
result
Set 1 Set 2
Inner join
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
7
Simple join example
1. Load records into a bag from input #1posts = load '/input/user-posts.txt' using PigStorage(',')
as (user:chararray, post:chararray, date:long);
2. Load records into a bag from input #2likes = load '/input/user-likes.txt' using PigStorage(',')
as (user:chararray,likes:int,date:long);
3. Join the data sets when a key is equal in both data-sets
then the rows are joined into a new single row; In this
case when user name is equaluserInfo = join posts by user, likes by user;
dump userInfo;
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
$ hdfs dfs -cat /input/user-posts.txt
user1,Funny Story,1343182026191
user2,Cool Deal,1343182133839
user4,Interesting Post,1343182154633
user5,Yet Another Blog,13431839394
$ hdfs dfs -cat /input/user-likes.txt
user1,12,1343182026191
user2,7,1343182139394
user3,0,1343182154633
user4,50,1343182147364
$ pig /code/InnerJoin.pig
(user1,Funny Story,1343182026191,user1,12,1343182026191)
(user2,Cool Deal,1343182133839,user2,7,1343182139394)
(user4,InterestingPost,1343182154633,user4,50,1343182147364)
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
8
Outer Join
• Records which will not join with the ‘other’ record-set are still
included in the result
• Left Outer
– Records from the first data-set are included whether they have
a match or not. Fields from the unmatched (second) bag are
set to null.
• Right Outer
– The opposite of Left Outer Join: Records from the second data-
set are included no matter what. Fields from the unmatched
(first) bag are set to null.
• Full Outer
– Records from both sides are included. For unmatched records
the fields from the ‘other’ bag are set to null.
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Pig Use cases
• Loading large amounts of data
– Pig is built on top of Hadoop -> scales with the number of
servers
– Alternative to manual bulkloading e.g. in HBASE
• Using different data sources, e.g.
– collect web server logs,
– use external programs to fetch geo-location data for the users’
IP addresses,
– join the new set of geo-located web traffic to click maps stored
• Support for data sampling
9
Hive
• Data Warehousing Solution built on top of Hadoop
• Provides SQL-like query language named HiveQL
– Minimal learning curve for people with SQL expertise
– Data analysts are target audience
• Early Hive development work started at Facebook in
2007
• Translates HiveQL statements into a set of MapReduce
Jobs which are then executed on a Hadoop Cluster
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Hive
• Ability to bring structure to various data formats
• Simple interface for ad hoc querying, analyzing and
summarizing large amounts of data
• Access to files on various data stores such as HDFS and
HBase
• Hive does NOT provide low latency or realtime queries
– Even querying small amounts of data may take minutes
• Designed for scalability and ease-of-use rather than low
latency responses
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
10
Hive
• To support features like schema(s) and data
partitioning Hive keeps its metadata in a Relational
Database
– Packaged with Derby, a lightweight embedded SQL DB
• Default Derby based is good for evaluation an testing
• Schema is not shared between users as each user has
their own instance of embedded Derby
• Stored in metastore_db directory which resides in the
directory that hive was started from
– Can easily switch another SQL installation such as MySQL
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Hive Architecture
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
11
Hive Interface Options
• Command Line Interface (CLI)
• Hive Web Interface – https://cwiki.apache.org/confluence/display/Hive/HiveWebInterface
• Java Database Connectivity (JDBC) – https://cwiki.apache.org/confluence/display/Hive/HiveClient
• Re-used from Relational Databases
– Database: Set of Tables, used for name conflict resolution
– Table: Set of Rows that have the same schema (same
columns)
– Row: A single record; a set of columns
– Column: provides value and type for a single value
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Hive creating a tablehive> CREATE TABLE posts (user STRING, post STRING, time BIGINT)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE;
OK Time taken: 10.606 seconds
hive> show tables;
OK
posts Time taken: 0.221 seconds
hive> describe posts;
OK
user string
post string
time bigint
Time taken: 0.212 seconds
creates a table with 3 columns
How the underlying file should be parsed
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Display schema for posts table
12
Hive Query Data
hive> select * from posts where user="user2"; ... ...
OK
user2 Cool Deal 1343182133839
Time taken: 12.184 seconds
hive> select * from posts where time<=1343182133839 limit
2;
...
...
OK
user1 Funny Story 1343182026191
user2 Cool Deal 1343182133839
Time taken: 12.003 seconds hive>
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Partitions
• To increase performance Hive has the capability to
partition data
– The values of partitioned column divide a table into
segments
– Entire partitions can be ignored at query time
– Similar to relational databases’ indexes but not as
granular
• Partitions have to be properly crated by users
– When inserting data must specify a partition
• At query time, whenever appropriate, Hive will
automatically filter out partitions
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
13
Bucketing
• Mechanism to query and examine random samples of
data
• Break data into a set of buckets based on a hash
function of a "bucket column"
– Capability to execute queries on a sub-set of random data
• Doesn’t automatically enforce bucketing
– User is required to specify the number of buckets by
setting # of reducer
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
Joins
• Hive support outer joins – left, right and full joins
• Can join multiple tables
• Default Join is Inner Join
– Rows are joined where the keys match
– Rows that do not have matches are not included in the
result
Slide based on lecture http://www.coreservlets.com/hadoop-tutorial/
14
Pig vs. Hive• Hive
– Uses an SQL like query language called HQL
– Gives non-programmers the ability to query and analyze
data in Hadoop.
• Pig
– Uses a workflow driven scripting language
– Don't need to be an expert Java programmer but need a
few coding skills.
– Can be used to convert unstructured data into a
meaningful form.
Mahout
• Scalable machine learning library
– Built with MapReduce and Hadoop in mind
– Written in Java
• Focusing on three application scenarios
– Recommendation Systems
– Clustering
– Classifiers
• Multiple ways for utilizing Mahout
– Java Interfaces
– Command line interfaces
• Newest Mahout releases target Spark, not Mapreduce
anymore!
15
Classification
• Currently supported algorithms
– Naïve Baysian Classifier
– Hidden Markov Models
– Logistical Regression
– Random Forest
Clustering
• Currently supported algorithms
– Canopy clustering
– K-means clustering
– Fuzzy k-means clustering
– Spectral clustering
• Multiple tools available to support clustering
– clusterdump: utility to output results of a clustering to a
text file
– cluster visualization
16
Mahout input arguments
• Input data has to be sequence files and sequence
vectors
– Sequence file: generic Hadoop concept for binary files
containing a
• list of key/value pairs
• Classes used for the key and the value pair
– Sequence vector: binary file containing list of key/(array
of values)
• For using mahout algorithms, key has to be text and value has to be of type VectorWritable (which is a
Mahout class, not a Hadoop class)
Sequence Files
• Creating a sequencfile using command line argumentgabriel@shark>mahout seqdirectory -i /lastfm/input/ -o
/lastfm/seqfiles
• Looking at the output of a sequence filegabriel@shark>mahout seqdumper –i /lastfm/seqfiles/control-
data.seq | more
Input Path: file:/lastfm/seqfiles/control-data.seq
Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.math.VectorWritable
Key: 0: Value: {0:28.7812,1:34.4632,2:31.3381}
Key: 1: Value: {0:24.8923,1:25.741,2:27.5532}
…
17
Sequence File from Java
• Required if the original input file is not already structured in a
manner that can be interpreted as key/value pair
public class CreateSequenceFile {
public static void main(String[] argsx) throws
FileNotFoundException, IOException
{
String filename = "/home/gabriel/mahouttest/synthetic-control-
data/input/synthetic-control.data";
String outputfilename = "/home/gabriel/mahouttest/synthetic-
control-data/seqfile/synthetic-control.seq";
Path path = new Path(outputfilename);
BufferedReader br=new BufferedReader(new FileReader(filename));
String line;
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
SequenceFile.Writer writer = new
SequenceFile.Writer(fs,conf,path,Text.class,VectorWritable.class);
Text key = new Text();
long tempkey = 0;
while( (line = br.readLine()) != null ) {
Scanner scanner = new Scanner (new StringReader (line) );
double[] values = new double[64] ;
int i=0;
while ( scanner.hasNextDouble() && i < 64 ) {
values[i] = scanner.nextDouble();
i++;
}
DenseVector val = new DenseVector (values) ;
VectorWritable vec = new VectorWritable(val);
key = new Text(String.format("%d",tempkey));
writer.append(key,vec);
tempkey++;
}
writer.close();}}
18
Using Mahout clustering
The SequenceFile containing the input vectors.
The SequenceFile containing the initial cluster
centers.
The similarity measure to be used.
The convergenceThreshold.
The number of iterations to be done.
The Vector implementation used in the input files.
Using Mahout clustering
19
Distance measures
Euclidean distance measure
Squared Euclidean distance measure
Manhattan distance measure
Distance measures
Cosine distance measure
Tanimoto distance measure
20
Running Mahout Clustering algorithms
bin/mahout kmeans
-i <input vectors directory> \
-c <input clusters directory> \
-o <output working directory> \
-k <optional no. of initial clusters> \
-dm <DistanceMeasure> \
-x <maximum number of iterations> \
-cd <optional convergence delta. Default is 0.5> \
-ow <overwrite output directory if present>
-cl <run input vector clustering after computing Canopies>
-xm <execution method: sequential or mapreduce>
mahout clusterdump -i /gabriel/clustering/canopy/clusters-0-final --
pointsDir /gabriel/clustering/canopy/clusteredPoints
-o /home/gabriel/mahouttest/synthetic-control-data/canopy.out