Syncfusion_WhitePaper_HDInsight

Ignore HDInsightat Your Own Peril:

by Daniel Jebaraj

EVERYTHING YOU NEED TO KNOW

Syncfusion | Ignore HDInsight at Your Own Peril: Everything You Need to Know 1

Contents Ignore HDInsight at Your Own Peril: Everything You Need to Know ............................................................ 3

Abstract ..................................................................................................................................................... 3

Introduction .................................................................................................................................................. 4

Storage and Analysis of Big Data .................................................................................................................. 4

Hadoop/HDInsight ........................................................................................................................................ 5

Scalable StorageHadoop Distributed File System ..................................................................................... 5

Scalable processing ....................................................................................................................................... 6

MapReduce ............................................................................................................................................... 8

Map ....................................................................................................................................................... 8

Shuffle ................................................................................................................................................... 9

Reduce ................................................................................................................................................ 10

MapReduce sampleJava implementation of word count ................................................................... 10

Prerequisites ....................................................................................................................................... 10

Compiling provided Java sample ......................................................................................................... 10

Upload the input text document to HDFS .......................................................................................... 11

C# implementation of word Count ......................................................................................................... 14

Review results ..................................................................................................................................... 14

Important notes .................................................................................................................................. 14

C# Mapper ........................................................................................................................................... 14

C# Reducer ....................................................................................................................................... 15

MapReduce the Easy Way .......................................................................................................................... 15

Building a simple recommendation engine ................................................................................................ 16

Perfectly correlated data ........................................................................................................................ 16

Uncorrelated data ................................................................................................................................... 16

C# implementation to calculate correlations ......................................................................................... 19

Simple Recommendation System Using Pig................................................................................................ 21

Load and store ........................................................................................................................................ 21

Relation ................................................................................................................................................... 21

Joins ........................................................................................................................................................ 22

Filter ........................................................................................................................................................ 22

Projection ................................................................................................................................................ 22


Grouping ................................................................................................................................................. 22

Dump ....................................................................................................................................................... 22

Pig script that analyzes movie ratings ..................................................................................................... 22

Load the data from HDFS .................................................................................................................... 22

Obtain a list of unique movie combinations ....................................................................................... 22

Project the data to a more usable form .............................................................................................. 24

Obtain groups containing ratings for each pair of movies.................................................................. 24

Calculating correlations ...................................................................................................................... 24

Project final results ............................................................................................................................. 24

Dump final results for review.............................................................................................................. 24

Running the script ............................................................................................................................... 25

Results ................................................................................................................................................. 25

Applying the same concepts to a much larger set of data ..................................................................... 25

Structure of u.item .............................................................................................................................. 26

Structure of u.data .............................................................................................................................. 26

Running the script ............................................................................................................................... 28

The Role of Traditional BI ............................................................................................................................ 29

Data Mining Post-ETL .................................................................................................................................. 29

Data Mining with Big Data .......................................................................................................................... 30

Big Data Processing Is Not Just for Big Data ............................................................................................... 30

ConclusionHarnessing Your Data Is Easier Than You Think..................................................................... 30

How Can Syncfusion Help? ......................................................................................................................... 30

Contact information .................................................................................................................................... 31

Appendix AInstalling and Configuring HDInsight on a Single Node (Pseudo-Cluster) .............................. 32

Appendix BConfiguring NetBeans for HDInsight Development on Windows ........................................... 34


Ignore HDInsight at Your Own Peril: Everything You Need to Know

Abstract HDInsight is a Microsoft-provided distribution of Apache Hadoop, adapted for and tested on the

Windows platform. It can be deployed to private self-hosted clusters or can be accessed as a service on

the Windows Azure1 cloud. It is currently available as a public preview and is expected to be released in

the fourth quarter of 2013.

HDInsight brings truly scalable data processing of structured and unstructured data to the Windows

Platform. This white paper will cover all you need to know about Hadoop, MapReduce, and HDInsight in

order to harness the benefits of Big Data.

This white paper is suitable for both developers and managers. You could skip over the included code

and still get a working understanding of the overall system. For developers, we have included several

working samples that will help you understand the environment.

1 http://www.windowsazure.com/en-us/services/hdinsight/

Syncfusion | Introduction 4

Introduction There has been a virtual explosion in the amount of data being created. Not very long ago, transactional

information was the main source of data. In the past, only a few large organizations accumulated

unwieldy amounts of transactional data. The need to store and process such amounts of data was not a

common business requirement for most organizations.

Now, the situation has changed dramatically. Organizations have woken up to the reality that huge

amounts of data are being generated on a daily basis by people and machines.

Consider some examples:

Activity logs generated by customers browsing websites.

Logs generated by complex, field-deployed machinery, recording key measurements several

times a second.

Signals generated on social media related to a companys products and services.

Such data can be big, difficult to store and process using traditional methods. Such unwieldiness is

what distinguishes big data from other data.

In spite of storage and processing difficulties, big data offers potentially huge business value. It provides

an opportunity to gather insight concerning business activities in ways previously not possible. Customer

web log information, for instance, can be used to predict valuable trends that an organization may not

otherwise be aware of. Machine-generated information can be used to predict failure, probability of

accidents, and other such events long before they happen. Social media signals can be used to predict

the failure or success of specific marketing initiatives.

This white paper focuses on the storage and processing of big data using the HDInsight distribution (the

terms Hadoop and HDInsight are used interchangeably). Understanding this is critical to harnessing big

data and putting it to use to further business goals.

Storage and Analysis of Big Data Though most organizations have realized that they are accumulating more data than ever before, only a

minority have implemented a storage and analysis strategy for such data. There is good reason for this.

The storage of transactional data in relational database systems has been well understood for several

decades. The tools that are used for this purpose, such as SQL Server and Oracle, are well understood.

Relational data is stored in a highly structured format and processed using SQL. The warehousing of this

data to build online analytical processing (OLAP) systems is also understood well.

On the other hand, unstructured data (especially in huge quantities), is by nature very different. It has

no predefined structure. Relational databases are not suitable for storing such data. Storage on

traditional file systems is also problematic since the size of the data often exceeds the capabilities of

single machines.

Syncfusion | Hadoop/HDInsight 5

Hadoop/HDInsight Against this backdrop, Hadoop has gained broad acceptance as an effective storage and processing

mechanism for big data. Hadoop is an open-source implementation of systems that Google

implemented2 internally to solve big data problems related to storing indexes for web scale data.

Hadoop at its core has two pieces: one for storing large amounts of unstructured data in a cost-effective

manner and another for processing large amounts of data in a cost-effective manner.

The data storage solution is named Hadoop Distributed File System (HDFS).

The processing solution is an implementation of the MapReduce programming model

documented by Google.

Scalable StorageHadoop Distributed File System HDFS is a file system designed to store large amounts of data economically. It does this by storing data

on multiple commodity machinesscaling out instead of scaling up.

HDFS, for all its underlying magic, is simple to understand at a conceptual level.

Each file that is stored by HDFS is split into large blocks (typically 64 MB each, but this setting is

configurable).

Each block is then stored on multiple machines that are part of the HDFS cluster. A centralized

metadata store has information on where individual parts of a file are stored.

Considering that HDFS is implemented on commodity hardware, machines and disks are

expected to fail. When a node fails, HDFS will ensure that data blocks the node held are

replicated to other systems.

This scheme allows for the storage of large files in a fault-tolerant manner across multiple machines.

2 http://research.google.com/archive/mapreduce.html

Syncfusion | Scalable processing 6

HDFS visually

In HDFS, the metadata store is typically on a machine referred to as the name node. The nodes where

data is stored are referred to as data nodes. In the previous diagram, there are three data nodes. Each

of these nodes contains a copy of each block of data that is stored on the HDFS cluster. A production

implementation of HDFS will have many more nodes, but the essential structure still applies.

The data blocks stored on individual machines also play an important role in efficiently processing data

by the implementation of MapReduce in Hadoop, but we will have more to say about that shortly.

Scalable processing Before we discuss MapReduce, it will be helpful to carefully consider the issues associated with scaling

out the processing of data across multiple machines. We will do this using a simple example. Assume we

have a text file, and we would like an individual count of all words that appear in that text file.

This is the pseudo-code for a simple word-counting program that runs on a single machine:

Open the text file for reading, and read each line.

Parse each line into words.

Increment and store the count of each word as it appears in a dictionary or similar structure.

Close the file and output summary information.

Simple enough. Now consider that you have several gigabytes (maybe petabytes) of text files. How will

we modify the simple program described above to process this kind of information by scaling out3 across

multiple machines?

Some issues to consider:

Data storageThe system should provide a way to store the data being processed.

Data distributionThe system should be able to distribute data to each of the processing

nodes.

3 Scale up vs. scale outIt will not be ideal to implement such a processing system on a single machine. A powerful

machine can certainly process gigabytes of text, but there is a limit to this kind of scaling.


Parallelizable algorithmThe processing algorithm should be parallelizable. Each node

should be able to run without being held up by another during any given stage. If nodes

have to share data, the difficulties associated with the synchronization of such data are to

be considered.

Fault toleranceAny of the nodes processing data can fail. In order for the system to be

resilient given the failure of individual nodes, the failure should be promptly detected and

the work should be delegated to another available node.

AggregationThe system should have a way to aggregate results produced by individual

nodes to compute final summaries.

Storage of resultsThe final output can itself be substantial; there should be an effective

way to store and retrieve it.

As we consider these aspects, it is evident that implementing a custom version of a truly scalable parallel

system across multiple machines is not a trivial task, even for a problem as simple as counting words.

Hadoop makes scaling out processing easier by implementing solutions to these issues, summarized in

the following table.

Issue considered The Hadoop solution

Data storage HDFS

Data distributionThe system should be able to distribute data to each of the processing nodes.

Hadoop keeps data distribution between nodes to a minimum. It instead moves processing code to each node and processes the data where it is already available on disk.

Parallelizable algorithm We will study this aspect in more detail, but in essence, as of version 1.x, Hadoop mandates the MapReduce programming model to enable a scalable processing model.

Fault tolerance Hadoop monitors data storage nodes and will add replicas as nodes become unavailable. Hadoop also monitors tasks assigned to nodes and will reassign if a node becomes unavailable.

Aggregation This is accounted for in a distributed manner through the Reduce stage, as we will see in the next section.

Storage of results HDFS


MapReduce We have seen that Hadoop as of version 1.x4 mandates the MapReduce programming model.

MapReduce is a functional programming model that moves away from shared resources and related

synchronization or contention issues. It instead uses simple parts that are inherently scalable to achieve

complex solutions.

Googles paper on MapReduce provides the following description:

MapReduce is a programming model and an associated implementation for

processing and generating large data sets. Users specify a map function that

processes a key/value pair to generate a set of intermediate key/value pairs, and a

reduce function that merges all intermediate values associated with the same

intermediate key. Many real-world tasks are expressible in this model.

Programs written in this functional style are automatically parallelized and

executed on a large cluster of commodity machines. The run-time system takes

care of the details of partitioning the input data, scheduling the program's

execution across a set of machines, handling machine failures, and managing the

required inter-machine communication. This allows programmers without any

experience with parallel and distributed systems to easily utilize the resources of a

large distributed system.

The MapReduce programming model is not hard to understand, especially if we study it using a simple

example. MapReduce as implemented in Hadoop is comprised of three stages. We will look at the three

stages of any MapReduce program in detail.

Map The Map stage takes input in the form of a key and a value, processes the input, and then outputs

another key and value. In this sense, it is no different than the implementation of Map in several

programming environments.

Considering the word count example, a Map task is likely to follow these steps:

Input

KeyKey identifying the value being provided to the Mapper.

o In the context of Hadoop and the word counting problem, this key is simply the starting

index of the text in the data block being processed. We can consider it to be opaque and

ignore it.

ValueValue is a single line of text. Consider it as a unit to be processed.

Processing (implemented by user code)

Splits the provided line of text into individual words.

Output

For each word, output a set of key or value pairs. The mechanism to output these values is provided by

Hadoop.

4 Version 2.0 introduces additional programming models.


KeyA suitable key is the actual word detected.

ValueA suitable value is 1. Think of this as a distributed marker that simply denotes that we

saw a particular word once. It is important to distinguish this from a dictionary approach. With a

dictionary, we will look up the current value and increment it by one. In our case, we do not do

this. Every time we see a word, we simply mark that we have seen it again by outputting a 1.

Aggregation will happen later.

Example walkthrough

Input to Mapper

Key Value

{Any number indicating the index within the block being processed}

Twinkle, Twinkle Little Star

Output by Mapper

We assume that punctuation does not count in our context. Note that the word Twinkle was seen

twice during processing, and therefore appears twice with 1 as the value and Twinkle as the key.

Key Value

Twinkle 1

Twinkle 1

Little 1

Star 1

Shuffle Once the Map stage is over, data collected from the Mappers (remember, there could be several

Mappers operating in parallel) will be sent to the Shuffle stage.

During the Shuffle stage, all values that have the same key are collected and stored as a conceptual list

tied to the key under which they were registered.

In the word count example, assuming the single line of text we observed earlier was the only input, this

is what the output by the Shuffle phase should be:

Key List of values

Twinkle 1,1

Little 1

Star 1

The Shuffle stage guarantees that data under a specific key will be sent to exactly one reducer (the next

stage).

Shuffle is not typically implemented by the application. Hadoop implements shuffle and guarantees that

all data values that belong to a single key will be gathered together and passed to a single reducer. In

the instance mentioned above, the key Twinkle will be processed by a single reducer. It will never be


processed by more than one reducer. Data under different keys can of course be routed to different

reducers.

Reduce The reducers role is to process the transformed data and output yet another key-value pair. This is the

key-value pair that is actually written to the output. In the word count sample, the reducer can simply

return the word as a key again, with the value being a summation of all the ones that appear in the

provided list of values. This will, of course, be the number of times the word has appeared in the text

the desired output.

Key Value

Twinkle 2

Little 1

Star 1

The beauty of MapReduce is that once a problem is broken into MapReduce terms and tested on a small

amount of data, you can be confident you have a scalable solution that can handle large volumes of

similar data.

We will now review a working implementation of the word count problem implemented using

MapReduce in Java and C#.

We chose to show the solution in both Java and C# since Java is the native language of the Hadoop

environment. Other languages such as C# are supported by streaming through stdin and stdout, but Java

is the language you will often turn to when reviewing available sample code or implementing more

advanced Hadoop features. For this reason, it is a good idea to have a working knowledge of using Java

with Hadoop.

MapReduce sampleJava implementation of word count

Prerequisites 1. Install HDInsight following the steps given in Appendix AInstalling and Configuring HDInsight

on a Single Node (Pseudo-Cluster).

2. Download sample code from https://bitbucket.org/syncfusion/hdinsightwp/src.

3. Install and configure the NetBeans IDE for Hadoop development as documented in Appendix

BConfiguring NetBeans for HDInsight Development on Windows.

Compiling provided Java sample Open and compile the Java Sample Word Count, available in the sample folder, word count java, using

the NetBeans IDE.

Alternatively, you can use your favorite Java IDE or the Java command line, but do keep in mind the

following:

You have to use Java 6, 64-bit version for compilation.

You have to package the compiled class files in a JAR file for execution by the Hadoop

environment.


Once you have a compiled JAR file, please follow these steps to execute the sample:

Upload the input text document to HDFS 1. Start the Hadoop command line shell through the link that is created when you install HDInsight.

2. Navigate to the folder where you have installed the source for this article. Specifically, navigate

to the folder named data.

3. Since HDFS is a virtual file system, Hadoop provides access to HDFS files through a shell. The

shell offers several standard POSIX commands. To copy the data file required for this sample,

use the following commands:

hadoop fs -mkdir warpeaceCreates a directory on HDFS named warpeace. The

directory will be created under the users home directory since we did not provide

an absolute path.

hadoop fs -put warpeace.txt warpeaceUploads warpeace.txt from the local file

system to the HDFS file system.

hadoop fs -ls warpeaceVerifies that the file was uploaded correctly.

4. Next, navigate to the sample folder named word count (java\wordcount\dist). The compiled

JAR file named wordcount.jar should be under this directory if you compiled using the NetBeans

IDE.

5. Run this command: hadoop jar wordcount.jar warpeace warpeacecount

6. Hadoop will start running the job right away. If all goes well, you should see the task complete

with no error messages.

7. Once the task completes with no errors, issue this command to see the output: hadoop fs -ls

warpeacecount

You should see the output depicted in the following image. There are multiple files in the

output. One contains status information, and another contains log information. The third file

with the name part-##### is the file of interest that contains the actual results. If there were

multiple nodes working in parallel, we would see additional files that contain parts of the result.

8. Use this command to view the content of the output directory: hadoop fs -cat

warpeacecount/part*

9. You should observe results dumped to the console, as shown in the following image.


After running the Java implementation of Word Count implemented using MapReduce, take a look at

the three parts of the code, reproduced here:

The Mapper

The Mapper takes lines of input, and for each word seen, returns the word as a key with the value 1,

as described in our earlier walkthrough.

// Template arguments state the type of input key,value and output key, value

public static class Map extends MapReduceBase implements Mapper


StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

oc.collect(word, one);

}

}

}

Reducer

The Reducer aggregates output from the Shuffle stage, as seen in the earlier walkthrough. It then

outputs each word as a key with its total count as a value.

public static class Reduce extends MapReduceBase implements Reducer {

@Override

public void reduce(Text key, Iterator values,

OutputCollector output, Reporter reporter) throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

output.collect(key, new IntWritable(sum));

}

}

Entry point with configuration information

Hadoop requires some plumbing in order to submit a job, but this plumbing is straightforward. It uses

console arguments to configure values, such as the input and output paths. Several other settings can

also be specified if needed.

public static void main(String[] args) throws IOException {

JobConf jobConf = new JobConf(Wordcount.class);

jobConf.setJobName("Word Count example");

jobConf.setOutputKeyClass(Text.class);


jobConf.setOutputValueClass(IntWritable.class);

jobConf.setMapperClass(Map.class);

jobConf.setReducerClass(Reduce.class);

jobConf.setInputFormat(TextInputFormat.class);

jobConf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.addInputPath(jobConf,new Path(args[0]));

FileOutputFormat.setOutputPath(jobConf,new Path(args[1]));

JobClient.runJob(jobConf);

}

If you review each line in the code sample and observe the results, you should have a working

understanding of how MapReduce works.

C# implementation of word Count We will now take a look at the C# implementation of the same sample.

The C# sample is available in the folder named word count cs. Unlike the Java sample, the C# sample is

configured to run itself. You do not, therefore, need to invoke Hadoop. Just know that you can have a

self-running exe that can start a Hadoop job, or you can have a command start the job for you based on

provided parameters (as we did with the Java sample).

The C# sample, once it is done, will create a folder named warpeacecountcs with results identical to the

Java version.

Review results hadoop fs -cat warpeacecountcs/part*

Important notes

The C# sample uses the Hadoop SDK available on CodePlex. We have included copies of the

assemblies and files needed. You will not need to build the SDK to work with the sample.

If you do have issues running the C# sample, we recommend that you build the Hadoop SDK

from code and then run the sample with updated dependencies.

Additionally, though the Hadoop SDK is available through Nuget, we do not recommend going

that route since we experienced some issues when building against the Nuget version. Building

the SDK from source is the way to go if you have issues.

The C# versions of the Mapper and Reducer are shown in the following sample. If you compare them

with the Java version, you will see they have similar functionality.

C# Mapper public class WordCountMapper : MapperBase

{

public override void Map(string inputLine, MapperContext context)

Syncfusion | MapReduce the Easy Way 15

{

try

{

string[] words = inputLine.Split(' ');

foreach (string word in words)

context.EmitKeyValue(word, "1");

}

catch (ArgumentException ex)

{

return;

}

}

}

C# Reducer

public class WordCountReducer : ReducerCombinerBase

{

public override void Reduce(string key, IEnumerable values,

ReducerCombinerContext context)

{

context.EmitKeyValue(key, values.Count().ToString());

}

}

MapReduce the Easy Way We have looked at writing MapReduce the hard way with Java and C#. It is essential to understand how

MapReduce works; for that reason, the Java and C# samples we just reviewed are useful. In practice

however, writing MapReduce jobs in C# or Java can be compared to writing software programs in

Assembly. You dont usually do it unless you absolutely have to.

Several domain-specific languages exist for the specific purpose of authoring MapReduce jobs. Pig

(http://pig.apache.org/) and Hive (http://hive.apache.org/) are two of the more commonly used

languages.

Syncfusion | Building a simple recommendation engine 16

We will not work with Hive in this article, but we will spend a fair amount of time with Pig. Hive provides

a SQL-like approach to specifying MapReduce jobs. If you are interested in Hive, we encourage you to

check out material available online and the book Programming Hive5.

Pig and Hive are both compelling environments for authoring MapReduce jobs. As developers, we prefer

Pig since its syntax is closer to that of a programming language. If you come from a SQL background, you

may prefer Hive. HDInsight has great support for both. You will not be at a disadvantage choosing one

over the other for most tasks.

In the next section, we will look into building a simple product recommendation engine using Pig. The

task of building a product recommendation engine is a real-world, big data use case. We will simplify its

specification and implementation in order to make it easier to understand, but the fundamental ideas

will remain the same as those in actual use. Working through this sample will give you a good

understanding of using Pig for complex MapReduce tasks.

Building a simple recommendation engine Product recommendations are available on many popular sites such as Amazon.com and Netflix.com.

When we review or buy a specific product, these websites usually offer helpful suggestions on other

products that may be of interest. They use complex algorithms tuned over years to achieve these

results. The underlying concepts though, are quite simple. The key concept is that of a correlation

between two pieces of data.

Consider the following table with two columns of data. Even a casual review tells us that one column is

in tandem with the other (simply calculated by a multiplier).

Perfectly correlated data 1 10

2 20

3 30

4 40

5 50

6 60

6 60

7 70

8 80

9 90

10 100

On the other hand, consider the table below. The two columns are not related in an evident manner.

Uncorrelated data 1 3123

2 12321

5 http://www.amazon.com/Programming-Hive-Edward-Capriolo/dp/1449319335


3 3123

4 12312

5 5555555

6 2323123

6 123213

7 23123

8 12313

9 1231232

10 13

A mathematical way to measure the extent of correlation is the Pearson product-moment correlation

coefficient. You can read about the formula and complete details on Wikipedia6. The Pearson Coefficient

can vary between -1 and 1, as summarized below.

Pearson product-moment correlation coefficient value

Comments

-1 Perfectly correlated data, but as one rises the other decreases.

0 Uncorrelated data

+1 Perfectly correlated data.

The Pearson Coefficient can be any number between these values. Please refer to the Microsoft Excel

file named cor.xlsx, available under the folder correlation-excel in the sample code folder. It has simple

examples of correlations. Excel has built-in support for calculating the Pearson product-moment

correlation coefficient7.

Applying this information to our problem (deriving recommendations for related products), consider the

following:

Assume there are two movies, Lord of the Rings and The Chronicles of Narnia, that we wish to

evaluate to see if they are similar (similarity being defined in this context as the possibility that

someone liking one will also like the other).

Assume users watched both movies and rated them, as given in the following table.

Name The Lord of the Rings The Chronicles of Narnia

Jack 2 3

Mark 4 4.5

Albert 4 3.5

John 5 5

6 http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient.

7 http://office.microsoft.com/en-us/excel-help/correl-HP005209023.aspx.


Using Excels CORREL function, we calculate the Pearson Correlation Coefficient to be 0.8705715.

Please refer to sample Excel file correlation-excel\lord of the rings.xlsx to play with the provided data.

It is clear that the ratings for the two movies are strongly correlated. Now assume you have similar

ratings for thousands of movies from millions of users. It should be possible to calculate the correlation

coefficients for each pair of movies where ratings from the same user are available for both. Once these

have been calculated, they can be loaded into a relational database system; we should be able to quickly

look up the top N movies simply by looking at the pre-calculated correlation values.

Note: There are other ways to calculate correlations, and it is entirely possible that one system is vastly

superior to another for certain kinds of data. We use the Pearson product-moment correlation

coefficient since it is one of the most commonly used and is easily calculated using Excel. Also, the

method we use has a substantial number of shortcomings (dealing with sparse data is one). As stated

earlier, it does however serve as a useful example to understand more complex uses for MapReduce.

Consider the data set ratings.csv8 available in the folder named data, included with the sample code for

this document. It has data in the following form.

Name of movie critic

Name of movie Rating

Lisa Rose Lady in the Water 2.5

Lisa Rose Snakes on a Plane 3.5

Lisa Rose Just My Luck 3

Lisa Rose Superman Returns 3.5

Lisa Rose You Me and Dupree 2.5

Lisa Rose The Night Listener 3

Gene Seymour Lady in the Water 3

Gene Seymour Snakes on a Plane 3.5

Gene Seymour Just My Luck 1.5

Gene Seymour Superman Returns 5

Gene Seymour The Night Listener 3

This data is only slightly different from the form we considered earlier. We need to obtain pairs of

movies and obtain ratings for each of them by the same user. Once we do this, we will have a list of

ratings for pairs of movies by the same users. We can then calculate the correlation coefficient for the

pairs of movies.

8 This data set and sample were adapted from the excellent book, Programming Collective Intelligence: Building

Smart Web 2.0 Applications by Toby Segaran - http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325.


C# implementation to calculate correlations We have included a sample implemented in straight C# without the use of MapReduce (simple

recommendation - cs console). This will help clarify the actual process of making these calculations. We

will then perform the same calculations using Pig.

The code that calculates correlations follows this procedure:

1. It takes an array containing all ratings and the two movies for which correlation values should be

calculated.

2. It then gets a list of critics who have rated both movies using LINQ.

3. It prepares two lists with parallel ratings.

4. It then uses the LINQStatistics9 library to calculate the Pearson product-moment correlation

coefficient from the two lists of data.

// Calculate the correlation between two movies using ratings by the same critic.

// The stronger the correlation, the more similar the two movies can be considered to

be.

private static double CalculateCorrelation(MovieRating[] movieRatings, string

movie1, string movie2)

{

Console.WriteLine(movie1);

Console.WriteLine(movie2);

// Get the critics who have rated either movie.

var items1 = movieRatings.Where(item => item.Movie == movie1).ToArray();

var items2 = movieRatings.Where(item => item.Movie == movie2).ToArray();

// Critics who have rated both movies.

var commonItems = items1.Intersect(items2, new

PropertyComparer("Critic")).Select(item => item.Critic);

// No common critics - seen correlation is 0.0

if (!commonItems.Any())

return 0.0;

var ratings1 = new List();

var ratings2 = new List();

9 http://www.codeproject.com/Articles/42492/Using-LINQ-to-Calculate-Basic-Statistics


foreach (var critic in commonItems)

{

DumpWithOffset(critic);

var r1 = items1.Where(i => i.Critic == critic).Select(i =>

i.Rating).First();

ratings1.Add(r1);

DumpWithOffset(r1);

var r2 = items2.Where(i => i.Critic == critic).Select(i =>

i.Rating).First();

ratings2.Add(r2);

DumpWithOffset(r2);

}

return ratings1.Pearson(ratings2);

}

Once we are able to calculate the correlation coefficient for a pair of movies, obtaining related

recommendations is simply a matter of obtaining related movies with the highest correlation scores to

the movie in question. The code is given in the following sample, and is straightforward.

In the code, the threshold parameter exists to ensure that movies with a very low or negative

correlation are not picked up. This can certainly be a problem if your data set is very sparse and does not

contain enough ratings. For our purpose, the threshold is set to -1. For practical use, it may need to be

set to 0.5 or so.

public static Recommendation[] GetRelatedProducts(MovieRating[] movieRatings,

string movie, double threshold = -1)

{

var allMovies = movieRatings.Select(x => x.Movie).Distinct();

var results = allMovies.Where(c => c != movie)

.Select(c => new Recommendation()

{Movie = c, Rating = Analysis.CalculateCorrelation(movieRatings,

movie, c)})

.Where(x => x.Rating > threshold)

Syncfusion | Simple Recommendation System Using Pig 21

.OrderByDescending(x => x.Rating);

return results.ToArray();

}

Running the program to obtain the top related movies based on the movie Superman Returns provides

the following output:

You Me and Dupree

0.657951694959769

Lady in the Water

0.487950036474267

Snakes on a Plane

0.111803398874989

The Night Listener

-0.179847194799054

Just My Luck

-0.422890031611031

Simple Recommendation System Using Pig Let us now analyze the same data set using Pig. Pig is termed as a data-flow language. It allows us to

express our processing requirements as a series of transformationsthe result of one flowing into

another. Pig then translates our specifications into Map and Reduce tasks.

In our opinion, it is similar to LINQ; once you play around with a few samples, you will have a good idea

of how it works. The key concepts are explained below. We will stick to explaining these concepts in a

manner that makes sense for our samples. We will not stray into additional details. If you need a

complete introduction to Pig, we recommend Programming Pig by Alan Gates10.

Load and store Pig can load and store data from or to HDFS and other data sources. Pig can load files containing

comma-separated or tab-delimited data. It can also handle several other forms of data. It is also possible

to extend Pig to support custom data sources.

Relation Pig works with collections of data that it refers to as relations. A relation is not to be confused with a

relational database relation. A relation in Pig terminology is simply a collection of data. It is best to think

of relations as similar to a table with rows and columns of data (for our current context). When grouped,

relations can also contain keys with an associated collection of values for each unique key.

10

http://www.amazon.com/Programming-Pig-Alan-Gates/dp/1449302645


Joins Pig can accomplish Joins in a manner that is conceptually intuitive for users who have worked with

relational data. It can join two relations using a common key.

Filter Pig can apply filters to data. A provided predicate is checked to see if data should be included or

excluded.

Projection Pig can project from an existing collection in a manner similar to the SQL Select statement. Pigs

equivalent statement is named Generate.

Grouping Pig can group data by one or more keys. Once grouped, you do not have to flatten the resulting data.

You can maintain a hierarchical structure with keys and lists of values related to the keys. These can

then be projected as needed.

Dump Pig includes a Dump statement that can dump the contents of a relation to the console. Dump is useful

when working with Pig since you can run commands without writing the results to disk.

Pig script that analyzes movie ratings In the following sample, we have explained the Pig code (in sample simple recommendation -

pig\recommend.pig) that analyzes the ratings document to calculate correlations between all pairs of

movies, as we did with C#.

Load the data from HDFS The first step is to load the data from HDFS. We use Pigs load statement. Since the file that we are

processing (ratings.csv) is in CSV format, we pass in the comma as the separator to PigStorage (the

default load mechanism).

We have to load the same data twice in order to do a self-join. In the future, it may be possible to work

with two references to the same relation, but Pig does not work this way currently.

ratings1 = load 'recommend/ratings.csv' using PigStorage(',') as (critic:chararray,

movie:chararray, rating:double);

ratings2 = load 'recommend/ratings.csv' using PigStorage(',') as (critic:chararray,

movie:chararray, rating:double);

Obtain a list of unique movie combinations In order to obtain a list of unique movie combinations, we first do a self-join by the name of the critic.

For each record, this will give us a complete set of records with ratings by the same critic. We then have

to filter out records with duplicate movie names.

As an example, consider the first record in the data set.

Critics name Movie name Rating



Also, these are the complete list of ratings by Lisa Rose.


Lisa Rose Snakes on a Plane 3.5

Lisa Rose Just My Luck 3

Lisa Rose Superman Returns 3.5

Lisa Rose You Me and Dupree 2.5

Lisa Rose The Night Listener 3

After the join, just the result of joining the first row with the second relation should appear as seen in

the following table, which combines the first row, repeating it, with each of Lisa Roses ratings.

Lisa Rose

Lady in the Water

2.5 Lisa Rose

Lady in the Water 2.5

Lisa Rose

Lady in the Water

2.5 Lisa Rose

Snakes on a Plane 3.5

Lisa Rose

Lady in the Water

2.5 Lisa Rose

Just My Luck 3

Lisa Rose

Lady in the Water

2.5 Lisa Rose

Superman Returns 3.5

Lisa Rose

Lady in the Water

2.5 Lisa Rose

You Me and Dupree

2.5

Lisa Rose

Lady in the Water

2.5 Lisa Rose

The Night Listener 3

As you can see, the first row is a duplicate that needs to be filtered from our result. After filtering, the

results derived from the first row of data will appear as seen in the following table.

Lisa Rose

Lady in the Water

2.5 Lisa Rose

Snakes on a Plane 3.5

Lisa Rose

Lady in the Water

2.5 Lisa Rose

Just My Luck 3

Lisa Rose

Lady in the Water

2.5 Lisa Rose

Superman Returns 3.5

Lisa Rose

Lady in the Water

2.5 Lisa Rose

You Me and Dupree

2.5

Lisa Rose

Lady in the Water

2.5 Lisa Rose

The Night Listener 3

The Pig code that affects this transformation is:

combined = JOIN ratings1 BY critic, ratings2 BY critic;


The filter operation removes combinations of movies that are identical. One point to note is that after a

join, we refer to fields using the form original_relation_name::field_name.

filtered = FILTER combined BY ratings1::movie < ratings2::movie;

Project the data to a more usable form We project the results of the join, properly naming each field in the process. This will make it easier to

continue processing the data.

movie_pairs = FOREACH filtered GENERATE ratings1::critic AS critic1,

ratings1::movie AS movie1,

ratings1::rating AS rating1,

ratings2::critic AS critic2,


ratings2::rating AS rating2;

Obtain groups containing ratings for each pair of movies This is achieved by simply grouping both movies in our relation.

grouped_ratings = group movie_pairs by (movie1, movie2);

Calculating correlations We now have all the information we need to calculate correlations. Pig offers built-in support for

calculating correlations. We make use of the pairs of ratings that have been gathered during the

grouping.

The COR function that calculates the correlation returns a list of records with additional information

besides the correlation value. Use the Flatten statement to flatten the results from COR into a single

Tuple of data.

correlations = foreach grouped_ratings generate group.movie1 as movie1,group.movie2

as movie2,

FLATTEN(COR(movie_pairs.rating1, movie_pairs.rating2)) as (var1, var2, correlation);

Project final results Now, we just need the names of the movies and the correlation coefficient.

results = foreach correlations generate movie1, movie2, correlation;

Dump final results for review Do not store results in this case. Simply dump them to the console for review.

dump results;


The following steps have to be followed to run the code.

Running the script The script we just walked through is available at simple recommendation - pig\recommend.pig. Follow

these steps to run the script.

1. Create a folder on HDFS named recommend.

Hadoop fs mkdir recommend

2. Upload data\ratings.csv to HDFS.

Hadoop fs put ratings.csv recommend

3. Run the script with the command.

Pig recommend.pig

4. Though the data set is small, it will take a while to complete since there is fixed-processing

overhead with setting up the MapReduce jobs.

5. When finished, you should see the results dumped to the console.

Results

(Just My Luck,Superman Returns,-0.42289003161103106)

(Just My Luck,Lady in the Water,-0.944911182523068)

(Just My Luck,Snakes on a Plane,-0.3333333333333333)

(Just My Luck,You Me and Dupree,-0.4856618642571827)

(Just My Luck,The Night Listener,0.5555555555555556)

(Superman Returns,You Me and Dupree,0.657951694959769)

(Superman Returns,The Night Listener,-0.1798471947990542)

(Lady in the Water,Superman Returns,0.4879500364742666)

(Lady in the Water,Snakes on a Plane,0.7637626158259734)

(Lady in the Water,You Me and Dupree,0.3333333333333333)

(Lady in the Water,The Night Listener,-0.6123724356957946)

(Snakes on a Plane,Superman Returns,0.11180339887498948)

(Snakes on a Plane,You Me and Dupree,-0.6454972243679028)

(Snakes on a Plane,The Night Listener,-0.5663521139548541)

(The Night Listener,You Me and Dupree,-0.25)

If you compare these results with the results from the C# version, you will observe that they are

identical.

Applying the same concepts to a much larger set of data We can apply the same system to a much larger data set: the MovieLens ratings dataset available from

http://www.grouplens.org/datasets/movielens/.


The MovieLens site has ratings with 100,000, one million, and 10 million records11. The structure of the

data set is explained below. We are only interested in the u.info and u.data files.

Structure of u.item Each line in this file has information on a specific movie. The only two fields that we will end up using are

the first two containing the unique ID for the movie and the name of the movie.

Movie ID Movie name Several other fields (unused)

1 Toy Story (1995 NA

Structure of u.data Each line in this file has identifiers for critics, the movie they rated, and the rating they gave. There is

also a column with timestamp information. The timestamp information is not needed for our purpose.

Critic ID Movie ID Rating Timestamp (unused)

196 242 3 881250949

The complete Pig script used to perform this analysis is given in the following sample. Observe that it is

similar to the script we used with the smaller data set. The only major difference is that we perform an

extra join to include movie names as part of the results since the names are stored in a separate u.item

file.

-- Load MovieLens file u.data twice since we need to do a self-join to obtain unique

pairs of movies as before.

-- This script assumes you have uploaded the u.data file and u.item file into a

folder named movielens on HDFS

ratings1 = load 'movielens/u.data' as (critic:long, movie:long, rating:double);

ratings2 = load 'movielens/u.data' as (critic:long, movie:long, rating:double);

-- Join by critic name

combined = JOIN ratings1 BY critic, ratings2 BY critic;

-- Since movies are identified by ID of type long, we filter and remove cases where

both movies are identical

-- The resulting relation contains unique pairs of movies

filtered = FILTER combined BY ratings1::movie != ratings2::movie;

11

We ran our test on a single machine with the 100k record data set. For larger data sets, it may be better to run on a true cluster, one that you set up and configure locally. Or better yet, use one that you set up on Windows Azures HDInsight service: http://www.windowsazure.com/en-us/services/hdinsight/.


-- Project intermediate results

movie_pairs = FOREACH filtered GENERATE ratings1::critic AS critic1,


ratings1::rating AS rating1,

ratings2::critic AS critic2,


ratings2::rating AS rating2;

-- Group by movie names

grouped_ratings = group movie_pairs by (movie1, movie2);

-- Calculate correlation in rating values

correlations = foreach grouped_ratings generate group.movie1 as movie1,group.movie2

as movie2,

FLATTEN(COR(movie_pairs.rating1, movie_pairs.rating2)) as (var1, var2,

correlation);

-- Project results removing fields that we do not need

results = foreach correlations generate movie1, movie2, correlation;

-- Load item names and do a join to get actual names instead of just ID references

-- Notice that the separator between fields is a '|' in the file u.item.

movies = load 'movielens/u.item' using PigStorage('|') as (movie:long,

moviename:chararray);

-- Get the name of the first movie

named_results = JOIN results BY movie1, movies BY movie;

-- Get the name of the second movie

named_results2 = JOIN named_results BY results::movie2, movies BY movie;

-- Write the results to HDFS

-- Please ensure that this folder does not exist

-- Remove with hadoop fs -rmr movielensout if it exists


STORE named_results2 INTO 'movielensout';

Running the script 1. Download and extract the 100k data set from the MovieLens website12,

http://www.grouplens.org/datasets/movielens/. These files are not included with the provided

sample code.

2. The data set contains several files. Only two of these files are required for our immediate needs:

u.data and u.item.

3. Upload u.data and u.item files to HDFS:

Hadoop fs mkdir movielens

Hadoop fs put u.data u.item movielens

4. Run the script with the command: Pig movielens.pig The script will take a while to complete.

Our tests on a single-machine, pseudo-cluster took about 10 minutes with the 100k data set.

5. The results are written to a folder named movielensout by default. If you run the script more

than once, be sure to remove this folder (hadoop fs rmr movielensout) before running the

script. Pig will complain if the output folder already exists.

6. Once the script completes, you can copy the results to your local file system for review.

>hadoop fs -get movielensout/part*

A portion of the output we obtained from this job is shown in the following table.

Movie 1 ID

Movie 2 ID

Correlation Coefficient

Movie 1 ID (repeated13)

Movie 1 Name

Movie 2 ID

(repeated)

Movie 2 Name

1469 4 -0.6651330399133046 1469 Tom and Huck (1995)

4 Get Shorty (1995)

1489 4 0.8703882797784892 1489 Chasers (1994)

4 Get Shorty (1995)

1510 4 NaN 1510 Mad Dog Time (1996)

4 Get Shorty (1995)

1475 4 -0.3273268353539886 1475 Bhaji on the Beach (1993)

4 Get Shorty (1995)

1419 4 0.6750771560841521 1419 Highlander III: The Sorcerer (1994)

4 Get Shorty (1995)

1436 4 1.0 1436 Mr. Jones (1993)

4 Get Shorty (1995)

1656 4 NaN 1656 Little City (1998)

4 Get Shorty (1995)

12

MovieLens usage terms prohibit the distribution of the data. You will have to download a copy yourself in order to test this script. 13

Repeated due to the joins with u.item. We should have added a projection, but we did not do so to keep the code succinct.

Syncfusion | The Role of Traditional BI 29

Looking at a couple of result rows (highlighted), it is a safe bet to recommend Get Shorty to those who

like Chasers. It is a bad idea to recommend Get Shorty to those who like Tom and Huck.

The data contains fields that are repeated as well as several NaN values. It would be a good exercise to

modify the Pig script so NaN values, which appear due to a lack of common ratings for the pair of

movies, are removed. Also, you can modify projections so duplicate fields are removed.

The content thus far should have given you a good overview of the fundamentals of working with

Hadoop/HDInsight. You should now have enough of an understanding of the general environment

related to big data to briefly review some related topics.

The Role of Traditional BI Analysis with Hadoop is a batch process. Such analysis takes time. While Hadoop can become an

invaluable part of your extract-transform-load (ETL) pipeline, in many instances you still need to store

final output in relational form or in a data warehouse of some sort. That way, it can be accessed on

demand. In fact, this is exactly what we would do with the movie recommendations we calculated. We

would store them in a relational database and update the information at regular intervals as new ratings

come in. We can then provide movie recommendations on demand.

In our opinion, traditional Business Intelligence tools do not lose their importance. They just get much

more powerful by harnessing the capabilities of the Hadoop ecosystem. This is precisely what we expect

to see happening on the Microsoft Business Intelligence stack as well. Microsoft, and other vendors, will

make it easier to integrate Hadoop and SQL Server/SQL Server Analysis Services.

Data Mining Post-ETL The ability to perform large-scale ETL on data that was previously unavailable or difficult to process

opens up several additional avenues for data mining. Common data mining taskssuch as dependency

modeling, clustering, classification, detection of anomalies, regression, and aggregationdepend on

access to good data from multiple sources. The ability to process big data now provides additional data

sources that can be used with these tasks.

The famous example of Target learning14 that a lady was pregnant by analyzing and modelling shopping

habits was likely achieved because they were able to integrate not just transactional data, but additional

data sources such as web activity logs. The combined model is likely superior to one built from a smaller

segment of available data.

It is important to understand that the actual data mining environment does not have to be capable of

handling big data. The mining process can still be performed using traditional tools. The open source R

environment is a powerful tool for data mining. SQL Server also provides a solid set of data mining tools

already available to many organizations. When used in tandem with Hadoop, R and SQL Server can be

used to build compelling models that can predict customer spending, customer actions, fraud, machine

failure, and much more.

14

http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?_r=0

Syncfusion | Data Mining with Big Data 30

Data Mining with Big Data It is also possible to model directly using big data. The open source Mahout15 project implements several

algorithms (including a recommendation system suitable for production use) in a distributed manner

that can be integrated with Hadoop. A complete set of currently implemented algorithms is available at

https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms.

Big Data Processing Is Not Just for Big Data The processing methods we have seen are applicable to a broad swath of data that does not necessarily

have to be big. The processing methods are powerful and useful for much smaller data sets as well. This

is especially true when data is not in a structured format.

ConclusionHarnessing Your Data Is Easier Than You Think There has been a lot of talk about big data, Hadoop, and data science. Having a good understanding of

the environment surrounding big data will help you make the right decisions, whether as a developer or

business person.

At Syncfusion, we sincerely believe getting your big data strategy right is not hard. It just requires a solid

understanding of the fundamentals and a willingness to push boundaries and test how the adoption of

new strategies can make your business more competitive. The future belongs not to those who have

more data, but to those who have data put to good use. We close with this statement by Tim OReilly in

a Google+ conversation16:

Companies that have massive amounts of data without massive amounts of

clue are going to be displaced by startups that have less data but more clue.

How Can Syncfusion Help? Syncfusion has been working with HDInsight since the earliest releases of the product. We also have

extensive experience with traditional business intelligence, data mining, the R environment, data

visualization, and enterprise reporting.

Syncfusions solutions team can implement big data solutions from end-to-end. Contact us today to

learn more.

15

http://mahout.apache.org/ 16

https://plus.google.com/+TimOReilly/posts/4Xa76AtxYwd

Syncfusion | Contact information 31

Contact information Syncfusion, Inc.

2501 Aerial Center Parkway

Suite 200

Morrisville, NC 27560

USA

[email protected]

Syncfusion | Appendix AInstalling and Configuring HDInsight on a Single Node (Pseudo-Cluster)

32

Appendix AInstalling and Configuring HDInsight on a Single Node

(Pseudo-Cluster) 1. Install HDInsight (developer preview version as of October 24, 2013) from

http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW

2. Install according to the installation package prompts.

3. Once installation is complete, ensure that the following services are running. They are not set to

run by default, so you will have to manually start them or change settings so they start

automatically.

4. The install creates a shortcut to a command line environment configured with the right

environment for running Hadoop. Navigate to this, and start the environment.

Syncfusion | Appendix AInstalling and Configuring HDInsight on a Single Node (Pseudo-Cluster)

33

Syncfusion | Appendix BConfiguring NetBeans for HDInsight Development on Windows 34

Appendix BConfiguring NetBeans for HDInsight Development on

Windows 1. Install version 6 of the Java SDK from

http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-

javase6-419409.html.

2. Install the latest version of NetBeans from https://netbeans.org/downloads/.

3. Once installed, you can open the included word count java sample in NetBeans.

4. Once you open the project, select project properties by right-clicking on the project name, as

shown in the following image, and selecting Properties.

5. The following dialog will be displayed. Select Libraries and check to see if JDK 1.6 (this is the

version of the JDK that corresponds to Java 6) is selected. If you do not see JDK 1.6 selected,

please select it. If JDK 1.6 is not listed, click Manage Platforms.


6. Click Add Platform on the next dialog that is displayed.


7. A dialog will then be displayed with a selection dialog that can be pointed to the location of the

JDK, as shown in the following image.

8. Now, make sure JDK 1.6 is selected as the platform, and close the selection dialog. The project

should then display JDK 1.6 under the libraries tree entry.


9. The word count java project already contains a reference to the hadoop-core-1.1.0-

SNAPSHOT.jar file. In new projects that you create, you should include a reference to this library

(installed by HDInsight to {install disk}:\Hadoop\hadoop-1.1.0-SNAPSHOT\hadoop-core-1.1.0-

SNAPSHOT.jar). You may have to add additional library references if you use additional features.

Please consult included documentation for this information.

10. Once these settings are in place, you should be able to build the project using the Runbuild

project menu option. A JAR file will be created and available under a folder named dist under

the main project folder. This JAR file can be deployed to Hadoop clusters.

Syncfusion_WhitePaper_HDInsight

Documents

Transcript of Syncfusion_WhitePaper_HDInsight