Syncfusion_WhitePaper_HDInsight

download Syncfusion_WhitePaper_HDInsight

of 38

description

Syncfusion_WhitePaper_HDInsight

Transcript of Syncfusion_WhitePaper_HDInsight

  • Ignore HDInsightat Your Own Peril:

    by Daniel Jebaraj

    EVERYTHING YOU NEED TO KNOW

  • Syncfusion | Ignore HDInsight at Your Own Peril: Everything You Need to Know 1

    Contents Ignore HDInsight at Your Own Peril: Everything You Need to Know ............................................................ 3

    Abstract ..................................................................................................................................................... 3

    Introduction .................................................................................................................................................. 4

    Storage and Analysis of Big Data .................................................................................................................. 4

    Hadoop/HDInsight ........................................................................................................................................ 5

    Scalable StorageHadoop Distributed File System ..................................................................................... 5

    Scalable processing ....................................................................................................................................... 6

    MapReduce ............................................................................................................................................... 8

    Map ....................................................................................................................................................... 8

    Shuffle ................................................................................................................................................... 9

    Reduce ................................................................................................................................................ 10

    MapReduce sampleJava implementation of word count ................................................................... 10

    Prerequisites ....................................................................................................................................... 10

    Compiling provided Java sample ......................................................................................................... 10

    Upload the input text document to HDFS .......................................................................................... 11

    C# implementation of word Count ......................................................................................................... 14

    Review results ..................................................................................................................................... 14

    Important notes .................................................................................................................................. 14

    C# Mapper ........................................................................................................................................... 14

    C# Reducer ....................................................................................................................................... 15

    MapReduce the Easy Way .......................................................................................................................... 15

    Building a simple recommendation engine ................................................................................................ 16

    Perfectly correlated data ........................................................................................................................ 16

    Uncorrelated data ................................................................................................................................... 16

    C# implementation to calculate correlations ......................................................................................... 19

    Simple Recommendation System Using Pig................................................................................................ 21

    Load and store ........................................................................................................................................ 21

    Relation ................................................................................................................................................... 21

    Joins ........................................................................................................................................................ 22

    Filter ........................................................................................................................................................ 22

    Projection ................................................................................................................................................ 22

  • Syncfusion | Ignore HDInsight at Your Own Peril: Everything You Need to Know 2

    Grouping ................................................................................................................................................. 22

    Dump ....................................................................................................................................................... 22

    Pig script that analyzes movie ratings ..................................................................................................... 22

    Load the data from HDFS .................................................................................................................... 22

    Obtain a list of unique movie combinations ....................................................................................... 22

    Project the data to a more usable form .............................................................................................. 24

    Obtain groups containing ratings for each pair of movies.................................................................. 24

    Calculating correlations ...................................................................................................................... 24

    Project final results ............................................................................................................................. 24

    Dump final results for review.............................................................................................................. 24

    Running the script ............................................................................................................................... 25

    Results ................................................................................................................................................. 25

    Applying the same concepts to a much larger set of data ..................................................................... 25

    Structure of u.item .............................................................................................................................. 26

    Structure of u.data .............................................................................................................................. 26

    Running the script ............................................................................................................................... 28

    The Role of Traditional BI ............................................................................................................................ 29

    Data Mining Post-ETL .................................................................................................................................. 29

    Data Mining with Big Data .......................................................................................................................... 30

    Big Data Processing Is Not Just for Big Data ............................................................................................... 30

    ConclusionHarnessing Your Data Is Easier Than You Think..................................................................... 30

    How Can Syncfusion Help? ......................................................................................................................... 30

    Contact information .................................................................................................................................... 31

    Appendix AInstalling and Configuring HDInsight on a Single Node (Pseudo-Cluster) .............................. 32

    Appendix BConfiguring NetBeans for HDInsight Development on Windows ........................................... 34

  • Syncfusion | Ignore HDInsight at Your Own Peril: Everything You Need to Know 3

    Ignore HDInsight at Your Own Peril: Everything You Need to Know

    Abstract HDInsight is a Microsoft-provided distribution of Apache Hadoop, adapted for and tested on the

    Windows platform. It can be deployed to private self-hosted clusters or can be accessed as a service on

    the Windows Azure1 cloud. It is currently available as a public preview and is expected to be released in

    the fourth quarter of 2013.

    HDInsight brings truly scalable data processing of structured and unstructured data to the Windows

    Platform. This white paper will cover all you need to know about Hadoop, MapReduce, and HDInsight in

    order to harness the benefits of Big Data.

    This white paper is suitable for both developers and managers. You could skip over the included code

    and still get a working understanding of the overall system. For developers, we have included several

    working samples that will help you understand the environment.

    1 http://www.windowsazure.com/en-us/services/hdinsight/

  • Syncfusion | Introduction 4

    Introduction There has been a virtual explosion in the amount of data being created. Not very long ago, transactional

    information was the main source of data. In the past, only a few large organizations accumulated

    unwieldy amounts of transactional data. The need to store and process such amounts of data was not a

    common business requirement for most organizations.

    Now, the situation has changed dramatically. Organizations have woken up to the reality that huge

    amounts of data are being generated on a daily basis by people and machines.

    Consider some examples:

    Activity logs generated by customers browsing websites.

    Logs generated by complex, field-deployed machinery, recording key measurements several

    times a second.

    Signals generated on social media related to a companys products and services.

    Such data can be big, difficult to store and process using traditional methods. Such unwieldiness is

    what distinguishes big data from other data.

    In spite of storage and processing difficulties, big data offers potentially huge business value. It provides

    an opportunity to gather insight concerning business activities in ways previously not possible. Customer

    web log information, for instance, can be used to predict valuable trends that an organization may not

    otherwise be aware of. Machine-generated information can be used to predict failure, probability of

    accidents, and other such events long before they happen. Social media signals can be used to predict

    the failure or success of specific marketing initiatives.

    This white paper focuses on the storage and processing of big data using the HDInsight distribution (the

    terms Hadoop and HDInsight are used interchangeably). Understanding this is critical to harnessing big

    data and putting it to use to further business goals.

    Storage and Analysis of Big Data Though most organizations have realized that they are accumulating more data than ever before, only a

    minority have implemented a storage and analysis strategy for such data. There is good reason for this.

    The storage of transactional data in relational database systems has been well understood for several

    decades. The tools that are used for this purpose, such as SQL Server and Oracle, are well understood.

    Relational data is stored in a highly structured format and processed using SQL. The warehousing of this

    data to build online analytical processing (OLAP) systems is also understood well.

    On the other hand, unstructured data (especially in huge quantities), is by nature very different. It has

    no predefined structure. Relational databases are not suitable for storing such data. Storage on

    traditional file systems is also problematic since the size of the data often exceeds the capabilities of

    single machines.

  • Syncfusion | Hadoop/HDInsight 5

    Hadoop/HDInsight Against this backdrop, Hadoop has gained broad acceptance as an effective storage and processing

    mechanism for big data. Hadoop is an open-source implementation of systems that Google

    implemented2 internally to solve big data problems related to storing indexes for web scale data.

    Hadoop at its core has two pieces: one for storing large amounts of unstructured data in a cost-effective

    manner and another for processing large amounts of data in a cost-effective manner.

    The data storage solution is named Hadoop Distributed File System (HDFS).

    The processing solution is an implementation of the MapReduce programming model

    documented by Google.

    Scalable StorageHadoop Distributed File System HDFS is a file system designed to store large amounts of data economically. It does this by storing data

    on multiple commodity machinesscaling out instead of scaling up.

    HDFS, for all its underlying magic, is simple to understand at a conceptual level.

    Each file that is stored by HDFS is split into large blocks (typically 64 MB each, but this setting is

    configurable).

    Each block is then stored on multiple machines that are part of the HDFS cluster. A centralized

    metadata store has information on where individual parts of a file are stored.

    Considering that HDFS is implemented on commodity hardware, machines and disks are

    expected to fail. When a node fails, HDFS will ensure that data blocks the node held are

    replicated to other systems.

    This scheme allows for the storage of large files in a fault-tolerant manner across multiple machines.

    2 http://research.google.com/archive/mapreduce.html

  • Syncfusion | Scalable processing 6

    HDFS visually

    In HDFS, the metadata store is typically on a machine referred to as the name node. The nodes where

    data is stored are referred to as data nodes. In the previous diagram, there are three data nodes. Each

    of these nodes contains a copy of each block of data that is stored on the HDFS cluster. A production

    implementation of HDFS will have many more nodes, but the essential structure still applies.

    The data blocks stored on individual machines also play an important role in efficiently processing data

    by the implementation of MapReduce in Hadoop, but we will have more to say about that shortly.

    Scalable processing Before we discuss MapReduce, it will be helpful to carefully consider the issues associated with scaling

    out the processing of data across multiple machines. We will do this using a simple example. Assume we

    have a text file, and we would like an individual count of all words that appear in that text file.

    This is the pseudo-code for a simple word-counting program that runs on a single machine:

    Open the text file for reading, and read each line.

    Parse each line into words.

    Increment and store the count of each word as it appears in a dictionary or similar structure.

    Close the file and output summary information.

    Simple enough. Now consider that you have several gigabytes (maybe petabytes) of text files. How will

    we modify the simple program described above to process this kind of information by scaling out3 across

    multiple machines?

    Some issues to consider:

    Data storageThe system should provide a way to store the data being processed.

    Data distributionThe system should be able to distribute data to each of the processing

    nodes.

    3 Scale up vs. scale outIt will not be ideal to implement such a processing system on a single machine. A powerful

    machine can certainly process gigabytes of text, but there is a limit to this kind of scaling.

  • Syncfusion | Scalable processing 7

    Parallelizable algorithmThe processing algorithm should be parallelizable. Each node

    should be able to run without being held up by another during any given stage. If nodes

    have to share data, the difficulties associated with the synchronization of such data are to

    be considered.

    Fault toleranceAny of the nodes processing data can fail. In order for the system to be

    resilient given the failure of individual nodes, the failure should be promptly detected and

    the work should be delegated to another available node.

    AggregationThe system should have a way to aggregate results produced by individual

    nodes to compute final summaries.

    Storage of resultsThe final output can itself be substantial; there should be an effective

    way to store and retrieve it.

    As we consider these aspects, it is evident that implementing a custom version of a truly scalable parallel

    system across multiple machines is not a trivial task, even for a problem as simple as counting words.

    Hadoop makes scaling out processing easier by implementing solutions to these issues, summarized in

    the following table.

    Issue considered The Hadoop solution

    Data storage HDFS

    Data distributionThe system should be able to distribute data to each of the processing nodes.

    Hadoop keeps data distribution between nodes to a minimum. It instead moves processing code to each node and processes the data where it is already available on disk.

    Parallelizable algorithm We will study this aspect in more detail, but in essence, as of version 1.x, Hadoop mandates the MapReduce programming model to enable a scalable processing model.

    Fault tolerance Hadoop monitors data storage nodes and will add replicas as nodes become unavailable. Hadoop also monitors tasks assigned to nodes and will reassign if a node becomes unavailable.

    Aggregation This is accounted for in a distributed manner through the Reduce stage, as we will see in the next section.

    Storage of results HDFS

  • Syncfusion | Scalable processing 8

    MapReduce We have seen that Hadoop as of version 1.x4 mandates the MapReduce programming model.

    MapReduce is a functional programming model that moves away from shared resources and related

    synchronization or contention issues. It instead uses simple parts that are inherently scalable to achieve

    complex solutions.

    Googles paper on MapReduce provides the following description:

    MapReduce is a programming model and an associated implementation for

    processing and generating large data sets. Users specify a map function that

    processes a key/value pair to generate a set of intermediate key/value pairs, and a

    reduce function that merges all intermediate values associated with the same

    intermediate key. Many real-world tasks are expressible in this model.

    Programs written in this functional style are automatically parallelized and

    executed on a large cluster of commodity machines. The run-time system takes

    care of the details of partitioning the input data, scheduling the program's

    execution across a set of machines, handling machine failures, and managing the

    required inter-machine communication. This allows programmers without any

    experience with parallel and distributed systems to easily utilize the resources of a

    large distributed system.

    The MapReduce programming model is not hard to understand, especially if we study it using a simple

    example. MapReduce as implemented in Hadoop is comprised of three stages. We will look at the three

    stages of any MapReduce program in detail.

    Map The Map stage takes input in the form of a key and a value, processes the input, and then outputs

    another key and value. In this sense, it is no different than the implementation of Map in several

    programming environments.

    Considering the word count example, a Map task is likely to follow these steps:

    Input

    KeyKey identifying the value being provided to the Mapper.

    o In the context of Hadoop and the word counting problem, this key is simply the starting

    index of the text in the data block being processed. We can consider it to be opaque and

    ignore it.

    ValueValue is a single line of text. Consider it as a unit to be processed.

    Processing (implemented by user code)

    Splits the provided line of text into individual words.

    Output

    For each word, output a set of key or value pairs. The mechanism to output these values is provided by

    Hadoop.

    4 Version 2.0 introduces additional programming models.

  • Syncfusion | Scalable processing 9

    KeyA suitable key is the actual word detected.

    ValueA suitable value is 1. Think of this as a distributed marker that simply denotes that we

    saw a particular word once. It is important to distinguish this from a dictionary approach. With a

    dictionary, we will look up the current value and increment it by one. In our case, we do not do

    this. Every time we see a word, we simply mark that we have seen it again by outputting a 1.

    Aggregation will happen later.

    Example walkthrough

    Input to Mapper

    Key Value

    {Any number indicating the index within the block being processed}

    Twinkle, Twinkle Little Star

    Output by Mapper

    We assume that punctuation does not count in our context. Note that the word Twinkle was seen

    twice during processing, and therefore appears twice with 1 as the value and Twinkle as the key.

    Key Value

    Twinkle 1

    Twinkle 1

    Little 1

    Star 1

    Shuffle Once the Map stage is over, data collected from the Mappers (remember, there could be several

    Mappers operating in parallel) will be sent to the Shuffle stage.

    During the Shuffle stage, all values that have the same key are collected and stored as a conceptual list

    tied to the key under which they were registered.

    In the word count example, assuming the single line of text we observed earlier was the only input, this

    is what the output by the Shuffle phase should be:

    Key List of values

    Twinkle 1,1

    Little 1

    Star 1

    The Shuffle stage guarantees that data under a specific key will be sent to exactly one reducer (the next

    stage).

    Shuffle is not typically implemented by the application. Hadoop implements shuffle and guarantees that

    all data values that belong to a single key will be gathered together and passed to a single reducer. In

    the instance mentioned above, the key Twinkle will be processed by a single reducer. It will never be

  • Syncfusion | Scalable processing 10

    processed by more than one reducer. Data under different keys can of course be routed to different

    reducers.

    Reduce The reducers role is to process the transformed data and output yet another key-value pair. This is the

    key-value pair that is actually written to the output. In the word count sample, the reducer can simply

    return the word as a key again, with the value being a summation of all the ones that appear in the

    provided list of values. This will, of course, be the number of times the word has appeared in the text

    the desired output.

    Key Value

    Twinkle 2

    Little 1

    Star 1

    The beauty of MapReduce is that once a problem is broken into MapReduce terms and tested on a small

    amount of data, you can be confident you have a scalable solution that can handle large volumes of

    similar data.

    We will now review a working implementation of the word count problem implemented using

    MapReduce in Java and C#.

    We chose to show the solution in both Java and C# since Java is the native language of the Hadoop

    environment. Other languages such as C# are supported by streaming through stdin and stdout, but Java

    is the language you will often turn to when reviewing available sample code or implementing more

    advanced Hadoop features. For this reason, it is a good idea to have a working knowledge of using Java

    with Hadoop.

    MapReduce sampleJava implementation of word count

    Prerequisites 1. Install HDInsight following the steps given in Appendix AInstalling and Configuring HDInsight

    on a Single Node (Pseudo-Cluster).

    2. Download sample code from https://bitbucket.org/syncfusion/hdinsightwp/src.

    3. Install and configure the NetBeans IDE for Hadoop development as documented in Appendix

    BConfiguring NetBeans for HDInsight Development on Windows.

    Compiling provided Java sample Open and compile the Java Sample Word Count, available in the sample folder, word count java, using

    the NetBeans IDE.

    Alternatively, you can use your favorite Java IDE or the Java command line, but do keep in mind the

    following:

    You have to use Java 6, 64-bit version for compilation.

    You have to package the compiled class files in a JAR file for execution by the Hadoop

    environment.

  • Syncfusion | Scalable processing 11

    Once you have a compiled JAR file, please follow these steps to execute the sample:

    Upload the input text document to HDFS 1. Start the Hadoop command line shell through the link that is created when you install HDInsight.

    2. Navigate to the folder where you have installed the source for this article. Specifically, navigate

    to the folder named data.

    3. Since HDFS is a virtual file system, Hadoop provides access to HDFS files through a shell. The

    shell offers several standard POSIX commands. To copy the data file required for this sample,

    use the following commands:

    hadoop fs -mkdir warpeaceCreates a directory on HDFS named warpeace. The

    directory will be created under the users home directory since we did not provide

    an absolute path.

    hadoop fs -put warpeace.txt warpeaceUploads warpeace.txt from the local file

    system to the HDFS file system.

    hadoop fs -ls warpeaceVerifies that the file was uploaded correctly.

    4. Next, navigate to the sample folder named word count (java\wordcount\dist). The compiled

    JAR file named wordcount.jar should be under this directory if you compiled using the NetBeans

    IDE.

    5. Run this command: hadoop jar wordcount.jar warpeace warpeacecount

    6. Hadoop will start running the job right away. If all goes well, you should see the task complete

    with no error messages.

    7. Once the task completes with no errors, issue this command to see the output: hadoop fs -ls

    warpeacecount

    You should see the output depicted in the following image. There are multiple files in the

    output. One contains status information, and another contains log information. The third file

    with the name part-##### is the file of interest that contains the actual results. If there were

    multiple nodes working in parallel, we would see additional files that contain parts of the result.

    8. Use this command to view the content of the output directory: hadoop fs -cat

    warpeacecount/part*

    9. You should observe results dumped to the console, as shown in the following image.

  • Syncfusion | Scalable processing 12

    After running the Java implementation of Word Count implemented using MapReduce, take a look at

    the three parts of the code, reproduced here:

    The Mapper

    The Mapper takes lines of input, and for each word seen, returns the word as a key with the value 1,

    as described in our earlier walkthrough.

    // Template arguments state the type of input key,value and output key, value

    public static class Map extends MapReduceBase implements Mapper

  • Syncfusion | Scalable processing 13

    StringTokenizer tokenizer = new StringTokenizer(line);

    while (tokenizer.hasMoreTokens()) {

    word.set(tokenizer.nextToken());

    oc.collect(word, one);

    }

    }

    }

    Reducer

    The Reducer aggregates output from the Shuffle stage, as seen in the earlier walkthrough. It then

    outputs each word as a key with its total count as a value.

    public static class Reduce extends MapReduceBase implements Reducer {

    @Override

    public void reduce(Text key, Iterator values,

    OutputCollector output, Reporter reporter) throws IOException {

    int sum = 0;

    while (values.hasNext()) {

    sum += values.next().get();

    }

    output.collect(key, new IntWritable(sum));

    }

    }

    Entry point with configuration information

    Hadoop requires some plumbing in order to submit a job, but this plumbing is straightforward. It uses

    console arguments to configure values, such as the input and output paths. Several other settings can

    also be specified if needed.

    public static void main(String[] args) throws IOException {

    JobConf jobConf = new JobConf(Wordcount.class);

    jobConf.setJobName("Word Count example");

    jobConf.setOutputKeyClass(Text.class);

  • Syncfusion | Scalable processing 14

    jobConf.setOutputValueClass(IntWritable.class);

    jobConf.setMapperClass(Map.class);

    jobConf.setReducerClass(Reduce.class);

    jobConf.setInputFormat(TextInputFormat.class);

    jobConf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.addInputPath(jobConf,new Path(args[0]));

    FileOutputFormat.setOutputPath(jobConf,new Path(args[1]));

    JobClient.runJob(jobConf);

    }

    If you review each line in the code sample and observe the results, you should have a working

    understanding of how MapReduce works.

    C# implementation of word Count We will now take a look at the C# implementation of the same sample.

    The C# sample is available in the folder named word count cs. Unlike the Java sample, the C# sample is

    configured to run itself. You do not, therefore, need to invoke Hadoop. Just know that you can have a

    self-running exe that can start a Hadoop job, or you can have a command start the job for you based on

    provided parameters (as we did with the Java sample).

    The C# sample, once it is done, will create a folder named warpeacecountcs with results identical to the

    Java version.

    Review results hadoop fs -cat warpeacecountcs/part*

    Important notes

    The C# sample uses the Hadoop SDK available on CodePlex. We have included copies of the

    assemblies and files needed. You will not need to build the SDK to work with the sample.

    If you do have issues running the C# sample, we recommend that you build the Hadoop SDK

    from code and then run the sample with updated dependencies.

    Additionally, though the Hadoop SDK is available through Nuget, we do not recommend going

    that route since we experienced some issues when building against the Nuget version. Building

    the SDK from source is the way to go if you have issues.

    The C# versions of the Mapper and Reducer are shown in the following sample. If you compare them

    with the Java version, you will see they have similar functionality.

    C# Mapper public class WordCountMapper : MapperBase

    {

    public override void Map(string inputLine, MapperContext context)

  • Syncfusion | MapReduce the Easy Way 15

    {

    try

    {

    string[] words = inputLine.Split(' ');

    foreach (string word in words)

    context.EmitKeyValue(word, "1");

    }

    catch (ArgumentException ex)

    {

    return;

    }

    }

    }

    C# Reducer

    public class WordCountReducer : ReducerCombinerBase

    {

    public override void Reduce(string key, IEnumerable values,

    ReducerCombinerContext context)

    {

    context.EmitKeyValue(key, values.Count().ToString());

    }

    }

    MapReduce the Easy Way We have looked at writing MapReduce the hard way with Java and C#. It is essential to understand how

    MapReduce works; for that reason, the Java and C# samples we just reviewed are useful. In practice

    however, writing MapReduce jobs in C# or Java can be compared to writing software programs in

    Assembly. You dont usually do it unless you absolutely have to.

    Several domain-specific languages exist for the specific purpose of authoring MapReduce jobs. Pig

    (http://pig.apache.org/) and Hive (http://hive.apache.org/) are two of the more commonly used

    languages.

  • Syncfusion | Building a simple recommendation engine 16

    We will not work with Hive in this article, but we will spend a fair amount of time with Pig. Hive provides

    a SQL-like approach to specifying MapReduce jobs. If you are interested in Hive, we encourage you to

    check out material available online and the book Programming Hive5.

    Pig and Hive are both compelling environments for authoring MapReduce jobs. As developers, we prefer

    Pig since its syntax is closer to that of a programming language. If you come from a SQL background, you

    may prefer Hive. HDInsight has great support for both. You will not be at a disadvantage choosing one

    over the other for most tasks.

    In the next section, we will look into building a simple product recommendation engine using Pig. The

    task of building a product recommendation engine is a real-world, big data use case. We will simplify its

    specification and implementation in order to make it easier to understand, but the fundamental ideas

    will remain the same as those in actual use. Working through this sample will give you a good

    understanding of using Pig for complex MapReduce tasks.

    Building a simple recommendation engine Product recommendations are available on many popular sites such as Amazon.com and Netflix.com.

    When we review or buy a specific product, these websites usually offer helpful suggestions on other

    products that may be of interest. They use complex algorithms tuned over years to achieve these

    results. The underlying concepts though, are quite simple. The key concept is that of a correlation

    between two pieces of data.

    Consider the following table with two columns of data. Even a casual review tells us that one column is

    in tandem with the other (simply calculated by a multiplier).

    Perfectly correlated data 1 10

    2 20

    3 30

    4 40

    5 50

    6 60

    6 60

    7 70

    8 80

    9 90

    10 100

    On the other hand, consider the table below. The two columns are not related in an evident manner.

    Uncorrelated data 1 3123

    2 12321

    5 http://www.amazon.com/Programming-Hive-Edward-Capriolo/dp/1449319335

  • Syncfusion | Building a simple recommendation engine 17

    3 3123

    4 12312

    5 5555555

    6 2323123

    6 123213

    7 23123

    8 12313

    9 1231232

    10 13

    A mathematical way to measure the extent of correlation is the Pearson product-moment correlation

    coefficient. You can read about the formula and complete details on Wikipedia6. The Pearson Coefficient

    can vary between -1 and 1, as summarized below.

    Pearson product-moment correlation coefficient value

    Comments

    -1 Perfectly correlated data, but as one rises the other decreases.

    0 Uncorrelated data

    +1 Perfectly correlated data.

    The Pearson Coefficient can be any number between these values. Please refer to the Microsoft Excel

    file named cor.xlsx, available under the folder correlation-excel in the sample code folder. It has simple

    examples of correlations. Excel has built-in support for calculating the Pearson product-moment

    correlation coefficient7.

    Applying this information to our problem (deriving recommendations for related products), consider the

    following:

    Assume there are two movies, Lord of the Rings and The Chronicles of Narnia, that we wish to

    evaluate to see if they are similar (similarity being defined in this context as the possibility that

    someone liking one will also like the other).

    Assume users watched both movies and rated them, as given in the following table.

    Name The Lord of the Rings The Chronicles of Narnia

    Jack 2 3

    Mark 4 4.5

    Albert 4 3.5

    John 5 5

    6 http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient.

    7 http://office.microsoft.com/en-us/excel-help/correl-HP005209023.aspx.

  • Syncfusion | Building a simple recommendation engine 18

    Using Excels CORREL function, we calculate the Pearson Correlation Coefficient to be 0.8705715.

    Please refer to sample Excel file correlation-excel\lord of the rings.xlsx to play with the provided data.

    It is clear that the ratings for the two movies are strongly correlated. Now assume you have similar

    ratings for thousands of movies from millions of users. It should be possible to calculate the correlation

    coefficients for each pair of movies where ratings from the same user are available for both. Once these

    have been calculated, they can be loaded into a relational database system; we should be able to quickly

    look up the top N movies simply by looking at the pre-calculated correlation values.

    Note: There are other ways to calculate correlations, and it is entirely possible that one system is vastly

    superior to another for certain kinds of data. We use the Pearson product-moment correlation

    coefficient since it is one of the most commonly used and is easily calculated using Excel. Also, the

    method we use has a substantial number of shortcomings (dealing with sparse data is one). As stated

    earlier, it does however serve as a useful example to understand more complex uses for MapReduce.

    Consider the data set ratings.csv8 available in the folder named data, included with the sample code for

    this document. It has data in the following form.

    Name of movie critic

    Name of movie Rating

    Lisa Rose Lady in the Water 2.5

    Lisa Rose Snakes on a Plane 3.5

    Lisa Rose Just My Luck 3

    Lisa Rose Superman Returns 3.5

    Lisa Rose You Me and Dupree 2.5

    Lisa Rose The Night Listener 3

    Gene Seymour Lady in the Water 3

    Gene Seymour Snakes on a Plane 3.5

    Gene Seymour Just My Luck 1.5

    Gene Seymour Superman Returns 5

    Gene Seymour The Night Listener 3

    This data is only slightly different from the form we considered earlier. We need to obtain pairs of

    movies and obtain ratings for each of them by the same user. Once we do this, we will have a list of

    ratings for pairs of movies by the same users. We can then calculate the correlation coefficient for the

    pairs of movies.

    8 This data set and sample were adapted from the excellent book, Programming Collective Intelligence: Building

    Smart Web 2.0 Applications by Toby Segaran - http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325.

  • Syncfusion | Building a simple recommendation engine 19

    C# implementation to calculate correlations We have included a sample implemented in straight C# without the use of MapReduce (simple

    recommendation - cs console). This will help clarify the actual process of making these calculations. We

    will then perform the same calculations using Pig.

    The code that calculates correlations follows this procedure:

    1. It takes an array containing all ratings and the two movies for which correlation values should be

    calculated.

    2. It then gets a list of critics who have rated both movies using LINQ.

    3. It prepares two lists with parallel ratings.

    4. It then uses the LINQStatistics9 library to calculate the Pearson product-moment correlation

    coefficient from the two lists of data.

    // Calculate the correlation between two movies using ratings by the same critic.

    // The stronger the correlation, the more similar the two movies can be considered to

    be.

    private static double CalculateCorrelation(MovieRating[] movieRatings, string

    movie1, string movie2)

    {

    Console.WriteLine(movie1);

    Console.WriteLine(movie2);

    // Get the critics who have rated either movie.

    var items1 = movieRatings.Where(item => item.Movie == movie1).ToArray();

    var items2 = movieRatings.Where(item => item.Movie == movie2).ToArray();

    // Critics who have rated both movies.

    var commonItems = items1.Intersect(items2, new

    PropertyComparer("Critic")).Select(item => item.Critic);

    // No common critics - seen correlation is 0.0

    if (!commonItems.Any())

    return 0.0;

    var ratings1 = new List();

    var ratings2 = new List();

    9 http://www.codeproject.com/Articles/42492/Using-LINQ-to-Calculate-Basic-Statistics

  • Syncfusion | Building a simple recommendation engine 20

    foreach (var critic in commonItems)

    {

    DumpWithOffset(critic);

    var r1 = items1.Where(i => i.Critic == critic).Select(i =>

    i.Rating).First();

    ratings1.Add(r1);

    DumpWithOffset(r1);

    var r2 = items2.Where(i => i.Critic == critic).Select(i =>

    i.Rating).First();

    ratings2.Add(r2);

    DumpWithOffset(r2);

    }

    return ratings1.Pearson(ratings2);

    }

    Once we are able to calculate the correlation coefficient for a pair of movies, obtaining related

    recommendations is simply a matter of obtaining related movies with the highest correlation scores to

    the movie in question. The code is given in the following sample, and is straightforward.

    In the code, the threshold parameter exists to ensure that movies with a very low or negative

    correlation are not picked up. This can certainly be a problem if your data set is very sparse and does not

    contain enough ratings. For our purpose, the threshold is set to -1. For practical use, it may need to be

    set to 0.5 or so.

    public static Recommendation[] GetRelatedProducts(MovieRating[] movieRatings,

    string movie, double threshold = -1)

    {

    var allMovies = movieRatings.Select(x => x.Movie).Distinct();

    var results = allMovies.Where(c => c != movie)

    .Select(c => new Recommendation()

    {Movie = c, Rating = Analysis.CalculateCorrelation(movieRatings,

    movie, c)})

    .Where(x => x.Rating > threshold)

  • Syncfusion | Simple Recommendation System Using Pig 21

    .OrderByDescending(x => x.Rating);

    return results.ToArray();

    }

    Running the program to obtain the top related movies based on the movie Superman Returns provides

    the following output:

    You Me and Dupree

    0.657951694959769

    Lady in the Water

    0.487950036474267

    Snakes on a Plane

    0.111803398874989

    The Night Listener

    -0.179847194799054

    Just My Luck

    -0.422890031611031

    Simple Recommendation System Using Pig Let us now analyze the same data set using Pig. Pig is termed as a data-flow language. It allows us to

    express our processing requirements as a series of transformationsthe result of one flowing into

    another. Pig then translates our specifications into Map and Reduce tasks.

    In our opinion, it is similar to LINQ; once you play around with a few samples, you will have a good idea

    of how it works. The key concepts are explained below. We will stick to explaining these concepts in a

    manner that makes sense for our samples. We will not stray into additional details. If you need a

    complete introduction to Pig, we recommend Programming Pig by Alan Gates10.

    Load and store Pig can load and store data from or to HDFS and other data sources. Pig can load files containing

    comma-separated or tab-delimited data. It can also handle several other forms of data. It is also possible

    to extend Pig to support custom data sources.

    Relation Pig works with collections of data that it refers to as relations. A relation is not to be confused with a

    relational database relation. A relation in Pig terminology is simply a collection of data. It is best to think

    of relations as similar to a table with rows and columns of data (for our current context). When grouped,

    relations can also contain keys with an associated collection of values for each unique key.

    10

    http://www.amazon.com/Programming-Pig-Alan-Gates/dp/1449302645

  • Syncfusion | Simple Recommendation System Using Pig 22

    Joins Pig can accomplish Joins in a manner that is conceptually intuitive for users who have worked with

    relational data. It can join two relations using a common key.

    Filter Pig can apply filters to data. A provided predicate is checked to see if data should be included or

    excluded.

    Projection Pig can project from an existing collection in a manner similar to the SQL Select statement. Pigs

    equivalent statement is named Generate.

    Grouping Pig can group data by one or more keys. Once grouped, you do not have to flatten the resulting data.

    You can maintain a hierarchical structure with keys and lists of values related to the keys. These can

    then be projected as needed.

    Dump Pig includes a Dump statement that can dump the contents of a relation to the console. Dump is useful

    when working with Pig since you can run commands without writing the results to disk.

    Pig script that analyzes movie ratings In the following sample, we have explained the Pig code (in sample simple recommendation -

    pig\recommend.pig) that analyzes the ratings document to calculate correlations between all pairs of

    movies, as we did with C#.

    Load the data from HDFS The first step is to load the data from HDFS. We use Pigs load statement. Since the file that we are

    processing (ratings.csv) is in CSV format, we pass in the comma as the separator to PigStorage (the

    default load mechanism).

    We have to load the same data twice in order to do a self-join. In the future, it may be possible to work

    with two references to the same relation, but Pig does not work this way currently.

    ratings1 = load 'recommend/ratings.csv' using PigStorage(',') as (critic:chararray,

    movie:chararray, rating:double);

    ratings2 = load 'recommend/ratings.csv' using PigStorage(',') as (critic:chararray,

    movie:chararray, rating:double);

    Obtain a list of unique movie combinations In order to obtain a list of unique movie combinations, we first do a self-join by the name of the critic.

    For each record, this will give us a complete set of records with ratings by the same critic. We then have

    to filter out records with duplicate movie names.

    As an example, consider the first record in the data set.

    Critics name Movie name Rating

    Lisa Rose Lady in the Water 2.5

  • Syncfusion | Simple Recommendation System Using Pig 23

    Also, these are the complete list of ratings by Lisa Rose.

    Lisa Rose Lady in the Water 2.5

    Lisa Rose Snakes on a Plane 3.5

    Lisa Rose Just My Luck 3

    Lisa Rose Superman Returns 3.5

    Lisa Rose You Me and Dupree 2.5

    Lisa Rose The Night Listener 3

    After the join, just the result of joining the first row with the second relation should appear as seen in

    the following table, which combines the first row, repeating it, with each of Lisa Roses ratings.

    Lisa Rose

    Lady in the Water

    2.5 Lisa Rose

    Lady in the Water 2.5

    Lisa Rose

    Lady in the Water

    2.5 Lisa Rose

    Snakes on a Plane 3.5

    Lisa Rose

    Lady in the Water

    2.5 Lisa Rose

    Just My Luck 3

    Lisa Rose

    Lady in the Water

    2.5 Lisa Rose

    Superman Returns 3.5

    Lisa Rose

    Lady in the Water

    2.5 Lisa Rose

    You Me and Dupree

    2.5

    Lisa Rose

    Lady in the Water

    2.5 Lisa Rose

    The Night Listener 3

    As you can see, the first row is a duplicate that needs to be filtered from our result. After filtering, the

    results derived from the first row of data will appear as seen in the following table.

    Lisa Rose

    Lady in the Water

    2.5 Lisa Rose

    Snakes on a Plane 3.5

    Lisa Rose

    Lady in the Water

    2.5 Lisa Rose

    Just My Luck 3

    Lisa Rose

    Lady in the Water

    2.5 Lisa Rose

    Superman Returns 3.5

    Lisa Rose

    Lady in the Water

    2.5 Lisa Rose

    You Me and Dupree

    2.5

    Lisa Rose

    Lady in the Water

    2.5 Lisa Rose

    The Night Listener 3

    The Pig code that affects this transformation is:

    combined = JOIN ratings1 BY critic, ratings2 BY critic;

  • Syncfusion | Simple Recommendation System Using Pig 24

    The filter operation removes combinations of movies that are identical. One point to note is that after a

    join, we refer to fields using the form original_relation_name::field_name.

    filtered = FILTER combined BY ratings1::movie < ratings2::movie;

    Project the data to a more usable form We project the results of the join, properly naming each field in the process. This will make it easier to

    continue processing the data.

    movie_pairs = FOREACH filtered GENERATE ratings1::critic AS critic1,

    ratings1::movie AS movie1,

    ratings1::rating AS rating1,

    ratings2::critic AS critic2,

    ratings2::movie AS movie2,

    ratings2::rating AS rating2;

    Obtain groups containing ratings for each pair of movies This is achieved by simply grouping both movies in our relation.

    grouped_ratings = group movie_pairs by (movie1, movie2);

    Calculating correlations We now have all the information we need to calculate correlations. Pig offers built-in support for

    calculating correlations. We make use of the pairs of ratings that have been gathered during the

    grouping.

    The COR function that calculates the correlation returns a list of records with additional information

    besides the correlation value. Use the Flatten statement to flatten the results from COR into a single

    Tuple of data.

    correlations = foreach grouped_ratings generate group.movie1 as movie1,group.movie2

    as movie2,

    FLATTEN(COR(movie_pairs.rating1, movie_pairs.rating2)) as (var1, var2, correlation);

    Project final results Now, we just need the names of the movies and the correlation coefficient.

    results = foreach correlations generate movie1, movie2, correlation;

    Dump final results for review Do not store results in this case. Simply dump them to the console for review.

    dump results;

  • Syncfusion | Simple Recommendation System Using Pig 25

    The following steps have to be followed to run the code.

    Running the script The script we just walked through is available at simple recommendation - pig\recommend.pig. Follow

    these steps to run the script.

    1. Create a folder on HDFS named recommend.

    Hadoop fs mkdir recommend

    2. Upload data\ratings.csv to HDFS.

    Hadoop fs put ratings.csv recommend

    3. Run the script with the command.

    Pig recommend.pig

    4. Though the data set is small, it will take a while to complete since there is fixed-processing

    overhead with setting up the MapReduce jobs.

    5. When finished, you should see the results dumped to the console.

    Results

    (Just My Luck,Superman Returns,-0.42289003161103106)

    (Just My Luck,Lady in the Water,-0.944911182523068)

    (Just My Luck,Snakes on a Plane,-0.3333333333333333)

    (Just My Luck,You Me and Dupree,-0.4856618642571827)

    (Just My Luck,The Night Listener,0.5555555555555556)

    (Superman Returns,You Me and Dupree,0.657951694959769)

    (Superman Returns,The Night Listener,-0.1798471947990542)

    (Lady in the Water,Superman Returns,0.4879500364742666)

    (Lady in the Water,Snakes on a Plane,0.7637626158259734)

    (Lady in the Water,You Me and Dupree,0.3333333333333333)

    (Lady in the Water,The Night Listener,-0.6123724356957946)

    (Snakes on a Plane,Superman Returns,0.11180339887498948)

    (Snakes on a Plane,You Me and Dupree,-0.6454972243679028)

    (Snakes on a Plane,The Night Listener,-0.5663521139548541)

    (The Night Listener,You Me and Dupree,-0.25)

    If you compare these results with the results from the C# version, you will observe that they are

    identical.

    Applying the same concepts to a much larger set of data We can apply the same system to a much larger data set: the MovieLens ratings dataset available from

    http://www.grouplens.org/datasets/movielens/.

  • Syncfusion | Simple Recommendation System Using Pig 26

    The MovieLens site has ratings with 100,000, one million, and 10 million records11. The structure of the

    data set is explained below. We are only interested in the u.info and u.data files.

    Structure of u.item Each line in this file has information on a specific movie. The only two fields that we will end up using are

    the first two containing the unique ID for the movie and the name of the movie.

    Movie ID Movie name Several other fields (unused)

    1 Toy Story (1995 NA

    Structure of u.data Each line in this file has identifiers for critics, the movie they rated, and the rating they gave. There is

    also a column with timestamp information. The timestamp information is not needed for our purpose.

    Critic ID Movie ID Rating Timestamp (unused)

    196 242 3 881250949

    The complete Pig script used to perform this analysis is given in the following sample. Observe that it is

    similar to the script we used with the smaller data set. The only major difference is that we perform an

    extra join to include movie names as part of the results since the names are stored in a separate u.item

    file.

    -- Load MovieLens file u.data twice since we need to do a self-join to obtain unique

    pairs of movies as before.

    -- This script assumes you have uploaded the u.data file and u.item file into a

    folder named movielens on HDFS

    ratings1 = load 'movielens/u.data' as (critic:long, movie:long, rating:double);

    ratings2 = load 'movielens/u.data' as (critic:long, movie:long, rating:double);

    -- Join by critic name

    combined = JOIN ratings1 BY critic, ratings2 BY critic;

    -- Since movies are identified by ID of type long, we filter and remove cases where

    both movies are identical

    -- The resulting relation contains unique pairs of movies

    filtered = FILTER combined BY ratings1::movie != ratings2::movie;

    11

    We ran our test on a single machine with the 100k record data set. For larger data sets, it may be better to run on a true cluster, one that you set up and configure locally. Or better yet, use one that you set up on Windows Azures HDInsight service: http://www.windowsazure.com/en-us/services/hdinsight/.

  • Syncfusion | Simple Recommendation System Using Pig 27

    -- Project intermediate results

    movie_pairs = FOREACH filtered GENERATE ratings1::critic AS critic1,

    ratings1::movie AS movie1,

    ratings1::rating AS rating1,

    ratings2::critic AS critic2,

    ratings2::movie AS movie2,

    ratings2::rating AS rating2;

    -- Group by movie names

    grouped_ratings = group movie_pairs by (movie1, movie2);

    -- Calculate correlation in rating values

    correlations = foreach grouped_ratings generate group.movie1 as movie1,group.movie2

    as movie2,

    FLATTEN(COR(movie_pairs.rating1, movie_pairs.rating2)) as (var1, var2,

    correlation);

    -- Project results removing fields that we do not need

    results = foreach correlations generate movie1, movie2, correlation;

    -- Load item names and do a join to get actual names instead of just ID references

    -- Notice that the separator between fields is a '|' in the file u.item.

    movies = load 'movielens/u.item' using PigStorage('|') as (movie:long,

    moviename:chararray);

    -- Get the name of the first movie

    named_results = JOIN results BY movie1, movies BY movie;

    -- Get the name of the second movie

    named_results2 = JOIN named_results BY results::movie2, movies BY movie;

    -- Write the results to HDFS

    -- Please ensure that this folder does not exist

    -- Remove with hadoop fs -rmr movielensout if it exists

  • Syncfusion | Simple Recommendation System Using Pig 28

    STORE named_results2 INTO 'movielensout';

    Running the script 1. Download and extract the 100k data set from the MovieLens website12,

    http://www.grouplens.org/datasets/movielens/. These files are not included with the provided

    sample code.

    2. The data set contains several files. Only two of these files are required for our immediate needs:

    u.data and u.item.

    3. Upload u.data and u.item files to HDFS:

    Hadoop fs mkdir movielens

    Hadoop fs put u.data u.item movielens

    4. Run the script with the command: Pig movielens.pig The script will take a while to complete.

    Our tests on a single-machine, pseudo-cluster took about 10 minutes with the 100k data set.

    5. The results are written to a folder named movielensout by default. If you run the script more

    than once, be sure to remove this folder (hadoop fs rmr movielensout) before running the

    script. Pig will complain if the output folder already exists.

    6. Once the script completes, you can copy the results to your local file system for review.

    >hadoop fs -get movielensout/part*

    A portion of the output we obtained from this job is shown in the following table.

    Movie 1 ID

    Movie 2 ID

    Correlation Coefficient

    Movie 1 ID (repeated13)

    Movie 1 Name

    Movie 2 ID

    (repeated)

    Movie 2 Name

    1469 4 -0.6651330399133046 1469 Tom and Huck (1995)

    4 Get Shorty (1995)

    1489 4 0.8703882797784892 1489 Chasers (1994)

    4 Get Shorty (1995)

    1510 4 NaN 1510 Mad Dog Time (1996)

    4 Get Shorty (1995)

    1475 4 -0.3273268353539886 1475 Bhaji on the Beach (1993)

    4 Get Shorty (1995)

    1419 4 0.6750771560841521 1419 Highlander III: The Sorcerer (1994)

    4 Get Shorty (1995)

    1436 4 1.0 1436 Mr. Jones (1993)

    4 Get Shorty (1995)

    1656 4 NaN 1656 Little City (1998)

    4 Get Shorty (1995)

    12

    MovieLens usage terms prohibit the distribution of the data. You will have to download a copy yourself in order to test this script. 13

    Repeated due to the joins with u.item. We should have added a projection, but we did not do so to keep the code succinct.

  • Syncfusion | The Role of Traditional BI 29

    Looking at a couple of result rows (highlighted), it is a safe bet to recommend Get Shorty to those who

    like Chasers. It is a bad idea to recommend Get Shorty to those who like Tom and Huck.

    The data contains fields that are repeated as well as several NaN values. It would be a good exercise to

    modify the Pig script so NaN values, which appear due to a lack of common ratings for the pair of

    movies, are removed. Also, you can modify projections so duplicate fields are removed.

    The content thus far should have given you a good overview of the fundamentals of working with

    Hadoop/HDInsight. You should now have enough of an understanding of the general environment

    related to big data to briefly review some related topics.

    The Role of Traditional BI Analysis with Hadoop is a batch process. Such analysis takes time. While Hadoop can become an

    invaluable part of your extract-transform-load (ETL) pipeline, in many instances you still need to store

    final output in relational form or in a data warehouse of some sort. That way, it can be accessed on

    demand. In fact, this is exactly what we would do with the movie recommendations we calculated. We

    would store them in a relational database and update the information at regular intervals as new ratings

    come in. We can then provide movie recommendations on demand.

    In our opinion, traditional Business Intelligence tools do not lose their importance. They just get much

    more powerful by harnessing the capabilities of the Hadoop ecosystem. This is precisely what we expect

    to see happening on the Microsoft Business Intelligence stack as well. Microsoft, and other vendors, will

    make it easier to integrate Hadoop and SQL Server/SQL Server Analysis Services.

    Data Mining Post-ETL The ability to perform large-scale ETL on data that was previously unavailable or difficult to process

    opens up several additional avenues for data mining. Common data mining taskssuch as dependency

    modeling, clustering, classification, detection of anomalies, regression, and aggregationdepend on

    access to good data from multiple sources. The ability to process big data now provides additional data

    sources that can be used with these tasks.

    The famous example of Target learning14 that a lady was pregnant by analyzing and modelling shopping

    habits was likely achieved because they were able to integrate not just transactional data, but additional

    data sources such as web activity logs. The combined model is likely superior to one built from a smaller

    segment of available data.

    It is important to understand that the actual data mining environment does not have to be capable of

    handling big data. The mining process can still be performed using traditional tools. The open source R

    environment is a powerful tool for data mining. SQL Server also provides a solid set of data mining tools

    already available to many organizations. When used in tandem with Hadoop, R and SQL Server can be

    used to build compelling models that can predict customer spending, customer actions, fraud, machine

    failure, and much more.

    14

    http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?_r=0

  • Syncfusion | Data Mining with Big Data 30

    Data Mining with Big Data It is also possible to model directly using big data. The open source Mahout15 project implements several

    algorithms (including a recommendation system suitable for production use) in a distributed manner

    that can be integrated with Hadoop. A complete set of currently implemented algorithms is available at

    https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms.

    Big Data Processing Is Not Just for Big Data The processing methods we have seen are applicable to a broad swath of data that does not necessarily

    have to be big. The processing methods are powerful and useful for much smaller data sets as well. This

    is especially true when data is not in a structured format.

    ConclusionHarnessing Your Data Is Easier Than You Think There has been a lot of talk about big data, Hadoop, and data science. Having a good understanding of

    the environment surrounding big data will help you make the right decisions, whether as a developer or

    business person.

    At Syncfusion, we sincerely believe getting your big data strategy right is not hard. It just requires a solid

    understanding of the fundamentals and a willingness to push boundaries and test how the adoption of

    new strategies can make your business more competitive. The future belongs not to those who have

    more data, but to those who have data put to good use. We close with this statement by Tim OReilly in

    a Google+ conversation16:

    Companies that have massive amounts of data without massive amounts of

    clue are going to be displaced by startups that have less data but more clue.

    How Can Syncfusion Help? Syncfusion has been working with HDInsight since the earliest releases of the product. We also have

    extensive experience with traditional business intelligence, data mining, the R environment, data

    visualization, and enterprise reporting.

    Syncfusions solutions team can implement big data solutions from end-to-end. Contact us today to

    learn more.

    15

    http://mahout.apache.org/ 16

    https://plus.google.com/+TimOReilly/posts/4Xa76AtxYwd

  • Syncfusion | Contact information 31

    Contact information Syncfusion, Inc.

    2501 Aerial Center Parkway

    Suite 200

    Morrisville, NC 27560

    USA

    [email protected]

  • Syncfusion | Appendix AInstalling and Configuring HDInsight on a Single Node (Pseudo-Cluster)

    32

    Appendix AInstalling and Configuring HDInsight on a Single Node

    (Pseudo-Cluster) 1. Install HDInsight (developer preview version as of October 24, 2013) from

    http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW

    2. Install according to the installation package prompts.

    3. Once installation is complete, ensure that the following services are running. They are not set to

    run by default, so you will have to manually start them or change settings so they start

    automatically.

    4. The install creates a shortcut to a command line environment configured with the right

    environment for running Hadoop. Navigate to this, and start the environment.

  • Syncfusion | Appendix AInstalling and Configuring HDInsight on a Single Node (Pseudo-Cluster)

    33

  • Syncfusion | Appendix BConfiguring NetBeans for HDInsight Development on Windows 34

    Appendix BConfiguring NetBeans for HDInsight Development on

    Windows 1. Install version 6 of the Java SDK from

    http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-

    javase6-419409.html.

    2. Install the latest version of NetBeans from https://netbeans.org/downloads/.

    3. Once installed, you can open the included word count java sample in NetBeans.

    4. Once you open the project, select project properties by right-clicking on the project name, as

    shown in the following image, and selecting Properties.

    5. The following dialog will be displayed. Select Libraries and check to see if JDK 1.6 (this is the

    version of the JDK that corresponds to Java 6) is selected. If you do not see JDK 1.6 selected,

    please select it. If JDK 1.6 is not listed, click Manage Platforms.

  • Syncfusion | Appendix BConfiguring NetBeans for HDInsight Development on Windows 35

    6. Click Add Platform on the next dialog that is displayed.

  • Syncfusion | Appendix BConfiguring NetBeans for HDInsight Development on Windows 36

    7. A dialog will then be displayed with a selection dialog that can be pointed to the location of the

    JDK, as shown in the following image.

    8. Now, make sure JDK 1.6 is selected as the platform, and close the selection dialog. The project

    should then display JDK 1.6 under the libraries tree entry.

  • Syncfusion | Appendix BConfiguring NetBeans for HDInsight Development on Windows 37

    9. The word count java project already contains a reference to the hadoop-core-1.1.0-

    SNAPSHOT.jar file. In new projects that you create, you should include a reference to this library

    (installed by HDInsight to {install disk}:\Hadoop\hadoop-1.1.0-SNAPSHOT\hadoop-core-1.1.0-

    SNAPSHOT.jar). You may have to add additional library references if you use additional features.

    Please consult included documentation for this information.

    10. Once these settings are in place, you should be able to build the project using the Runbuild

    project menu option. A JAR file will be created and available under a folder named dist under

    the main project folder. This JAR file can be deployed to Hadoop clusters.