Exploring Dataset in MapReduce

14
Exploring Yelp Dataset in Map Reduce, HIVE and HPCC

description

Hadoop

Transcript of Exploring Dataset in MapReduce

Exploring Yelp Dataset in Map Reduce, HIVE and HPCC

Preface

The era we belong to is ruled by data, where advertently or inadvertently our contribution also counts in. Parallel processing framework concepts were introduced in 90s, but the real test for these frameworks happened when the data production rates grew exponentially. Big Data tools and software helps in processing these data. Data analysis decisions helps an organization to determine the data relevancy, underlying patterns and other useful information in the data which can be used to drive better business decisions in the future.This white paper discuss about Map Reduce, Hive, Pig which belongs to Hadoop Ecosystem and High Performance Computing Cluster (HPCC) on the context of Yelp Dataset from Yelp Dataset Challenge using a subset of data.

The Data

Understand the data

A good practice is to understand the data before you gets your hands dirty by start processing on the data. What the data represents, data sources, data formats and so on.

Yelp

Yelp is a multi-platform local business review and social networking application that publishes crowd sourced reviews about local business owned by Yelp, California. A five star rating system is used to rate business by reviewers.Yelp datasetEvery year Yelp publishes its dataset about local businesses, users and reviews across the country. The Yelp Dataset challenge for $35K price is to mine the Academic Data and analyze the trends like what cuisines are Yelpers raving about in these different countries, how much of a business' success is really due to just location .

The dataset we are going to explore is the Yelp business and review dataset which is provided in JSON format.Yelp business dataset represents every business with a unique business id. It also provides the city, location, state and similar attributes.The review dataset has the individual review done by each user identified user id against business.

Business

Review

Hadoop Ecosystem and HPCC

HPCCHigh-Performance Computing Cluster (HPCC) systems has a history of 14 years where it was developed for data and information services business, data analytics and scoring, full text storage and retrieval, entity extraction and resolution and data linking developed by Lexis Nexis Risk Solutions.Enterprise Control Language (ECL) is the primary programming language for the HPCC environment. ECL is compiled into optimized C++ which is then compiled into DLLs for execution on the Thor and Roxie platformsHadoopHadoop was created by Doug Cutting and Mike Cafarella in 2005 at Yahoo! MapReduce and HDFS components were inspired by Google papers on MapReduce and Google File System (GFS).Hadoop ecosystem is have a collection of additional software packages that can run alongside Hadoop - Apache Pig, Apache Hive, Apache HBase, Apache Spark, and others.Hadoop Map Reduce jobs are usually written in Java. Other languages are supported through a streaming or pipe interfaceHiveApache Hive is a data warehousing infrastructure based on Hadoop. It provides a simple query language called Hive QL, which is similar to SQL. .A Hive query gets converted into a sequence of stages. These stages may be map/reduce stages or they may even be stages that do Metastore or file system operations like move and rename.

Data Processing

Environments

The operation performed on the Yelp dataset is on the following environment settings.

FrameworkHost OSVirtualization SoftwareVM Image

HPCC SystemsWindows 8.1VMware Player 5.0.3HPCC Systems VM-amd64-4.2.4-3

HIVEWindows 8.1

Oracle VM Virtual BoxCloudera Quick Start VM 4.7.0 Virtual Box (Red Hat 64 bit)

Map ReduceWindows 8.1Oracle VM Virtual BoxUbuntu-12.04.5-desktop-i386 ISO 64 bit

Data Loading

HPCCData loading is controlled through the Distributed File Utility (DFU) server. Data typically arrives on the landing zone (for example, by FTP). File movement (across components) is initiated by DFU. Data is copied from the landing zone and is distributed (sprayed) to the Data Refinery (Thor) by the ECL code. A single physical file is distributed into multiple physical files across the nodes of a cluster. The aggregate of the physical files creates one logical file that is addressed by the ECL code.

Here the JSON data is converted into csv and uploaded to the cluster.

HADOOP

We can use command line to manage files on HDFShadoop fs -put:Copy single src file, or multiple src files from local file system to the Hadoop data file systemhadoop fs -put /home/user/review_cleaned.csv /user/hadoop/dir3/

HUEUsing Hue Metastore manager in Cloudera we can create a table manually from a JSON file and by specifying the input column names.A python script was used to clean the Json and the review json cleaned has the following headers.["funny", "useful", "cool", "user_id", "review_id", "text", "business_id", "stars", "date", "type"]A screen cast of MetaStore manager in Hue after Yelp_Biz table has been created from Yelp business dataset json.

Queries

We mainly focus on three operations COUNT, GROUP and JOIN. A COUNT query of the number of records based on a filter (city) A GROUP query on one the columns on the dataset (city, review count) A JOIN of two datasets (business )

Query Execution in HPCCIn HPCC we read logical files by specifying the layout of the file. Once we read the file we can specify the filter by mentioning the column name and filter value. Q1: How many businesses in Scottsdale city got reviewed?The code snippet followsEXPORT YBiz := DATASET(~query_result::biz,YBiz_Lay,CSV(HEADING(1)));

OUTPUT(COUNT(YBIZ(city='"Scottsdale"')),NAMED('Scott_Review));

Result

Q2: Find the total number of review counts for businesses in each city.This is going to be a group by based on city.Code SnippetCityCountRec:=RECORD Biz_Clean.city; statecount:=SUM(GROUP,Biz_Clean.review_c);END;personSummary:= TABLE(Biz_Clean,CityCountRec,city);OUTPUT(personSummary,ALL);

In ECL we can ask the code to group the records by creating a layout that refers a data set.

Query ResultExecution Graph

Q3: List 10 cool business We have the cool attribute in review and business name in Yelp business dataset. Obliviously we need a join of the two datasets.Code snippetJoin_Lay := RECORD

STRING27 biz_id ;STRING20 name;STRING7 cool;

END;Join_Lay JoinTran(YBiz B, YReview R) := TRANSFORMSELF.biz_id := B.biz_id;SELF.name := REGEXREPLACE('"',B.name,'');SELF.cool := REGEXREPLACE('"',R.cool,'');END;YJoin := JOIN( YBiz, YReview,LEFT.biz_id = RIGHT.business_id ,JoinTran(LEFT, RIGHT));

YLay:=RECORD STRING20 name := YJoin.name; INTEGER4 cool:=SUM(GROUP,(INTEGER4)YJoin.cool);END;z := TABLE(YJoin,YLay,biz_id);y := SORT(z,-cool);y[1..10];

Query Execution in Hive

Q1: How many businesses in Scottsdale city got reviewed?We can implement a simple COUNT query in HIVE.SELECT COUNT(*) FROM YELP_BIZ WHERE city = Scottsdale

After executing the query in HIVE query editor we will get the result.

There was one map task, one reduce task.Total Map Reduce CPU Time Spent: 2 seconds 480 milliseconds

Q2: Find the total number of review counts for businesses in each city? A Hive query with SUM will fetch as the result.SELECT b.city, SUM(review_count) AS review FROM yelp_biz b GROUP BY b.city

The result

There were one map task and one reduce task.Total Map Reduce CPU Time Spent: 2 seconds 810 milliseconds

Q2: Top 10 coolest restaurants.

SELECT r.business_id, name, SUM(cool) AS coolness FROM review r JOIN business b ON (r.business_id = b.business_id) WHERE categories LIKE '%Restaurants%' GROUP BY r.business_id, name ORDER BY coolness DESC LIMIT 10

There were 3 map reduce jobs.Total Map Reduce CPU Time Spent: 17 seconds 610 millisecondsResult

Query Execution using Map Reduce

JSON is harder to partition into distinct segments than a format such as XML because JSON doesnt have a token (like an end tag in XML) to denote the start or end of a record.The problem in using JSON in Map Reduce is if you are working with large JSON files, you need to be able to split them. Elephant Bird, an open source project that contains some useful utilities for working with LZOP compression, has an LzoJsonInputFormat that can read JSON, though it requires that the input file be LZOP compressed.

So we have uploaded the csv file for the dataset into HDFS and will do the processing.Mapper and Reducer code snippetMapper

Reducer

Map Reduce Logs

We wrote the data to yelpdb/01 folder

There is a slight mismatch if one record compared to Hive and HPCC executions.

Q2: Find the total number of review counts for businesses in each city?

Mapper code snippet

Reducer code snippet

Map Reduce Logs

Output file written to yelpdb/o2 folder

Summary

The easiest way of processing the dataset was using Hive as loading the dataset and defining its layout via HUE. The HiveQL increases the usability with its SQL like syntax. Writing map reduce programs for big data problems becomes easy if you are familiar with Java and it gives you more control over implementation and execution.

HPCC is a lesser known platform but powerful enough to compete with Hadoop. The language ECL used is powerful but with an unfamiliar set of syntax and semantics takes more development time for novices. If you consider a data movement of files from HPCC to Hadoop Comma Separated Files (CSV) is the best option even though HPCC can produce XML files.

As most of the queries were executed on single node local or virtual machine environments a performance benchmarking comparing the Big data platforms or technologies will be effete. At this point the comparable material will be the line of codes and coding time.

References

http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/ https://cwiki.apache.org/confluence/display/Hive/LanguageManual http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html http://hpccsystems.com/ https://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programming/ http://blog.cloudera.com/ https://amplab.cs.berkeley.edu/benchmark/