OPTIMIZING SQL-ON-HADOOP PERFORMANCE ON ANALYZING …ijaema.com/gallery/212-september-2247.pdf ·...

OPTIMIZING SQL-ON-HADOOP PERFORMANCE ON

ANALYZING WEB ACCESS LOGS DATA

Ayushi

M.Tech(CSE)*,SAMCET, Bhopal(M.P)

[email protected]

Head of Dept. of CSE, SAMCET, Bhopal(M.P)

Abstract— Web log file is log file automatically

created and maintained by a web server. Analysing

web server access logs files will offer valuable

insight into website usage. Log files collect a

variety of data about information requests to your

web server. Server logs act as a visitor sign-in

sheet. Server log files can give information about

what pages get the most and the least traffic? What

sites refer visitors to your site? What pages that

your visitors view?. Because of the tremendous

usage of web, the web log files are growing at faster

rate and the size is becoming huge. Processing this

explosive growth of log files using relational

database technology has been facing a bottle neck.

To analyze such large datasets we need parallel

processing system and reliable data storage

mechanism. Hadoop rides the big data where

massive quantity of information is processed using

cluster of commodity hardware. In this paper we

present the hadoop framework for storing and

processing large log files and also present the

hadoop ecosystems fOr analysis purpose called hive

and pig used in preprocessing of huge volume of

web log files and finding the statics of website and

learning the user behavior. And we also compare

the performance of both the analytical tools on

analysing log files.

Keywords-- Hadoop, data mining, logfile analysis,

behaviour mining, web mining, hive, pig.

I.INTRODUCTION

As per the need of today’s world, everything is

going on Internet for browsing something. Each and

every field is having their own way of putting their

applications, online on Internet. Seating at home we

can do shopping, banking related work; we get

weather information, and many more services. And

in such a competitive environment, service

providers are eager to know about are they

providing best service in the market, whether

people are purchasing their product, are they

finding application interesting and friendly to use,

or in the field of banking they need to know about

how many customers are looking forward to our

bank scheme. In similar way, they also need to

know about problems that have been occurred, how

to resolve them, how to make websites or web

application interesting, which products people are

not purchasing and in that case how to improve

advertising strategies to attract customer, what will

be the future marketing plans. To answer these

entire questions, Logs come in all shapes, but as

applications and infrastructures grow, the result is a

The International journal of analytical and experimental modal analysis

Volume XI, Issue IX, September/2019

ISSN NO: 0886-9367

Page No:1808

Ankur Taneja

[email protected]

massive amount of distributed data that's useful to

mine. From web and mail servers to kernel and boot

logs, modern servers hold a rich set of information.

Massive amounts of distributed data are a perfect

application for Apache Hadoop, as are log files—

time-ordered structured textual data.You can use

log processing to extract a variety of information.

One of its most common uses is to extract errors or

count the occurrence of some event within a system

(such as login failures). You can also extract some

types of performance data, such as connections or

transactions per second. Other useful information

includes the extraction (map) and construction of

site visits (reduce) from a web log. This analysis

can also support detection of unique user visits in

addition to file access statistics. As the growth of

data increases over years, storage and analysis

becomes incredible, this in turn increases the

processing time and cost efficiency. Though various

techniques and algorithms are used in distributed

computing the problem remains still idle. To

overcome this issue Hadoop Map reduce is used, to

process large number of files in a parallel manner.

The use of World Wide Web emits data in larger

quantity as users are more interested in performing

their day to day activities through online. Users

interaction in a website is analyzed through web

server log files, a computer generated data in semi

structured format. This paper presents an analysis

of web server log files using Hadoop Mapreduce to

preprocess the log files .

HADOOP

The Apache Hadoop project develops open-source

software for scalable, reliable, distributed computing.

The Apache Hadoop library is a framework that allows

for the distributed processing of large data sets beyond

clusters of computers using a thousands of

computational independent computers and large amount

(terabytes, petabytes) of data. Hadoop was derived from

Google File System (GFS) and Google's Map Reduce.

Apache Hadoop is good choice for twitter analysis as it

works for distributed huge data. Apache Hadoop is an

open source framework for distributed storage and large

scale distributed processing of data-sets on clusters.

Hadoop runs applications using the MapReduce

algorithm, where the data is processed in parallel on

different clusters nodes. In short, Hadoop framework is

able enough to develop applications able of running on

clusters of computers and they could perform complete

statistical analysis for a huge amounts of data. Hadoop

MapReduce is a software framework for easily writing

applications which process big amounts of data in-

parallel on large clusters (thousands of nodes) of

commodity hardware in a reliable, fault-tolerant manner.

Apache Hive

Facebook created Hive for analyzing large datasets.

It is a most widely adopted data warehousing

application which can provide the Relational model

and SQL interface. Hive infrastructure runs on the

top of Hadoop. It mainly helps in providing

summary of the data, query and analysis of the

unstructured data. Since its incubation in 2008,

Apache Hive is considered as standard for Batch

and Interactive SQL workloads on data in Hadoop.

Hive offers Hadoop users the broadest set of SQL

semantics at petabyte scale with interactive

response times. The Hive tables are similar to

relational databases, but the tables in Hive are made

up of partitions. It supports overwriting and

appending data into the tables. Within a particular

database, data in the tables are serialized and each

table has a corresponding HDFS directory. The

tables are further divided into partitions which

determine the distribution of data within the sub-

directories. Hive aids the data formats like binary,

char, boolean, double, bigint, decimal, int, string,

smallint, timestamp, float etc. Primitive data types



ISSN NO: 0886-9367

Page No:1809

can be combined together to form complex data

types such as arrays and maps.

Apache Pig

Yahoo started Pig as a research project to focus on

analysis of large datasets. It was designed in the

style of SQL and also MapReduce. Pig is used with

Hadoop in general. Pig Latin is a procedural

language used by Apache Pig. The programmers

use Pig script and execute the command in the grunt

shell. It runs MapReduce programs when running

the pig script in grunt shell. Apache pig can execute

in three modes which are explained as follows.

Interactive mode: Users get the output by entering

Pig Latin statements, Batch mode: Users run

Apache Pig in single file with .pig extension and

Embedded mode: User can define their own

functions named User Defined Functions (UDF).

The major components of Apache pig are, Parser:

Checks the syntax of the script. Optimizer: Carries

out the plan of the script as push down. Compiler:

Compiles the plan into MapReduce job. Execution

engine: Execute the MapReduce jobs and finally

Hadoop produce the results.

II. LITERATURE REVIEW

According to [1], Web mining[13] is the application of

data mining techniques to extract useful knowledge from

web data that includes web document, hyperlink

between documents, usage logs of web sites etc. Web

usage mining is the process of applying data mining

techniques to discover usage pattern from the web data.

It is one of the techniques to find personalization of web

pages. The collection of web usage data gathered from

different levels such as server level, client level and

proxy level and also from different resources through the

web browser and web server interaction using the HTTP

protocol [3]. But in the current scenario the number of

online customer’s increases day by day and each click

from a web page creates on the order hundred bytes data

in typical website log file. When a web user submits

request to web server at the same time user activities are

recorded in server side. These types of web access logs

are called log file. Request information sent by the user

via protocol to the web server which is recorded in log

file. The logfiles [4]are contains some entries like ip

address of which computer making the request, the

visitor data, line of hit, the request method, location and

name of the requested file, the HTTP status code, the

size of the requested file.

Log files can be classified into categories

depending on the location of their storage that is web

server logs and application server logs. A web server [5]

maintains two types of log files: Access log and Error

log. The access log records all requests that were made

of this server. The error log records all request that

failed and the reason for the failure as recorded by the

application. A log files have lot of parameters which are

very useful to recognizing user browsing patterns [6, 7,

8].

Mining the web log file will helpful to server

and E-commerce for predicting the behavior of their

online customer. Every day increasing online customers

as well as increasing the size of web access log [10].. In

large websites handling millions of simultaneous visitors

can generate hundred of peta bytes of logs per day. The

existing data mining techniques store web log files in

traditional DBMS and analyze. RDBMS system cannot

store and manage the peta bytes of heterogeneous

dataset. So, to analyze such big web log file efficiently

and effectively we need to develop faster, efficient and

effective parallel and scalable data mining algorithm.

Also need a cluster of storage devices to store peta bytes

of web log data and parallel computing model for

analyzing huge amount of data. Hadoop framework

provides reliable clusters of storage facility to keep our

large web log file data in a distributed manner and

parallel processing features to process a large web log

file data efficiently and effectively[11,12]. The

preprocessed web logs by HadoopMapReduce

environment is further processed for prediction of user

next request without disturbing them to increase the



ISSN NO: 0886-9367

Page No:1810

interest and to reduce the response time with ecommerce

system.

This paper shows how to process log file using

MapReduce and how Hadoop framework is used for

parallel computation of log files. Data collected from

various resources are loaded into HDFS for facilitating

MapReduce and Hadoop framework.We proved that

processing big data with the help of Hadoop

environment leads to minimum

computation and response time and also our HM_PP

algorithm leads to good accuracy in prediction of user

preferred pages. So, one can easily access the

ecommerce system with the help of big data analytics

tools with less response time and good prediction

accuracy. In future log analysis can be done by

correlation engines like RSA envision and HA cloud

environment. The above work can also be extended with

semantic analysis for better prediction.

In [2], the author describes that Big data

analytics has attracted intense interest from all academia

and industry recently for its attempt to extract

knowledge, information and wisdom form big data. Big

data and cloud computing, two of the most important

trends that are defining the new emerging analytical

tools. Big data analytical capabilities using cloud

delivery models could ease adoption for many industry,

and most important thinking to cost saving, it could

simplify useful insights that could providing them with

different kinds of competitive advantage. Many

companies to provide online Big Data analytical tools

some of the top most companies like Amazon Big data

Analytics Platform ,HIVE web based Interface, SAP Big

data Analytics, IBM InfoSphere BigInsights,

TERADATA Big Data Analytics, 1010data Big Data

Platform, Cloudera Big Data Solution etc. Those

companies analyze huge amount of data with help of

different type of tools and also provide easy or simple

user interface for analyzing data.

III PROBLEM DEFINITION

Companies like Flipkart, Snapdeal and Amazon

routinely produces a huge amount of logs on a daily

basis. They continually improve their operations and

services by analyzing the data. Analyzing these huge

amounts of data in a very short period of time is a

crucial task for any business analyst. The problem of log

files analysis is complicated because of not only its

volume but also its disparate structure. The log files are

semi-structure or unstructured type so by using using

traditional tool and techniques are not feasible , and the

tradition tool cannot handle the large amount of dataset

or an unstructured data.

For this reason, data mining needs pre-processing and

analytic method for finding the value. Indeed, data

mining is closely related with artificial intelligence and

machine learning and so on. Scale of data management

in data mining and big data is significantly different in

size. However, the basic method to extract the value is

very similar. In case of data mining, the process of

extracting knowledge needs data cleaning, data

integration, data selection, data transformation, data

mining, pattern evaluation, knowledge presentation et..

Big data came out after solving the requirements and

challenges of data mining[13].

IV PROPOSED WORK

For analyzing these large and complex data required a

power tool, we are using hadoop which is a open source

implementation of mapreduce, a powerful tool designed

for deep analysis and transformation of very large data.

Figure1. Workflow Diagram



ISSN NO: 0886-9367

Page No:1811

This paper we design algorithm for handling the

problems raised by the larger data volume and the

dynamic data characteristics for finding and performing

operation on social media data sets. For analysing first

we used standard platform as hadoop on single node

ubuntu machine [9] to solve the challenges of big data

through MapReduce framework where the complete data

is mapped to frequent datasets and reduced to smaller

sizable data to ease of handling ,After that we can use

bigdata analytical tools to refine such unstructured data

and analyse the social data using bigdata analytical

tools.

V. EXPERIMENTAL & RESULT ANALYSIS

All the experiments were performed using an i3-

2410M CPU @ 2.30 GHz processor and 3 GB of

RAM running ubuntu 14 . And than we configure

hadoop-1.1.2 on ubuntu and along with hadoop we

also integrate bigdata analytical tools hive and pig

on top of the hadoop, So to achieve this we are

going to follow the following methods:

Loading Data into HDFS.

Analyzing using Apache Hive and pig.

Comparing performance of hive & pig.

Loading Data into HDFS

First we can loading different access and error log

files in to HDFS, in our dissertation we can

analyse nasa web access log which are common

access log. Figure 2 shows the loading a log file

into HDFS. And in this figures we can clearly seen

that there is not any structure between the data of

these logs file. After loading these different logs

file into HDFS we can analyze using bigdata

analytical tool such as apache hive and pig, in the

next section we can analysze these complex log

files.

Figure2. Loading web access logs into HDFS

Analyzing using Hive & Pig

After storing the log raw data into HDFS , now we

can start analyzing these complex log files using

apache hive. For analyzing common log file we

can first create a nasa_log table to store the access

log data efficiently in structured manner. For

converting the unstructured and complex log file

into structure tabular format, we can use Regex

SerDe properties into hive which can transform the

untructure data into structured format. For creating

table and apply regez serde properties into table.

For these hive query, hive engine launches a

mapreduce job for pre-processing the log files ,the

mapreduce job is launched by running a hive query

on terminal. After finishing a execution of

mapreduce job we can get the output of that query

. In figure 3 we will get the host or ip address

which has maximum frequency or hit counts.and

the time taken by hive query is also shown in

figure 3 that is takes 47.099 seconds to finish the

execution.



ISSN NO: 0886-9367

Page No:1812

Figure 3. maximum hits from ip addresses

Similarly we can also find the various status

code which we can get along with its

frequency and the time taken by hive, figure 4

shows the various status code we get along

with its frequency and the time taken by hive

query.

Figure 4. Various status code along with

frequency

Similarly we can find the maximum hitting

pages along with frequency which user can

access and the time taken by hive are shown in

figure 5.

Figure 5. maximum hitting pages

Analyzing using pig

Now we can analyze the nasa web access log

files through pig which is an another bigdata

analytical tools for performing analysis on

large amount of data, for these we can first

start the pig by entering into the grunt shell by

simply typing pig command. And to analyze

the unstructured log files we can register the

pigloader into or grunt shell through which we

can validate the log data and process the log

data. So we can find the maximum ip address

hit ratio by writing pig script.

REGISTER /home/Desktop/piggybank-

0.11.0.jar;

DEFINE ApacheCommonLogLoader

org.apache.pig.piggybank.storage.apach

elog.CommonLogLoader();

logs = LOAD '/logfile' USING

ApacheCommonLogLoader AS (addr:

chararray, logname: chararray, user:

chararray, time: chararray, method:

chararray, uri: chararray, proto:

chararray, status: int, bytes: int);

addrs = GROUP logs BY addr;



ISSN NO: 0886-9367

Page No:1813

counts = FOREACH addrs GENERATE

flatten($0), COUNT($1) as count;

top = order counts by $1 DESC;

Result = LIMIT top 10;

DUMP Result;

After completing the execution of pig script we

can get the output of these pig script and the

output are shown in figure 6.

Figure 6. Output generated by pig

And the time taking by pig are shown in figure

7, in which is clearly shows that the pig start

the execution at 15:41:06 and finish the

execution at 15:43:10 means pig scripts takes

124 seconds to complete the execution.

Figure 7. Time taken by pig script

Comparison between hive and pig

After analyzing the access logs from hive and pig

we can seen the result and the result of both the

tools are same means both are accurate in terms of

accuracy but both are taking different execution

time to generate the result , table 1 shows the time

taken by hive and pig to generate the result.

Table 1. time taken by hive and pig

Figure 8. time taken by hive and pig

In our experiment we also introduced hive which is

more useful as compared to pig on analysis and we say

that hive performs fast as compared to pig on the basis

of various parameters, also the above query results

demonstrate that the execution time taken by hive is

very less as compared to pig. And the mapreduce job

generated by hive is less as compared to pig whereby the

execution time is less in hive. Another benefit of using

hive is number of lines of code, which are more in pig

but in hive only one line query is sufficient. Another

parameter is load over mr-jobhistory server, there is

much load on history server when we execute pig scripts

because there is more switching between the aliases in

pig whereas hive imparts less switching thereby

reducing the load on mr-jobhistory server. The

experimental results are shown below.



ISSN NO: 0886-9367

Page No:1814

Table 2 No. of job launched by pig and hive

Table 3 Query executed w.r.t mr-history server

05

101520

Execution time taken (in min)

Execution time taken (in min)

Figure 9. Query executed w.r.t mr-history server

Optimizing Query Performance

In these we can also optimize the hive query

performance, we can perform serialization process at the

starting table and store the resultant table into new table

and then apply all the query on these new resultant table

by which we can get the result faster as compared to

perform same operation on deserialize table. For these

we can execute the different query on two hive tables

first table in which the optimization is not present and

second in which we perform optimization process for

which we can get output of queries with different

execution time and the time taken by query in both the

tables. For these we can create another table call lognew

and the schema difference between both table are shown

in figure 9.

Figure 9. Schema difference between both the table

And the time taken by query while running on normal

table and the optimize table and the result are shown

below.

Table 2. time taken by queries



ISSN NO: 0886-9367

Page No:1815

Figure 10. Execution time taken by query on hive

tables

VII CONCLUSION:

World Wide Web has necessitated the users to make use

of automated tools to find desired information resources

and to follow and assess their usage pattern. We have

presented best fit Hadoop MapReduce programming

model for analyzing web application log files . In this

system, data storage is provided using HDFS and

MapReduce model applied over log files gives analyzed

results in minimal response time. To get categorized

results of analysis hive query and pig query is written

over MapReduce result. And we can also compare the

performance on hive and pig and the hive perform better

in processing access logs over pig in terms of execution.

And in this we can also optimize the hive query

performance to analyze the log data.

REFERENCES

[01] Dr.S.Suguna, M.Vithya, J.I.Christy Eunaicy, “Big

Data Analysis in E-commerce System Using

HadoopMapReduce” in 2016 IEEE.

[02] Rahul Kumar Chawda, Dr. Ghanshyam Thakur,

“Big Data and Advanced Analytics Tools”, 2016

Symposium on Colossal Data Analysis and Networking

(CDAN), IEEE 2016, ISSN: 978-1-5090-0669-4/16.

[03] M.Santhanakumar and C.Christopher Columbus,

“Web Usage Analysis of Web pages UsingRapidminer”,

WSEAS Transactions on computers, EISSN: 2224-2872,

vol.3, May 2015.

[04] Shaily G.Langhnoja ,MehulP.Barot and

DarshakB.Mehta, “Web Usage Mining Using

Association Rule Mining on Clustered Data for Pattern

Discovery “,International Journal of Data Mining

Techniques and Applications, vol.2 ,Issue.1, June 2013.

[05] Web server logs ://http. Sever side log.org.

[06] Nanhay Singh, Achin Jain, Ram and Shringar Raw,

“Comparison Analysis of Web Usage Mining Using

Pattern Recognition Techniques”, International Journal

of Data Mining & Knowledge Process(IJDKP) vol.3,

Issue.4, July 2013.

[07] J.Srivastava et al, “Web usage Mining:

Discoveryand Applications of usage patterns from Web

Data“, ACM SIGKDD Explorations, vol.1, Issue. 2,

pp.12-23, 2000.

[08] S.Saravanan and B.UmaMaheswari, “Analyzing

Large Web Log Files in A HadoopDistributedCluster

Environment”, International Journal of Computer

Technology & Applications, vol.5, pp. 1677-1681.

[09] Michael G. Noll, Applied Research, Big Data,

Distributed Systems, Open Source, "Running Hadoop on

Ubuntu Linux (Single-Node Cluster)", [online],

available at http://www.michael-

noll.com/tutorials/running-hadoop-on-ubuntu-linux-

single-node-cluster/

[10] K.V.Shvachko, “ TheHadoop Distributed File

System Requirements”, MSST ’10 Proceeding of the

2010 IEEE 26th Symposium on Mass Storage System

and Technologies(MSST).

[11] Apache Hadoop ://http://hadoop.apache.org.

[12] A white paper by OrzotaInc, “Beyond Web

Application Log Analysis using Apache Hadoop”.

[13] Mining the Social Web: Data Mining Facebook,

Twitter, LinkedIn, Google+, GitHub, and More--

‐Matthew A. Russell.



ISSN NO: 0886-9367

Page No:1816

OPTIMIZING SQL-ON-HADOOP PERFORMANCE ON ANALYZING …ijaema.com/gallery/212-september-2247.pdf ·...

Documents

Transcript of OPTIMIZING SQL-ON-HADOOP PERFORMANCE ON ANALYZING …ijaema.com/gallery/212-september-2247.pdf ·...