OPTIMIZING SQL-ON-HADOOP PERFORMANCE ON ANALYZING …ijaema.com/gallery/212-september-2247.pdf ·...
Transcript of OPTIMIZING SQL-ON-HADOOP PERFORMANCE ON ANALYZING …ijaema.com/gallery/212-september-2247.pdf ·...
OPTIMIZING SQL-ON-HADOOP PERFORMANCE ON
ANALYZING WEB ACCESS LOGS DATA
Ayushi
M.Tech(CSE)*,SAMCET, Bhopal(M.P)
Head of Dept. of CSE, SAMCET, Bhopal(M.P)
Abstract— Web log file is log file automatically
created and maintained by a web server. Analysing
web server access logs files will offer valuable
insight into website usage. Log files collect a
variety of data about information requests to your
web server. Server logs act as a visitor sign-in
sheet. Server log files can give information about
what pages get the most and the least traffic? What
sites refer visitors to your site? What pages that
your visitors view?. Because of the tremendous
usage of web, the web log files are growing at faster
rate and the size is becoming huge. Processing this
explosive growth of log files using relational
database technology has been facing a bottle neck.
To analyze such large datasets we need parallel
processing system and reliable data storage
mechanism. Hadoop rides the big data where
massive quantity of information is processed using
cluster of commodity hardware. In this paper we
present the hadoop framework for storing and
processing large log files and also present the
hadoop ecosystems fOr analysis purpose called hive
and pig used in preprocessing of huge volume of
web log files and finding the statics of website and
learning the user behavior. And we also compare
the performance of both the analytical tools on
analysing log files.
Keywords-- Hadoop, data mining, logfile analysis,
behaviour mining, web mining, hive, pig.
I.INTRODUCTION
As per the need of today’s world, everything is
going on Internet for browsing something. Each and
every field is having their own way of putting their
applications, online on Internet. Seating at home we
can do shopping, banking related work; we get
weather information, and many more services. And
in such a competitive environment, service
providers are eager to know about are they
providing best service in the market, whether
people are purchasing their product, are they
finding application interesting and friendly to use,
or in the field of banking they need to know about
how many customers are looking forward to our
bank scheme. In similar way, they also need to
know about problems that have been occurred, how
to resolve them, how to make websites or web
application interesting, which products people are
not purchasing and in that case how to improve
advertising strategies to attract customer, what will
be the future marketing plans. To answer these
entire questions, Logs come in all shapes, but as
applications and infrastructures grow, the result is a
The International journal of analytical and experimental modal analysis
Volume XI, Issue IX, September/2019
ISSN NO: 0886-9367
Page No:1808
Ankur Taneja
massive amount of distributed data that's useful to
mine. From web and mail servers to kernel and boot
logs, modern servers hold a rich set of information.
Massive amounts of distributed data are a perfect
application for Apache Hadoop, as are log files—
time-ordered structured textual data.You can use
log processing to extract a variety of information.
One of its most common uses is to extract errors or
count the occurrence of some event within a system
(such as login failures). You can also extract some
types of performance data, such as connections or
transactions per second. Other useful information
includes the extraction (map) and construction of
site visits (reduce) from a web log. This analysis
can also support detection of unique user visits in
addition to file access statistics. As the growth of
data increases over years, storage and analysis
becomes incredible, this in turn increases the
processing time and cost efficiency. Though various
techniques and algorithms are used in distributed
computing the problem remains still idle. To
overcome this issue Hadoop Map reduce is used, to
process large number of files in a parallel manner.
The use of World Wide Web emits data in larger
quantity as users are more interested in performing
their day to day activities through online. Users
interaction in a website is analyzed through web
server log files, a computer generated data in semi
structured format. This paper presents an analysis
of web server log files using Hadoop Mapreduce to
preprocess the log files .
HADOOP
The Apache Hadoop project develops open-source
software for scalable, reliable, distributed computing.
The Apache Hadoop library is a framework that allows
for the distributed processing of large data sets beyond
clusters of computers using a thousands of
computational independent computers and large amount
(terabytes, petabytes) of data. Hadoop was derived from
Google File System (GFS) and Google's Map Reduce.
Apache Hadoop is good choice for twitter analysis as it
works for distributed huge data. Apache Hadoop is an
open source framework for distributed storage and large
scale distributed processing of data-sets on clusters.
Hadoop runs applications using the MapReduce
algorithm, where the data is processed in parallel on
different clusters nodes. In short, Hadoop framework is
able enough to develop applications able of running on
clusters of computers and they could perform complete
statistical analysis for a huge amounts of data. Hadoop
MapReduce is a software framework for easily writing
applications which process big amounts of data in-
parallel on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner.
Apache Hive
Facebook created Hive for analyzing large datasets.
It is a most widely adopted data warehousing
application which can provide the Relational model
and SQL interface. Hive infrastructure runs on the
top of Hadoop. It mainly helps in providing
summary of the data, query and analysis of the
unstructured data. Since its incubation in 2008,
Apache Hive is considered as standard for Batch
and Interactive SQL workloads on data in Hadoop.
Hive offers Hadoop users the broadest set of SQL
semantics at petabyte scale with interactive
response times. The Hive tables are similar to
relational databases, but the tables in Hive are made
up of partitions. It supports overwriting and
appending data into the tables. Within a particular
database, data in the tables are serialized and each
table has a corresponding HDFS directory. The
tables are further divided into partitions which
determine the distribution of data within the sub-
directories. Hive aids the data formats like binary,
char, boolean, double, bigint, decimal, int, string,
smallint, timestamp, float etc. Primitive data types
The International journal of analytical and experimental modal analysis
Volume XI, Issue IX, September/2019
ISSN NO: 0886-9367
Page No:1809
can be combined together to form complex data
types such as arrays and maps.
Apache Pig
Yahoo started Pig as a research project to focus on
analysis of large datasets. It was designed in the
style of SQL and also MapReduce. Pig is used with
Hadoop in general. Pig Latin is a procedural
language used by Apache Pig. The programmers
use Pig script and execute the command in the grunt
shell. It runs MapReduce programs when running
the pig script in grunt shell. Apache pig can execute
in three modes which are explained as follows.
Interactive mode: Users get the output by entering
Pig Latin statements, Batch mode: Users run
Apache Pig in single file with .pig extension and
Embedded mode: User can define their own
functions named User Defined Functions (UDF).
The major components of Apache pig are, Parser:
Checks the syntax of the script. Optimizer: Carries
out the plan of the script as push down. Compiler:
Compiles the plan into MapReduce job. Execution
engine: Execute the MapReduce jobs and finally
Hadoop produce the results.
II. LITERATURE REVIEW
According to [1], Web mining[13] is the application of
data mining techniques to extract useful knowledge from
web data that includes web document, hyperlink
between documents, usage logs of web sites etc. Web
usage mining is the process of applying data mining
techniques to discover usage pattern from the web data.
It is one of the techniques to find personalization of web
pages. The collection of web usage data gathered from
different levels such as server level, client level and
proxy level and also from different resources through the
web browser and web server interaction using the HTTP
protocol [3]. But in the current scenario the number of
online customer’s increases day by day and each click
from a web page creates on the order hundred bytes data
in typical website log file. When a web user submits
request to web server at the same time user activities are
recorded in server side. These types of web access logs
are called log file. Request information sent by the user
via protocol to the web server which is recorded in log
file. The logfiles [4]are contains some entries like ip
address of which computer making the request, the
visitor data, line of hit, the request method, location and
name of the requested file, the HTTP status code, the
size of the requested file.
Log files can be classified into categories
depending on the location of their storage that is web
server logs and application server logs. A web server [5]
maintains two types of log files: Access log and Error
log. The access log records all requests that were made
of this server. The error log records all request that
failed and the reason for the failure as recorded by the
application. A log files have lot of parameters which are
very useful to recognizing user browsing patterns [6, 7,
8].
Mining the web log file will helpful to server
and E-commerce for predicting the behavior of their
online customer. Every day increasing online customers
as well as increasing the size of web access log [10].. In
large websites handling millions of simultaneous visitors
can generate hundred of peta bytes of logs per day. The
existing data mining techniques store web log files in
traditional DBMS and analyze. RDBMS system cannot
store and manage the peta bytes of heterogeneous
dataset. So, to analyze such big web log file efficiently
and effectively we need to develop faster, efficient and
effective parallel and scalable data mining algorithm.
Also need a cluster of storage devices to store peta bytes
of web log data and parallel computing model for
analyzing huge amount of data. Hadoop framework
provides reliable clusters of storage facility to keep our
large web log file data in a distributed manner and
parallel processing features to process a large web log
file data efficiently and effectively[11,12]. The
preprocessed web logs by HadoopMapReduce
environment is further processed for prediction of user
next request without disturbing them to increase the
The International journal of analytical and experimental modal analysis
Volume XI, Issue IX, September/2019
ISSN NO: 0886-9367
Page No:1810
interest and to reduce the response time with ecommerce
system.
This paper shows how to process log file using
MapReduce and how Hadoop framework is used for
parallel computation of log files. Data collected from
various resources are loaded into HDFS for facilitating
MapReduce and Hadoop framework.We proved that
processing big data with the help of Hadoop
environment leads to minimum
computation and response time and also our HM_PP
algorithm leads to good accuracy in prediction of user
preferred pages. So, one can easily access the
ecommerce system with the help of big data analytics
tools with less response time and good prediction
accuracy. In future log analysis can be done by
correlation engines like RSA envision and HA cloud
environment. The above work can also be extended with
semantic analysis for better prediction.
In [2], the author describes that Big data
analytics has attracted intense interest from all academia
and industry recently for its attempt to extract
knowledge, information and wisdom form big data. Big
data and cloud computing, two of the most important
trends that are defining the new emerging analytical
tools. Big data analytical capabilities using cloud
delivery models could ease adoption for many industry,
and most important thinking to cost saving, it could
simplify useful insights that could providing them with
different kinds of competitive advantage. Many
companies to provide online Big Data analytical tools
some of the top most companies like Amazon Big data
Analytics Platform ,HIVE web based Interface, SAP Big
data Analytics, IBM InfoSphere BigInsights,
TERADATA Big Data Analytics, 1010data Big Data
Platform, Cloudera Big Data Solution etc. Those
companies analyze huge amount of data with help of
different type of tools and also provide easy or simple
user interface for analyzing data.
III PROBLEM DEFINITION
Companies like Flipkart, Snapdeal and Amazon
routinely produces a huge amount of logs on a daily
basis. They continually improve their operations and
services by analyzing the data. Analyzing these huge
amounts of data in a very short period of time is a
crucial task for any business analyst. The problem of log
files analysis is complicated because of not only its
volume but also its disparate structure. The log files are
semi-structure or unstructured type so by using using
traditional tool and techniques are not feasible , and the
tradition tool cannot handle the large amount of dataset
or an unstructured data.
For this reason, data mining needs pre-processing and
analytic method for finding the value. Indeed, data
mining is closely related with artificial intelligence and
machine learning and so on. Scale of data management
in data mining and big data is significantly different in
size. However, the basic method to extract the value is
very similar. In case of data mining, the process of
extracting knowledge needs data cleaning, data
integration, data selection, data transformation, data
mining, pattern evaluation, knowledge presentation et..
Big data came out after solving the requirements and
challenges of data mining[13].
IV PROPOSED WORK
For analyzing these large and complex data required a
power tool, we are using hadoop which is a open source
implementation of mapreduce, a powerful tool designed
for deep analysis and transformation of very large data.
Figure1. Workflow Diagram
The International journal of analytical and experimental modal analysis
Volume XI, Issue IX, September/2019
ISSN NO: 0886-9367
Page No:1811
This paper we design algorithm for handling the
problems raised by the larger data volume and the
dynamic data characteristics for finding and performing
operation on social media data sets. For analysing first
we used standard platform as hadoop on single node
ubuntu machine [9] to solve the challenges of big data
through MapReduce framework where the complete data
is mapped to frequent datasets and reduced to smaller
sizable data to ease of handling ,After that we can use
bigdata analytical tools to refine such unstructured data
and analyse the social data using bigdata analytical
tools.
V. EXPERIMENTAL & RESULT ANALYSIS
All the experiments were performed using an i3-
2410M CPU @ 2.30 GHz processor and 3 GB of
RAM running ubuntu 14 . And than we configure
hadoop-1.1.2 on ubuntu and along with hadoop we
also integrate bigdata analytical tools hive and pig
on top of the hadoop, So to achieve this we are
going to follow the following methods:
Loading Data into HDFS.
Analyzing using Apache Hive and pig.
Comparing performance of hive & pig.
Loading Data into HDFS
First we can loading different access and error log
files in to HDFS, in our dissertation we can
analyse nasa web access log which are common
access log. Figure 2 shows the loading a log file
into HDFS. And in this figures we can clearly seen
that there is not any structure between the data of
these logs file. After loading these different logs
file into HDFS we can analyze using bigdata
analytical tool such as apache hive and pig, in the
next section we can analysze these complex log
files.
Figure2. Loading web access logs into HDFS
Analyzing using Hive & Pig
After storing the log raw data into HDFS , now we
can start analyzing these complex log files using
apache hive. For analyzing common log file we
can first create a nasa_log table to store the access
log data efficiently in structured manner. For
converting the unstructured and complex log file
into structure tabular format, we can use Regex
SerDe properties into hive which can transform the
untructure data into structured format. For creating
table and apply regez serde properties into table.
For these hive query, hive engine launches a
mapreduce job for pre-processing the log files ,the
mapreduce job is launched by running a hive query
on terminal. After finishing a execution of
mapreduce job we can get the output of that query
. In figure 3 we will get the host or ip address
which has maximum frequency or hit counts.and
the time taken by hive query is also shown in
figure 3 that is takes 47.099 seconds to finish the
execution.
The International journal of analytical and experimental modal analysis
Volume XI, Issue IX, September/2019
ISSN NO: 0886-9367
Page No:1812
Figure 3. maximum hits from ip addresses
Similarly we can also find the various status
code which we can get along with its
frequency and the time taken by hive, figure 4
shows the various status code we get along
with its frequency and the time taken by hive
query.
Figure 4. Various status code along with
frequency
Similarly we can find the maximum hitting
pages along with frequency which user can
access and the time taken by hive are shown in
figure 5.
Figure 5. maximum hitting pages
Analyzing using pig
Now we can analyze the nasa web access log
files through pig which is an another bigdata
analytical tools for performing analysis on
large amount of data, for these we can first
start the pig by entering into the grunt shell by
simply typing pig command. And to analyze
the unstructured log files we can register the
pigloader into or grunt shell through which we
can validate the log data and process the log
data. So we can find the maximum ip address
hit ratio by writing pig script.
REGISTER /home/Desktop/piggybank-
0.11.0.jar;
DEFINE ApacheCommonLogLoader
org.apache.pig.piggybank.storage.apach
elog.CommonLogLoader();
logs = LOAD '/logfile' USING
ApacheCommonLogLoader AS (addr:
chararray, logname: chararray, user:
chararray, time: chararray, method:
chararray, uri: chararray, proto:
chararray, status: int, bytes: int);
addrs = GROUP logs BY addr;
The International journal of analytical and experimental modal analysis
Volume XI, Issue IX, September/2019
ISSN NO: 0886-9367
Page No:1813
counts = FOREACH addrs GENERATE
flatten($0), COUNT($1) as count;
top = order counts by $1 DESC;
Result = LIMIT top 10;
DUMP Result;
After completing the execution of pig script we
can get the output of these pig script and the
output are shown in figure 6.
Figure 6. Output generated by pig
And the time taking by pig are shown in figure
7, in which is clearly shows that the pig start
the execution at 15:41:06 and finish the
execution at 15:43:10 means pig scripts takes
124 seconds to complete the execution.
Figure 7. Time taken by pig script
Comparison between hive and pig
After analyzing the access logs from hive and pig
we can seen the result and the result of both the
tools are same means both are accurate in terms of
accuracy but both are taking different execution
time to generate the result , table 1 shows the time
taken by hive and pig to generate the result.
Table 1. time taken by hive and pig
Figure 8. time taken by hive and pig
In our experiment we also introduced hive which is
more useful as compared to pig on analysis and we say
that hive performs fast as compared to pig on the basis
of various parameters, also the above query results
demonstrate that the execution time taken by hive is
very less as compared to pig. And the mapreduce job
generated by hive is less as compared to pig whereby the
execution time is less in hive. Another benefit of using
hive is number of lines of code, which are more in pig
but in hive only one line query is sufficient. Another
parameter is load over mr-jobhistory server, there is
much load on history server when we execute pig scripts
because there is more switching between the aliases in
pig whereas hive imparts less switching thereby
reducing the load on mr-jobhistory server. The
experimental results are shown below.
The International journal of analytical and experimental modal analysis
Volume XI, Issue IX, September/2019
ISSN NO: 0886-9367
Page No:1814
Table 2 No. of job launched by pig and hive
Table 3 Query executed w.r.t mr-history server
05
101520
Execution time taken (in min)
Execution time taken (in min)
Figure 9. Query executed w.r.t mr-history server
Optimizing Query Performance
In these we can also optimize the hive query
performance, we can perform serialization process at the
starting table and store the resultant table into new table
and then apply all the query on these new resultant table
by which we can get the result faster as compared to
perform same operation on deserialize table. For these
we can execute the different query on two hive tables
first table in which the optimization is not present and
second in which we perform optimization process for
which we can get output of queries with different
execution time and the time taken by query in both the
tables. For these we can create another table call lognew
and the schema difference between both table are shown
in figure 9.
Figure 9. Schema difference between both the table
And the time taken by query while running on normal
table and the optimize table and the result are shown
below.
Table 2. time taken by queries
The International journal of analytical and experimental modal analysis
Volume XI, Issue IX, September/2019
ISSN NO: 0886-9367
Page No:1815
Figure 10. Execution time taken by query on hive
tables
VII CONCLUSION:
World Wide Web has necessitated the users to make use
of automated tools to find desired information resources
and to follow and assess their usage pattern. We have
presented best fit Hadoop MapReduce programming
model for analyzing web application log files . In this
system, data storage is provided using HDFS and
MapReduce model applied over log files gives analyzed
results in minimal response time. To get categorized
results of analysis hive query and pig query is written
over MapReduce result. And we can also compare the
performance on hive and pig and the hive perform better
in processing access logs over pig in terms of execution.
And in this we can also optimize the hive query
performance to analyze the log data.
REFERENCES
[01] Dr.S.Suguna, M.Vithya, J.I.Christy Eunaicy, “Big
Data Analysis in E-commerce System Using
HadoopMapReduce” in 2016 IEEE.
[02] Rahul Kumar Chawda, Dr. Ghanshyam Thakur,
“Big Data and Advanced Analytics Tools”, 2016
Symposium on Colossal Data Analysis and Networking
(CDAN), IEEE 2016, ISSN: 978-1-5090-0669-4/16.
[03] M.Santhanakumar and C.Christopher Columbus,
“Web Usage Analysis of Web pages UsingRapidminer”,
WSEAS Transactions on computers, EISSN: 2224-2872,
vol.3, May 2015.
[04] Shaily G.Langhnoja ,MehulP.Barot and
DarshakB.Mehta, “Web Usage Mining Using
Association Rule Mining on Clustered Data for Pattern
Discovery “,International Journal of Data Mining
Techniques and Applications, vol.2 ,Issue.1, June 2013.
[05] Web server logs ://http. Sever side log.org.
[06] Nanhay Singh, Achin Jain, Ram and Shringar Raw,
“Comparison Analysis of Web Usage Mining Using
Pattern Recognition Techniques”, International Journal
of Data Mining & Knowledge Process(IJDKP) vol.3,
Issue.4, July 2013.
[07] J.Srivastava et al, “Web usage Mining:
Discoveryand Applications of usage patterns from Web
Data“, ACM SIGKDD Explorations, vol.1, Issue. 2,
pp.12-23, 2000.
[08] S.Saravanan and B.UmaMaheswari, “Analyzing
Large Web Log Files in A HadoopDistributedCluster
Environment”, International Journal of Computer
Technology & Applications, vol.5, pp. 1677-1681.
[09] Michael G. Noll, Applied Research, Big Data,
Distributed Systems, Open Source, "Running Hadoop on
Ubuntu Linux (Single-Node Cluster)", [online],
available at http://www.michael-
noll.com/tutorials/running-hadoop-on-ubuntu-linux-
single-node-cluster/
[10] K.V.Shvachko, “ TheHadoop Distributed File
System Requirements”, MSST ’10 Proceeding of the
2010 IEEE 26th Symposium on Mass Storage System
and Technologies(MSST).
[11] Apache Hadoop ://http://hadoop.apache.org.
[12] A white paper by OrzotaInc, “Beyond Web
Application Log Analysis using Apache Hadoop”.
[13] Mining the Social Web: Data Mining Facebook,
Twitter, LinkedIn, Google+, GitHub, and More--
‐Matthew A. Russell.
The International journal of analytical and experimental modal analysis
Volume XI, Issue IX, September/2019
ISSN NO: 0886-9367
Page No:1816