TEXT MINING BASED RETRIEVE SIMILARITY CONTENT … · inflections that convey parts of speech, tense...

14
TEXT MINING BASED RETRIEVE SIMILARITY CONTENT WEBPAGE IN WEB MINING TECHNIQUES S.Amudha*, Dr.I.ElizabethShanthi**, *(Ph.D Research Scholar Department of Computer Science, Avinashilingam Institute for Home Science and Higher Education for Women, Coimbatore, India [email protected]) **(Professor, Department of Computer Science, Avinashilingam Institute for Home Science and Higher Education for Women, Coimbatore, India [email protected]) I.INTRODUCTION Nowadays World Wide Web (WWW) is considered to be the best source of information. Its importance mainly is due to easy access, low- cost and being responsive to users’ needs in the shortest time[10]. Due to the vast number of web pages that exists in; analyzing and clustering of the results is still the maximum important challenge in design of search engines and still more than half of all retrieved web pages in any search engine have been reported to be Abstract: Search engine have a huge amount of information on the web. Search engine based on the query to retrieve the content and the user viewed some pages of search results. The user’s views of the web information produce the ranking value to the web pages for retrieve the content. Most of the time user’s query not contain relevant document to the users search and relevant document not contain highest ranking values. The proposed rwork overcomes the drawback to retrieve better relevant document. The proposed framework mainly classify three parts (i) Webcrawler: to retrieve the web page content in search engine based on user’s query (ii) preprocessing: tokenization-nonempty sequence of characters excluding spaces and punctuations, stopwords-remove function words and connectives words, stemming-Remove inflections that convey parts of speech, tense and number (iii) similarity of the web content in retrieve document in web crawler using clustering techniques. Keywords web crawler, search engine, information retrieval. International Journal of Pure and Applied Mathematics Volume 119 No. 12 2018, 13571-13583 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu 13571

Transcript of TEXT MINING BASED RETRIEVE SIMILARITY CONTENT … · inflections that convey parts of speech, tense...

Page 1: TEXT MINING BASED RETRIEVE SIMILARITY CONTENT … · inflections that convey parts of speech, tense and number (iii) similarity of the web content in retrieve document in web crawler

TEXT MINING BASED RETRIEVE SIMILARITY

CONTENT WEBPAGE IN WEB MINING TECHNIQUES

S.Amudha*, Dr.I.ElizabethShanthi**, *(Ph.D Research Scholar

Department of Computer Science,

Avinashilingam Institute for Home Science and Higher Education for Women,

Coimbatore, India

[email protected]) **(Professor,

Department of Computer Science,

Avinashilingam Institute for Home Science and Higher Education for Women,

Coimbatore, India

[email protected])

I.INTRODUCTION

Nowadays World Wide Web (WWW) is

considered to be the best source of information.

Its importance mainly is due to easy access, low-

cost and being responsive to users’ needs in the

shortest time[10]. Due to the vast number of web

pages that exists in; analyzing and clustering of

the results is still the maximum important

challenge in design of search engines and still

more than half of all retrieved web pages in any

search engine have been reported to be

Abstract:

Search engine have a huge amount of information on the web. Search engine based on the query to retrieve

the content and the user viewed some pages of search results. The user’s views of the web information

produce the ranking value to the web pages for retrieve the content. Most of the time user’s query not

contain relevant document to the users search and relevant document not contain highest ranking values.

The proposed rwork overcomes the drawback to retrieve better relevant document. The proposed

framework mainly classify three parts (i) Webcrawler: to retrieve the web page content in search engine

based on user’s query (ii) preprocessing: tokenization-nonempty sequence of characters excluding spaces

and punctuations, stopwords-remove function words and connectives words, stemming-Remove

inflections that convey parts of speech, tense and number (iii) similarity of the web content in retrieve

document in web crawler using clustering techniques.

Keywords — web crawler, search engine, information retrieval.

International Journal of Pure and Applied MathematicsVolume 119 No. 12 2018, 13571-13583ISSN: 1314-3395 (on-line version)url: http://www.ijpam.euSpecial Issue ijpam.eu

13571

Page 2: TEXT MINING BASED RETRIEVE SIMILARITY CONTENT … · inflections that convey parts of speech, tense and number (iii) similarity of the web content in retrieve document in web crawler

irrelevant.Search engines are the major tools for

finding and receiving access to the contents on

the web. Whenever users seek information, enter

their query in search engine. The search engine

searches through web pages and return a list of

relevant ones[6].

Current Web search techniques are not

directly suited for indexing and retrieval of

semantic mark-up. Document is treated as a bag

of words where words or word variants are

recognized as indexing terms. The existing

semantic mark-up is either simply ignored by

many search engines for indexing purposes or not

processed in a way that allows the mark-up to be

used distinguishably from other text during the

search. The upcoming Web search is no longer

limited to matching keywords of the query

against documents but instead complex

information needs can be expressed in a

structured way with precise and structured

answers as results.The kind of search in which

user’s information needs are addressed by

considering the meaning of user’s query as well

as available resources is referred to as Semantic

Search[12].

One of the most important challenging

issues in any web search engine is finding high

quality web pages. Quality of pages is defined

based on the user preferences. Then, the problem

of ranking is to sort web pages based on users’

requests or preferences. Definitely, to make the

web more interesting and productive, we need a

good and efficient ranking algorithm for crawling

and searching [2].

The reason search results are ranked in an

information retrieval (IR) system derives from

the assumption that information-seeking users

should get all the information relevant to their

search query and only that information. Although

mathematical and statistical methods of varying

complexity do exist to determine the relevance of

a search result, such methods use algorithms to

integrate assumptions of relevance. But it is the

subjective relevance of a result that matters to the

user in the end, ―because an information-retrieval

system exists only to serve its users‖ [4].

The workflow of a web crawler can be described

roughly as follows [13]:

(1) A search engine assigns some URLs as the

initial URLs for every web crawler. Then,

the web crawler pushes them into a URL

queue (queued URLs) in which each one

instructs the web crawler where to travel in

the Web.

(2) The web crawler starts working with the

initial URLs.

(3) When the web crawler retrieves web pages, it

extracts all of the URLs (current URLs) in

the web pages.

(4) The web crawler adds them to the queued

URLs.

(5) Where after, to continue crawling, the web

crawler makes a choice of URLs from the

queues URLs and deletes these crawled

URLs.

(6) The web crawler repeats (2) to (5) until no

URLs remain in the queue of URLs.

Currently, the most classic Web structure

algorithm is PageRank algorithm that Sergey

Brin and Larry Page have proposed at Stanford

University. In order to verify the performance of

the algorithm, they successfully applied it to the

Google search engine prototype, and now Google

has become the world's most well-known search

engine.

Many of the existing page ranking

algorithms are based on connectivity. Graph

theory based on networks plays an important role

in page ranking and many algorithms use the in

links and out links of a page for ranking them.

One more important aspect in graph theory

involves the concept of Eccentricity. The more

the eccentricity of a page, the more will be its

reachability. That is, the rank will be high for

those pages which have small eccentricity value

[8].

Web information retrieval may be defined

as the application of information retrieval

theories and methodologies to the World Wide

Web. Web information retrieval task faces

several challenges when compared to the classic

information retrieval due to the following reasons

[11]:

The difference in size between the

document collections used for classic

International Journal of Pure and Applied Mathematics Special Issue

13572

Page 3: TEXT MINING BASED RETRIEVE SIMILARITY CONTENT … · inflections that convey parts of speech, tense and number (iii) similarity of the web content in retrieve document in web crawler

information retrieval and the web makes

the task of web information retrieval a

tedious one.

The structure of the web is another

important factor. On the web, the links

between the documents exhibit unique

patterns.

The web exhibits a dynamic behavior.

The information on the web is

heterogeneous in nature, where multiple

types of document formats coexist.

Most of the contents on the web are

duplicated.

The task of web information retrieval

needs to deal with several types of users

starting from professionals to naive users.

II. RELATED WORK

ImanRasekh 2015 has proposed a new

type of web page search based on the competitive

intelligence and used link based ranking for

identified user preferences. This proposed system

getting keywords from the user and retrieved the

information based on the keyword. After

analyzed the retrieved information and stored in

the system. Then find the relationship between

the users and find the user behavior of the user

web pages are classified. Finally using ICA

semantic algorithm produce the final result to the

user.

SuruchiChawla 2016 has developed for

optimal ranking of clicked URLs using genetic

algorithm based on clustered web page query

session for personalised web search. This system

using the dataset of query session collected from

the web in the three domain academician,

entertainment and sports. This system produce

result improvement in the average precision of

personalized web search with clustered based

optimal ranking of clicked URLs in selected

domain and produce more relevant document in

top URLs. Personalised web search[PSW] using

optimal ranked clicked URL more effective for

produce relevant document.

ValiDerhami, ElaheKhodadadian,

Mohammad Ghasemzadeh, Ali Mohammad

ZarehBidoki (2013) has developed two new

algorithms using reinforcement learning concept

in artificial intelligence. This experiment using

benchmark datasets like LETOR and dotIR data

collection. The dataset use three common

evaluation measures like precision, mean average

precision and Normalized discount cumulative

gain. This system calculate score of every

webpage considered state and value of web

page.RL Ranking algorithm based of n

connectivity of out links from the current page

for finding the score of web page in iteratively.

They proposed another new algorithm is

combined of content based inBM25 and RL rank

algorithm. Those algorithms produce improvised

result of existing ranking system.

VikasJinda, SeemaBawa, ShaliniBatra

(2014) this paper describesthe semantic search on

web using different ranking algorithms. The

relevancy ranking approach based on semantic

which are consider appropriate for retrieval of

relevant information. This review paper

examined depends on the methodologies and

unique characteristics on ranking process. The

classical IR based search model and semantic

based search models, ranking involves three

stages like Entity ranking, relationship ranking

and semantic document ranking. The review

process considers many number of parameter of

semantic search on web have been identified

directly or indirectly in ranking process.

Ali Mohammad ZarehBidoki, Nasser

Yazdani has proposed intelligent ranking

algorithm for web pages with distance. Distance

rank is a recursive method based on

reinforcement learning which considered distance

between web pages. This system compute rank of

web pages and number of average clicks between

two pages. This system used University of

International Journal of Pure and Applied Mathematics Special Issue

13573

Page 4: TEXT MINING BASED RETRIEVE SIMILARITY CONTENT … · inflections that convey parts of speech, tense and number (iii) similarity of the web content in retrieve document in web crawler

California at Berkeley’s Web site with five

millions web pages to evaluate Distance-Rank

and used two scenarios like crawling schedule

and ranking ordering. Finally it is compared the

ordered rank of distance rank with pagerank

algorithm and Google rank algorithm with and

without user query. The distance rank algorithm

produce 5% more throughput compared with

other algorithm.

Ahmet Selman Bozkir

,EbruAkcapinarSezer(2018) has proposed layout

based calculation of web page similarity ranks

and considered the structure and vision based

features. This system considered two categories.

In the first category structural similarities are

analysed with visual inspection of DOM trees

and they have used five types of structure layout

component with whitespace are utilized. In the

second category computer vision based method is

histogram of oriented gradient (HOG) is

employed to edge orientation. The feature

extraction phase used the method is spatial

pyramid matching. This paper achieved the goal

like the visual layout of web pages were mapped

and compared in a multi-resolution schema, the

intermediate process of visual segmentation was

removed and efficient and easily comparable web

page layout signatures were generated.

Gabrielle Demange (2017) has proposed

evolutions between two groups abound for

instance between buyer and seller. This system

used ranking algorithm to assigns scores to each

side members based on these evaluations and

mutual centrality method used to characterize by

two properties. Finally the mutual centrality and

congruence method coincide for affiliation

network. The characterization applies to any pair

of evaluation matrices and affiliation network

minimization of the error.

Bo Yang, Hechang Chen , Xuehua Zhao,

Masato Naka , Jing Huang (2015) has developed

a probabilistic counting based method to

quantitatively and efficiently computing the

diversity of inbound hyperlink and Drank

algorithm to rank pages by simultaneously

analysing the quantity, quality and diversity of

their inbound hyperlinks. The Drank algorithm

compute the following are the diversity of each

pair of pages,adjust hyperlink weights based on

diversity and page authority according to the

updated hyperlink weight.

Christiane Behnert , Dirk Lewandowski

(2015) has proposed library information system

consider approaches adapted from web search

engines. This system considers ranking factors

into six groups are text statistics, popularity,

freshness, locality and availability, content

properties and user background. The first factor

finds the relevancy of content using relevancy

ranking and popularity factor based on citation

analysis. Remaining factors are major role in

relevancy ranking.

S Hariharan, S Dhanasekar,

KalyaniDesikan (2015) has developed

reachability for web based ranking using Haar

wavelets with multi resolution. This system used

page ranking in the form of structured signal with

in link, out link and reachability values of the

web page in network graphs. The page ranking

of web pages used average, coefficient of the

input signal and down sampling process. Finally

compare the result between original page rank

and category based page rank and produce better

result category based page rank compared with

others.

YaJun Du, YuFengHai (2013) has

proposed new method for measuring the

similarity of formal concept analysis(FCA)

method for web page rank in user’s web log. This

system proposed new algorithm that to find the

intension and extension similarity that analyze a

user’s browsing pattern with hyperlinks and also

find the information similarity between two

nouns with using of user’s web log. This system

computes the semantic similarity between two

concepts and finding similarity ranking of web

pages in own web crawler based on focused web

crawler. They proved that the semantic ranks of

International Journal of Pure and Applied Mathematics Special Issue

13574

Page 5: TEXT MINING BASED RETRIEVE SIMILARITY CONTENT … · inflections that convey parts of speech, tense and number (iii) similarity of the web content in retrieve document in web crawler

web pages are useful and efficient for making a

web crawler’s choice of web pages for

continuing work.

Michael Scholz, Jella Pfeiffer , Franz

Rothlauf (2017) has proposed default page

ranking algorithm used to non-personalized

product ranking on landing pages of online

stores. This system proposed new algorithm

product centrality ranking algorithm (PCRA)

used the page rank centrality product in a product

domination graph to find their rank values. The

graph contains two parts are node and edges. The

node represents products and the edges represent

dominance relationship between the products.

The PCRA algorithm achieve more accurate

ranking than existing algorithm.

Vidya P V, Reghu Raj P C, Jayan V

(2016) has proposed multilingual information

search algorithm with web page ranking based on

user’s query. This system performs five major

task are preprocessing, searching, processing web

page contents, retrieval and ranking. This system

used cross lingual information retrieval among

the languages English, Hindi and Malayalam and

performs pre-post preprocessing for user quires

in different language. Finally improves the

quality of the result obtained from Google search.

III. PROPOSED METHODOLOGY

The proposed work developed using java

language for finding the similarity content in

retrieved documents. The first process in this

framework is pass the user’s query to the search

engine for retrieve the content in the web. Then

search engine based on the user’s query analysed

to receive the search results and set the window

size for retrieve the number of the web page for

example set as WZ=2 retrieve homepage have 10

and next page have 8 then both have 18 links of

web pages.

The web crawler used http protocol to

retrieve the document in links with help of href

tag. The href tag used to extract the web

information in the particular link. Finally match

the content similarity in the retrieved documents.

STOPWORDS

Stop words are a partition of natural

language. The purpose of that stop-words should

be eliminated from a text is that they make the

text appear weighted and less important for

analysts. Removing stop words decreases the

dimensionality of term space.

Figure 1.Workflow of retrieve similarity

content

The most common words in text

documents are articles, prepositions, and pro-

nouns, etc. that doesn’t offer which means of the

documents. These words are preserved as stop

words. Sample of stop words are: the, in, a, an,

with, etc. Stop words are removed

fromdocuments as a result of those words don’t

seem to be measured as keywords in text mining

applications

STEMMING

This technique is used to find the

root/stem of a word. For example, the words

select, selected, selecting, selections all can be

stemmed to the word ―select‖ [6]. The

determination of this method is to

eliminatenumerous suffixes, to decrease the

Find The Similarity Content

Preprocessing

Retrieve Webpages

Web Crawler

Search Engine

Query

International Journal of Pure and Applied Mathematics Special Issue

13575

Page 6: TEXT MINING BASED RETRIEVE SIMILARITY CONTENT … · inflections that convey parts of speech, tense and number (iii) similarity of the web content in retrieve document in web crawler

amount of words, to have perfectlyequivalent

stems, to save time and memory space.

PORTERS STEMMER

Porters stemming algorithm is one

amongst the foremost stemming algorithm

projected in 1980. Several modifications and

enhancements are created and suggested on the

fundamental algorithm.

It’s supported the thought that the

suffixes within the English language area

unit largely created from grouping of smaller

and less complicated suffixes. It’s 5 steps,and at

every step, rules are applied till one

amongst them passes the conditions. If a rule is

accepted, the suffix is removed consequently, and

therefore the next step is performed. The

resultant stem at the end of the fifth step is came

back. The rule like the following: → as an

example, a rule (m>0) EED → EE suggests

that ―if the word has a minimum of one vowel

and consonant and EED ending, modification the

ending to EE‖. Therefore ―agreed‖ becomes

―agree‖ whereas ―feed‖ remains unchanged.

Porter designed an in depth framework of

stemming that is thought as „Snowball‟ . The

most purpose of the framework is to

permit programmers to develop their own

stemmers for different character sets or

languages. But it had been noted that Lovins

stemmer could be a heavier stemmer that

produces a higher information reduction [13].

The Lovins algorithmic rule is clearly larger than

the Porter algorithmic rule, attributable

to its terribly intensive endings list. However in a

way that's used to advantage: it's quicker. It is

effectively listed area for time, and with

its massive suffix set it wants simply 2 major

steps to get rid of a suffix, compared with

the 5 of the Porter algorithmic rule.

K-MEANS CLUSTERING ALGORITHM

K-means is one among the best

unsupervised learning algorithms that solve the

well-known clustering issues. The procedure

follows a straightforward and simple way to

classify a given data set through a particular

group of clusters used an apriori. The

most plan is to define k centers, one for

every cluster. These centers are placed in totally

different location causes different result. So, the

higher alternative is to put them the maximum

amount as possible from one another. The

following step is to require every point going to a

given data set and associate it to the

closest center. Once no point is unfinished, the

primary step is completed associated with

nearest cluster. At now we'd like to re-calculate k

new centroids as barycenter of the

clusters output from the previous step. Next

when we've these k new centroids, a

prime binding must be done between an

equivalent data set points and also the nearest

new center. A loop has been generated. As

a results of this loop we tend to might notice that

the k centers alter their location step by

step till no additional changes are done or

in alternative words centers don't move any

further. Finally, this algorithm aims at

minimizing associate objective function as square

error function given by:

where,

‘||xi - vj||’ is the Euclidean distance

between xi and vj.

‘ci’ is the number of data points in ith

cluster.

‘c’ is the number of cluster centers.

Algorithmic steps for k-means clustering

Let X = {x1,x2,x3,……..,xn} be the set of data

points and V = {v1,v2,…….,vc} be the set of

centers.

1) Randomly select ‘c’ cluster centers.

International Journal of Pure and Applied Mathematics Special Issue

13576

Page 7: TEXT MINING BASED RETRIEVE SIMILARITY CONTENT … · inflections that convey parts of speech, tense and number (iii) similarity of the web content in retrieve document in web crawler

2) Measure the distance between each data

point and cluster centers.

3) Assign the data point to the cluster center

whose distance from the cluster center is

minimum of all the cluster centers..

4) Recalculate the new cluster center using:

where, ‘ci’ signifies the number of data points

in ith

cluster.

5) Recalculate the distance between every data

point and new found cluster centers.

6) If no information point was reallocated then

stop, otherwise repeat from step three).

IV. EXPERIMENTAL RESULT AND

EVALUATION

This experiment was done on a dataset

based on user queries like single keyword. This

system capture the users in search results

obtained using the google,yahoo,bing and ask

search engines. In order to generate the dataset,

the user require to enter the input query as single

keyword is passed to the google, yahoo, bing and

ask search engines. The figure 1 represent the

personalized search engine.

The Search results are retrieved and

stored in the system using the href and h3 html

tag. This system was evaluated the retrieved most

similarity content in the webpages and developed

this experiment using java netbeans and mysql

software. The first step collect the data from the

various search engines based on the users input

query and retrieve the search results. The second

process is preprocessing the datasets in the

follows:

Tokenization:

Tokenization is the procedure of splitting

a stream of text up into words, phrases, symbols,

or other meaningful elements called tokens. The

list of tokens becomes input for further

processing such as parsing or text mining.

Tokenization is beneficial both in

linguistics and in computer science, where it

forms portion of lexical analysis‖. The figure 2

shows the tokenization results in preprocessing.

Stopwords:

Terms that occur numerous times in a group and

later are not discriminating for example to, a, the,

of, from ect. Assess the stop terms for a domain

and Stop word lists are maintained. Stop words

decreases the index size.Information retrieval has

been to reduce the size of stop word list or

remove the use of it. Using a better index

compression and Weighting stop terms depend

for query processing (query-based). The figure 3

shows the stopwords results in preprocessing.

Stemming:

Stemming is also known as Conflation.

This is to reduce differences of every word due to

modulation or derivation to a similar stem.

Stemming is improves effectiveness by providing

aimprovedequal between query and a relevant

document. User who is searching for

―swimming‖ might be attentive in documents

with ―swim‖.

It decreases the term index by ~17% and

alsolossy compression. Our system using porter

stemmer for remove the ing,ion,ious etc. porter

stemming an inward word is washed up in the

initialization part, one prefix trimming phase then

takes place so then five suffix trimming phases

occur. The figure 4 shows the stemming results

in preprocessing.

International Journal of Pure and Applied Mathematics Special Issue

13577

Page 8: TEXT MINING BASED RETRIEVE SIMILARITY CONTENT … · inflections that convey parts of speech, tense and number (iii) similarity of the web content in retrieve document in web crawler

Figure 1:Personalised Search Engine

Figure 2:Tokenization Result

Figure 3: Stopwords result

Figure 4:Stemming Result

The table 1 represents the grouping of similar

content in the web pages using k-means

clustering algorithm and table 2 Retrieve

International Journal of Pure and Applied Mathematics Special Issue

13578

Page 9: TEXT MINING BASED RETRIEVE SIMILARITY CONTENT … · inflections that convey parts of speech, tense and number (iii) similarity of the web content in retrieve document in web crawler

unique similar content with cluster values in

webpages in various search engines. Figure 5:

Chart for Retrieve similar content in webpages

in various searches and table 3: Retrieve

similar content in webpages in various

searches.

FILE CLUSTER VALUE

apple//google//https___en.wikipedia.org_wiki_Apple_Inc..txt 1 0

apple//yahoo//https___en.wikipedia.org_wiki_Apple_Inc..txt 1 0

apple//bing//https___en.wikipedia.org_wiki_Apple_Inc..txt 1 0

apple//ask//https___en.wikipedia.org_wiki_Apple_Inc..txt 1 0

apple//google//https___support.apple.com_en_in.txt 2 1.75890591

apple//yahoo//https___support.apple.com_en_in.txt 2 1.75890591

apple//bing//https___support.apple.com_en_in.txt 2 1.75890591

apple//ask//https___support.apple.com_.txt 2 1.75890591

apple//google//https___www.apple.com_in_.txt 3 5.140819694

apple//google//https___www.apple.com_in_buy_shop_.txt 3 5.140819694

apple//google//https___www.apple.com_in_iphone_.txt 3 5.140819694

apple//yahoo//https___www.apple.com_.txt 3 5.140819694

apple//yahoo//https___www.apple.com_in_.txt 3 5.140819694

apple//yahoo//https___www.apple.com_in_ipad_.txt 3 5.140819694

apple//yahoo//https___www.apple.com_in_iphone_.txt 3 5.140819694

apple//bing//https___www.apple.com_in_.txt 3 5.140819694

apple//bing//https___www.apple.com_in_buy_.txt 3 5.140819694

apple//bing//https___www.apple.com_in_contact_.txt 3 5.140819694

apple//bing//https___www.apple.com_in_iphone_.txt 3 5.140819694

apple//bing//http___www.myimaginestore.com_.txt 3 5.140819694

apple//ask//https___www.apple.com_.txt 3 5.140819694

apple//ask//https___www.apple.com_in_buy_shop_.txt 3 5.140819694

apple//bing//https___simple.wikipedia.org_wiki_Apple.txt 4 0

apple//yahoo//https___www.apple.com_iphone_.txt 5 5.490529511

apple//ask//https___www.apple.com_ipad_.txt 5 5.490529511

apple//ask//https___www.apple.com_iphone_.txt 5 5.490529511

apple//ask//https___www.apple.com_watch_.txt 5 5.490529511

apple//google//https___www.apple.com_in_iphone_battery_and_performance_.txt 6 0

apple//google//https___www.apple.com_in_macbook_.txt 7 0

International Journal of Pure and Applied Mathematics Special Issue

13579

Page 10: TEXT MINING BASED RETRIEVE SIMILARITY CONTENT … · inflections that convey parts of speech, tense and number (iii) similarity of the web content in retrieve document in web crawler

apple//google//https___www.apple.com_in_mac_.txt 8 13.01161655

apple//yahoo//https___www.apple.com_in_ios_ios_11_.txt 8 13.01161655

apple//yahoo//https___www.apple.com_in_mac_.txt 8 13.01161655

apple//yahoo//https___www.apple.com_iphone_se_.txt 8 13.01161655

apple//bing//https___www.apple.com_in_mac_.txt 8 13.01161655

apple//ask//https___www.apple.com_mac_.txt 8 13.01161655

apple//google//https___www.engadget.com_2018_01_26_apple_homepod_2018_release_.txt 9 0

apple//google//https___www.youtube.com_channel_UCE_M8A5yxnLfW0KghEeajjw.txt 10 1.414213562

apple//ask//https___www.youtube.com_user_Apple.txt 10 1.414213562

apple//google//http___imaginestore.org_.txt 11 0

Table 1: Group the similarity content using K-means Clustering Algorithm

Link VALUES

http://www.myimaginestore.com/ 5.14082

https://en.wikipedia.org/wiki/Apple_Inc. 0

https://support.apple.com/ 1.758906

https://support.apple.com/en-in 1.758906

https://www.apple.com/ 5.14082

https://www.apple.com/in/ 5.14082

https://www.apple.com/in/buy/ 5.14082

https://www.apple.com/in/buy/shop/ 5.14082

https://www.apple.com/in/iphone-battery-and-performance/ 0

https://www.apple.com/in/iphone/ 5.14082

https://www.apple.com/in/mac/ 13.01162

https://www.apple.com/iphone/ 5.14082

Table 2: Retrieve similar content with cluster values in webpages in various search engine

International Journal of Pure and Applied Mathematics Special Issue

13580

Page 11: TEXT MINING BASED RETRIEVE SIMILARITY CONTENT … · inflections that convey parts of speech, tense and number (iii) similarity of the web content in retrieve document in web crawler

Figure 5: Chart for Retrieve similar content in webpages in various search

LINK

http://www.myimaginestore.com/

https://en.wikipedia.org/wiki/Apple_Inc.

https://support.apple.com/

https://support.apple.com/en-in

https://www.apple.com/

https://www.apple.com/in/

https://www.apple.com/in/buy/

https://www.apple.com/in/buy/shop/

https://www.apple.com/in/iphone-battery-and-performance/

https://www.apple.com/in/iphone/

https://www.apple.com/in/mac/

https://www.apple.com/iphone/

Table 3: Retrieve similar content in webpages in various search

0.01.02.03.04.05.06.07.08.09.010.011.012.013.014.0

http://w

ww.m

yimaginestore.com/

https://en.wikipedia.org/w

iki/Apple_I

nc.

https://support.apple.com/

https://support.apple.com/en-in

https://www.apple.com/

https://www.apple.com/in/

https://www.apple.com/in/buy/

https://www.apple.com/in/buy/shop

/

https://www.apple.com/in/iphone-

battery-and-perform

ance/

https://www.apple.com/in/iphone/

https://www.apple.com/in/m

ac/

https://www.apple.com/iphone/

International Journal of Pure and Applied Mathematics Special Issue

13581

Page 12: TEXT MINING BASED RETRIEVE SIMILARITY CONTENT … · inflections that convey parts of speech, tense and number (iii) similarity of the web content in retrieve document in web crawler

V. CONCLUSION AND FUTURE WORK

In this paper has proposed which use k-means

clustering algorithm for retrieve similar content

for webpages in various search engine. This

system remove the noisy values in datasets in the

preprocessing techniques like tokenization,

stopwords and stemming with porter stemmer

algorithm. The performance of the proposed

work is assessed for similar content in webpages.

Our future work is improvise the similarity links

and also improvise page ranking values for web

pages.

REFERENCES

[1] Ahmet Selman Bozkir∗ ,

EbruAkcapinarSezer,‖Layout-based

computation of web page similarity

ranks ―,Int. J. Human-Computer Studies

110 (2018) 95–114

[2] Ali Mohammad ZarehBidoki *, Nasser

Yazdani‖, DistanceRank: An intelligent

ranking algorithm for web pages‖,

Information Processing and

Management 44 (2008) 877–892

[3] Bo Yang, Hechang Chen , Xuehua Zhao

, Masato Naka, Jing Huang,‖ On

characterizing and computing the

diversity of hyperlinks for anti-

spamming page ranking‖, Knowledge-

Based Systems 77 (2015) 56–67

[4] Christiane Behnert , Dirk

Lewandowski,‖ Ranking Search Results

in Library Information Systems —

Considering Ranking Approaches

Adapted From Web Search Engines‖,

The Journal of Academic Librarianship

41 (2015) 725–735

[5] Gabrielle Demange,‖ Mutual rankings‖,

Mathematical Social Sciences 90 (2017)

35–42

[6] ImanRasekh,‖A New Competitive

Intelligence-Based Strategy for Web

Page Search‖ The 2015 International

Conference on Soft Computing and

Software Engineering (SCSE 2015),

Procedia Computer Science 62 ( 2015 )

450 – 456

[7] Michael Scholz, JellaPfeiffer , Franz

Rothlauf ,‖ Using PageRank for non-

personalized default rankings in

dynamic markets ―,European Journal of

Operational Research 260 (2017) 388–

401.

[8] S Hariharan, S Dhanasekar,

KalyaniDesikan,‖Reachability Based

Web Page Ranking Using Wavelets‖,

2nd International Symposium on Big

Data and Cloud Computing

(ISBCC’15),Procedia Computer

Science 50 ( 2015 ) 157 – 162.

[9] SuruchiChawla,‖A novel approach of

cluster based optimal ranking of clicked

URLs using genetic algorithm for

effective personalized web search‖,

Applied Soft Computing 46 (2016) 90–

103

[10] ValiDerhami∗, ElaheKhodadadian,

Mohammad Ghasemzadeh, Ali

Mohammad ZarehBidoki,‖ Applying

reinforcement learning for web pages

ranking algorithms‖, Applied Soft

Computing 13 (2013) 1686–1692

[11] Vidya P V, Reghu Raj P C, Jayan V,‖

Web Page Ranking Using Multilingual

Information Search Algorithm - A

Novel Approach‖, International

Conference on Emerging Trends in

Engineering, Science and

Technology(ICETEST - 2015),

Procedia Technology 24 ( 2016 ) 1240 –

1247

[12] Vikas Jindal, SeemaBawa ,

ShaliniBatra,‖A review of ranking

approaches for semantic search on

Web‖, Information Processing and

Management 50 (2014) 416–425

[13] YaJun Du, YuFengHai,‖ Semantic

ranking of web pages based on formal

concept analysis‖, The Journal of

Systems and Software 86 (2013) 187–

197

International Journal of Pure and Applied Mathematics Special Issue

13582

Page 13: TEXT MINING BASED RETRIEVE SIMILARITY CONTENT … · inflections that convey parts of speech, tense and number (iii) similarity of the web content in retrieve document in web crawler

13583

Page 14: TEXT MINING BASED RETRIEVE SIMILARITY CONTENT … · inflections that convey parts of speech, tense and number (iii) similarity of the web content in retrieve document in web crawler

13584