Web Content Outlier Mining

8/2/2019 Web Content Outlier Mining

1/21

Click to edit Master subtitle style

4/26/12

DETECTION AND REMOVAL OF REDUNDANT

WEB CONTENTTHROUGH RECTANGULAR AND SIGNEDAPPROACH

11

Presented by,

Pramod.N

Reg no. 091718V Sem M.C.A

A.I.M.I.T


2/21

4/26/12

Topics to be Covered

Abstract

Introduction

Related WorksNeed for the Algorithm

Rectangular and signed approach

Algorithm Review

Conclusion

References 22


3/21

4/26/12

Abstract

Today, Internet marks the era of information revolution and

people rely on search engines for getting relevant informationwithout replicas. As the duplicated web pages increases, theindexing space and time complexity increases. Finding andremoving these pages becomes significant for search enginesand other likely systems. Web content outliers mining play avital role in covering all these aspects. Existing algorithm forweb content outliers mining focuses on applying weight age tostructured documents. Where as in this research work, amathematical approach based on signed and rectangular

representation is developed to detect and remove theredundancy between unstructured web documents also. Thismethod optimizes the indexing of web document as well asimproves the quality of search engines.

33


4/21

4/26/12

Introduction

Due to voluminous amount of information available onthe web, most of the people like to perform their taskover the internet.

There are many web documents, which have redundant

and irrelevant contents.

Web content outliers mining concentrates on findingoutliers such as noise, irrelevant and redundant pagesfrom the web documents.

Generally, Outliers are the data that obviously deviatefrom others, disobey the general mode or behavior ofdata and disaccord with other existing data.

44


5/21

4/26/12

Web Mining In general, web mining tasks can be classified into

three major categories

I. Web structure mining

II. Web usage mining

III. Web content mining.

Web structure mining tries to discover usefulknowledge from the structure of hyperlinks.

Web usage mining refers to the discovery of useraccess patterns from web usage logs.

Web content mining aims to extract/mine usefulinformation from the web pages based on their

contents. 55


6/21

4/26/12

Related works

G Poonkuzhali suggested set theoretical approach fordetecting and eliminating redundant links in webdocuments.

Giuseppe Antoio Di proposed an algorithm based onclone detection and similarity metrics to detectduplicate pages in web sites and application implementedwith HTML which works only for structured web

documents.

66


7/21

4/26/12

Contd

Yunhe Weng come up with an idea of improved COPS(Copy Detection Algorithm) scheme which aims to

protect intelligent property of the document owner bydetecting overlap among documents.

This method performs similarity computation only for thepages that are relevant to the suspicious pages.

77


8/21

4/26/12

Contd

Zhongming Han developed a novel multilayer frameworkfor detecting duplicated web pages through two similaritytext paragraphs detection algorithms based on Editdistance and bootstrap method.

This method achieves high performance in detectingduplicates efficiently simply by tag statistic and textcomparison, still it cannot find duplicates among multiple

web pages.

88


9/21

4/26/12

Contd

All the above works on web content mining, are lack insimplicity of concept and computation.

These issues results in determining a novel approach

based on mathematics through signed and rectangularrepresentation to detect and remove redundant webdocument with less time and space complexity.

Apart from the above benefits, there is need for a

algorithm which works well for both unstructured andstructured data.

99


10/21

4/26/12

Need for the Algorithm

Existing web mining algorithms do not considerdocuments having varying contents within the samecategory called web content outliers.

Most of the time we get different web documents withsame contents!!!

This algorithm focuses on detection and removal of noiseissue and redundancy, which implies outliers mining.

1010


11/21

4/26/12

Design of Proposed System

1111


12/21

4/26/12

Rectangular and signed approach

In this framework, web documents are extracted from thesearch engines by giving query by the user to the web.

Then the obtained web documents are preprocessed, i.e.,

stop words, stem words and except text other data such ashyperlinks, sound, images etc are removed.

Then the number of documents extracted on the web iscounted.

1212


13/21

4/26/12

ContdNext, n x m matrix representation are generated for all

the extracted documents based on four tuples namely,

I. Number of pages

II. Paragraphs

III. Lines

IV. Word occurrences

Then all the elements of 4 tuples taken from n x m

matrix of first two documents are compared and itsoutcome is stored using signed approach.

1313


14/21

4/26/12

Contd

Finally, redundancy computation is done based on theresults of similarity computation.

Every element of Di is taken as a 4 tuple.

For example a 4 tuple(3,2,5,8) refers 8th word from 5thline in 2nd paragraph of 3rd page.

Usage of n x m matrix representation helps easy retrieval

and searching of web content.

1414


15/21

4/26/12

Redundancy Computation (RC) Algorithm

Input: User Query q;

Method: Rectangular representation and Signed Approach

Output: Web Document without Redundancy.

Step 1: Extract input web documents Di based on user query where1 i N.

Step2 : Preprocess all the extracted documents.

Step 3: Calculate maximum number of pages p, paragraph q , lines rand words s in any of the extracted Web documents.

Step 4: Generate n x m matrix for all extracted web documents with4 tuples k, l, m and n where 1k p, 1l q ,1m r and 1n srespectively

1515


16/21

4/26/12

Contd

Step 5: Initialize i=1.

Step 7: Assign j=i+1.

Step 8: Initialize PC=0 and NC=0; (PC=Positive count,

NC=Negative count). Step 9: Consider first element in 4-tuple (k,l,m,n) from Di and Dj

and perform string comparison.

Step 10: If they are similar, update PC=PC+1 else NC=NC+1

Step 11: Repeat step7 and step 8 for all the elements of 4-tuples till(p,q,r,s) taken from Di and Dj.

1616


17/21

4/26/12

Contd

Step 12: If PC NC thenDi and Dj are redundant.

Remove Dj from the set of documents.

Else

Di and Dj are not redundant.

Step 13: Increment j and repeat the steps from 8 to 12 until jN.

Step 14: At the termination of 13th step redundancy with firstdocument is eliminated.

Step 15: Increment i and repeat the steps 7 to 13 until i


18/21

4/26/12

Algorithm Review

1818


19/21

4/26/12

Conclusion

Experimental results ensure that the memory space, searchtime and run time gets reduced by using rectangularrepresentation and signed approach.

As the efficiency of web content is increased, the qualityof the search engines also gets increased.

This method is very simple to implement.

This algorithm works well for both unstructured andstructured data.

1919


20/21

4/26/12

References

http://www.ijest.info/docs/IJEST10-02-09-

http://www.waset.org/journals/waset/v56

http://www.wseas.us/e-library/conferencehttp://www.libsearch.com/view/1323898

2020
http://www.ijest.info/docs/IJEST10-02-09-11.pdfhttp://www.waset.org/journals/waset/v56/v56-150.pdfhttp://www.wseas.us/e-library/conferences/2011/Venice/ACACOS/ACACOS-12.pdfhttp://www.libsearch.com/view/1323898http://www.libsearch.com/view/1323898http://www.wseas.us/e-library/conferences/2011/Venice/ACACOS/ACACOS-12.pdfhttp://www.waset.org/journals/waset/v56/v56-150.pdfhttp://www.ijest.info/docs/IJEST10-02-09-11.pdf


21/21

4/26/12

Thank You

2121

Web Content Outlier Mining

Documents

Transcript of Web Content Outlier Mining