Web Content Outlier Mining
-
Upload
bcool-andhappy -
Category
Documents
-
view
219 -
download
0
Transcript of Web Content Outlier Mining
-
8/2/2019 Web Content Outlier Mining
1/21
Click to edit Master subtitle style
4/26/12
DETECTION AND REMOVAL OF REDUNDANT
WEB CONTENTTHROUGH RECTANGULAR AND SIGNEDAPPROACH
11
Presented by,
Pramod.N
Reg no. 091718V Sem M.C.A
A.I.M.I.T
-
8/2/2019 Web Content Outlier Mining
2/21
4/26/12
Topics to be Covered
Abstract
Introduction
Related WorksNeed for the Algorithm
Rectangular and signed approach
Algorithm Review
Conclusion
References 22
-
8/2/2019 Web Content Outlier Mining
3/21
4/26/12
Abstract
Today, Internet marks the era of information revolution and
people rely on search engines for getting relevant informationwithout replicas. As the duplicated web pages increases, theindexing space and time complexity increases. Finding andremoving these pages becomes significant for search enginesand other likely systems. Web content outliers mining play avital role in covering all these aspects. Existing algorithm forweb content outliers mining focuses on applying weight age tostructured documents. Where as in this research work, amathematical approach based on signed and rectangular
representation is developed to detect and remove theredundancy between unstructured web documents also. Thismethod optimizes the indexing of web document as well asimproves the quality of search engines.
33
-
8/2/2019 Web Content Outlier Mining
4/21
4/26/12
Introduction
Due to voluminous amount of information available onthe web, most of the people like to perform their taskover the internet.
There are many web documents, which have redundant
and irrelevant contents.
Web content outliers mining concentrates on findingoutliers such as noise, irrelevant and redundant pagesfrom the web documents.
Generally, Outliers are the data that obviously deviatefrom others, disobey the general mode or behavior ofdata and disaccord with other existing data.
44
-
8/2/2019 Web Content Outlier Mining
5/21
4/26/12
Web Mining In general, web mining tasks can be classified into
three major categories
I. Web structure mining
II. Web usage mining
III. Web content mining.
Web structure mining tries to discover usefulknowledge from the structure of hyperlinks.
Web usage mining refers to the discovery of useraccess patterns from web usage logs.
Web content mining aims to extract/mine usefulinformation from the web pages based on their
contents. 55
-
8/2/2019 Web Content Outlier Mining
6/21
4/26/12
Related works
G Poonkuzhali suggested set theoretical approach fordetecting and eliminating redundant links in webdocuments.
Giuseppe Antoio Di proposed an algorithm based onclone detection and similarity metrics to detectduplicate pages in web sites and application implementedwith HTML which works only for structured web
documents.
66
-
8/2/2019 Web Content Outlier Mining
7/21
4/26/12
Contd
Yunhe Weng come up with an idea of improved COPS(Copy Detection Algorithm) scheme which aims to
protect intelligent property of the document owner bydetecting overlap among documents.
This method performs similarity computation only for thepages that are relevant to the suspicious pages.
77
-
8/2/2019 Web Content Outlier Mining
8/21
4/26/12
Contd
Zhongming Han developed a novel multilayer frameworkfor detecting duplicated web pages through two similaritytext paragraphs detection algorithms based on Editdistance and bootstrap method.
This method achieves high performance in detectingduplicates efficiently simply by tag statistic and textcomparison, still it cannot find duplicates among multiple
web pages.
88
-
8/2/2019 Web Content Outlier Mining
9/21
4/26/12
Contd
All the above works on web content mining, are lack insimplicity of concept and computation.
These issues results in determining a novel approach
based on mathematics through signed and rectangularrepresentation to detect and remove redundant webdocument with less time and space complexity.
Apart from the above benefits, there is need for a
algorithm which works well for both unstructured andstructured data.
99
-
8/2/2019 Web Content Outlier Mining
10/21
4/26/12
Need for the Algorithm
Existing web mining algorithms do not considerdocuments having varying contents within the samecategory called web content outliers.
Most of the time we get different web documents withsame contents!!!
This algorithm focuses on detection and removal of noiseissue and redundancy, which implies outliers mining.
1010
-
8/2/2019 Web Content Outlier Mining
11/21
4/26/12
Design of Proposed System
1111
-
8/2/2019 Web Content Outlier Mining
12/21
4/26/12
Rectangular and signed approach
In this framework, web documents are extracted from thesearch engines by giving query by the user to the web.
Then the obtained web documents are preprocessed, i.e.,
stop words, stem words and except text other data such ashyperlinks, sound, images etc are removed.
Then the number of documents extracted on the web iscounted.
1212
-
8/2/2019 Web Content Outlier Mining
13/21
4/26/12
ContdNext, n x m matrix representation are generated for all
the extracted documents based on four tuples namely,
I. Number of pages
II. Paragraphs
III. Lines
IV. Word occurrences
Then all the elements of 4 tuples taken from n x m
matrix of first two documents are compared and itsoutcome is stored using signed approach.
1313
-
8/2/2019 Web Content Outlier Mining
14/21
4/26/12
Contd
Finally, redundancy computation is done based on theresults of similarity computation.
Every element of Di is taken as a 4 tuple.
For example a 4 tuple(3,2,5,8) refers 8th word from 5thline in 2nd paragraph of 3rd page.
Usage of n x m matrix representation helps easy retrieval
and searching of web content.
1414
-
8/2/2019 Web Content Outlier Mining
15/21
4/26/12
Redundancy Computation (RC) Algorithm
Input: User Query q;
Method: Rectangular representation and Signed Approach
Output: Web Document without Redundancy.
Step 1: Extract input web documents Di based on user query where1 i N.
Step2 : Preprocess all the extracted documents.
Step 3: Calculate maximum number of pages p, paragraph q , lines rand words s in any of the extracted Web documents.
Step 4: Generate n x m matrix for all extracted web documents with4 tuples k, l, m and n where 1k p, 1l q ,1m r and 1n srespectively
1515
-
8/2/2019 Web Content Outlier Mining
16/21
4/26/12
Contd
Step 5: Initialize i=1.
Step 7: Assign j=i+1.
Step 8: Initialize PC=0 and NC=0; (PC=Positive count,
NC=Negative count). Step 9: Consider first element in 4-tuple (k,l,m,n) from Di and Dj
and perform string comparison.
Step 10: If they are similar, update PC=PC+1 else NC=NC+1
Step 11: Repeat step7 and step 8 for all the elements of 4-tuples till(p,q,r,s) taken from Di and Dj.
1616
-
8/2/2019 Web Content Outlier Mining
17/21
4/26/12
Contd
Step 12: If PC NC thenDi and Dj are redundant.
Remove Dj from the set of documents.
Else
Di and Dj are not redundant.
Step 13: Increment j and repeat the steps from 8 to 12 until jN.
Step 14: At the termination of 13th step redundancy with firstdocument is eliminated.
Step 15: Increment i and repeat the steps 7 to 13 until i
-
8/2/2019 Web Content Outlier Mining
18/21
4/26/12
Algorithm Review
1818
-
8/2/2019 Web Content Outlier Mining
19/21
4/26/12
Conclusion
Experimental results ensure that the memory space, searchtime and run time gets reduced by using rectangularrepresentation and signed approach.
As the efficiency of web content is increased, the qualityof the search engines also gets increased.
This method is very simple to implement.
This algorithm works well for both unstructured andstructured data.
1919
-
8/2/2019 Web Content Outlier Mining
20/21
4/26/12
References
http://www.ijest.info/docs/IJEST10-02-09-
http://www.waset.org/journals/waset/v56
http://www.wseas.us/e-library/conferencehttp://www.libsearch.com/view/1323898
2020
http://www.ijest.info/docs/IJEST10-02-09-11.pdfhttp://www.waset.org/journals/waset/v56/v56-150.pdfhttp://www.wseas.us/e-library/conferences/2011/Venice/ACACOS/ACACOS-12.pdfhttp://www.libsearch.com/view/1323898http://www.libsearch.com/view/1323898http://www.wseas.us/e-library/conferences/2011/Venice/ACACOS/ACACOS-12.pdfhttp://www.waset.org/journals/waset/v56/v56-150.pdfhttp://www.ijest.info/docs/IJEST10-02-09-11.pdf -
8/2/2019 Web Content Outlier Mining
21/21
4/26/12
Thank You
2121