Web Content Outlier Mining

download Web Content Outlier Mining

of 21

Transcript of Web Content Outlier Mining

  • 8/2/2019 Web Content Outlier Mining

    1/21

    Click to edit Master subtitle style

    4/26/12

    DETECTION AND REMOVAL OF REDUNDANT

    WEB CONTENTTHROUGH RECTANGULAR AND SIGNEDAPPROACH

    11

    Presented by,

    Pramod.N

    Reg no. 091718V Sem M.C.A

    A.I.M.I.T

  • 8/2/2019 Web Content Outlier Mining

    2/21

    4/26/12

    Topics to be Covered

    Abstract

    Introduction

    Related WorksNeed for the Algorithm

    Rectangular and signed approach

    Algorithm Review

    Conclusion

    References 22

  • 8/2/2019 Web Content Outlier Mining

    3/21

    4/26/12

    Abstract

    Today, Internet marks the era of information revolution and

    people rely on search engines for getting relevant informationwithout replicas. As the duplicated web pages increases, theindexing space and time complexity increases. Finding andremoving these pages becomes significant for search enginesand other likely systems. Web content outliers mining play avital role in covering all these aspects. Existing algorithm forweb content outliers mining focuses on applying weight age tostructured documents. Where as in this research work, amathematical approach based on signed and rectangular

    representation is developed to detect and remove theredundancy between unstructured web documents also. Thismethod optimizes the indexing of web document as well asimproves the quality of search engines.

    33

  • 8/2/2019 Web Content Outlier Mining

    4/21

    4/26/12

    Introduction

    Due to voluminous amount of information available onthe web, most of the people like to perform their taskover the internet.

    There are many web documents, which have redundant

    and irrelevant contents.

    Web content outliers mining concentrates on findingoutliers such as noise, irrelevant and redundant pagesfrom the web documents.

    Generally, Outliers are the data that obviously deviatefrom others, disobey the general mode or behavior ofdata and disaccord with other existing data.

    44

  • 8/2/2019 Web Content Outlier Mining

    5/21

    4/26/12

    Web Mining In general, web mining tasks can be classified into

    three major categories

    I. Web structure mining

    II. Web usage mining

    III. Web content mining.

    Web structure mining tries to discover usefulknowledge from the structure of hyperlinks.

    Web usage mining refers to the discovery of useraccess patterns from web usage logs.

    Web content mining aims to extract/mine usefulinformation from the web pages based on their

    contents. 55

  • 8/2/2019 Web Content Outlier Mining

    6/21

    4/26/12

    Related works

    G Poonkuzhali suggested set theoretical approach fordetecting and eliminating redundant links in webdocuments.

    Giuseppe Antoio Di proposed an algorithm based onclone detection and similarity metrics to detectduplicate pages in web sites and application implementedwith HTML which works only for structured web

    documents.

    66

  • 8/2/2019 Web Content Outlier Mining

    7/21

    4/26/12

    Contd

    Yunhe Weng come up with an idea of improved COPS(Copy Detection Algorithm) scheme which aims to

    protect intelligent property of the document owner bydetecting overlap among documents.

    This method performs similarity computation only for thepages that are relevant to the suspicious pages.

    77

  • 8/2/2019 Web Content Outlier Mining

    8/21

    4/26/12

    Contd

    Zhongming Han developed a novel multilayer frameworkfor detecting duplicated web pages through two similaritytext paragraphs detection algorithms based on Editdistance and bootstrap method.

    This method achieves high performance in detectingduplicates efficiently simply by tag statistic and textcomparison, still it cannot find duplicates among multiple

    web pages.

    88

  • 8/2/2019 Web Content Outlier Mining

    9/21

    4/26/12

    Contd

    All the above works on web content mining, are lack insimplicity of concept and computation.

    These issues results in determining a novel approach

    based on mathematics through signed and rectangularrepresentation to detect and remove redundant webdocument with less time and space complexity.

    Apart from the above benefits, there is need for a

    algorithm which works well for both unstructured andstructured data.

    99

  • 8/2/2019 Web Content Outlier Mining

    10/21

    4/26/12

    Need for the Algorithm

    Existing web mining algorithms do not considerdocuments having varying contents within the samecategory called web content outliers.

    Most of the time we get different web documents withsame contents!!!

    This algorithm focuses on detection and removal of noiseissue and redundancy, which implies outliers mining.

    1010

  • 8/2/2019 Web Content Outlier Mining

    11/21

    4/26/12

    Design of Proposed System

    1111

  • 8/2/2019 Web Content Outlier Mining

    12/21

    4/26/12

    Rectangular and signed approach

    In this framework, web documents are extracted from thesearch engines by giving query by the user to the web.

    Then the obtained web documents are preprocessed, i.e.,

    stop words, stem words and except text other data such ashyperlinks, sound, images etc are removed.

    Then the number of documents extracted on the web iscounted.

    1212

  • 8/2/2019 Web Content Outlier Mining

    13/21

    4/26/12

    ContdNext, n x m matrix representation are generated for all

    the extracted documents based on four tuples namely,

    I. Number of pages

    II. Paragraphs

    III. Lines

    IV. Word occurrences

    Then all the elements of 4 tuples taken from n x m

    matrix of first two documents are compared and itsoutcome is stored using signed approach.

    1313

  • 8/2/2019 Web Content Outlier Mining

    14/21

    4/26/12

    Contd

    Finally, redundancy computation is done based on theresults of similarity computation.

    Every element of Di is taken as a 4 tuple.

    For example a 4 tuple(3,2,5,8) refers 8th word from 5thline in 2nd paragraph of 3rd page.

    Usage of n x m matrix representation helps easy retrieval

    and searching of web content.

    1414

  • 8/2/2019 Web Content Outlier Mining

    15/21

    4/26/12

    Redundancy Computation (RC) Algorithm

    Input: User Query q;

    Method: Rectangular representation and Signed Approach

    Output: Web Document without Redundancy.

    Step 1: Extract input web documents Di based on user query where1 i N.

    Step2 : Preprocess all the extracted documents.

    Step 3: Calculate maximum number of pages p, paragraph q , lines rand words s in any of the extracted Web documents.

    Step 4: Generate n x m matrix for all extracted web documents with4 tuples k, l, m and n where 1k p, 1l q ,1m r and 1n srespectively

    1515

  • 8/2/2019 Web Content Outlier Mining

    16/21

    4/26/12

    Contd

    Step 5: Initialize i=1.

    Step 7: Assign j=i+1.

    Step 8: Initialize PC=0 and NC=0; (PC=Positive count,

    NC=Negative count). Step 9: Consider first element in 4-tuple (k,l,m,n) from Di and Dj

    and perform string comparison.

    Step 10: If they are similar, update PC=PC+1 else NC=NC+1

    Step 11: Repeat step7 and step 8 for all the elements of 4-tuples till(p,q,r,s) taken from Di and Dj.

    1616

  • 8/2/2019 Web Content Outlier Mining

    17/21

    4/26/12

    Contd

    Step 12: If PC NC thenDi and Dj are redundant.

    Remove Dj from the set of documents.

    Else

    Di and Dj are not redundant.

    Step 13: Increment j and repeat the steps from 8 to 12 until jN.

    Step 14: At the termination of 13th step redundancy with firstdocument is eliminated.

    Step 15: Increment i and repeat the steps 7 to 13 until i

  • 8/2/2019 Web Content Outlier Mining

    18/21

    4/26/12

    Algorithm Review

    1818

  • 8/2/2019 Web Content Outlier Mining

    19/21

    4/26/12

    Conclusion

    Experimental results ensure that the memory space, searchtime and run time gets reduced by using rectangularrepresentation and signed approach.

    As the efficiency of web content is increased, the qualityof the search engines also gets increased.

    This method is very simple to implement.

    This algorithm works well for both unstructured andstructured data.

    1919

  • 8/2/2019 Web Content Outlier Mining

    20/21

    4/26/12

    References

    http://www.ijest.info/docs/IJEST10-02-09-

    http://www.waset.org/journals/waset/v56

    http://www.wseas.us/e-library/conferencehttp://www.libsearch.com/view/1323898

    2020

    http://www.ijest.info/docs/IJEST10-02-09-11.pdfhttp://www.waset.org/journals/waset/v56/v56-150.pdfhttp://www.wseas.us/e-library/conferences/2011/Venice/ACACOS/ACACOS-12.pdfhttp://www.libsearch.com/view/1323898http://www.libsearch.com/view/1323898http://www.wseas.us/e-library/conferences/2011/Venice/ACACOS/ACACOS-12.pdfhttp://www.waset.org/journals/waset/v56/v56-150.pdfhttp://www.ijest.info/docs/IJEST10-02-09-11.pdf
  • 8/2/2019 Web Content Outlier Mining

    21/21

    4/26/12

    Thank You

    2121