Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

20
An Approach to Identify Duplicated Web Pages Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad

Transcript of Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

Page 1: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

An Approach to Identify Duplicated Web Pages

Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad

Page 2: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

Out line

1) Introduction2) Definition3) Problems Statement4) Results5) Contribution6) Concluson7) Refrance

1

Page 3: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

Introduction (1)

1) The rapid diffusion of the Internet and of the World Wide Web

infrastructure is producing a considerable growth of the demand of new

Web sites and Web Applications.

2) to obtain a further reduction of time-to-market, new pages are

obtained by reusing the code of existing pages, just by copy-and-

paste operations.

3) Duplicated Web pages,having the same structure and justdiffering

for the data they include, can be considered as clones.

2

Page 4: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

Introduction(2) 4) In this paper an approach to detect duplicated pages in WAs is

Proposed.

5) The validity of the proposed approach has been assessed by

means of experiments involving several WAs.

6) Section1:clone analysis

Section2:WAs’ duplicated pages identification

Section3:experiments carried

Section4:conclusive remarks

3

Page 5: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

Defination

WAs: a Web site may be thought of as a static site that may provide

dynamic information too. A Web application provides the Web user .with a

means to modify the site status

Clones: Duplicated or similar portions of code in software artifacts

Levenshtein:introduced the concept of near miss clone, which is a

fragment of code that partially coincides with another one. [Bax98]

clone analysis :clone analysis is the research area that investigates

methods and techniques for automatically detecting them.

4

Page 6: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

problem Statement

The detection of duplicated WA pages based on the

Levenshtein distance is in general very expensive from a

computational point of view.

The computational complexity of the algorithm for computing the

Levenshtein distance is in fact O (n2),where n is the length of the

longer string.

5

Page 7: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

Results(1)

6

Page 8: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

<td width="18%>"

<img src="../images/Nuovo.jpg" width="92"

height="27"<>/td>

)td, width, img, src, width, height, /td)

u = hifgieb

<td width="35%>"

<div align="right"< >img src =" ../pic1.jpg"

width="92" height="27"< >/div< >/td>

)td, width, div, align, img, src, width,

height, /div, /td)

v = hidcfgieab

Results(2)

7

Page 9: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

Results(3)

8

D(u, v)=3

ED=1.732

Page 10: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

Results(4)

9

Page 11: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

Contribution(1)

Background and motivations:

1) Software clones and clone analysis

2) Web applications and Software clones

Client pages:

a) static page

b. dynamic page

10

Page 12: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

Web page:

a. control component

b. data component

Metrics to detect duplicated Web pages

1) Detecting duplicated Web pages by the Levenshtein

distance

Contribution(2)

11

Page 13: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

a) Detecting duplicated client pages

2) Detecting duplicated client pages using a frequency

based metric

3) Detecting duplicated server pages

Case studies

Clone detection within a WA

 

Contribution(3)

12

Page 14: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

Conclustion(1)

In this paper an approach to clone analysis in the context of

Web systems has been proposed.

Pages of a WA having the same control component were

considered as clones, even if they differed for the data component.

Two methods for detecting duplicated WA pages - one exploiting

the Levenshtein distance and the other one based on the frequency

of the HTML tags in a page - have been defined and experimented

with.

13

Page 15: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

The proposed approach has been successfully applied to

identify a case of plagiarism too.

Further experimentation should be carried out to better validate

the proposed methods .

Conclustion(2)

14

Page 16: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

Refrance

[Bak93] Baker S. B., A theory of parametrized pattern matching:

algorithms and applications, in Proceedings of the 25th Annual

ACM Symposium on Theory of Computing, 71-80, May 1993.

[Bak95] Baker B. S., On finding duplication and near duplication in

large software systems, in Proc. of the 2nd Working Conference

on Reverse Engineering, IEEE Computer Society Press, 1995.

[Bak95b] Baker S. B., Parametrized pattern matching via Boyer-Moore

algorithms, in Proceedings of Sixth Annual ACM-SIAM

Symposium on Discrete Algorithms, 541-550, Jan 1995.

15

Page 17: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

[Bal00] Balazinska M., Merlo E., Dagenais M., Lagüe B., Kontogiannis

K., Advanced clone-analysis to support object-oriented system

refactoring, in Seventh Working Conference on Reverse

Engineering, 98-107, Nov 2000.

[Bal99] Balazinska M., Merlo E., Dagenais M., Lagüe B., Kontogiannis

K., Measuring clone based reengineering opportunities, in

International Symposium on software metrics. METRICS’99.

IEEE Computer Society Press, Nov 1999.

Refrance

16

Page 18: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

]Bax98 [Baxter I. D., Yahin A., Moura L., Sant’Anna M., Bier L., Clone

Detection Using Abstract Syntax Trees, in Proceedings of the

International Conference on Software Maintenance, 368-377,

IEEE Computer Society Press, 1998.

]Ber84 [Berghel H.L., Sallach D.L., Measurements of program

similarity in identical task environments, SIGPLAN Notices,

9)8:)65-76 ,Aug 1984.

]Frak92 [W.B. Frakes, R. Baeza-Yates - Information Retrieval: Data

Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ,

1992.

Refrance

17

Page 19: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

[Gri81] Grier S., A tool that detects plagiarism in PASCAL programs, in

SIGSCE Bulletin, 13(1), 1981.

[Hor90] Horwitz Susan, Identifying the semantics and textual differences

between two versions of a program, in Proceedings of ACM

SIGPLAN Conference on Programming Language Design and

Implementation, 234-245, June 1990.

[Jan88] Jankowitz H.T., Detecting plagiarism in student PASCAL

programs, in Computer Journal, 31(1):1-8, 1988.

[

Refrance

18

Page 20: Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.

[Kon96] Kontogiannis K., De Mori R., Merlo E., Galler M., Bernstein

M., Pattern Matching for clone and concept detection, in

Journal of Automated Software Engineering, 3:77-108, Mar

1996.

[Kon95] Kontogiannis K., De Mori R., Bernstein M., Merlo E., Pattern

Matching for Design Concept Localization, in Proc. of the 2nd Working Conference on Reverse Engineering, IEEE Computer

Society Press, 1995.

Refrance

19