Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.
-
Upload
russell-stevens -
Category
Documents
-
view
215 -
download
0
Transcript of Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad.
An Approach to Identify Duplicated Web Pages
Supervisor:Mr. Sayed Morteza Zaker Presentor:Fateme hadinezhad
Out line
1) Introduction2) Definition3) Problems Statement4) Results5) Contribution6) Concluson7) Refrance
1
Introduction (1)
1) The rapid diffusion of the Internet and of the World Wide Web
infrastructure is producing a considerable growth of the demand of new
Web sites and Web Applications.
2) to obtain a further reduction of time-to-market, new pages are
obtained by reusing the code of existing pages, just by copy-and-
paste operations.
3) Duplicated Web pages,having the same structure and justdiffering
for the data they include, can be considered as clones.
2
Introduction(2) 4) In this paper an approach to detect duplicated pages in WAs is
Proposed.
5) The validity of the proposed approach has been assessed by
means of experiments involving several WAs.
6) Section1:clone analysis
Section2:WAs’ duplicated pages identification
Section3:experiments carried
Section4:conclusive remarks
3
Defination
WAs: a Web site may be thought of as a static site that may provide
dynamic information too. A Web application provides the Web user .with a
means to modify the site status
Clones: Duplicated or similar portions of code in software artifacts
Levenshtein:introduced the concept of near miss clone, which is a
fragment of code that partially coincides with another one. [Bax98]
clone analysis :clone analysis is the research area that investigates
methods and techniques for automatically detecting them.
4
problem Statement
The detection of duplicated WA pages based on the
Levenshtein distance is in general very expensive from a
computational point of view.
The computational complexity of the algorithm for computing the
Levenshtein distance is in fact O (n2),where n is the length of the
longer string.
5
Results(1)
6
<td width="18%>"
<img src="../images/Nuovo.jpg" width="92"
height="27"<>/td>
)td, width, img, src, width, height, /td)
u = hifgieb
<td width="35%>"
<div align="right"< >img src =" ../pic1.jpg"
width="92" height="27"< >/div< >/td>
)td, width, div, align, img, src, width,
height, /div, /td)
v = hidcfgieab
Results(2)
7
Results(3)
8
D(u, v)=3
ED=1.732
Results(4)
9
Contribution(1)
Background and motivations:
1) Software clones and clone analysis
2) Web applications and Software clones
Client pages:
a) static page
b. dynamic page
10
Web page:
a. control component
b. data component
Metrics to detect duplicated Web pages
1) Detecting duplicated Web pages by the Levenshtein
distance
Contribution(2)
11
a) Detecting duplicated client pages
2) Detecting duplicated client pages using a frequency
based metric
3) Detecting duplicated server pages
Case studies
Clone detection within a WA
Contribution(3)
12
Conclustion(1)
In this paper an approach to clone analysis in the context of
Web systems has been proposed.
Pages of a WA having the same control component were
considered as clones, even if they differed for the data component.
Two methods for detecting duplicated WA pages - one exploiting
the Levenshtein distance and the other one based on the frequency
of the HTML tags in a page - have been defined and experimented
with.
13
The proposed approach has been successfully applied to
identify a case of plagiarism too.
Further experimentation should be carried out to better validate
the proposed methods .
Conclustion(2)
14
Refrance
[Bak93] Baker S. B., A theory of parametrized pattern matching:
algorithms and applications, in Proceedings of the 25th Annual
ACM Symposium on Theory of Computing, 71-80, May 1993.
[Bak95] Baker B. S., On finding duplication and near duplication in
large software systems, in Proc. of the 2nd Working Conference
on Reverse Engineering, IEEE Computer Society Press, 1995.
[Bak95b] Baker S. B., Parametrized pattern matching via Boyer-Moore
algorithms, in Proceedings of Sixth Annual ACM-SIAM
Symposium on Discrete Algorithms, 541-550, Jan 1995.
15
[Bal00] Balazinska M., Merlo E., Dagenais M., Lagüe B., Kontogiannis
K., Advanced clone-analysis to support object-oriented system
refactoring, in Seventh Working Conference on Reverse
Engineering, 98-107, Nov 2000.
[Bal99] Balazinska M., Merlo E., Dagenais M., Lagüe B., Kontogiannis
K., Measuring clone based reengineering opportunities, in
International Symposium on software metrics. METRICS’99.
IEEE Computer Society Press, Nov 1999.
Refrance
16
]Bax98 [Baxter I. D., Yahin A., Moura L., Sant’Anna M., Bier L., Clone
Detection Using Abstract Syntax Trees, in Proceedings of the
International Conference on Software Maintenance, 368-377,
IEEE Computer Society Press, 1998.
]Ber84 [Berghel H.L., Sallach D.L., Measurements of program
similarity in identical task environments, SIGPLAN Notices,
9)8:)65-76 ,Aug 1984.
]Frak92 [W.B. Frakes, R. Baeza-Yates - Information Retrieval: Data
Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ,
1992.
Refrance
17
[Gri81] Grier S., A tool that detects plagiarism in PASCAL programs, in
SIGSCE Bulletin, 13(1), 1981.
[Hor90] Horwitz Susan, Identifying the semantics and textual differences
between two versions of a program, in Proceedings of ACM
SIGPLAN Conference on Programming Language Design and
Implementation, 234-245, June 1990.
[Jan88] Jankowitz H.T., Detecting plagiarism in student PASCAL
programs, in Computer Journal, 31(1):1-8, 1988.
[
Refrance
18
[Kon96] Kontogiannis K., De Mori R., Merlo E., Galler M., Bernstein
M., Pattern Matching for clone and concept detection, in
Journal of Automated Software Engineering, 3:77-108, Mar
1996.
[Kon95] Kontogiannis K., De Mori R., Bernstein M., Merlo E., Pattern
Matching for Design Concept Localization, in Proc. of the 2nd Working Conference on Reverse Engineering, IEEE Computer
Society Press, 1995.
Refrance
19