ISTANBUL TECHNICAL UNIVERSITY FACULTY of COMPUTER and INFORMATICS
EVALUATION of DOM TREE SIMILARITIES
Graduation Project
Teoman Turan 040100014
Department: Computer Engineering
Supervisor: Asst. Prof. Dr. Tolga OVATMAN
June 2015
Declaration of Originality
I declare that
1. In this study, all citations from other sources are clearly indicated by giving reference
to the relevant sources, and
2. Sections except for the citations, especially theoretical studies and software/hardware
that form the main subject of the project, are prepared by me.
Istanbul, 06/29/2015
Teoman Turan
1
EVALUATION of DOM TREE SIMILARITIES
(SUMMARY)
HTML (Hyper-Text Markup Language) is a markup language being used to design web pages.
The syntax of an HTML file consists of a tree whose major nodes are the elements (tags between
the signs < and >) of the file, and whose minor nodes connecting to them as their children are
the attributes (color, format, link direction, image etc.) of these elements, and texts which settle
between their opening and closure markers. Beginning from the root node corresponding to the
<html> element of an HTML file, the rest of element nodes form a tree according to the order
of their nested settlements in the syntax. This tree is called “DOM (Document Object Model)
Tree”. The main purpose of this graduation project is to develop an algorithm to evaluate the
similarity level between the designs of two HTML files comparing their DOM trees. Along
with the development of this algorithm, a simple GUI (graphical user interface) for the
application has also been developed in order to provide an interface where buttons to load
HTML files and compare them, and text labels to see the results are located.
The project has been developed under Windows 8.1 (64-bit, English) with all updates installed.
The most updated version of Eclipse IDE for Java EE Developers (Luna) has been used as the
IDE, and Java programming language with the most updated versions of Java Development Kit
(JDK) and Java Runtime Environment (JRE) has been chosen. For the first part of the
development, in order to extract and parse the DOM tree of an HTML file, Dom4J Java library
has been used. The rest of the development mostly consists of generating the algorithm to
compare two extracted and parsed DOM trees, using the object oriented programming concept
of Java.
The system returns three sorts of similarity: the similarity ratio with respect to the frequency of
the elements of HTML files, the similarity ratio with respect to the parents (Which nodes own
at least one attribute?) and the number of attributes of HTML files, and the similarity ratio with
respect to the parents (Which nodes own text children?) and the number of text nodes of HTML
files. Also, there is an overall similarity ratio as the fourth one that is calculated based on the
influence proportion of the former three similarity ratios. Here, comments in an HTML file are
ignored if exist. To use the application, user just has to press the relevant buttons to load HTML
files by choosing them through the file explorer, then press the button regarding to calculating
the four similarity levels among them. The results are printed on the same window.
2
DOM AĞAÇLARININ BENZERLİĞİNİN DEĞERLENDİRİLMESİ
(ÖZET)
HTML (İngilizce: Hyper-Text Markup Language, Türkçe: Hiper-Metin İşaretleme Dili), Web
sayfalarının tasarlanmasında kullanılan bir işaretleme dilidir. Bir HTML dosyasının söz dizimi
(sentaksı); ana düğümleri bu HTML dosyasının elemanları (< ve > işaretleri arasında kalan
etiketler), diğer düğümleri ise bu eleman düğümlerinin çocuğu olmak üzere bu düğümlere bağlı
olan özellikler (İngilizce: Attribute) (yazının biçimi, yazının rengi, dâhil edilen resim,
bağlantının gittiği adres, sayfanın dili, tablo sütunlarının genişliği vb.) ve işaretçilerin aralarında
yer alan metinlerdir. Bir HTML dosyasının en başında, daha doğrusu, en dış kabuğunda yer
alan <html> elemanına denk gelen kök (İngilizce: Root) düğümünden başlamak üzere, bu
HTML dosyasında yer alan elemanların iç içe yerleşme sıralarına göre bir ağaç yapısı meydana
gelir. Aşağıdaki kod parçasında bunun küçük bir örneği görülebilir.
<html>
<body>
<h1>Benim İlk Paragrafım</h1>
<p id=”para”>Türkiye’nin başkenti Ankara.</p>
</body>
</html>
Bu kod parçasını bir ağaca döktüğümüzde html, body, h1 ve p, ağacın ana düğümleri olacaktır.
İç içe yerleşme sırasına göre html en tepede yer alan kök düğümü olacak, bu düğümden body
düğümü doğacak, bu düğümün de h1 ve p çocuk düğümleri olacak. Ayrıca, <p> etiketinde yer
alan id=”para” özelliği nedeniyle, ilgili p düğümüne bir id çocuk düğümü de bağlı olacak. Ek
olarak, <h1> etiketinin açılış ve kapanışının arasında “Benim İlk Paragrafım” metni yer aldığı
için bu elemana bağlı bir metin çocuk düğümü yer alacak Aynı şey, “Türkiye’nin başkenti
Ankara.” metni nedeniyle p düğümü için de geçerli olacak. İşte oluşan bu ağaca “DOM
(İngilizce: Document Object Model, Türkçe: Belge Nesnesi Modeli) ağacı” adı verilir.
Bu raporda anlatılan bitirme projesinin asıl amacı, iki HTML dosyasının tasarımları arasındaki
benzerlik düzeyini, bu HTML dosyalarının DOM ağaçlarını karşılaştırarak ölçecek bir
algoritma geliştirmektir. Sistem, yüklenen iki HTML dosyasını inceleyip çözümleyecek ve
DOM ağaçlarını ortaya çıkaracaktır. Daha sonra, geliştirilen algoritmayı kullanarak iki DOM
ağacı arasındaki benzerlik yüzdesi bilgilerini dönecektir. Algoritmanın geliştirilmesinin
yanında, HTML dosyalarının yüklenmesi ve benzerlik değerlendirilmesinin yapılması için
butonların yer aldığı ve sonuç bilgilerinin yazıldığı sade bir kullanıcı arayüzü de tasarlanmıştır.
Proje, tüm güncelleştirmeleri yüklü olan Windows 8.1 (64-bit, İngilizce) işletim sisteminde
geliştirilmiştir. IDE olarak Eclipse IDE for Java EE Developers (Luna) adlı geliştirme
ortamının en güncel sürümü kullanılmıştır. Programlama dili olarak, nesneye yönelik
programlama dili desteği ve bazı aşamalar için uygun ve kullanımı kolay kütüphaneleri olduğu
için, Java Geliştirme Kiti’nin (JDK) ve Java Runtime Environment’ın (JRE) en güncel
sürümleriyle Java programlama dili tercih edilmiştir. Geliştirmenin ilk kısmı için HTML
belgeleri sistem tarafından algılanmalı, baştan sona taratılmalı ve DOM ağaçları ayıklanmalıdır.
3
Bunun için de bir Java kütüphanesi olan Dom4j kullanılmıştır. Geliştirmenin kalan kısmı;
algoritmanın üretilmesi, arayüzün geliştirilmesi ve test süreci ile geçmiştir.
Geliştirilen sistem, önce iki HTML dosyasını algılayıp incelemekte ve bu dosyaların DOM
ağaçlarını tüm düğümleri ve bağlantıları ile birlikte çıkarmakta, daha sonra, geliştirilen
algoritmaya ile bu ağaçları karşılaştırıp, tasarımlarının yüzde kaç benzediğini dönmektedir.
Geliştirilen algoritmaya göre sistem üç tip benzerlik yüzdesi dönmektedir: HTML
dosyalarındaki özgün eleman düğümlerinin miktarına (frekansı) göre benzerlik oranı, HTML
dosyalarındaki metin düğümlerinin sayısına ve en az bir metin düğümüne sahip olan ebeveyn
düğümlerinin sayısına (“Hangi düğümlerin metin çocukları var?” sorusunun yanıtı aranmakta
ve bulunan ebeveyn düğümler olan eleman düğümleri için, ilk benzerlik oranı tipi bulunurken
uygulanan metodun aynısı uygulanmaktadır.) göre benzerlik oranı, HTML dosyalarındaki
eleman özelliklerinin sayısına ve en az bir özellik düğümüne sahip olan ebeveyn düğümlerinin
sayısına (Metin düğümleri için uygulanan metodun aynısı: “Hangi düğümlerin özellik çocukları
var?” sorusunun yanıtı aranmakta ve bulunan ebeveyn düğümler olan eleman düğümleri için,
ilk benzerlik oranı tipi bulunurken uygulanan metodun aynısı uygulanmaktadır.) göre benzerlik
oranı. Bunların dışında, dördüncü bir benzerlik oranı tipi olarak da “genel benzerlik oranı”
hesaplanmaktadır. Bu oran, ilk 3 benzerlik oranının etki ve önem ağırlıklarına göre ortalama
bir değerdir.
Sistem iki HTML dosyasının, dolayısıyla iki DOM ağacının tasarımını karşılaştırdığı için,
metin düğümlerinin içindeki değerlerin, yani metinlerin kendisinin ve eleman özelliklerinin
(Attributes) aldığı değerlerin önemi yoktur. “Tasarım” söz konusu olduğu için düğümlerin
varlığı ve birbirleri ile bağlantıları önemlidir. Ayrıca, sistem, HTML dosyalarının içindeki
yorum satırlarını da (İngilizce: Comments) dikkate almaz.
Uygulamanın kullanışı kolaydır. Kullanıcı, öncelikle HTML dosyalarını sisteme yüklemek için
ilgili butonu tıklar ve dosya gezgini aracılığıyla HTML dosyasını bulup, seçer. Daha sonra,
benzerlik oranının hesaplanması ile ilgili butona tıklar. Arka planda hesaplanan benzerlik
oranları verileri aynı pencerede basılır. Bu işlem, uygulamadan çıkmadan, farklı HTML
dosyalarını seçerek tekrar edilebilir.
4
TABLE of CONTENTS
1 – INTRODUCTION…………………………………………………………………………5
1.1. The Augmentation of Web Pages and the Results………………………………...5
1.2. A Brief Introduction to the Main Problem………………………………………...6
1.3. The Study Done and the Results…………………………………………………..6
1.4. The Sections of the Report………………………………………………………...7
2 – THE PROJECT DESCRIPTION and PLAN........................................................................9
2.1. The Project Description……………………………………………………………9
2.2. The Project Plan…………………………………………………………………...9
3 – THEORETICAL INFORMATION………………………………………………………11
3.1. HTML (Hyper-Text Markup Language)…………………………………………11
3.2. DOM (Document Object Model)………………………………………………...13
4 – ANALYSIS and MODELLING………………………………………………………….16
4.1. Understanding the Main Problem………………………………………………..16
4.2. Modelling………………………………………………………………………...16
4.2.1. The Programming Language and the Development Environment……..16
4.2.2. The Project Hierarchy…………………………………………………..16
4.2.3. The Programming Concept…………………………………………......17
4.2.4. The Modelling of Classes………………………………………………17
4.2.5. The Comparison and Similarity Evaluation Algorithm…………….......18
5 – DESCRIPTION, IMPLEMENTATION and TEST………………………………………20
5.1. Classes and Methods……………………………………………………………..20
5.1.1. ElementNode.java……………………………………………………...20
5.1.2. AttributeNode.java……………………………………………………..20
5.1.3. TextNode.java………………………………………………………….21
5.1.4. TreeSim.java…………………………………………………………...22
6 – EXPERIMENTAL RESULTS…………………………………………………………...25
7 – THE RESULTS and SUGGESTIONS…………………………………………………...31
8 – REFERENCES…………………………………………………………………………...32
5
1. INTRODUCTION
1.1. The Augmentation of Web Pages and the Results
While the role World Wide Web (WWW) plays in our lives is growing in a grandiose pace, the
total amount of web pages increases in a parallel way as a matter of course. In order to meet
our diversified point of interests and specific requirements such as researches or acquiring
media, the total number of web sites all around the world has got closer to 1 billion. According
to the statistics published by Internet Live States, assuming that distinctive hostnames are meant
by the term “web site”, by the end of the mid-year of 2014, there are 968.882.453 web sites
visited by 2.295.249.355 Internet users, which means 1 website is visited approximately by 3
users. [1]
This huge growth results in the following fact: Since it is impossible to give rise to web page
designs in an assortment that meets those almost billions of web sites, the difference among the
designs of web pages serving in the same purpose such as the official web site of a product or
company, a bulletin board, a video sharing platform is shrinking. Web page templates can be
an admirable pattern for this issue. Within the context of the requirement for the designs of new
web pages providing service in the same concept, today, a couple of prepared templates can be
utilized. At the present time, it is possible to see thousands of web pages whose schematic
structures are close to each other. In Figure 1.1 below, a sample template that can be used for
the designs of a great number of new web sites can be seen.
The fact that web page designs are getting closer to each other involves in an issue that ought
to be studied: the evaluation of web page similarities.
Figure 1.1: A sample template
6
1.2. A Brief Introduction to the Main Problem
A web site consists of connections between a couples of web pages. As stated in the summary
section of the thesis, a web page is designed being written by HTML, and an HTML file forms
the source of a web page. Here, the content of the web page can be described as a reflection of
the construction of the statements in the source HTML file. The settlements of the elements in
an HTML file, their attributes, and the texts that settle among them form the schema, in other
words, design of the relevant web page. This settlement order forms a tree called DOM Tree.
All of these information can be said to lead the following fact: Evaluating how much two web
pages are similar to each other can be achieved via an algorithm that compares the DOM trees
that are extracted from these pages. Hence, the main problem underlying in this thesis is to
generate an algorithm to compare two DOM trees by the aspect of a property, then to measure
the similarity level among them in the nearby way.
1.3. The Study Done and the Result
Under the light of the facts explained in the section 1.2, the nodes of the DOM trees and the
connections among them have to be extracted by parsing the HTML files, as the first step. In
order to achieve this, Dom4j, a Java library that embodies classes, data types and methods for
parsing an HTML file and extracting its DOM tree has been used. Taking the advantage of the
object oriented programming concept of Java programming language at the same time, using
the special content of the library, the following DOM components of an HTML file whose path
is sent to the system as a parameter to load it are extracted:
Element nodes: The major nodes corresponding to the tags between the marks < and >
Attribute nodes: One of the sort of minor nodes corresponding to the attributes of the
elements, which are the statements inside the tags to indicate the specific properties of
that element such as the path of an image being put there, font size, text format, the type
of script being used there, the direction of a link, the identity of an element. These nodes
are connected to the elements nodes as their children.
Text nodes: Like the attribute nodes, these nodes are also connected to the elements
nodes corresponding to the tags whose opening (<…>) and closing (</…>) ends harbor
a text inside. If an element has such a feature, its corresponding node have children
nodes that are text nodes.
The vast of the rest of the development is about the construction of the comparison algorithm.
Obtained the array of the DOM components explained above, the algorithm having been
developed can much briefly be claimed to be based on the ratio of the number of the like-for-
like nodes over the number of the total nodes. To dig in the issue a little, the ways to calculate
the following 4 similarity ratios (in percentages) have been implemented:
The similarity ratio between the element nodes of two DOM trees: The frequency, in
other words, the amount of each distinct element node extracted in the first step achieved
before and stored in a list (for example, the number of li, the number of table, the number
of p, the number of script) has been found, then the sum of the lower frequencies is
divided by the number of the total element nodes
The similarity ratio between the text nodes of two DOM trees: The frequency of each
distinct element node which has at least one child as text node has been found, then the
7
same way as that implemented for the element nodes has been followed to reach the
result.
The similarity ratio between the attribute nodes of two DOM trees: The way followed
to calculate the similarity ratio between the text nodes has been followed for each
distinct element node which has at least one child as attribute node. The way followed
to calculate the same thing between the element nodes has also been followed for each
distinct attribute node. The average of these two results lead the final ratio.
The overall similarity ratio: This is calculated based on the influence portion of the
similarities described above as follows: 60% of the element node similarity, 30% of the
text node similarity, and 10% of the attribute node similarity.
Here, because of the fact that only the design, in other words, the schema, the structure or the
skeleton of a web page is examined, the values of attribute nodes and text nodes have not been
taken into consideration. Moreover, the object oriented programming concept of Java has been
utilized to achieve the implementation of the algorithm.
With respect to the results of some tests that have been done by loading a couple of dissimilar
HTML files whose contents are known by tester as well, the algorithm can be said to be
successful as the similarity ratios are coherent and satisfying. The more similar DOM structure
with elements owning the same or the similar features an HTML file harbors, the higher
similarity ratios the application developed within the context of this thesis project outcomes,
and vice versa.
1.4. The Sections of the Report
This thesis report consists of the following sections.
Summary: The project is summarized under this section both in Turkish and English.
Introduction: The main problem underlying the implementation of the project is
introduced here.
The Project Description and Plan: The project whose implementation is involved due
to the requirement to solve the major problem mentioned is described here. Then, the
project plan that gives information regarding to how long a part of the process (survey,
development, test etc.) had been planned to take is presented under the same section.
Theoretical Information: Under this section, theoretical information used for the
implementation for the project, which had actually been collected during the research
section at the beginning, are presented.
Analysis and Modelling: It is achieved to understand the main issue, and present the
way towards how to realize the system to serve solutions to the problem. Design,
Implementation, and Test: After the modelling of the project in the previous section,
here, the soft implementation of the system is explained avoiding put the vast of source
codes. Furthermore, the way the system can be tested is also presented under this
section.
Experimental Results: The results of tests that have been performed in a satisfying
number of times are presented here to prove that the system works free of bugs, and the
main algorithm implemented background is reliable.
The Result and Suggestions: Under this section, the solution produced with this project
is interpreted considering some factors including the performance of the application, the
8
budget. Plus, some suggestions for those who would like to examine the same issue are
included here.
Reference: The sources cited within the report are listed under this section clearly
including their addresses.
9
2. THE PROJECT DESCRIPTION and PLAN
2.1. The Project Description
“DOM Similarity Evaluator” is a cross-platform Java application which has been developed
using Eclipse IDE for Java EE Developers (Luna) and written in Java programming language.
The application developed within the context of the thesis project can be said to be simple as it
can be launched from a single JAR file, and the graphical user interface of the application is
easy-to-use.
The name of the development project that takes place in the workspace of Eclipse, and is the
source of the application is “TreeSimilarity”, and the library used to parse an HTML file and
extract its DOM tree, Dom4j (dom4j-1.6.1.jar), is also included to the development
environment.
The aim of the project is to compare the DOM trees that are extracted as a result of parsing two
HTML files, the evaluate (measure) the similarity level among their designs with respect to
their element nodes, text nodes and attribute nodes. Therefore, this comparison also signifies
the comparison between the designs of two web pages whose sources are two dissimilar HTML
files.
2.2. The Project Plan
From assenting the thesis topic to its submission, the project plan consists of the following
phases:
Theoretical research
Analysis and Modelling
Development
Test
Documentation
Theoretical research: This is the first phase. It was planned to take approximately 3 months.
The purpose of this phase is to understand the main problem, to gather theoretical information
regarding to the topic, to decide which methods and technologies are going to be followed, and
to plan the project development. Within this phase, the supervisor provides some academic
sources as well.
Analysis and Modelling: This is the second phase. It was planned to take approximately 1
month. The purpose of this phase is to analyze the main problem, then to prepare the solution
10
that is going to be followed during the development process. Within this phase, the answer of
the question what can be done for the implementation of the solution is found too.
Development: This is the third phase. It was planned to take approximately 5 months. After
the collection of theoretical information, and the analysis and modelling of the project, this
phase consists of the implementation.
Test: This is the fourth phase. It was planned to take approximately 1 month. After the
development of the project is completed, the application that is outcome of the development is
tested with various parameters (several HTML files for this project) to see if the results are
acceptable, satisfying and rational.
Documentation: This is the final phase. It was planned to take approximately 1 month. This
phase actually consists of three sort of reports: the project plan that is submitted at the beginning
of the project, the interim report that is submitted around the midway of the project, and the
final report that is the major one submitted at the end of the project. But most of the period is
spent to the final one where the project is introduced, theoretical information used for the
implementation are presented, the project plan is given, and the development process is
explained.
Figure 2.1. Gantt Chart
11
3. THEORETICAL INFORMATION
3.1. HTML (Hyper-Text Markup Language)
Hyper-Text Markup Language, mostly called HTML which is its abbreviation, is a markup
language used to create a web page. If a text file that contains HTML statements is saved with
a filename extension of “.html” or “.htm”, the file immediately becomes the source of a web
page whose design is the visual reflection of what the code inside the file tells. The output can
easily be seen by opening the file with a web browser. Web browser can read and render an
HTML file into web page. This transformation is also called interpreting an HTML code for
the content of the page. The filename extension can either be .html or .htm. Two major
differences between .html and .htm are some host servers’ requirement to let the starting page
be named as “index.html”, not “index.htm”, and the fact that DOS/Windows 3.x platform did
not allow a filename extension be longer than 3 characters. [2]
The major component of the syntax of an HTML code is HTML elements which are written as
tags enclosed in the angle bracket signs, < and >. Some of these elements consist of opening
and closing pairs, like <p> and </p> whereas some being unpaired like <hr/>, <img/>. The
elements in an HTML code forms the basic schema of the output page. Each element in the
code corresponds to an embedded block or action on the output page.
The second significant component of the HTML syntax is HTML attributes. An attribute
indicates a specific property of the owner element. An attribute is or the series of attributes are
put right after the tag of the owner element. For instance, id attribute whose value is “x” of a
div element means that the div element’s identity that is unique throughout the whole page is x.
As another example, src attribute whose value is the path of an image file that is expected to be
embedded on the page of an img element corresponds to the existence of that image on the
output page.
The latter considerable component of an HTML file is texts. Texts often settle between the
opening and closing tags of an element that corresponds to a block containing a text. A web
browser interprets the texts in an HTML file affecting from the attributes wrapping them up.
A sample HTML code is given below.
<html>
<head>
<title>Sample HTML File</title>
</head>
<body>
12
<div id=”first_block”>
<h1>The Latest Sport News</h1>
<p id=”233”><a
href=”www.foxsports.com/soccer”>Los Angeles Galaxy has won the
Major League Soccer title!</a></p>
<button type=”button”
onclick=”alert(‘Congrats!’)”>Are you happy?</button>
</div>
</body>
</html>
In this code piece, the following components are HTML elements: html, div, body, head, title,
p, h1, button, a.
id attribute of div element takes the value “first_block”, which means the unique identity of the
block is first_block as known throughout the whole document. href attribute whose value is
www.foxsports.com/soccer of a element stands for a link that goes to the URL in the attribute
value. type attribute of button element indicates the button property, and onclick attribute whose
value is a JavaScript method, alert(‘Congrats!’), shows what happens when user clicks on the
button. The outcome page can be seen in Figure 3.1 below. (The button had been clicked before
taking the screenshot as well.)
Figure 3.1. The output page from the HTML code given above
13
3.2. DOM (Document Object Model)
In accordance with the definition by W3C, the Document Object Model, mostly called DOM
which is its abbreviation, is an interface independent of programming language and platform,
through which programs and scripts have dynamic access to the content, structure and style of
a structural document. Along with accessing them, via DOM, a program or script can also
update the accessed component. [3] The DOM interface can provide structural representation
for HTML, XML, and SVG documents, and a component is often accessed through DOM using
JavaScript. [4]
The order of nested settlement of the components of an HTML, XML or SVG document
actually forms a tree structure. This structure is called “DOM Tree” where an element with the
outermost opening and closing tags within the syntax in such a document corresponds to the
root node, the inner one corresponds to its child node or one of its children nodes, and the rest
of the inner ones settle in this way.
<element1>
<element2>
<element3>
</element3>
<element4>
</element4>
</element2>
</element1>
A structure example given above as pseudo-code forms the following DOM tree as shown on
the right-hand side above: Since element1 is the outermost element, it is the root node. Since
element2 is a further inner one, the corresponding node connects to the root node as its child.
Since element3 and element4 are a further inner ones under the same shell of element2, the
nodes corresponding to them become the children of that corresponding to element2.
The application having been developed within this thesis project deals with HTML DOM. The
basic concept having been explained here is okay for an HTML document as well, plus some
specific components corresponding to new nodes in DOM tree. As stated in W3Schools,
everything can be said to correspond to a node in an HTML DOM as follows: The document
itself already corresponds to a document node. HTML elements correspond to element nodes
in the tree structure. The attributes of these elements correspond to attribute nodes. Texts settle
in between elements also correspond to text nodes. Moreover, comment lines in an HTML
document also correspond to the comment nodes in the relevant DOM tree. [5]
Element1
Element2
Element3
Element4
Graph 3.1. The DOM Tree
of the Sample Structure on
the Left-Hand Side
14
In an HTML DOM tree, since elements are the major components in the syntax of an HTML
code, nodes corresponding to the elements of an HTML file can be described as the major sort
of nodes. If an element owns at least one attribute, the corresponding attribute nodes are
connected to the belonging element node as its children. If there is a text between the tags of an
element, the corresponding text node is connected to the belonging element node as its child.
Under the light of these information, the sample HTML code piece given in Section 3.1 can be
claimed to have the DOM tree given below.
Graph 3.2. The HTML DOM Tree of the Sample HTML Code in Section 3.1
Html
Head Body
Title Div
Text
“Sample HTML File” Id
“first_block”
h1
“The Latest
Sport News”
br p
id
“233”
a Href
“www.foxsports.com/soccer”
Text
“Los Angeles Galaxy has won
the MLS title.”
Button
Type
“button”
Text
“Are
you
happy?”
Onclick
“alert(‘
Congrats
!’)”
15
Within the context of this thesis project, since only the designs of web pages are considered,
values text and attribute nodes have are not taken into consideration during the development of
the comparison algorithm even though they are included in the representation of the DOM tree.
16
4. ANALYSIS and MODELLING
4.1. Understanding the Main Problem
The core of the project is to construct an algorithm to compare two DOM trees extracted from
two parsed HTML files, then to measure the similarity level among them. This problem
involves a more general question in itself: how to compare two trees consisting of dissimilar
nodes. Thus, beyond the aspect of parsing an HTML file and extracting its DOM tree, the actual
step that has to be achieved is to prepare a solution to the way of comparing two tree structure
consisting of dissimilar nodes, and evaluating the similarity level among them. Here, in order
to provide the comparison algorithm and evaluate how much they are similar to each other, a
feature a tree structure owns should be focused on, like the frequency dissimilarity for the nodes
carrying the same characteristics. Hence, the project should be modelled with respect to a tree
structure: classes representing node types, method to iterate over the tree, method to extract the
content and the feature of each node, special methods to implement the comparison and
similarity evaluation.
4.2. Modelling
4.2.1. The Programming Language and the Development Environment
The application being developed within the context of the thesis project has been chosen to be
a Java application. It can be launched from a single JAR file in a simple way. Related to this
programming language preference, the integrated development environment (IDE) has been
chosen as Eclipse IDE for Java EE Developers.
Since it is a Java application, it is also a cross-platform application. This means that it can run
under any operating system that supports Java, like Windows, any Linux distribution, any
UNIX operating system.
4.2.2. The Project Hierarchy
As stated in Section 2.1, the name of the development project that settles in the workspace of
Eclipse IDE is TreeSimilarity. Under src folder, there is a package for the whole project, which
can be assimilated to an umbrella, named as treesim. This package contains all Java files. Each
of them with the filename extension .java contains necessary classes. The main file from which
the project is launched after the compilation, and where the main method takes place is
TreeSim.java. The rest consists of classes representing node types: AttributeNode.java,
ElementNode.java, and TextNode.java.
17
Figure 4.1. The Project Hierarchy in Eclipse IDE
4.2.3. The Programming Concept
The concept of Java programming language is based on object oriented programming; even the
main method of the source code is encapsulated in a class. Owing to the beneficial features, for
the implementation of the solution to the analyzed problem, object oriented programming has
adequately been utilized. Moreover, Java’s container data types like ArrayList have been pretty
used as they serve beneficial solutions for data storage.
The vast majority of the project is formed by TreeSim class. Other classes that represent the
node types are imported into this class.
4.2.4. The Modelling of Classes
ElementNode class represents element nodes. It contains the following data: the name of an
element, the frequency of an element which means how many instances from it exist in the tree.
AttributeNode class represents attribute nodes. It contains the following data: the name of an
attribute, the frequency of an attribute.
TextNode class represents a text node. Since the value of a text node which is the text output
itself is not taken into consideration due to dealing only with the design, it contains the
following data that are regarding to the parent of a text node: the name of the parent element
node a text node is connected, the frequency of the parent element node a text node is connected.
TreeSim class forms the core of the project as it encapsulates the main method. Along with the
encapsulation of the main method, it contains almost all of the rest of the methods in the project,
and codes for the implementation the graphical user interface of the application. The other three
classes modelled as above are imported into this class.
18
4.2.5. The Comparison and Similarity Evaluation Algorithm
This is the most significant part of the development as it forms the core of the project, and a
solution to the main problem is modelled.
Before beginning the comparison operation, a method that iterates over the DOM tree that has
been extracted from a parsed HTML file to visit all nodes, then stores the data encapsulated in
the class representing the sort of the visited node in a private storage field has to be
implemented. After collecting all necessary data, the following comparison ways according to
the node types are followed.
Element nodes: First, all distinct elements in the iterated DOM tree are collected with their
frequency information. For example, let the iteration over a DOM tree give the following list
of element nodes.
html head title body div h1 p div h2 p a button div h1 p table ul li li li li
The distinct elements with their frequencies become as follows.
html: 1
head: 1
title: 1
body: 1
div: 3
h1: 2
h2: 1
p: 3
a: 1
button: 1
table: 1
ul: 1
li: 3
The same way is followed for the other DOM tree. Acquiring the frequency data for all distinct
elements for both trees, for common elements which means elements that exist in both trees,
the less frequency for each element is added to a special frequency list. Then, the sum of them
is divided by the greater number of all elements nodes in either tree. When the result is
multiplied by 100, the similarity ratio with respect to elements nodes is obtained.
19
Attribute nodes: The way being followed for element nodes is followed for the attribute nodes
themselves as well. A ratio value is obtained from this phase. Plus, the same way is followed
for the element nodes that have at least one attribute, in other words, the parent element nodes
of the attribute nodes, as well. Another ratio value is obtained from this phase. The average of
these values give the final similarity ratio with respect to attribute nodes.
Text nodes: The way being followed for element nodes is followed for the element nodes that
have children as text nodes, in other words, the parent element nodes of the text nodes. A
similarity ratio with respect to text nodes is obtained.
Overall: Considering the importance of the difference in the same node types for two DOM
trees, the influence order can be as follows: element nodes > text nodes > attribute nodes.
According to the importance greatness, managing to be rational, 60% effect margin for element
nodes, 30% effect margin for text nodes, and 10% effect margin for attribute nodes have been
assigned. Based on this distribution, the overall similarity ratio between two DOM trees is
obtained.
20
5. DESCRIPTION, IMPLEMENTATION, AND TEST
5.1. Classes and Methods
5.1.1. ElementNode.java
Figure 5.1. ElementNode Class
This class represents the structure of an element node. It holds the name of element, and its
frequency which means how many instances from it exist in the tree. Since they are private
fields, there are also getter and setter methods for them.
5.1.2. AttributeNode.java
Figure 5.2. AttributeNode Class
21
This class represents the structure of an attribute node. It holds the name of attribute, and its
frequency. There are also getter and setter methods for these private fields.
5.1.3. TextNode.java
Figure 5.3. TextNode Class
This class represents the structure of a text node. It holds the name of the parent element node
that owns text node, and its frequency. There are also getter and setter methods for these private
fields.
5.1.4. TreeSim.java
This class can be described as the main class of the project since it encapsulates the main
method. Along with the main method, almost all of the methods have been implemented here.
22
Figure 5.4. TreeSim Class
The body of the class begins with plenty of private fields such as containers, GUI components,
arrays. But the methods within this class should be focused on.
main(String[]): This is the main method of the class. The application is launched after the
compilation from this method.
TreeSim(): This is the default constructor of the class. Actually, this constructor contains
statements to load GUI components and to add action events linked to them.
actionPerformed(ActionEvent): This GUI method tells the system what must happen when a
button is clicked. The action events sent to here as parameter are linked to the GUI components
(buttons here) inside the body of TreeSim(). If the action event of pressing “Calculate the
Similarity” button on the application is triggered, the comparison and similarity ratio
measurement algorithm having been generated steps in.
commonAvailabilityChecker methods: These methods check if a given node exists in both
DOM trees. Which method checks for what sort of node can be understood from the method
name. If the node can be found in both DOM trees, the method returns true. Otherwise, it returns
false.
extractUnique methods: These methods extract distinct elements in a DOM tree having been
iterated.
preParentOrderFirstDOM(Node): This method iterates over the first DOM tree which is
extracted from the first HTML’s parse, and discloses the information stored in each node and
23
the connections between these nodes. Gathered data is stored in the node arrays for the first
DOM tree.
preParentOrderFirstDOM(Node): This method is the equivalent of the previous one for the
second DOM tree.
24
5.2. Test
When the application is launched, the window of its simple GUI appears on screen as can be
seen in the figure below.
Figure 5.5. The Application
“Load the First HTML File” button: This button opens a file explorer to search for the HTML
file to be loaded to the system.
“Load the Second HTML File” button: Equivalent to the first one.
“Calculate the Similarity” button: Once this button is pressed, the similarity ratios that can
be seen in the lower line are calculated according to the algorithm having been generated within
the context of the thesis project.
“Quit the Application” button: If user wants to exit the application, he/she can click on this
button. A dialog box with “Yes” and “No” choices to make sure appears on screen.
Figure 5.6. A Sample Run
25
6. EXPERIMENTAL RESULTS
For this thesis report, some HTML files whose content (their syntax and their outputs) are
already known have been tested by loading them to the system. Two of them are very similar
to each other whereas another couple consisting of ones very dissimilar to each other. The files
26
containing longer HTML statements had been readily found from a template portal. The other
two files containing smaller pieces of codes had been written.
Evaluating the similarity between the DOM trees of two similar HTML files:
Figure 6.1. The First HTML File
Figure 6.2. The Second HTML File
27
Figure 6.3. The Output of the First HTML File
Figure 6.4. The Output of the Second HTML File
28
Figure 6.5. The Test Result
As can be seen from Figure 6.5, the result is satisfying and rational according to the loaded
HTML files. The overall similarity ratio is high, but not close to 100% due to design (structure,
text owning and attribute) dissimilarities.
Evaluating the similarity between the DOM trees of two dissimilar HTML files:
Figure 6.6. The First HTML File – 2
29
Figure 6.7. The Second HTML File – 2
Figure 6.8. The Output of the First HTML File – 2
30
Figure 6.9. The Output of the Second HTML File – 2
Figure 6.10. The Test Result – 2
As can be seen from Figure 6.10, the overall similarity ratio is too low. It is satisfying and
rational, because it is obvious that the HTML files are too dissimilar to each other.
Under the light of these test results, the algorithm can be claimed to be successful, and the
application operates the algorithm free of bugs.
31
7. THE RESULT and SUGGESTIONS
First of all, as can be seen from the test results in the previous section, the algorithm can be
claimed to be successful, and the application operates the algorithm free of bugs.
The graphical user interface (GUI) of the application is very simple. It is very easy to use, but
it should be enhanced to be more stylish and useful.
A considerable con of the application and/or the algorithm is that the system now expects an
HTML source code free of bugs. This means that the system does not have an error detection
and correction mechanism. In addition, Dom4j library can fail for elements that consist of
unpaired (single) tags where “/” mark does not exist right before the closing bracket, >.
The project is eligible to be improved easily; it is open source.
32
8. REFERENCES
[1] Internet Live Stats, The total number of Websites, http://www.internetlivestats.com/total-
number-of-websites/
[2] Sight Specific, What is the difference between the HTM and HTML extension,
http://www.sightspecific.com/~mosh/www_faq/ext.html
[3] W3C, Document Object Model, http://www.w3.org/DOM/
[4] Mozilla Developer Network, Document Object Model (DOM),
https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model
[5] W3Schools, The HTML DOM Element Object,
http://www.w3schools.com/jsref/dom_obj_all.asp
Top Related