A Methodological Approach for Web Sites Reengineering · 22/05/2003 FNRS contact day on "Software...

14
A Methodological A Methodological Approach for Web Sites Approach for Web Sites Reengineering Reengineering Estiévenart Fabrice François Aurore Henrard Jean Hainaut Jean-Luc

Transcript of A Methodological Approach for Web Sites Reengineering · 22/05/2003 FNRS contact day on "Software...

Page 1: A Methodological Approach for Web Sites Reengineering · 22/05/2003 FNRS contact day on "Software (re-)engineering" 3 Goals • To provide methods and tools for web sites reengineering

A MethodologicalA MethodologicalApproach for Web SitesApproach for Web Sites

ReengineeringReengineering

Estiévenart FabriceFrançois AuroreHenrard Jean

Hainaut Jean-Luc

Page 2: A Methodological Approach for Web Sites Reengineering · 22/05/2003 FNRS contact day on "Software (re-)engineering" 3 Goals • To provide methods and tools for web sites reengineering

22/05/2003 FNRS contact day on "Software (re-)engineering" 2

ContextContext

• Still many static web sites…– Advantage

• easy to create• à for small web sites

– Drawbacks• data and layout are mixed up• maintenance problems• à out-of-date or redundant information• à non-homogeneous design

– Solution• DBMS + scripts (php, Perl,…)

Page 3: A Methodological Approach for Web Sites Reengineering · 22/05/2003 FNRS contact day on "Software (re-)engineering" 3 Goals • To provide methods and tools for web sites reengineering

22/05/2003 FNRS contact day on "Software (re-)engineering" 3

GoalsGoals

• To provide methods and tools for web sitesreengineering :

Web Page

Web site

Web Site Reverse Engineering

Data Conceptual schema

Web Site Engineering

Web Page

Web site

DB

Page 4: A Methodological Approach for Web Sites Reengineering · 22/05/2003 FNRS contact day on "Software (re-)engineering" 3 Goals • To provide methods and tools for web sites reengineering

22/05/2003 FNRS contact day on "Software (re-)engineering" 4

Method : overviewMethod : overview

Web Page

Web site

Classification

Web PageHTML

Page Type

Cleaning

Web PageXHTML

Page Type

Semantic Enrichment

META File

Page Type

XMLPage Type

XML schema

Conceptual schema

Conceptualisation

Extraction

Page 5: A Methodological Approach for Web Sites Reengineering · 22/05/2003 FNRS contact day on "Software (re-)engineering" 3 Goals • To provide methods and tools for web sites reengineering

22/05/2003 FNRS contact day on "Software (re-)engineering" 5

Method : step 1Method : step 1

• Pages classification– « Page type » = a set of pages relative to the same

concept, that display very similar information in a verysimilar layout

Page 6: A Methodological Approach for Web Sites Reengineering · 22/05/2003 FNRS contact day on "Software (re-)engineering" 3 Goals • To provide methods and tools for web sites reengineering

22/05/2003 FNRS contact day on "Software (re-)engineering" 6

Method : steps 2 and 3.1Method : steps 2 and 3.1

• HTML cleaning• Semantic enrichment

– For each page type• Concepts identification and description on a sample page

- « Concept » = a part of the HTML tree describing the layout, thestructure and possibly the value of a certain reality

- Ex : the concept « Phone Number »<tr>

<td align="middle" bgcolor="#FFFF66"><b>Phone :</b>

</td ><td bgcolor="#FFFF99">+32 71 72.23.49</td>

</tr>- Ex : the concept « Address » composed of « Street » and « City »

<table width="100%"> <tr><td><b>Address :<b></td></tr> <tr><td>Quality Street, 25</td></tr> <tr><td>London</td></tr></table>

Page 7: A Methodological Approach for Web Sites Reengineering · 22/05/2003 FNRS contact day on "Software (re-)engineering" 3 Goals • To provide methods and tools for web sites reengineering

22/05/2003 FNRS contact day on "Software (re-)engineering" 7

Method : the META fileMethod : the META file

<HTMLDescription xmlns:meta="http://www.cetic.be/FR/CRAQ-DB.htm"><meta:element name="Department">

<html><head>...</head><body> <table>

<meta:element ref="DeptName"/><meta:element ref="PhoneNumber"/><meta:element ref="Address"/>

</table></body>

</html></meta:element><meta:element name="PhoneNumber">

<tr><td align="middle" bgcolor="#FFFF66"><b>Phone :</b></td><td bgcolor="#FFFF99"><meta:value/></td>

</tr></meta:element>…

</HTMLDescription>

Page 8: A Methodological Approach for Web Sites Reengineering · 22/05/2003 FNRS contact day on "Software (re-)engineering" 3 Goals • To provide methods and tools for web sites reengineering

22/05/2003 FNRS contact day on "Software (re-)engineering" 8

• Application to other pages of the same type- there may be layout and structure differences between pages of the

same type àa same concept may have several descriptions- Example : a layout difference

Method : step 3.2Method : step 3.2

<tr> <td><b>Name :</b></td></tr>

<table width="100%"> <tr><td>Address :</td></tr> <tr><td>Quality Street, 25</td></tr> <tr><td>London</td></tr></table>

- Example : a structure difference

<tr> <td><i>Name :</i></td></tr>

<table width="100%"> <tr><td>Address :</td></tr> <tr><td>New York</td></tr> <tr><td>Main Street, 110</td></tr></table>

Page 9: A Methodological Approach for Web Sites Reengineering · 22/05/2003 FNRS contact day on "Software (re-)engineering" 3 Goals • To provide methods and tools for web sites reengineering

22/05/2003 FNRS contact day on "Software (re-)engineering" 9

Method : step 4Method : step 4

• Data and schema extraction– Data extraction

• Web pages + META file à XML document

– Data structure extraction• META file à XML schema

Page 10: A Methodological Approach for Web Sites Reengineering · 22/05/2003 FNRS contact day on "Software (re-)engineering" 3 Goals • To provide methods and tools for web sites reengineering

22/05/2003 FNRS contact day on "Software (re-)engineering" 10

Method : step 5.1Method : step 5.1

• Schema integration– To discover relationships between concepts– To detect redundancy

1-1f

1-1

1-1f

1-1

1-11-1f

1-1f

1-1

1-11-1f

1-11-1f

1-1f

1-1

0-Nf

1-1

0-Nf 1-1

0-Nf

1-11-Nf

0-1

1-11-1f

1-Nf

1-1

1-1f

1-1

1-1f 1-1

1-1f

1-1

1-1f 1-1

1-Nf

0-1

1-Nf 0-1

Staffseq: .Administrative[*]

.Academic[*]

.PHD[*]

ProjectsList

ProjectName«content»

ProjectDate«content»

Projectseq: .ProjectName

.ProjectDate

PHD

Person_1keygid: key

PersonName_1«content»

PersonNamekeyref«content»

PersonFunction_1«content»

PersonFunction«content»

PersonAddress«content»

Personexact-1: .f

.f

.fseq: .PersonName

.PersonFunction

Description_1seq: .PersonName_1

.PersonFunction_1

.PersonAddress

Description«content»

DeptName«content»

DepartmentwebFileName[0-1]seq: .DeptName

.Description

.Staff

.ProjectsList

Administrative

Academic

Page 11: A Methodological Approach for Web Sites Reengineering · 22/05/2003 FNRS contact day on "Software (re-)engineering" 3 Goals • To provide methods and tools for web sites reengineering

22/05/2003 FNRS contact day on "Software (re-)engineering" 11

Method : step 5.2Method : step 5.2

• Schema conceptualisation

1-1 0-Nstaff

1-1

1-N

in

PROJECTNameDate

PERSONNameFunctionAddressStatusid: Name

DEPARTMENTNamePhoneFaxAddressMapMailDescription

Page 12: A Methodological Approach for Web Sites Reengineering · 22/05/2003 FNRS contact day on "Software (re-)engineering" 3 Goals • To provide methods and tools for web sites reengineering

22/05/2003 FNRS contact day on "Software (re-)engineering" 12

Web Site EngineeringWeb Site Engineering

PROJECTNameDateID_DEPequ: ID_DEP

acc

PERSONNameFunctionAddressStatusID_DEPid: Name

acc ref: ID_DEP

acc

DEPARTMENTID_DEPNamePhoneFaxAddressMapMailDescriptionid: ID_DEP

acc

DBDBdescribesdescribes

Migration

XML Document

• Database engineering + Data migration

Page 13: A Methodological Approach for Web Sites Reengineering · 22/05/2003 FNRS contact day on "Software (re-)engineering" 3 Goals • To provide methods and tools for web sites reengineering

22/05/2003 FNRS contact day on "Software (re-)engineering" 13

ToolsTools

• HTML cleaning– Tidy

• Semantic enrichment– XML editor or the semantic browser (based on Mozilla) to

edit/generate the META file

• Data and schema extraction– XML parsers (Java DOM)– pageType.extractSchema(METAfile) à XMLSchema– pageType.extractData(HTML*, METAfile) à XML

• Schemas integration/conceptualisation anddatabase engineering

– CASE tool DB-Main

Page 14: A Methodological Approach for Web Sites Reengineering · 22/05/2003 FNRS contact day on "Software (re-)engineering" 3 Goals • To provide methods and tools for web sites reengineering

22/05/2003 FNRS contact day on "Software (re-)engineering" 14

Conclusion and future workConclusion and future work

• A method and tools to extract from a web site dataand their structure

• Difficulty : enormous diversity of web pagesstructures and layouts to represent the same reality

• Future work– test on real-size web sites– refine the semantic enrichment step

• improve GUI• automation/assistance based on heuristics