Extracting Structured Data from Web Pages

37
Extracting Structured Data from Web Pages By Arsun ARTEL, Özgün ÖZIŞIKYILMAZ 05.11.2003 Instructor: Prof. Taflan Gündem

description

Extracting Structured Data from Web Pages. By Arsun ARTEL, Özgün ÖZIŞIKYILMAZ 05.11.2003 Instructor: Prof. Taflan G ündem. General Underlying Terminology Modules and their operations. Presentation Outline. Motivation Example Pages. Model & Problem Formulation. Approach in Detail. - PowerPoint PPT Presentation

Transcript of Extracting Structured Data from Web Pages

Page 1: Extracting Structured Data from Web Pages

Extracting Structured Data from Web Pages

By Arsun ARTEL, Özgün ÖZIŞIKYILMAZ 05.11.2003

Instructor: Prof. Taflan Gündem

Page 2: Extracting Structured Data from Web Pages

Presentation Outline

• Motivation• Example Pages

– General– Underlying

Terminology– Modules and

their operations• Model & Problem Formulation

• Approach in Detail• Experimental Results

• Conclusion

Page 3: Extracting Structured Data from Web Pages

What is next?

• Motivation• Example Pages

– General– Underlying

Terminology– Modules and

their operations• Model & Problem Formulation

• Approach in Detail• Experimental Results

• Conclusion

Page 4: Extracting Structured Data from Web Pages

Motivation

• Extracting structured data from the web pages is useful, since it enables us to pose complex queries over the data.

• This paper focuses on the problem of automatically extracting structured data from a collection of pages.

• There are many web sites that contain a large collection of “structured” pages.

Page 5: Extracting Structured Data from Web Pages

What is next?

• Motivation• Example Pages

– General– Underlying

Terminology– Modules and

their operations• Model & Problem Formulation

• Approach in Detail• Experimental Results

• Conclusion

Page 6: Extracting Structured Data from Web Pages

Example Pages

• In the real world there are many examples for structured web pages.– amazon web site, e-bay web site etc.

• Two examples from www.amazon.com– My System– An Eternal Golden Braid

Page 7: Extracting Structured Data from Web Pages

Example Pages (My System: 21st Century Edition)

Page 8: Extracting Structured Data from Web Pages

Example Pages (An Eternal Golden Braid)

Page 9: Extracting Structured Data from Web Pages

What is next?

• Motivation• Example Pages

– General– Underlying

Terminology– Modules and

their operations• Model & Problem Formulation

• Approach in Detail• Experimental Results

• Conclusion

Page 10: Extracting Structured Data from Web Pages

Underlying Problems• Complex Schema: The “schema” of

the information encoded in the web pages could be very complex with arbitrary levels nesting. For instance, each book page can contain a set of authors, with each author having a set of addresses and so on.

• Template vs. Data: Syntactically, there is nothing that distinguishes the text that is part of the template and the text that is part of the data.

Page 11: Extracting Structured Data from Web Pages

How is a page created with template?

x extracted from the database

Page 12: Extracting Structured Data from Web Pages

Basic Type, Tuples and Sets• Basic Type: Basic unit

of text• Tuple: Ordered List of

types, <T1,T2,…,Tn>

• Set: {T1}

< C Programming Language, {< Brian, Kernighan >, < Dennis, Ritchie >}, $30.00 >

Page 13: Extracting Structured Data from Web Pages

Schema and Instance

< C Programming Language, {< Brian, Kernighan >, < Dennis, Ritchie >}, $30.00 >

Page 14: Extracting Structured Data from Web Pages

Template Definition

• Own example: • Schema: S = <, {, >

• Template: TS = <A * B {*}E C * D>

• A = ‘Title:’, B = ‘Presented by:’, C = ‘Cost:’, D = ‘ ’, E = ‘and’

• Instance of TS: Title: Extracting Structured Data Presented by: Arsun and Özgün Cost: 1hr

Page 15: Extracting Structured Data from Web Pages

Template Encoding (T1,x1)

Page 16: Extracting Structured Data from Web Pages

What is next?

• Motivation• Example Pages

– General– Underlying

Terminology– Modules and

their operations• Model & Problem Formulation

• Approach in Detail• Experimental Results

• Conclusion

Page 17: Extracting Structured Data from Web Pages

General Description of EXALG

Page 18: Extracting Structured Data from Web Pages

Multiple Pages

Set of Reviewers

Page 19: Extracting Structured Data from Web Pages

Correct Solution for those pages

Page 20: Extracting Structured Data from Web Pages

Some Terminology (1)• The occurrence-vector of a token t, is

defined as the vector <f1,f2,…fn> where fi is the number of occurrences of t in ith page

• An equivalence class is a maximal set of tokens having the same occurrence-vector.

• A token is said to have unique role, if all the occurrences of the token in the pages, is generated by a single template-token.

Page 21: Extracting Structured Data from Web Pages

Some Terminology (2)

<1,1,1,1>

<1,2,1,0>

No unique role

Page 22: Extracting Structured Data from Web Pages

Some Terminology (3)

• For real pages, an equivalence class of large size and support is usually valid, where support of a token is defined as the number of pages in which the token occurs.

• Example for invalid equivalence class:– {Data, Mining, Jeff, 2, Jane, 6} has

occurrence vector <0, 1, 0, 0>

Page 23: Extracting Structured Data from Web Pages

Some Terminology (4)

• The equivalence classes with large size and support are called LFEQs (for Large and Frequent EQuivalence class). LFEQs are rarely formed by “chance”.

• Threshold for size and support is set by the user (SizeThres, SupThres).

Page 24: Extracting Structured Data from Web Pages

Some Terminology(5)

• Valid equivalence class properties: Ordering and Nesting

• Back to own example:

• Template: TS = <A * B {*}E C * D>

• A = ‘Title:’, B = ‘Presented by:’, C = ‘Cost:’, D = ‘ ’, E = ‘and’

• Ordered: A > B > C > D• Nesting: B > E > C

Page 25: Extracting Structured Data from Web Pages

Important Observations

• In practice, two page-tokens with different occurrence-paths have different roles: html-parser

• Two page-tokens having same occurrence paths, but with different neighbours also have different roles

Page 26: Extracting Structured Data from Web Pages

Explanation of observations

Page 27: Extracting Structured Data from Web Pages

Modules and their operations

M o d u le E C G M

E q u iv a len c e C las s G en er a tio n M o d u le

Fin dEqF in d E q u iv a len c e C las s es

HandIn vHan d le I n v a lid

E q u iv a len c e C las s es

D if fEqD if f er en tia te R o les Us in g

E q C las s

An aly s is M o d u le

D if fFormD if f er en tia teR o les Us in gF o r m at

ExV a lE x tr ac t Valu e

C on s tTem pC o n s tr u c tT em p la te

Tem pla teS ch em aV a lu e s

in pu tpage s

Page 28: Extracting Structured Data from Web Pages

Constructing Template (1)

• The extraction algorithm determines the positions between consecutive tokens of an equivalence class that are non-empty.

• A position between two consecutive tokens is empty if the two tokens always occur contiguously, and non-empty, otherwise.

Page 29: Extracting Structured Data from Web Pages

Constructing Template (2)

• The tokens connected by empty positions belong to the template.

• In the non-empty positions, there are either basic types (strings extracted from database), or a more complex type

• This unknown type can be determined by inspecting input pages

Page 30: Extracting Structured Data from Web Pages

Constructing Template(3)

Page 31: Extracting Structured Data from Web Pages

What is next?

• Motivation• Example Pages

– General– Underlying

Terminology– Modules and

their operations• Model & Problem Formulation

• Approach in Detail• Experimental Results

• Conclusion

Page 32: Extracting Structured Data from Web Pages

Experimental Results (1)

• Basically this project is compared with the RoadRunner, however RoadRunner makes simplifying assumptions.

• The first 6 web pages are obtained from RoadRunner site.

• The last three web pages have more complex structure.

Page 33: Extracting Structured Data from Web Pages

Experimental Results(2)

Page 34: Extracting Structured Data from Web Pages

What is next?

• Motivation• Example Pages

– General– Underlying

Terminology– Modules and

their operations• Model & Problem Formulation

• Approach in Detail• Experimental Results

• Conclusion

Page 35: Extracting Structured Data from Web Pages

Concluding Remarks• EXALG first discovers the unknown

template that generated the pages and uses the discovered template to extract the data from the input pages.

• Besides getting very good results, EXALG does not completely fail to extract any data even when some of the assumptions made by EXALG are not met by the input collection.

• No human intervention – automatically getting template and data

Page 36: Extracting Structured Data from Web Pages

Future Work

• Automatically locate collections of pages that are structured

• Check, whether it is feasible to generate some large database from these pages

Page 37: Extracting Structured Data from Web Pages

Questions & Answers