Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

12
Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding

Transcript of Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Page 1: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Semiautomatic Generation of

Data-Extraction Ontologies

Master’s Thesis Proposal

Yihong Ding

Page 2: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Querying the Web(Two Approaches)

• Enhanced query language– Examples: WebSQL, WebOQL

– Sources: structured, or restructured before parsing

• Wrapper– Enables querying in a database-like fashion

– Depends on source format • not resilient

• same topic with different formats need different wrappers

Page 3: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Data-Extraction Ontology

• Beyond the wrapper approach– Extraction technique for data-rich, unstructured,

multiple-record Web documents

– Does not depend on source format • resilient

• Same topic with different formats uses same ontology

• Good experimental results

Page 4: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Main Difficulty (Creating the Data-Extraction Ontology)

• Users must be experts– database theory

– regular expression generation

• Manual creation is impractical– Very large information sources

– Frequently added sources of interest

– Many varying text formats

Page 5: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Semiautomatic Data-Extraction Generation

Generation & Updating Process

Input Knowledge

Sources

GeneratedData-Extraction

Ontology

Training Document(s)

Validation Documents

Page 6: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Generation Process

• For this research, three steps are expected:– Gathering Knowledge

– Generating Initial Ontology

– Validation & Updating Strategy

• Ontology Generation Performance Evaluation

Page 7: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Example: Extract Information from Country Library Web Site (http://www.tradeport.org/ts/countries/)

Car Advertisement XML Base CIA Factbook XML Base

Page 8: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Learning & DiscoveringAlgorithm

All

Mileage

Make

… …

Model

Car

Capital

… …

Area

CountryName

Population

AgricultureCountry

CIA Factbook XML Base

Car Advertisement XML Base

Page 9: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Learning & DiscoveringAlgorithm

All

Mileage

Make

… …

Model

Car

Capital

… …

Area

CountryName

Population

AgricultureCountry

All

Mileage

Make

… …

Model

Car

Capital

… …

Area

CountryName

Population

AgricultureCountry

Page 10: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Performance Evaluation

• Measure precision and recall for each lexical object set in generated extraction ontology

• Measure was generated with respect to could have been

generated

• Measure was generated with respect to should not have

been generated

Page 11: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Delimitation

Will not …• Consider all storage formats for existing knowledge

– XML

• Consider all document formats– HTML

– Plain Text

• Let users update the input knowledge source at run-time

Page 12: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.

Contribution

• Semi-automatically generate a data-extraction ontology

• Exploit the existing knowledge

• Link existing data-extraction tools

• Create a partial library of regular expression recognizers