Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.
-
Upload
lydia-sullivan -
Category
Documents
-
view
212 -
download
0
Transcript of Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.
![Page 1: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649cc95503460f94991bb7/html5/thumbnails/1.jpg)
Semiautomatic Generation of
Data-Extraction Ontologies
Master’s Thesis Proposal
Yihong Ding
![Page 2: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649cc95503460f94991bb7/html5/thumbnails/2.jpg)
Querying the Web(Two Approaches)
• Enhanced query language– Examples: WebSQL, WebOQL
– Sources: structured, or restructured before parsing
• Wrapper– Enables querying in a database-like fashion
– Depends on source format • not resilient
• same topic with different formats need different wrappers
![Page 3: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649cc95503460f94991bb7/html5/thumbnails/3.jpg)
Data-Extraction Ontology
• Beyond the wrapper approach– Extraction technique for data-rich, unstructured,
multiple-record Web documents
– Does not depend on source format • resilient
• Same topic with different formats uses same ontology
• Good experimental results
![Page 4: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649cc95503460f94991bb7/html5/thumbnails/4.jpg)
Main Difficulty (Creating the Data-Extraction Ontology)
• Users must be experts– database theory
– regular expression generation
• Manual creation is impractical– Very large information sources
– Frequently added sources of interest
– Many varying text formats
![Page 5: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649cc95503460f94991bb7/html5/thumbnails/5.jpg)
Semiautomatic Data-Extraction Generation
Generation & Updating Process
Input Knowledge
Sources
GeneratedData-Extraction
Ontology
Training Document(s)
Validation Documents
![Page 6: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649cc95503460f94991bb7/html5/thumbnails/6.jpg)
Generation Process
• For this research, three steps are expected:– Gathering Knowledge
– Generating Initial Ontology
– Validation & Updating Strategy
• Ontology Generation Performance Evaluation
![Page 7: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649cc95503460f94991bb7/html5/thumbnails/7.jpg)
Example: Extract Information from Country Library Web Site (http://www.tradeport.org/ts/countries/)
Car Advertisement XML Base CIA Factbook XML Base
![Page 8: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649cc95503460f94991bb7/html5/thumbnails/8.jpg)
Learning & DiscoveringAlgorithm
All
Mileage
Make
… …
Model
Car
Capital
… …
Area
CountryName
Population
AgricultureCountry
CIA Factbook XML Base
Car Advertisement XML Base
![Page 9: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649cc95503460f94991bb7/html5/thumbnails/9.jpg)
Learning & DiscoveringAlgorithm
All
Mileage
Make
… …
Model
Car
Capital
… …
Area
CountryName
Population
AgricultureCountry
All
Mileage
Make
… …
Model
Car
Capital
… …
Area
CountryName
Population
AgricultureCountry
![Page 10: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649cc95503460f94991bb7/html5/thumbnails/10.jpg)
Performance Evaluation
• Measure precision and recall for each lexical object set in generated extraction ontology
• Measure was generated with respect to could have been
generated
• Measure was generated with respect to should not have
been generated
![Page 11: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649cc95503460f94991bb7/html5/thumbnails/11.jpg)
Delimitation
Will not …• Consider all storage formats for existing knowledge
– XML
• Consider all document formats– HTML
– Plain Text
• Let users update the input knowledge source at run-time
![Page 12: Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.](https://reader035.fdocuments.in/reader035/viewer/2022072005/56649cc95503460f94991bb7/html5/thumbnails/12.jpg)
Contribution
• Semi-automatically generate a data-extraction ontology
• Exploit the existing knowledge
• Link existing data-extraction tools
• Create a partial library of regular expression recognizers