Paper Elseiver

8/3/2019 Paper Elseiver

1/6

A FRAMEWORK FOR SEMANTIC ENRICHMENT FROM

INFORMATION INTEGRATING

MARCO JAVIER SUREZ BARN, JUAN VELAZQUEZ, CARLOS ANDRES CIFUENTES

AbstractIn this paper, we propose a Semantic WebArchitecture for: Information extracting (IE) throughdeductive logic, Information integrating basedXML/RDF and finally a model for semanticenrichment give by an ontology. The purpose of thisresearch consists in generating XML/RDF TypeDocuments in order to integrate the information got inthe extraction process.We develop ontology for cardiology diseases throughOWL for processing which includes querying and

inference over tables structures. By studying a building zigbee network scenario, we show thatsemantic web technologies can provide high levelinformation extraction and inference of knowledge.

Keywords:

Information extracting, Information integrating,semantic enrichment, ontology

Resumen- En este trabajo, proponemos unaarquitectura para la Web Semntica: Extraccin deInformacin (IE) a travs de lgica deductiva, laintegracin de la informacin basada en XML / RDFy, finalmente, un modelo para el enriquecimientosemntico dado por una ontologa. El propsito deesta investigacin consiste en la generacin dedocumentos XML / RDF con el fin de integrar lainformacin obtenida en el proceso de extraccin.

Desarrollamos una ontologa en el dominio deenfermedades cardiovasculares mediante OWL parala transformacin que incluye la consulta y lainferencia sobre las estructuras de las tablas.Mediante el estudio de una red ZigBee para laconstruccin de escenarios, se muestra que lastecnologas de web semntica pueden proporcionar la

extraccin de informacin de alto nivel y la inferenciade los conocimientos.

Palabras Clave:

Extraccin de informacin, Integracin deinformacin, enriquecimiento semntico, ontologa

1. INTRODUCTION

The web contains large volumes of raw data which posses natural heterogeneity e.g. PDF, HTML, PSetc.; also, there are different structures intodocuments such as: tables, graphics, text, etc, thesedata need to be extracted and enriched with semanticinformation. PDF documents and HTML tables ismostly structured, but we usually do not know thestructure in advance[1].We begin our examination ofInformation Extracting ( IE) by considering aspecific example from tables contains within XHTML

and HTML pages or PDF documents. In order tointegrate the information of the XHTML, HTML andPDF documents, it is necessary to determine the areaof knowledge (Health, Tourism, Nutrition, etc.), aclassification of types of tables, and the leavingformat for the extracted information. This article

presents a classification of HTML and PDF tablestructures, the description of ways for organizing and

presenting information in the tables, a set of rules andheuristics for extracting information from web tables,and the design of the intermediate DTD whichfacilitates the generation of XML schemas. Here,XML is adopted as the output format of the extraction

process, and RDF/XML as format into integratinginformation through OWL ontology. This is sobecause it has a high adaptability, accessibility andinterpolarity in different processing data contexts. Inaddition, the W3C has made it a standard for thedevelopment of Web applications.

2. Methodology

An approach to solve the problem of dataheterogeneity on the Web [2] is an objective in ourwork. In order to give solution, we proposed thefollowing model of the figure 1. In this focussolution to this problem is to use the deductive logicand XML technologies for detecting, extracting andintegrating information. In this work, the nonstructured data sources are mainly concerned withtexts contained in structures known like tables, thesestructures are contained in An Web Collection[3]; theWeb Collections aims to provide a store ofdocuments for to be used in the wrapper.


2/6


3/6


4/6


5/6

been defined [14]. Shared refers to a commonunderstanding of some domain that can becommunicated across people and computers[13].

We have implemented the Ontology using OWL andRDF/XML language stressing efficiency and ease-of-use. In Figure 7, a general view of the Ontology can

be observed. In semantic terms, it lets generate amodel of table simple structure starting from complexstructures. In this case, the algorithm detected the

presence of a unique table with regroupings in theHTML document. It can be observed that thegeneration of the XML document corresponds to thesemantic structure of Figure 7. On the other hand, fora table that shows paragraphs

as a form fordistributing information, these ones will beconsidered en the process of extraction as a row of atable despite they are contained in the same cell.

Fig. 6 Part of Ontology on Cardiology for Information integrating

We will now present the semantic enrichmentalgorithm, based on the previously developedconceptualization. The preprocessing steps of dataenrichment are not shown. We note that categoriesare not maintained in the final result of the algorithm,as their status is ill defined.Thus, all assignments of categories to semantic typesare temporary and are deleted at the end of thealgorithm. For the propose

Algorithm: Semantic Enrichment

MQ

AT Candidated(Col, th)

={t }

Ifsub(t,Col) is values set of columns

assumed by A-term t.

Then

Applicant factors of Indexing:

1. Equality: (Mv = Mt)

2. Inclusion: (Mv Mt or Mt Mv)

3. Intersection: (Mt Mv ) or (Mv Mt =

Fin MQ

Fig 8. A part of Semantic enrichment algorithm

The goal for this section is give an approach towardsan algorithm or intelligent agent for enrichment

Fig. 7 Output for wrapper for the conversion of table throughsemantic hierarchies.

9. Conclusions and future work

We shows a tool for extracting and integratingsemistructured information from a set of HTML

pages and PDF documents for converting the

extracted information into XML set documents.Also, this article introduces an approach to the openextraction of information contained in XHTML,HTML and PDF. Both a study and a classification oftables according to their structure and complexityhave let establish and define a set of heuristic rules tosucceed in extracting the information contained intables of this type. For this purpose, a data processingWrapper prototype has been implemented. It letscapture the WEB documents, detect the presence of


6/6

tables and filter tables to be validated for the process.Moreover, it lets visualize the leaving of informationthrough XM-Type documents. After evaluating theresults obtained through the extracting prototype, itcan be affirmed that the results are satisfactory andhave high quality. Based on this fact, future tasks inthe research can be predicted. The next step will bethe semantic improvement of the generatedRDF/XML knowledge base in the domain ofCardiology. This component will be tested through alink with a ZIGBEE wireless network for integrating

biomedical sensors on patients con enfermedadesdcardivasvulares.

10. References

[4] Luger, G.F. (1992). Artificial Intelligence.Structures and Strategies for Complex ProblemSolving". Benjamin/Cummings Publishing.

[4] Lim S.J & K, Y. (2002). "Extracting Informationfrom semistructured sources". pp. 71-80.

[4] Maruyama, H. (1999)"XML y Java". Ed PrenticeHall.

[4] Nejdl, W. & Loser, A. (2003). Super-Peer-BasedRouting and Clustering Strategies for RDF-BasedPeer-to-Peer Networks. In Proceedings of WWW. pp.123-130.

[4] Surez, M. J. (2005). "An Approach to Semantic

Indexing and Information Retrieval". In ProceedingsCIIDET, IEEE Mxico. pp.34-350.

[4] Popov B, Kiryakov A, Manov D & Kirilov (2005).Towards Semantic Web Information Extraction". InProceedings CIIDET, IEEE Mxico. pp.34-350.

[x]Towards Semantic Web InformationExtraction,Borislav Popov, Atanas Kiryakov, DimitarManov, Angel Kirilov, Damyan,Ognyanoff, MiroslavGoranovOntotext Lab, Sirma AI EOOD, 135Tsarigradsko Shose, Sofia 1784, Bulgaria.

[y]Automating the Extraction of Data,from HTMLTables with Unknown Structure,David W.Embleyand Cui Tao , Department of ComputerScience[4] Stephen W. Liddle,Information Systems Groupand Rollins eBusiness Center Brigham YoungUniversity, Provo, Utah 84602, U.S.A.{embley,ctao}@cs.byu.edu, [email protected]

[z]Extracting Information from Semi-structured WebDocuments, Ajay Hemnani and Stephane Bressan.

National University of Singapore. 3 Science Drive 2,Singapore 117543{hemnania,steph}@comp.nus.edu.sg

[w] C. Blaschke and A. Valencia. The frame-basedmodule of the Suiseki information extraction system.IEEEIntelligent Systems, 17:1420, 2002.
mailto:[email protected]:[email protected]

Paper Elseiver

Documents

Transcript of Paper Elseiver