Treatment of Semantic Heterogeneity ...
description
Transcript of Treatment of Semantic Heterogeneity ...
![Page 1: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/1.jpg)
GESIS
Robert StrötgenSocial Science Information Centre, Bonn
euroCRIS 2002, 29th August 2002
... using Meta-Data Extraction and Query Translation
Treatment of Semantic Heterogeneity ...
![Page 2: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/2.jpg)
2GESIS
Outline
What is semantic heterogeneity?Meta-Data extractionSemantic relationsQuery translationOutlook
![Page 3: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/3.jpg)
3GESIS
Project CARMEN
Metadata (Dublin Core Element Set in RDF, “Meta-Maker”, digital signatures)Retrieval on structured documents and heterogeneous data types (search engine and gatherer for XML documents)Methods for treatment of resisting semantic heterogeneity in CARMEN
![Page 4: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/4.jpg)
4GESIS
Semantic Heterogeneity
Technical heterogeneity (different platforms, databases, formats) is not the issue of CARMENSemantic heterogeneity appears in different data collections using
different thesauri or classifications for content description
varying or no metadata at all or when intellectually indexed documents meet
completely un-indexed Internet pages
![Page 5: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/5.jpg)
5GESIS
Material: Social Sciences
SOLIS/FORIS vs. Internet documents from social sciencesspecialized documentation databases with high-quality content description like abstract, controlled keywords and classificationInternet documents in the majority of cases without any metadata, high semantic and formal heterogeneity
![Page 6: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/6.jpg)
6GESIS
Extraction of Meta-Data
PostScriptunstructured
PostScript document
extractorheuristicsextractorheuristicsextractorheuristicsextractorheuristicsextractorheuristics
structuredHTML document
Safety analysis of nuclear reactors strongly relies on numerical simultation of the reactor core. ...
www.tum.de/preprints/...
dc:cre
ator
Schmid,Werner
Math. Subject
Classification
dc:subject
(Keyword)
dcq:abstract
Multirid Methods, Eigenvalue Problems, Multigroup Diffusion
further MSC
Multigrid methods; domain decomposition
65N55
Classifi-cation
rdf:type
rdf:value
rdfs:la
bel
![Page 7: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/7.jpg)
7GESIS
Meta-Data in Test Corpus
Size: 3,661 documentsFile format: only HTML documentsTITLE:
Correct title tags: 96 % Title, but incorrectly coded: 17.7 % of the rest
KEYWORD: Correct keyword tags: 25.5 %
ABSTRACT: Correct description tags: 21 % Abstract, but incorrectly coded: 39,4 % of the rest
![Page 8: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/8.jpg)
8GESIS
Extraction from HTML files - Some Problems
Missing or irregular use of Meta tags (author, keywords, DC-Tags)Inconsistent use of semantic HTML tags (title, h1, h2, address etc.)Irregular formatting style for context information (type size, type style, horizontal orientation etc.)Missing context information (date, author, institution, etc.)Not specification consistent use of HTML!
![Page 9: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/9.jpg)
9GESIS
Converting HTML XML
Advantages: (syntactical) homogenisation of HTML files XML allows the use of many existing tools for
document analysis, particularly the query language XPath.
Disadvantage: Poor performance of the converting process
(not a big issue: extraction runs during gathering process, not at retrieval time)
![Page 10: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/10.jpg)
10GESIS
HTML Heuristic : Title (part)
If (<title>-tag exists && <title> does not contain "untitled" && HMAX exists){ /* 'does not contain "untitled"' is to be searched as case insensitive substring in <title> */ If (<title>==HMAX) { <1> Title[1]=<title> } elsif (<title> contains HMAX) { /* ' contain' does always mean case insensitive substring */ <2> Title[0,8]=<title> } elsif (HMAX contains <title>) { <3> Title[0,8]=HMAX } else { <4> Title[0,8]=<title> + HMAX } } elsif (<title> exists && S exists) { /* i.e. <title> exists AND an item //p/b, //i/p etc. exists */ <5> Title[0,5]=<title> + S } elsif (<title> exits) { <6> Title[0,5]=<title> } elsif (<Hx> exits) { <7> Title[0,3]=HMAX } elsif (S exits) { <8> Title[0,1]= S }}
![Page 11: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/11.jpg)
11GESIS
Results and Outlook
Extraction of Meta-Data TITEL: 80 % extracted with medium or high quality KEYWORDS: nearly 100 % extracted with high quality ABSTRACTS: 90 % extracted with medium/high
qualityConclusion
In principle transferable on other domains Expensive maintenance Only compromise solution, until builders of web pages
use Dublin Core or other Meta-Data standard
![Page 12: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/12.jpg)
12GESIS
Semantic Relations
Intellectual transfers relations(Cross-Concordances)
Tools for creation: SIS-TMS for thesauri, CarmenX for classifications
Statistical transfer relations (Co-occurrence analysis)
![Page 13: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/13.jpg)
13GESIS
Cross-Concordances in SIS-TMS
![Page 14: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/14.jpg)
14GESIS
SIS-TMS Correlation Editor
![Page 15: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/15.jpg)
15GESIS
Parallel Corpus
document set Bdocument set A
doc. A1
a
b
c
d
thesaurus or classification
known relation ofdocuments
Derivedrelation ofterms
x
a
y
z
thesaurus or classification
doc. B2
doc. B1
doc. A3
doc. A2
doc. B3
document set Bdocument set A
doc. A1
a
b
c
d
thesaurus or classification
known relation ofdocuments
Derivedrelation ofterms
x
a
y
z
thesaurus or classification
doc. B2
doc. B1
doc. A3
doc. A2
doc. B3
document set Bdocument set A
doc. A1
a
b
c
d
thesaurus or classification
known relation ofdocuments
Derivedrelation ofterms
document set Bdocument set A
doc. A1
a
b
c
d
thesaurus or classification
known relation ofdocuments
Derivedrelation ofterms
x
a
y
z
x
a
y
z
thesaurus or classification
doc. B2
doc. B1
doc. A3
doc. A2
doc. B3
![Page 16: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/16.jpg)
16GESIS
Corpus with Internet DocumentsIn ternet fu ll-tex t docum ents
Dokum ent
Dokum ent
Dokum ent
te rm s from prob.indexer
x
a
y
z
c lass ifica tion /thesaurus
a
b
c
d...
...
Social Sciences‘ Internet documents are not indexed using a thesaurus or classification
![Page 17: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/17.jpg)
17GESIS
Simulating a Parallel CorpusIn ternet fu ll-tex t docum ents
docum ent
docum ent
docum ent
te rm s from prob.indexer
c lass ifica tion /thesaurus
a
b
c
d
probablistic search results inweighted relations betweenclassification classes orthesuaurs term s todocum ents
0.5
0.1
0.8
probabilis tic search
...
x
a
y
z...
![Page 18: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/18.jpg)
18GESIS
Result: Simulated Parallel Corpus
In ternet fu ll-text docum ents
docum ent
docum ent
docum ent
term s from prob.indexer
x
a
y
z
0.8
0.60.9
0.50.7
0.8
0.4
c lass ification /thesaurus
a
b
c
d
0.1
0.1
0.1
0.8
0.8
0.8
0.5
0.5
......
![Page 19: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/19.jpg)
19GESIS
a x(0,8) ; y(0,4)
b x(0,3) ; z(0,3)
c a(0,2) ; y(0,4)
d x(0,6) ; y(0,7)
term s from prob.indexer
x
a
y
z
class ification /thesaurus
a
b
c
d......
Term-Term-Matrix
![Page 20: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/20.jpg)
20GESIS
Tool: Jester
Java Enviroment for Statistical TransfERs: Support and assistance for creating statistical transfer relations from a parallel corpus
![Page 21: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/21.jpg)
21GESIS
Query Transformation
Query
Transformations
A B C Databases
Query' v2 Query Query' v3
![Page 22: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/22.jpg)
22GESIS
Binding of Query Languages
Plugable QueryParsers and QueryPrinters for different query languages make exploitation in other contexts easy.
![Page 23: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/23.jpg)
23GESIS
CARMEN Transfer Architecture
Retrieval server (HyRex) identifies transferable parts of a query and sends them to the transfer serviceExchange of partial queries using XML/XIRQLTransfer service runs as TomCat servlet server
transfer module(CGI/Servlets)
querytransfer
HyRex
XIRQLquery
XIRQLpartial query
http (text/xirql)XIRQL
partial query
XIRQ-partial query'
XIRQLpartial query'
XIRQLquery'
http (text/xirql)
![Page 24: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/24.jpg)
24GESIS
Evaluation of Transfer Modules
Retrieval tests using transfer modules (using a corpus with Internet documents indexed with Fulcrum SearchServer)Limitation: no use of weight information of transfer relationsTested transfer: SOLIS/IZ-Thesaurus SoWi Internet documents/free-termsComparison: search using IZ-Thesaurus terms vs. search using free-terms from transfer2 exemplary searches per 3 domains (women studies, migration, sociology of industry)
![Page 25: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/25.jpg)
25GESIS
Exemplary Search: “Dominanz“
„Dominanz“ (“dominance“): 16 relevant documents10 transfer terms (Dominanz, Messen, Mongolei, Nichtregierungsorganisation, Flugzeug, Datenaustausch, Kommunikationsraum, Kommunikationstechnologie, Medienpädagogik, Wüste):
14 additive documents, thereof 7 relevant (50%, increase 44%)Precision: 77%
![Page 26: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/26.jpg)
26GESIS
Exemplary Search: „Leiharbeit“
„Leiharbeit“ (“temporary work“): 10 relevant documents4 transfer terms (Leiharbeit, Arbeitsphysiologie, Organisationsmodell, Risikoabschätzung):
10 additive documents, thereof 2 relevant (20%, increase 20%)Precision: 60%
![Page 27: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/27.jpg)
27GESIS
Results
All exemplary searches using transfers leads to additive relevant documents compared with a search without transferQuota of relevant documents from all new documents between 13% and 55%Transfer terms not always evident (Example „Wüste“ (“desert”))Partly very many transfer terms (user parametrizing or better algorithms needed)
![Page 28: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/28.jpg)
28GESIS
Outlook (What needs to be done?)
Improvement of dubble corpora: Kind of documents Diversity of document types Diversity of institutions / web sites Domain Corpus size
Comparison of transfers using statistical relations intellectual relationsImprovement of algorithmsEffect of interactive, repetitive retrieval and user parametrizing / adjustmentUser tests
![Page 29: Treatment of Semantic Heterogeneity ...](https://reader036.fdocuments.in/reader036/viewer/2022062808/568154fe550346895dc2ea68/html5/thumbnails/29.jpg)
29GESIS
Exploitation
Services (transfer)Software (Java classes)Projects:
Virtuelle Fachbibliothek Sozialwissenschaften (ViBSoz)
European Schools Treasury Browser (ETB) Informationsverbund Bildung – Sozialwissenschaften
– Psychologie (InfoConnex)Contact: [email protected]