Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction...

18
Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion Schema, ontologies, archives and next generation IR problems Dr. Robert Warren 1 1 [email protected] Math and Statistics Carleton University, Canada (now) DRS School of Computer Science University of Waterloo, Canada (previously) Webis, Bauhaus Universität, Weimar Warren Next Gen IR

Transcript of Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction...

Page 1: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

Schema, ontologies, archives and nextgeneration IR problems

Dr. Robert Warren1

[email protected] and Statistics

Carleton University, Canada (now)DRS School of Computer Science

University of Waterloo, Canada (previously)

Webis, Bauhaus Universität, Weimar

Warren Next Gen IR

Page 2: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

1 Introduction

2 Non-traditional information retrieval

3 Storage, Integration Problems

4 Conclusion

Warren Next Gen IR

Page 3: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

Muninn Project (Great War)Support for changes in organizations, people andrelationships.Taxonomy, multilingual, multicultural.Record instances and classes of objects that don’t existanymore.Current bias is towards Canada and the British Empire.Less ontological engineering than design-by-exception.Talking to and integrating to library systems is misery.

Warren Next Gen IR

Page 4: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

The problem with archives

“We’d love to give you the data but we can’t get it out of thecomputer system more than one page at a time.”

“We’d love to give you the data but it’s on 300 unlabeled tapes.”

“We’d love to give you the data but we don’t know where it is.”

“We’d love to give you the data. Please fill out this licensingagreement.”

“We’d love to give you the data but we can’t afford to store it sowe’ll burn it.”

Warren Next Gen IR

Page 5: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

Previously in the search world...Faceted search didn’t work outEverything is a free form query with magic(tm).

...This is what is going to happenThe last problem was the data bloat, we now have themeta-data bloat.Meta-data is now an ontology, a schema, documentsalready annotated with word net, cancorp, etc... some ofthis has already happened with word processors.Querying this is somewhere between classical databasesand information retrieval.Data management is a nightmare. It’s about to get worse.

Warren Next Gen IR

Page 6: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

PPCLI

I I

≈1,000

≈3,000

≈10,000

7th Bridgade

X

3rd Canadian Division

XX

≈50,000Canadian Corps

XXX

...

XX

...

XX

...

XX

...

X

...

X

...

X

...

I I

...

I I

...

I I

#1 Co

I

Cyril BiddulphLieutenant

John MacPhersonCaptain

BattalionPartOf

Brigade

BrigadePartOf

Division

DivisionPartOfCorps

80th Bridgade

X

27th British

Division

XX

Canadian Expeditionary Forces BritishExpeditionary Forces

XVI Corps

XXX

RCR

I I

isA

isA

isA<1915

AmalgamatedCommand

PPCLI

I I I

RCR

I I I

isAlso

isAlso

...

I I I

isA

AssignedTo

AssignedTo

AssignedTo

AssignedTo

isA

??

Regiment

CaptainCommandingOfficer

Appointed

SeniorTo

Commanding Officer (acting)

Warren Next Gen IR

Page 7: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

Warren Next Gen IR

Page 8: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

Exploratory IR in unknown domainsOngoing work with Shelley Hulan, English Literature.Large collections of mixed documents.Retrieval needs aren’t traditional - “soft” requirements.“Operationalizing” those requirements is painful.Ongoing problems with complex information retrievalproblems.

Warren Next Gen IR

Page 9: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

Prototypes using passage voice

Warren Next Gen IR

Page 10: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

Prototypes using passage voice

Warren Next Gen IR

Page 11: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

Messages and Signals - a data-mining perspective

Warren Next Gen IR

Page 12: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

SNA driven Information RetrievalInformation Retrieval on large sets of legal documents.Used a novel social networking method to modifydocument rankings.Above median for 2/3 topics and top score for one topic.Problem: Some normalization problems with sendersoutside of the network.

Pdoc = Pbm25AVG(∀Pdoc(sender)) (1)

Warren Next Gen IR

Page 13: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

Current Era“Kill them all, let god sort them all” (Local madman)

Crusades Era“Massacrez-les, car le seigneur connaît les siens.” (ArnaudAmalric, French madman)a

ahttp://fr.wikipedia.org/wiki/Arnaud_Amaury

Research problemTrack both the idiom and the underlying themes across all 8centuries.

Warren Next Gen IR

Page 14: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

RDF (Resource Description Framework) / Linked OpenDataXML like, but with cross references between files.

1 <org:University>Bauhaus</org:University>

OWL (Web Ontology Language)

1 <owl:Class rdf:ID="University">2 <owl:subClassOf rdf:resource="#Educational_Organization

" />3 <rdf:Type rdf:resource="http://xxxx.de/de_universities"

/>4 </owl:Class>

Warren Next Gen IR

Page 15: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

Library Catalogs (MARC)

Warren, Baby Boy,1919-1977

Warren Next Gen IR

Page 16: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

Differing views of the same data

1 <foaf:firstName>Robert</foaf:firstName>2 <foaf:lastName>Warren</foaf:lastName>

GNL view

1 <gnd:preferredNameForThePerson>Rob Warren</gnd:preferredNameForThePerson>

2 <gnd:forename>Robert</gnd:forename>3 <gnd:surname>Warren</gnd:surname>4 <gnd:locQualifier>Academic</gnd:locQualifier>

Library View

1 <foaf:Author>Robert H. Warren (Academic, 1973-)</foaf:Author>

Warren Next Gen IR

Page 17: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

The human being is the IR implementationMARC / Library Catalogs mimic library cards.In 1967, about 600k books published worldwhile. In 2011,600k in the UK alone (+2M others). Expect 14M bookspublished in the US in 2012.None of this will scale! (An the LOC knows it.)The thing and the name of the thing are not the same thing!

Both MARC and the FOND reference change!

RG 4353.664 550-670,Robert Warren

Warren Next Gen IR

Page 18: Schema, ontologies, archives and next generation IR problems · Warren Next Gen IR. Introduction Non-traditional information retrieval Storage, Integration Problems Conclusion 1 Introduction

IntroductionNon-traditional information retrieval

Storage, Integration ProblemsConclusion

We have to make this work together:Taxonomies (Canada is a realm, a constitutional monarchy,a confederation, a chunk of land and a Dominion.Newfoundland used to be a Dominion but is now aProvince and part of Canada.)Good quality, cross referenced information systems arecoming and the data deluge will be replaced with themeta-meta-data deluge.Cultural translation versus word translation (see wikipedia)We urgently need a super-class to the “bag-o-words”model.

ConclusionQuestions?

Warren Next Gen IR