Post on 18-Jan-2016
description
March 2006 NaCTeM – Ray R. Larson
Prof. Ray R. Larson
University of California, BerkeleySchool of Information
Metadata as Infrastructure for Information Retrieval and Text
Mining
March 2006 NaCTeM – Ray R. Larson
Overview
Metadata as Infrastructure– What, Where, When and Who?
What are Entry Vocabulary Indexes?– Notion of an EVI– How are EVIs Built
Time Period Directories– Mining Metadata for new metadata
March 2006 NaCTeM – Ray R. Larson
Metadata as Infrastructure
The difference between memorization and understanding lies in knowing the context and relationships of whatever is of interest. When setting out to learn about a new topic, a well-tested practice is to follow the traditional “5Ws and the H”: Who?, What?, When?, Where?, Why?, and How?
March 2006 NaCTeM – Ray R. Larson
Metadata as Infrastructure
The reference collections of paper-based libraries provide a structured environment for resources, with encyclopedias and subject catalogs, gazetteers, chronologies, and biographical dictionaries, offering direct support for at least What, Where, When, and Who.
The digital environment does not yet provide an effective, and easily exploited, infrastructure comparable to the traditional reference library.
March 2006 NaCTeM – Ray R. Larson
What?
Searching texts by topic, e.g. Dewey, LCSH, any subject index, or category scheme applied to documents.
Two kinds of mapping in every search:
• Documents are assigned to topic categories, e.g. Dewey
• Queries have to map to topic categories, e.g. Dewey’s Relativ Index from ordinary words/phrases to Decimal Classification numbers.
Also mapping between topic systems, e.g. US Patent classification and International Patent Classification.
March 2006 NaCTeM – Ray R. Larson
Texts
‘What’ searches involve mapping to controlled vocabularies
Thesaurus/Ontology
March 2006 NaCTeM – Ray R. Larson
Start with a collection of documents.
March 2006 NaCTeM – Ray R. Larson
Classify and index with controlled
vocabulary
Or use a pre-indexed
collection.
Index
March 2006 NaCTeM – Ray R. Larson
Problem:Controlled
Vocabularies can be
difficult for people to
use.
“pass mtr veh spark ign eng”
Index
Use: “Economic Policy”
In Library of Congress subj
For: “Wirtschaftspolitik”
March 2006 NaCTeM – Ray R. Larson
Solution:Entry Level Vocabulary
Indexes.Index
EVIpass mtr veh
spark ign eng”
= “Automobile”
March 2006 NaCTeM – Ray R. Larson
“What” and Entry Vocabulary Indexes EVIs are a means of mapping from user’s
vocabulary to the controlled vocabulary of a collection of documents…
March 2006 NaCTeM – Ray R. Larson
Has an Entry Vocabulary
Module been built?
User selects a subject domain of
interest.
Download a set of training data.
Build associations between extracted terms & controlled
vocabularies.
Map user’s query to ranked list of
controlled vocabulary terms
Part of speech tagging
Use an existing EVI.
Extract terms (words and noun phrases) from
titles and abstracts.
User selects search terms from the ranked
list of terms returned by the EVI.
YES
Building an Entry Vocabulary Module (EVI)
Searching
For noun phrases
Internet DB indexed with a controlled
vocabulary.
Domains to select from: Engineering, Medicine, Biology, Social science, etc.
User has question but is unfamiliar with the domain
he wants to search.
NO
Building and Searching EVIs
March 2006 NaCTeM – Ray R. Larson
Technical Details
Download a set of
training data.
Build associations between extracted terms & controlled
vocabularies.
Part of speech tagging
Extract terms (words and noun
phrases) from titles and abstracts.
Building an Entry Vocabulary Module (EVI)
For noun phrases
Internet DB indexed with a
controlled vocabulary.
March 2006 NaCTeM – Ray R. Larson
Association Measure
C ¬Ct a b¬t c d
Where t is the occurrence of a term and C is the occurrence of a class in the training set
March 2006 NaCTeM – Ray R. Larson
Association Measure
Maximum Likelihood ratio
W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p)
and p1= p2= p=
a a+b
c c+d
a+c a+b+c+d
Vis. Dunning
March 2006 NaCTeM – Ray R. Larson
Alternatively
Because the “evidence” terms in EVIs can be considered a document, you can also use IR techniques and use the top-ranked classes for classification or query expansion
March 2006 NaCTeM – Ray R. Larson
FindPlutonium
In Arabic Chinese Greek Japanese Korean Russian Tamil
...),,2[logL(p t)W(c, 1 ++= baaStatistical association
Digital library resources
March 2006 NaCTeM – Ray R. Larson
EVI example
EVI 1
Index term:“pass mtr veh spark ign eng”User
Query “Automobile
” EVI 2Index term:“automobiles”OR
“internal combustible engines”
March 2006 NaCTeM – Ray R. Larson
But why stop there?
Index
EVI
March 2006 NaCTeM – Ray R. Larson
“Which EVI do I use?”
Index
EVI
Index
Index EVI
IndexEVI
March 2006 NaCTeM – Ray R. Larson
EVI to EVIs
Index
EVI
Index
Index EVI
IndexEVI
EVI2
March 2006 NaCTeM – Ray R. Larson
FindPlutonium
In Arabic Chinese Greek Japanese Korean Russian Tamil
Why not treat language the same way?
March 2006 NaCTeM – Ray R. Larson
Texts
Numericdatasets
It is also difficult to move between different media forms
Thesaurus/Ontology
EVI
March 2006 NaCTeM – Ray R. Larson
Searching across data types
Different media can be linked indirectly via metadata, but often (e.g. for socio-economic numeric data series) you also need to specify WHERE to get correct results
March 2006 NaCTeM – Ray R. Larson
Texts
Numericdatasets
But texts associated with numeric data can be mapped as well…
Thesaurus/Ontology
captions
EVI
EVI
March 2006 NaCTeM – Ray R. Larson
EVI to Numeric Data example
EVI LCSH
marcnew query
search resultscaptions
numeric table
numeric database
online catalog
search interface 1
search interface 2
1
8 7 6
5
432
11
10 9
March 2006 NaCTeM – Ray R. Larson
Texts
Numericdatasets
But there are also geographic dependencies…
Thesaurus/Ontology
captionsMaps/Geo Data
EVI
EVI
March 2006 NaCTeM – Ray R. Larson
WHERE: Place names are problematic… Variant forms: St. Petersburg, Санкт Петербург,
Saint-Pétersbourg, . . . Multiple names: Cluj, in Romania / Roumania /
Rumania, is also called Klausenburg and Kolozsvar. Names changes: Bombay Mumbai. Homographs:Vienna, VA, and Vienna, Austria;
– 50 Springfields. Anachronisms: No Germany before 1870 Vague, e.g. Midwest, Silicon Valley Unstable boundaries: 19th century Poland; Balkans;
USSR Use a gazetteer!
March 2006 NaCTeM – Ray R. Larson
WHERE. Geo-temporal search interface. Place names found in documents. Gazetteer provided lat. & long. Places displayed on map.
Timebar
March 2006 NaCTeM – Ray R. Larson
Zoom on map. Click on place for a list of records. Click on record to display text.
March 2006 NaCTeM – Ray R. Larson
Catalogs and gazetteers should talk to each other!
Geographic sort / display of catalog search result.
Catalog search
Gazetteer search
March 2006 NaCTeM – Ray R. Larson
Texts
Numericdatasets
So geographic search becomes part of the infrastructure
Thesaurus/Ontology
Gazetteers captionsMaps/Geo Data
EVI
March 2006 NaCTeM – Ray R. Larson
WHEN: Search by time is also weakly supported… Calendars are the standard for time But people use the names of events to refer to time
periods Named time periods resemble place names in being:
– Unstable: European War, Great War, First World War– Multiple: Second World War, Great Patriotic War– Ambiguous: “Civil war” in different centuries in
England, USA, Spain, etc. Places have temporal aspects & periods have
geographical aspects: When the Stone Age was, varies by region
March 2006 NaCTeM – Ray R. Larson
Suggests a similar solution: A gazetteer-like Time Period Directory.
Gazetteer:– Place name – Type – Spatial markers (Lat & long) -- When
Time Period Directory: – Period name – Type – Time markers (Calendar) – Where
Note the symmetry in the connections between Where and When.
Similarity between place names and period names
March 2006 NaCTeM – Ray R. Larson
Solution - Time Period Directories Initial development involved mining the
Library of Congress Subject Authority file for named time periods…
March 2006 NaCTeM – Ray R. Larson
LC MARC Authorities Records<USMARC><Fld001>sh 00000613 </Fld001><Fld151><a>Magdeburg
(Germany)</a><x>History</x><y>Siege, 1550-1551</y></Fld151>
<Fld550><w>g</w><a>Sieges</a><z>Germany</z></Fld550><Fld670><a>Work cat.: 45053442: Besselmeier, S. Warhafftige
history vnd beschreibung des Magdeburgischen Kriegs, 1552.</a></Fld670>
<Fld670><a>Cath. encyc.</a><b>(Magdeburg: besieged (1550-51) by the Margrave Maurice of Saxony)</b></Fld670>
<Fld670><a>Ox. encyc. reformation</a><b>(Magdeburg: ... during the 1550-1551 siege of Magdeburg ...)</b></Fld670>
</USMARC>
March 2006 NaCTeM – Ray R. Larson
timePeriodEntry Time Period Directory InstanceContains components described below
- periodID Unique identifier
- periodName Period name, can be repeated for alternative namesInformation about language, script, transliteration schemeSource information and notes (where was the period name mentioned)
- descriptiveNotes Description of time period
- dates Calendar and date formatBegin & end date (exact, earliest, latest, most-likely, advocated-by-source, ongoing)Notes, sources
- periodClassification Period type, e.g. Period of Conflict, Art movementCan plug in different classification schemesCan be repeated for several classifications
- location Associated places with time periodContains both place name and entry to a gazetteer providing more specific place information like latitude / longitude coordinatesCan plug in different location indicators (e.g. ADL gazetteer, Getty Thesaurus of Geographic names)Recently added coordinates for direct use
- relatedPeriod Related time periodsperiodID of related periodsInformation about relationship type (part-of, successor etc.)Can plug in different relationship type schemes
- entryMetadata Notes about creator / creation of instanceEntry dateModification date
March 2006 NaCTeM – Ray R. Larson
March 2006 NaCTeM – Ray R. Larson
Time periods by named location
March 2006 NaCTeM – Ray R. Larson
Catalog Search Result
March 2006 NaCTeM – Ray R. Larson
Web Interface - Access by map
March 2006 NaCTeM – Ray R. Larson
Zoomable interface gives access to geographically focused info…
March 2006 NaCTeM – Ray R. Larson
Link initiates search of theLibrary of Congress catalogfor all records relating to thistime period.
Web Interface - Access by timeline
March 2006 NaCTeM – Ray R. Larson
WHEN and WHAT These named time periods are derived from Library of Congress catalog
subject headings and so can be used for catalog searching which finds books on topics important for that time period
March 2006 NaCTeM – Ray R. Larson
Texts
Numericdatasets
Time period directories link via the place (or time)
Thesaurus/Ontology
Gazetteers captionsMaps/Geo Data
EVI
Time Period Directory Time lines, Chronologies
March 2006 NaCTeM – Ray R. Larson
WHEN, WHERE and WHO Catalog records found from a time period search commonly include
names of persons important at that time. Their names can be forwarded to, e.g., biographies in the Wikipedia encyclopedia.
March 2006 NaCTeM – Ray R. Larson
Place and time are broadly important across numerous tools and genres including, e.g. Language atlases, Library catalogs,Biographical dictionaries, Bibliographies, Archival finding aids, Museum records, etc., etc.
Biographical dictionaries are heavy on place and time: Emanuel Goldberg, Born Moscow 1881. PhD under Wilhelm Ostwald, Univ. of Leipzig, 1906. Director, Zeiss Ikon, Dresden, 1926-33. Moved to Palestine 1937. Died Tel Aviv, 1970.
Life as a series of episodes involving Activity (WHAT), WHERE, WHEN, and WHO else.
March 2006 NaCTeM – Ray R. Larson
Texts
Numericdatasets
A new form of biographical dictionary would link to all
Thesaurus/Ontology
Gazetteers captionsMaps/Geo Data
EVI
Time Period Directory Time lines, Chronologies
Biographical Dictionary
March 2006 NaCTeM – Ray R. Larson
A Metadata Infrastructure
CATALOGS
AchivesHistorical Societies
LibrariesMuseums
Public TelevisionPublishersBooksellers
AudioImages
Numeric DataObjectsTexts
Virtual RealityWebpages
RESOURCES
INTERMEDIA INFRASTRUCTURE
Text and ImagesBiographical DictionaryWHO
TimelinesTime Period DirectoryWHEN
MapsGazetteerWHERE
Syndetic StructureThesaurusWHAT
Special Display ToolsAuthority ControlFacet
Learners
Dossiers
March 2006 NaCTeM – Ray R. Larson
Acknowledgements Electronic Cultural Atlas Initiative project This work was partially supported by the Institute
of Museum and Library Services through a National Leadership Grant for Libraries, award number LG-02-04-0041-04, Oct 2004 - Sept 2006 entitled “Supporting the Learner: What, Where, When and Who” – See: http://ecai.org/imls2004
Michael Buckland, Fred Gey, Vivien Petras, Matt Meiske, Kim Carl
Contact: ray@sims.berkeley.edu