2010.04.05 - SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley...

120
2010.04.05 - SLIDE 1 IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture 19: DLs and GIR
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of 2010.04.05 - SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley...

2010.04.05 - SLIDE 1IS 240 – Spring 2010

Prof. Ray Larson University of California, Berkeley

School of Information

Principles of Information Retrieval

Lecture 19: DLs and GIR

2010.04.05 - SLIDE 2IS 240 – Spring 2010

Today

• Digital Libraries and IR

• Image Retrieval in DL• From paper presented at the 1999 ASIS Annual

Meeting

• More on Geographic Information Retrieval

2010.04.05 - SLIDE 3IS 240 – Spring 2010

UCB Digital Library Project: Research Agenda

• Funded by NSF/NASA/DARPA Digital Library Initiative (Phases I and II) ~1993-2004

• Research agenda– Understand user needs.– Extend functionality of documents.

• “Enliven” legacy documents.

– Improve access to information.– Scale to large systems.– Re-Invent Scholarly Information Access and

Use

2010.04.05 - SLIDE 4IS 240 – Spring 2010

Testbed: An Environmental Digital Library

• Collection: Diverse material relevant to California’s key habitats.

• Users: A consortium of state agencies, development corporations, private corporations, regional government alliances, educational institutions, and libraries.

• Potential: Impact on state-wide environmental system (CERES )

2010.04.05 - SLIDE 5IS 240 – Spring 2010

The Environmental Library -Users/Contributors

• California Resources Agency, California Environment Resources Evaluation System (CERES)

• California Department of Water Resources

• The California Department of Fish & Game

• SANDAG

• UC Water Resources Center Archives

• New Partners: CDL and SDSC

2010.04.05 - SLIDE 6IS 240 – Spring 2010

The Environmental Library - Contents

• Environmental technical reports, bulletins, etc.• County general plans• Aerial and ground photography• USGS topographic maps• Land use and other special purpose maps• Sensor data• “Derived” information• Collection data bases for the classification and

distribution of the California biota (e.g., SMASCH)• Supporting 3-D, economic, traffic, etc. models• Videos collected by the California Resources Agency

2010.04.05 - SLIDE 7IS 240 – Spring 2010

The Environmental Library - Contents

• As of mid 1999, the collection represents about three quarters of a terabyte of data, including over 70,000 digital images, over 300,000 pages of environmental documents, and over a million records in geographical and botanical databases.

2010.04.05 - SLIDE 8IS 240 – Spring 2010

Botanical Data:

• The CalFlora Database contains taxonomical and distribution information for more than 8000 native California plants. The Occurrence Database includes over 300,000 records of California plant sightings from many federal, state, and private sources. The botanical databases are linked to our CalPhotos collection of Calfornia plants, and are also linked to external collections of data, maps, and photos.

2010.04.05 - SLIDE 9IS 240 – Spring 2010

Geographical Data:

• Much of the geographical data in our collection is being used to develop our web-based GIS Viewer. The Street Finder uses 500,000 Tiger records of S.F. Bay Area streets along with the 70,000-records from the USGS GNIS database. California Dams is a database of information about the 1395 dams under state jurisdiction. An additional 11 GB of geographical data represents maps and imagery that have been processed for inclusion as layers in our GIS Viewer. This includes Digital Ortho Quads and DRG maps for the S.F. Bay Area.

2010.04.05 - SLIDE 10IS 240 – Spring 2010

Documents:

• Most of the 300,000 pages of digital documents are environmental reports and plans that were provided by California state agencies. This collection includes documents, maps, articles, and reports on the California environment including Environmental Impact Reports (EIRs), educational pamphlets, water usage bulletins, and county plans. Documents in this collection come from the California Department of Water Resources (DWR), California Department of Fish and Game (DFG), San Diego Association of Governments (SANDAG), and many other agencies. Among the most frequently accessed documents are County General Plans for every California county and a survey of 125 Sacramento Delta fish species.

2010.04.05 - SLIDE 11IS 240 – Spring 2010

Documents - cont.

• The collection also includes about 20Mb of full-text (HTML) documents from the World Conservation Digital Library. In addition to providing online access to important environmental documents, the document collection is the testbed for our Multivalent Document research.

2010.04.05 - SLIDE 12IS 240 – Spring 2010

Photographs:

• The photo collection includes 17,000 images of California natural resources from the state Department of Water Resources, several hundred aerial photos, 17,000 photos of California native plants from St. Mary's College, the California Academy of Science, and others, a small collection of California animals, and 40,000 Corel stock photos.

2010.04.05 - SLIDE 13IS 240 – Spring 2010

Testbed Success Stories

• LUPIN: CERES’ Land Use Planning Information Network– California Country General Plans and other

environmental documents.– Enter at Resources Agency Server, documents stored

at and retrieved from UCB DLIB server.

• California flood relief efforts– High demand for some data sets only available on our

server (created by document recognition).

• CalFlora: Creation and interoperation of repositories pertaining to plant biology.

• Cloning of services at Cal State Library, FBI

2010.04.05 - SLIDE 14IS 240 – Spring 2010

Research Highlights

• Documents– Multivalent Document prototype

• Page images, structured documents, GIS data, photographs

• Intelligent Access to Content– Document recognition – Vision-based Image Retrieval: stuff, thing,

scene retrieval– Natural Language Processing: categorizing

the web, Cheshire II, TileBar Interfaces

2010.04.05 - SLIDE 15IS 240 – Spring 2010

User Interface Paradigms: Multivalent Documents • An approach to new document types and

their authoring.

• Supports active, distributed, composable transformations of multimedia documents.

• Enables sophisticated annotations, intelligent result handling, user-modifiable interface, composite documents.

2010.04.05 - SLIDE 16IS 240 – Spring 2010

Multivalent Documents

Cheshire LayerCheshire Layer

OCR LayerOCR Mapping LayerHistory of The Classical World

The jsfj sjjhfjs jsjjjsjhfsjf sjhfjksh sshfjsfksfjk sjs jsjfs kjsjfkjsfhskjf sjfhjkshskjfhkjshfjkshjsfhkjshfjkskjfhsfhskjfksjflksjflksjflksfsjfksjfkjskfjskfjklsslkslfjlskfjklsfklkkkdsjksfksjfkskflk sjfjksfkjsfkjsfkjshf sjfsjfjksksfjksfjksjfkthsjir\\ksksfjksjfkksjkls’ksklsjfkskfksjjjhsjhuusfsjfkjs

Modernjsfj sjjhfjs jsjjjsjhfsjf sslfjksh sshfjsfksfjk sjs jsjfs kjsjfkjsfhskjf sjfhjkshskjfhkjshfjkshjsfhkjshfjkskjfhsfhskjfksjflksjflksjflksfsjfksjfkjskfjskfjklsslkslfjlskfjklsfklkkkdsj

GIS Layer

taksksh kdjjdkd kdjkdjkd kjsksksk kdkdk kdkd dkkskksksk jdjjdj clclc ldldl

taksksh kdjjdkd kdjkdjkd kjsksksk kdkdk kdkd dkkskksksk jdjjdj clclc ldldl

Table 1.

Table Layer

kdkdkdkdk Scanned

PageImage

Valence:2: The relativecapacity to unite,react, or interact(as with antigensor a biologicalsubstrate).

Webster’s 7th CollegiateDictionary

Network Protocols &Resources

2010.04.05 - SLIDE 17IS 240 – Spring 2010

2010.04.05 - SLIDE 18IS 240 – Spring 2010

GIS in the MVD Framework

• Layers are georeferenced data sets.• Behaviors are

– display semi-transparently– pan– zoom– issue query– display context– “spatial hyperlinks”– annotations

• Written in Java (to be merged with MVD-1 code line?)

2010.04.05 - SLIDE 19IS 240 – Spring 2010

GIS Viewer Example http://elib.cs.berkeley.edu/annotations/gis/buildings.html

2010.04.05 - SLIDE 20IS 240 – Spring 2010

Overview of Cheshire II

• The Cheshire II system is intended to provide an easy-to-use, standards-compliant system capable of retrieving any type of information in a wide variety of settings.

2010.04.05 - SLIDE 21IS 240 – Spring 2010

Overview of Cheshire II

• It supports SGML and XML.• It is a client/server application.• Uses the Z39.50 Information Retrieval Protocol.• Server supports a Relational Database Gateway.• Supports Boolean searching of all servers.• Supports probabilistic ranked retrieval in the Cheshire search

engine.• Search engine supports ``nearest neighbor'' searches and

relevance feedback.• GUI interface on X window displays.• WWW/CGI forms interface for DL, using combined client/server CGI

scripting via WebCheshire.• Image Content retrieval using BlobWorld• Support for the SDLIP (Simple Digital Library Interoperability

Protocol) for search and as Z39.50 Gateway

2010.04.05 - SLIDE 22IS 240 – Spring 2010

Cheshire II Searching

Z39.50 Internet

ImagesScannedText

Local Remote

Z39.50

Z39.50

Z39.50

2010.04.05 - SLIDE 23IS 240 – Spring 2010

Current Usage of Cheshire II

• Web clients for:– NSF/NASA/ARPA Digital Library

• Includes support for full-text and page-level search.

• Experimental Blob-World image search

– SunSite

– University of Liverpool.

– University of Essex, HDS (part of AHDS)

– California Sheet Music Project

– Cha-Cha (Berkeley Intranet Search Engine)

– Univ. of Virginia

• Cheshire ranking algorithm is basis for Inktomi (i.e., Yahoo, Hotbot, MSN? and others)

2010.04.05 - SLIDE 24IS 240 – Spring 2010

Image Retrieval Research

• Finding “Stuff” vs “Things”

• BlobWorld

• Other Vision Research

2010.04.05 - SLIDE 25IS 240 – Spring 2010

Blobworld: use regions for retrieval

• We want to find general objects Represent images based on coherent regions

2010.04.05 - SLIDE 26IS 240 – Spring 2010

Outline

• Why regions?

• Creating Blobworld: segmentation and description

• Using Blobworld: query experiments

• Indexing blobs for faster querying

• Conclusions

2010.04.05 - SLIDE 27IS 240 – Spring 2010

Creating and using Blobworld

extract features segment image describe regions query

Create Use

2010.04.05 - SLIDE 28IS 240 – Spring 2010

Extract features for each pixel• Color

– Take average color (L*a*b*) at the selected scale ignore local color variations due to texture

– “zebra = gray horse + stripes”

• Texture– Find contrast, anisotropy, polarity at the selected

scale

• Position

2010.04.05 - SLIDE 29IS 240 – Spring 2010

Find groups in feature space

• Model feature distribution as a mixture of Gaussians using Expectation-Maximization (EM)

2010.04.05 - SLIDE 30IS 240 – Spring 2010

Find regions in the image• Label each pixel based on its Gaussian

cluster

• Find connected components regions

1

334

2 11

3 4

2

2010.04.05 - SLIDE 31IS 240 – Spring 2010

Describe regions by color, texture, shape

• Color– Color histogram within region– Quadratic distance: encode similarity between

color binsd2

hist(x, y) = (x - y)' A (x - y)

• Texture– Mean contrast and anisotropy

stripes vs. spots vs. smooth

• (Basic) Shape– Fourier descriptors of contour

2010.04.05 - SLIDE 32IS 240 – Spring 2010

Select appropriate scale for processing

• Polarity: do all the gradient vectors point in the same direction?

• Choose scale where polarity stabilizes include one approximate period

2010.04.05 - SLIDE 33IS 240 – Spring 2010

Initialize means using image data

• Before, we picked random initialization• Now, choose initial means based on

image tiles

• Add noise to means and restart EM (4 runs per K)

K = 2 K = 5K = 4K = 3

2010.04.05 - SLIDE 34IS 240 – Spring 2010

update ,

update labels update ,

Grouping: Expectation-Maximization• Given class characteristics (,), find class

membership• Given class membership, find class

characteristics (,)• Iterate

update labels

2010.04.05 - SLIDE 35IS 240 – Spring 2010

How many Gaussians?

• Model selection: Minimum Description Length– Prefer fewer Gaussians if performance is

comparable

vs.vs.

2010.04.05 - SLIDE 36IS 240 – Spring 2010

Find groups in feature space

• Model feature distribution as a mixture of Gaussians using Expectation-Maximization (EM)

2010.04.05 - SLIDE 37IS 240 – Spring 2010

EM mathProbability density:

Update equations:

where

( )

( )

( )

( )( )( )

( )∑

=

=

=

=

=

Θ

−−Θ=

Θ

Θ=

Θ=

N

jj

N

jijijj

i

N

jj

N

jjj

i

N

jji

xip

xxxip

xip

xipx

xipN

1

old

1

Tnewnewold

new

1

old

1

old

new

1

oldnew

,

,

,

,

,1

μμ

μ

α

( ) ( )( )∑

=

=Θ K

kkkk

iiij

xf

xfxip

1

old,θα

θα

( ) ( )

( ) )()(

1

1T21

21

2 det)2(

1iii

d

xx

i

ii

K

iiii

exf

xfxf

μμ

πθ

θα

−Σ−−

=

Σ=

=Θ ∑

2010.04.05 - SLIDE 38IS 240 – Spring 2010

Encode similarity between color bins

• Quadratic distance

• Distance between histograms x and y:

d2hist(x, y) = (x - y)' A (x - y)

• Aij is based on the similarity between bins i and j– Neighboring bins have Aij = 0.5

2010.04.05 - SLIDE 39IS 240 – Spring 2010

Fourier descriptors for shape

• [Zahn & Roskies ’72, Kuhl & Giardina ’82]

• Find (x,y) representation of outer contour

• Find Fourier series of (x,y)– Coefficients specify an ellipse (4 parameters):– major axis, minor axis, orientation, starting

point

• Remove starting point ambiguity

• Store first ten Fourier coefficients

2010.04.05 - SLIDE 40IS 240 – Spring 2010

Creating and using Blobworld

extract features segment image describe regions query

Create Use

2010.04.05 - SLIDE 41IS 240 – Spring 2010

Querying: let user see the representation

• Current systems are unsatisfying– User can’t see what the computer sees– Unclear how parameters relate to the image

• User should interact with the representation– Helps in query formulation– Makes results understandable– Minimizes disappointment

• http://elib.cs.berkeley.edu/photos/blobworld

2010.04.05 - SLIDE 42IS 240 – Spring 2010

2010.04.05 - SLIDE 43IS 240 – Spring 2010

2010.04.05 - SLIDE 44IS 240 – Spring 2010

2010.04.05 - SLIDE 45IS 240 – Spring 2010

2010.04.05 - SLIDE 46IS 240 – Spring 2010

2010.04.05 - SLIDE 47IS 240 – Spring 2010

2010.04.05 - SLIDE 48IS 240 – Spring 2010

Query experiments

• Collection of 10,000 Corel stock photos

• Five query images in each of ten categories(e.g., cheetahs, polar bears, airplanes)

• Compare Blobworld to global histogram queries

• Precision (% of retrieved images that are correct) vs. Recall (% of correct images that are retrieved)

2010.04.05 - SLIDE 49IS 240 – Spring 2010

Distinctive objects

• Tigers, cheetahs, and zebras:– Blobworld does better than global histograms

cheetahs zebras

2010.04.05 - SLIDE 50IS 240 – Spring 2010

black bears

Distinctive objects and backgrounds

• Eagles and black bears:– Blobworld does better than global histograms

2010.04.05 - SLIDE 51IS 240 – Spring 2010

Distinctive scenes

• Airplanes and brown bears:– Global histograms do better than Blobworld– But Blobworld has room to grow (shape, etc.)

airplanes

2010.04.05 - SLIDE 52IS 240 – Spring 2010

Index to search huge collections• Indexing is trickier than for traditional data

• We can afford some mistakes: even with full search, we’ll miss some tigers and include some pumpkins

• Two approaches we have tried:– Store terms and treat image as a document– Store features and index using a tree

• Final (“correct”) ranking of images from index

2010.04.05 - SLIDE 53IS 240 – Spring 2010

Index using conventional IR methods

• Treat each database blob as a document– Store “terms” (bins) for color, texture, location,

and shape– Repeat color terms based on histogram

weights

• Index using Cheshire II

• Treat each query blob as a document– Repeat “terms” according to query weights

2010.04.05 - SLIDE 54IS 240 – Spring 2010

Indexing and Retrieval with Cheshire II

• Originally used the same probabilistic algorithm used for text– Blobs are not distributed like text words or

stems

• Now using a weighting based on coordination level match with a minimum threshold (must have at least half of the characteristics of the query cluster.

• Still eyeballing data, but seems much better for many types of queries

2010.04.05 - SLIDE 55IS 240 – Spring 2010

2010.04.05 - SLIDE 56IS 240 – Spring 2010

2010.04.05 - SLIDE 57IS 240 – Spring 2010

2010.04.05 - SLIDE 58IS 240 – Spring 2010

2010.04.05 - SLIDE 59IS 240 – Spring 2010

Conclusions

• Image retrieval in general collections requires region segmentation and description

• Blobworld yields high precision in queries for distinctive objects

• Blobworld can be indexed to allow fast querying

2010.04.05 - SLIDE 60IS 240 – Spring 2010

User Interface Paradigms: Multivalent Documents

• An approach to new document types and their authoring.

• Supports active, distributed, composable transformations of multimedia documents.

• Enables sophisticated annotations, intelligent result handling, user-modifiable interface, composite documents.

2010.04.05 - SLIDE 61IS 240 – Spring 2010

Multivalent Documents

Cheshire LayerCheshire Layer

OCR LayerOCR Mapping LayerHistory of The Classical World

The jsfj sjjhfjs jsjjjsjhfsjf sjhfjksh sshfjsfksfjk sjs jsjfs kjsjfkjsfhskjf sjfhjkshskjfhkjshfjkshjsfhkjshfjkskjfhsfhskjfksjflksjflksjflksfsjfksjfkjskfjskfjklsslkslfjlskfjklsfklkkkdsjksfksjfkskflk sjfjksfkjsfkjsfkjshf sjfsjfjksksfjksfjksjfkthsjir\\ksksfjksjfkksjkls’ksklsjfkskfksjjjhsjhuusfsjfkjs

Modernjsfj sjjhfjs jsjjjsjhfsjf sslfjksh sshfjsfksfjk sjs jsjfs kjsjfkjsfhskjf sjfhjkshskjfhkjshfjkshjsfhkjshfjkskjfhsfhskjfksjflksjflksjflksfsjfksjfkjskfjskfjklsslkslfjlskfjklsfklkkkdsj

GIS Layer

taksksh kdjjdkd kdjkdjkd kjsksksk kdkdk kdkd dkkskksksk jdjjdj clclc ldldl

taksksh kdjjdkd kdjkdjkd kjsksksk kdkdk kdkd dkkskksksk jdjjdj clclc ldldl

Table 1.

Table Layer

kdkdkdkdk Scanned

PageImage

Valence:2: The relativecapacity to unite,react, or interact(as with antigensor a biologicalsubstrate).

Webster’s 7th CollegiateDictionary

Network Protocols &Resources

2010.04.05 - SLIDE 62IS 240 – Spring 2010

Image Retrieval Research

• Finding “Stuff” vs “Things”

• BlobWorld

2010.04.05 - SLIDE 63IS 240 – Spring 2010

2010.04.05 - SLIDE 64IS 240 – Spring 2010

Cheshire II Searching

Z39.50 Internet

ImagesScannedText

Local Remote

Z39.50

Z39.50

Z39.50

2010.04.05 - SLIDE 65IS 240 – Spring 2010

GIS in the MVD Framework

• Layers are georeferenced data sets.• Behaviors are

– display semi-transparently– pan– zoom– issue query– display context– “spatial hyperlinks”– annotations

• Written in Java

2010.04.05 - SLIDE 66IS 240 – Spring 2010

GIS Viewer Example http://elib.cs.berkeley.edu/annotations/gis/buildings.html

2010.04.05 - SLIDE 67IS 240 – Spring 2010

Geographic Information Retrieval and Spatial

Browsing

Ray R. Larson

School of Library and Information StudiesSchool of Library and Information StudiesUniversity of California, BerkeleyUniversity of California, Berkeley

2010.04.05 - SLIDE 68IS 240 – Spring 2010

Concerns for Digital Libraries

• Excellent summary in Distributed Geolibraries from NRC.– Distributed resources– Distributed users– Distributed services

• Access for a broad population is critical for many Digital Libraries

2010.04.05 - SLIDE 69IS 240 – Spring 2010

Concerns for Digital Libraries

• Georeferenced Information (geoinformation) provides one organizational perspective

• Other common perspectives include Topical Classification schemes, Temporal/Historical organization (ECAI)

• DL’s can provide multiple views of the same information

2010.04.05 - SLIDE 70IS 240 – Spring 2010

Concerns for Digital Libraries

• Most DLs are intended for a broad user base:– varying levels of expertise in the contents– varying requirements for access methods– simple expressions of interest in natural

language should be supported– Mapping NL to controlled vocabularies

(including Digital Gazetteers)

2010.04.05 - SLIDE 71IS 240 – Spring 2010

Digital Library Needs

• Geographic and Spatial Querying

• Spatial Browsing

• Geographic and Spatial Indexing

• (Berkeley DL contents and examples)

2010.04.05 - SLIDE 72IS 240 – Spring 2010

Overview

• What is Geographic Information Retrieval?

• Geographic and Spatial Querying and Browsing.

• Geographic and Spatial Indexing.

• Examples of GIR Systems and Geographically Indexed Information.

2010.04.05 - SLIDE 73IS 240 – Spring 2010

Introduction

• What is Geographic Information Retrieval?– GIR is concerned with providing access to

georeferenced information sources. It includes all of the areas of traditional IR research with the addition of spatially and geographically oriented indexing and retrieval.

– It combines aspects of DBMS research, User Interface Research, GIS research, and Information Retrieval research.

2010.04.05 - SLIDE 74IS 240 – Spring 2010

Introduction

• The need for Geographic and Spatial Information Retrieval.– Digital Libraries

• Sequoia 2000• UC Berkeley NSF/NASA/ARPA Digital Library

Project• UC Santa Barbara Alexandria Project• NSDI - National Spatial Data Infrastructure

– Next-Generation Online Catalogs• Cheshire II

2010.04.05 - SLIDE 75IS 240 – Spring 2010

Geographic and Spatial Querying

• Both imply querying on relationships within a particular coordinate system

• Spatial querying is the more general term

• Can be defined as queries about the spatial relationships (intersection, containment, boundary, adjacency, proximity) of entities geometrically defined and located in space

2010.04.05 - SLIDE 76IS 240 – Spring 2010

Geographic and Spatial Querying

• Geographical coordinates are geometric relationships (distance and direction can be measured on a continuous scale)– E.g. “5.21 miles north

of Champaign”

• Spatial relations may be both geometric and topological (spatially related but without measureable distance or absolute direction)– E.g.: “inside the city

limits”– “left side of Beckman

Institute”

2010.04.05 - SLIDE 77IS 240 – Spring 2010

Geographic and Spatial Querying

• Types of spatial queries– Point-in-polygon : “What do we

have at this X,Y point?”– Region Queries : “What do we

have in this region?”• Which point encoded items lie

within the region• What lines (borders, etc.) lie within

or the cross the region• What areas overlap the region area

YY

XX

2010.04.05 - SLIDE 78IS 240 – Spring 2010

Geographic and Spatial Querying

• Types of spatial queries, cont.– Distance and Buffer Zone Queries

• What cities lie within 40 miles of the border of Northern and Southern Ireland?

• What wetlands lie within 50 miles of London?

– Path Queries• What is the shortest route from San

Francisco to Los Angeles?

2010.04.05 - SLIDE 79IS 240 – Spring 2010

Geographic and Spatial Querying

• Types of spatial queries, cont.– Multimedia Queries : Use non-

map georeferenced information.

• What are the names of farmers affected by flooding in Monterey and Santa Cruz Counties?

p123p123p127p127

2010.04.05 - SLIDE 80IS 240 – Spring 2010

Spatial Browsing

• Combines ad hoc spatial querying with interactive displays

• HyperMap concept

• Pseudo-HyperMaps

2010.04.05 - SLIDE 81IS 240 – Spring 2010

Spatial Browsing

• Advantages:– May not need the accuracy of a full GIS– Comprehensible searching metaphor for

many materials

• Problems:– Clutter and differing scales.– Requires good (and preferably accurate)

geographical indexing– Assumes that the user knows some

geography

2010.04.05 - SLIDE 82IS 240 – Spring 2010

Geographic and Spatial Indexing

• Traditional geographic indexing involves using place names from LCSH and name authorities. These have some problems:– Names are not unique– The places referred to change size, shape

and names over time– Spelling variations– Some places are temporary conventions

(study areas, etc.)

2010.04.05 - SLIDE 83IS 240 – Spring 2010

Digital Gazetteers

• Geographic names are and will remain the primary Entry Vocabulary for DL spatial queries – The gazetteer must support as many variant

forms of the name as possible• Including temporal ranges for particular names

– querying must support spatial reasoning based on gazetteer and other geographic and temporal information in the system or accessible by network access

2010.04.05 - SLIDE 84IS 240 – Spring 2010

2010.04.05 - SLIDE 85IS 240 – Spring 2010

Geographic and Spatial Indexing

• Geographic coordinates have some advantages over names:– They are persistent regardless of name, political

boundary or other changes– The can be simply connected to spatial browsing

interfaces and GIS data.– They provide a consistent framework for GIR

applications and spatial queries.

• However, the geographic extents and boundaries of entities also change over time– This may be the primary interest of historical

scholarship

2010.04.05 - SLIDE 86IS 240 – Spring 2010

Geographic and Spatial Indexing

• GIPSY: Automatic georeferencing of texts (Geographic Info Processing System)– The work of Allison Woodruff and Christian Plaunt -

Later DBMS-based version by Jolly Chen -- New version planned

– Designed to operate on the full text of documents– Extracts geographic terms and attempts to identify the

coordinates of the places discussed in the text using a combination of evidence

2010.04.05 - SLIDE 87IS 240 – Spring 2010

Geographic and Spatial Indexing

• GIPSY cont.– Used the USGS Geographic Names

Information System (GNIS) and Geographic Information Retrieval and Analysis System (GIRAS) to associate names with coordinates of named places, geographic features and land use characteristics.

2010.04.05 - SLIDE 88IS 240 – Spring 2010

Geographic and Spatial Indexing

• GIPSY cont.– Identified places are added as “elevations”

with each place adding a weight based on its frequency in the text and database characteristics

– The resulting map is analysed to identify the most likely locations, and coordinates for those locations are extracted

2010.04.05 - SLIDE 89IS 240 – Spring 2010

Geographic and Spatial Indexing

• GIPSY Map Overlay

““The proposed project isThe proposed project is the construction of a new State the construction of a new State Water Project facility, the Water Project facility, the coastal branch... by water coastal branch... by water purveyors of northern Santa purveyors of northern Santa Barbara County... delivering Barbara County... delivering water to San Luis Obispo ... “water to San Luis Obispo ... “

““The proposed project isThe proposed project is the construction of a new State the construction of a new State Water Project facility, the Water Project facility, the coastal branch... by water coastal branch... by water purveyors of northern Santa purveyors of northern Santa Barbara County... delivering Barbara County... delivering water to San Luis Obispo ... “water to San Luis Obispo ... “

2010.04.05 - SLIDE 90IS 240 – Spring 2010

Geographic and Spatial Indexing

• To be useful for the range of cultural and humanities materials being collected in digital libraries, the GIPSY gazetteer must– Support many different time ranges, location

and boundary changes– Support synonymous and variant names with

differing locations for the same entity– Support names in multiple languages, scripts

and usages

2010.04.05 - SLIDE 91IS 240 – Spring 2010

ECAI

• The Electronic Cultural Atlas Initiative is a collaboration between IT professionals and humanities scholars

• ECAI is developing a globally distributed spatio-temporal library of cultural and historical resources with a centralized metadata catalogue and a GIS viewer

• Currently the ECAI consortium includes over 250 projects

2010.04.05 - SLIDE 92IS 240 – Spring 2010

ECAI

• Projects range from small works by individual scholars to large nationally and internationally funded efforts. E.g.:– geography of Greco-Roman culture (Perseus project)– toponym locations for over 300,000 images of

Buddhist art and architecture– Seals of the Sassanian Empire– historical trade routes of Eurasia– the map of Hideyoshi’s invasion of Korea– historical GIS projects for China, Great Britain, the

United States, the Black Sea and Tibet

2010.04.05 - SLIDE 93IS 240 – Spring 2010

Perseus

2010.04.05 - SLIDE 94IS 240 – Spring 2010

The Sasanian Empire

2010.04.05 - SLIDE 95IS 240 – Spring 2010

Opening shot of the Sasanian Empire ECAI project, showing a map with diverse resources, a timeline, and a menu of available map layers.

2010.04.05 - SLIDE 96IS 240 – Spring 2010

Users may zoom in to see resources that are only visible at a higher level of detail.

2010.04.05 - SLIDE 97IS 240 – Spring 2010

Spatial objects on the map are linked to a table of attributes, which may include any information about the objects. Note that this is a scholarly tool. By creating a “name quality” field, the author has noted that there is disagreement about the locations and names of places in the Sasanian Empire.

2010.04.05 - SLIDE 98IS 240 – Spring 2010

Sites on the map may be linked to resources elsewhere on the internet. In this case, important archaeological sites on the map are linked to web-based tours.

2010.04.05 - SLIDE 99IS 240 – Spring 2010

The map interface may be used to show change over time. The “Sasanian Empire ca. 270s” resource is highlighted, and the “Sasanian Empire ca. 570s” is greyed out. If a user slides the timeline bar, the new boundary of the empire will appear.

2010.04.05 - SLIDE 100IS 240 – Spring 2010

In a different time range, not only do the boundaries of the empire appear different, but the sites that were active during the earlier era (the red dots) have moved as well.

2010.04.05 - SLIDE 101IS 240 – Spring 2010

TimeMap is a user authoring tool, not merely a viewer. Users can control the look of the icons, the map layers that comprise a project, and, as shown here, the map scale at which different layers will become visible.

2010.04.05 - SLIDE 102IS 240 – Spring 2010

This screen displays the metadata for the a part of the Sasanian Empire project. The metadata includes functional (tm.) metadata to enable connection to the map interface in addition to cataloguing (dc. and ecai.) metadata. Using the menu on the left, users may choose to map individual map layers or packaged projects.

2010.04.05 - SLIDE 103IS 240 – Spring 2010

Historic Sydney

2010.04.05 - SLIDE 104IS 240 – Spring 2010

Google Earth GIR - Demo

2010.04.05 - SLIDE 105IS 240 – Spring 2010

The Mongol Empire

2010.04.05 - SLIDE 106IS 240 – Spring 2010

Prof. Ray Larson University of California, Berkeley

School of InformationTuesday and Thursday 10:30 am - 12:00 pm

Spring 2007http://courses.ischool.berkeley.edu/i240/s07

Principles of Information Retrieval

Lecture 23: GIR Continued

2010.04.05 - SLIDE 107IS 240 – Spring 2010

Today

• Review– Geographic Information Retrieval

• Parts of this this lecture were presented at the invitational conference “The ‘I’ in Geographic Information Science”, Manchester, U.K., July 2001

• GIR Algorithms and evaluation based on a presentation to the 2004 European Conference on Digital Libraries, held in Bath, U.K.

2010.04.05 - SLIDE 108IS 240 – Spring 2010

Introduction

• What is Geographic Information Retrieval?– GIR is concerned with providing access to

georeferenced information sources. It includes all of the areas of traditional IR research with the addition of spatially and geographically oriented indexing and retrieval.

– It combines aspects of DBMS research, User Interface Research, GIS research, and Information Retrieval research.

2010.04.05 - SLIDE 109IS 240 – Spring 2010

Introduction

• The need for Geographic and Spatial Information Retrieval.– Digital Libraries

• Sequoia 2000• UC Berkeley NSF/NASA/ARPA Digital Library

Project• UC Santa Barbara Alexandria Project• NSDI - National Spatial Data Infrastructure

– Next-Generation Online Catalogs• Cheshire II

2010.04.05 - SLIDE 110IS 240 – Spring 2010

Geographic and Spatial Querying

• Both imply querying on relationships within a particular coordinate system

• Spatial querying is the more general term

• Can be defined as queries about the spatial relationships (intersection, containment, boundary, adjacency, proximity) of entities geometrically defined and located in space

2010.04.05 - SLIDE 111IS 240 – Spring 2010

Geographic and Spatial Querying

• Geographical coordinates are geometric relationships (distance and direction can be measured on a continuous scale)– E.g. “5.21 miles north

of Champaign”

• Spatial relations may be both geometric and topological (spatially related but without measureable distance or absolute direction)– E.g.: “inside the city

limits”– “left side of Beckman

Institute”

2010.04.05 - SLIDE 112IS 240 – Spring 2010

Geographic and Spatial Querying

• Types of spatial queries– Point-in-polygon : “What do we

have at this X,Y point?”– Region Queries : “What do we

have in this region?”• Which point encoded items lie

within the region• What lines (borders, etc.) lie within

or the cross the region• What areas overlap the region area

YY

XX

2010.04.05 - SLIDE 113IS 240 – Spring 2010

Geographic and Spatial Querying

• Types of spatial queries, cont.– Distance and Buffer Zone Queries

• What cities lie within 40 miles of the border of Northern and Southern Ireland?

• What wetlands lie within 50 miles of London?

– Path Queries• What is the shortest route from San

Francisco to Los Angeles?

2010.04.05 - SLIDE 114IS 240 – Spring 2010

Geographic and Spatial Querying

• Types of spatial queries, cont.– Multimedia Queries : Use non-

map georeferenced information.

• What are the names of farmers affected by flooding in Monterey and Santa Cruz Counties?

p123p123p127p127

2010.04.05 - SLIDE 115IS 240 – Spring 2010

Spatial Browsing

• Combines ad hoc spatial querying with interactive displays

• HyperMap concept

• Pseudo-HyperMaps

2010.04.05 - SLIDE 116IS 240 – Spring 2010

Geographic and Spatial Indexing

• GIPSY Map Overlay

““The proposed project isThe proposed project is the construction of a new State the construction of a new State Water Project facility, the Water Project facility, the coastal branch... by water coastal branch... by water purveyors of northern Santa purveyors of northern Santa Barbara County... delivering Barbara County... delivering water to San Luis Obispo ... “water to San Luis Obispo ... “

““The proposed project isThe proposed project is the construction of a new State the construction of a new State Water Project facility, the Water Project facility, the coastal branch... by water coastal branch... by water purveyors of northern Santa purveyors of northern Santa Barbara County... delivering Barbara County... delivering water to San Luis Obispo ... “water to San Luis Obispo ... “

2010.04.05 - SLIDE 117IS 240 – Spring 2010

Geographic and Spatial Indexing

• To be useful for the range of cultural and humanities materials being collected in digital libraries, the GIPSY gazetteer must– Support many different time ranges, location

and boundary changes– Support synonymous and variant names with

differing locations for the same entity– Support names in multiple languages, scripts

and usages

2010.04.05 - SLIDE 118IS 240 – Spring 2010

The map interface may be used to show change over time. The “Sasanian Empire ca. 270s” resource is highlighted, and the “Sasanian Empire ca. 570s” is greyed out. If a user slides the timeline bar, the new boundary of the empire will appear.

2010.04.05 - SLIDE 119IS 240 – Spring 2010

Historic Sydney

2010.04.05 - SLIDE 120IS 240 – Spring 2010

The Mongol Empire