Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford...

104
Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh, Zoltan Gyongyi, Andy Kacsmar, Sep Kamvar, Wang Lam, Mor Naaman, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley, Rebecca Wesley, and others...

Transcript of Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford...

Page 1: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Digital Libraries Initiatives: What I learned (and didn't) in 10 years

Hector Garcia-Molina

Stanford University

Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh, Zoltan Gyongyi, Andy Kacsmar, Sep Kamvar,

Wang Lam, Mor Naaman, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley, Rebecca Wesley, and others...

Page 2: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.
Page 3: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

3

Outline

• DLI I & II Experience– (with special help from Andreas and Rebecca)

• Stanford Research

• “Controversial” Questions for the Future

Page 4: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

4

Disclaimer

• Stanford Perspective

Page 5: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

5

DLI Experience

• Lots of great research!

• Lots of great content!

• Main Event was....

Page 6: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

6

Main DL Event

IEEE and ACM DL Conferences Merge into JCDL!!

Page 7: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.
Page 8: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

8

Page 9: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

9

The WWW Tsunami

• Before the Web:– Publishers, catalogs,...– Librarians: see the need for technology– CS Types: want to have social impact

Page 10: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

10

The WWW Tsunami

• The Web Arrives:– few coherent collections– producers = consumers– everything free– heterogeneous– merge:

• shopping,

• entertainment

• library services ...

Page 11: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

11

CS-Library “Tensions”

• Web generated a lot of excitement, but...

• “Friendly tensions” as everyone adjusted:– Techies take all the funding!– Librarians don’t get it!

Page 12: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

12

Example: CS-TR Experience

• History

• Copyright issues

• Pubs servers everywhere

• Citeseer,...

• Organization vs chaos• Chaos wins! (this round)

DLI I & II

NCSTRL

Page 13: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

13

Bright Future

• DLI I & II made important contributions (more later...)• Huge volume of information available• Direct communication between authors and librarians• Core library functions needed, more than ever:

– organization– curation– trusted information– ...

DLI I & II

NCSTRL

today

Page 14: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

14

Stanford DLI Project

Page 15: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Stanford Theme - Phase I

• “GLUE” for accessing diverse libraries and services

InternetLibraries

PaymentInstitutions

SearchAgents

User Interfacesand Annotations

Commercial Information Brokers &

Providers

CopyrightServices

Query/DataConversionHTTP

Z39.50

Telnet

Page 16: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

InfoBus Example

Folio Dialog DigiCash F.V.

FolioProxy

DialogProxy

DigiCashProxy

F.V.Proxy

DLite GlossQueryTrans

MetaData U-Pai

Con-tracts

Q: Find Ti distributed (W) systems

Page 17: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

InfoBus Example

Folio Dialog DigiCash F.V.

FolioProxy

DialogProxy

DigiCashProxy

F.V.Proxy

DLite GlossQueryTrans

MetaData U-Pai

Con-tracts

Q: Find Ti distributed (W) systems

Suggested: Folio, Dialog

Page 18: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

InfoBus Example

Folio Dialog DigiCash F.V.

FolioProxy

DialogProxy

DigiCashProxy

F.V.Proxy

DLite GlossQueryTrans

MetaData U-Pai

Con-tracts

Q: Find Ti distributed (W) systems

Q’: Find Ti distributed AND systems

Query Translation

Page 19: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

InfoBus Example

Folio Dialog DigiCash F.V.

FolioProxy

DialogProxy

DigiCashProxy

F.V.Proxy

DLite GlossQueryTrans

MetaData U-Pai

Con-tracts

Q: Find Ti distributed (W) systems

Pay per View

Page 20: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

InfoBus Details

LSP LSP LSP LSPZ-cl

Z-sr

DLITEclient

Z39.50client

SenseMakerclient

Z39.50Library

L1 Ln. . . S1 Sn. . .

Libraries Services

Payment, Translation,MetaData,… Services

•ILU Objects•Information Models•DLI Protocol

{

Page 21: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Querying Sources

• Differences: Language, Operators, Attributes,...

Q1: title contains large AND distributed (W) system

Q2: FIND heading large AND distributed NEAR system

Page 22: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Query Translation

TargetIR System

TargetIR System

TargetIR System

...Query

Translator

Post-Filter

Userquery

Final results

Target syntax, capabilities, schemas

Filter Queries

Page 23: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Stop Word Examples• User Query Q1:

– title contains gone AND with AND the AND wind

• Subsuming Query QS: (for Dialog)– title contains gone AND wind

• Filter Query QF:– title contains with AND the

post-filter

query trans

sourceQ1

QS

QF

ASA1

Page 24: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Stop Word Examples• User Query Q1:

– title contains gone (W) with (W) the (W) wind

• Subsuming Query QS: (for Dialog)– title contains gone (2W) wind

• Filter Query QF:– title contains gone (W) with (W) the (W) wind

post-filter

query trans

sourceQ1

QS

QF

ASA1

Page 25: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Translation Overhead: Stop Words

Size of user query with (W) operator

Size ofsubsumingquerywithoutstopwords

Text field on Dialog

Page 26: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Summary

• Option 1: Avoid Translation– Need: common language and operators– Need: common attributes

• Option 2: Translate– Need: source meta-data– Need: user involvement in translation

Page 27: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

27

Stanford DLI II: Technical Barriers

Economic Weaknesses

Information Loss

Information Overload

Service Heterogeneity

Physical Barriers

Page 28: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

RepositoryMulticastEngine

WWW

FeatureRepository

RetrievalIndexes

Webbase API

Web CrawlerWeb

CrawlerWeb CrawlerWeb Crawlers

Client Client Client Client

Client ClientWebBase Architecture

Page 29: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

29

PowerBrowser - Start Screen

Page 30: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

30

PowerBrowser - Hypertext View

Page 31: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

31

Copy Detection

Copy Detection System

Page 32: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

32

Replicated Collections on the Web

Page 33: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

33

Archival Repository

server

stanfordTRs

server

illinoisTRs

stanfordarchival repository

illinoisarchival repository

Page 34: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

34

Archival Repository Design

• If I have $100K/yr• Want 99.999% “reliability”

– how many copies

– how much preventive maintenance

– ???

Preventive Maintenance and Aging

0

10

20

30

40

50

60

70

80

1 3 5 10 Never

Start of Aging (years)

MT

TF

(y

ea

rs)

1

3

5

10

Never

Preventive Maintenance

Period (years)

Page 35: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

35

Crawler Friendly Web Servers

• Year 2000 Paper:– Onn Brandman, Junghoo Cho, Narayanan Shivakumar

– Help crawlers identify pages of interest

webserver

crawler

pull

Page 36: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

36

Crawler Friendly Web Servers

• Year 2000 Paper:– Onn Brandman, Junghoo Cho, Narayanan Shivakumar

– Help crawlers identify pages of interest

webserver

crawler

pull

dige

st

Other options:• Push• Filter service

Page 37: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

37

Needless Requests

Page 38: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

38

Improved Freshness

Page 39: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.
Page 40: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

40

DLI Technology Transfer

• Research Product: Students

• Transfer Takes Time!

Page 41: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Economic Weaknesses

Information Loss

Information Overload

Service Heterogeneity

Physical Barriers

• Interoperability

• Value Filtering

• Mobile Access

• IP Infrastructure

• Archival Repository

Technologies forTechnologies forDigital LibrariesDigital Libraries

41

Page 42: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

42

“Controversial” Questions

Page 43: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

43

Is Metadata Dead?

document

metadata

Page 44: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

44

Will the Semantic Web Make It?

• Will tags be generated?• By whom?• Agreement

web

? SearchEngine

semantic tags

Page 45: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

45

Is Google the Future Digital Library?

Page 46: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

46

Not Online, Not Worth Having?

• Bill Arms Quote

Page 47: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

47

Are Publishers Still Needed?

Page 48: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

48

Here Today, Gone Tomorrow?

• Will we find today’s materials in 50 years?

Page 49: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

49

Will Lawyers Win?

Page 50: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

50

Summary

• We learned a lot from DLI I & II

• Trained students who are changing the world

• Many challenges ahead...

Page 51: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

51

Extra Slides

Page 52: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

52

Outline

• dli– 94-98; 00-05

– lots of great research; wonderful sites (cervantes)

– the web; like doing research on tidal pools when tsunami hits

– before the web:• librarians: catalogs, publishers in control, research funding low

• com sci: chance to have impact; do good for society

– the web• blurred distinction between producers consumers

• no coherent collections (with curator who controlled, organized...)

• everything free (expectation that...)

• heterogeneity (beyond html...)

• merged shopping, work, library, entertainment... blurred distinctions...

– tensions cs-librarians• cs folks taking all the funding to work on technology

• librarians “don’t get it” times are changing

• CS-TR experience...copyrights, servers, search, etc...

Page 53: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

53

Outline

• dli (continued)– Bright Future

• direct communication between librarians and authors (camera ready...)

• huge volume of information available

• core function of librarianship remains (organize, categorize,....)now more than ever: need to filter out junk, need to organize, synthesize....

• more on this future later on in talk...

Page 54: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

54

Outline

• summary of stanford work

Page 55: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

55

Outline

• dli

• summary of stanford work

• future issues– will semantic web ever make it?

– Is metadata really dead?

– Are publishers still needed?

– Is Google the digital library of the future?• google scans books

– Is paper relevant?• bill arms: “If it is not online, it is not work having”

• my students do not cite anything not online (Michigan story)

– Will we be able to find today’s digital materials in 50 years?

– How will DLs be funded? DL Research funded?

Page 56: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Research Areas

• Interface: our window to a digital library• Interoperation: accessing heterogeneous services• Discovery: finding desired resources• Translation: speaking the right language• Payment: multiple policies & currencies• Interpretation: understanding results• Creation: generating new information

Page 57: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Outline

• Overview of Digital Library Innitiative

• The Stanford Digital Library Project

– Overview

– The InfoBus– Internet Meta-Searching

• Discovery

• Querying

• Merging and ranking

– STARTS Protocol

Page 58: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Discovery: Exhaustive Searching

Source

Source

Queries

Answer

Answer

Page 59: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Discovery: Full Index

Source

Extractor

Source

Extractor

INDEXINDEX

Query

DocumentIdentifiers

Requests for Specific Documents

Full Text

Page 60: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Discovery: GLOSS

Source

Collector

Source

Collector

GLOSSGLOSS

Query

Hints

Query to source

Statistics

Page 61: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Example:

• query: find author Knuth and title computers

• statistics GLOSS keeps on databases:

DB #docs #docs with #docs with author Knuth title computers

db1 100 0 3 db2 200 10 200 db3 1000 100 100 db4 1000 1 1

Which database(s) should the user search?

Page 62: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

• q = find author Knuth and title computers

DB #docs #docs with #docs with author Knuth title computers

db1 100 0 3 db2 200 10 200 db3 1000 100 100 db4 1000 1 1

Example (cont.):

• Use IND predictor (others available).

• Resulting rank: ESize(q, db2) = (10/200)*(200/200)*200 = 10 docs ESize(q, db3) = (100/1000)*(100/1000)*1000 = 10 docs

ESize(q, db4) = (1/1000)*(1/1000)*1000 = 0.001 docs ESize(q, db1) = (0/100)*(3/100)*100 = 0 docs

Page 63: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

GLOSS Results

• Experimental Evaluation• GLOSS hints “very good” 85% to 90% of the time• GLOSS index is 2% of the size of full index

Page 64: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Summary

• GLOSS and other resource discovery tools work…• BUT require meta-data collection facilities.

SourceCollector

Queries

Page 65: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Translation Overhead: Stop Words

Size ofsubsumingquerywithoutstopwords

Size of user query with AND operator

Text field on Dialog

Page 66: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Translation Overhead: Stop Words

Size of user query with (W) operator

Size ofsubsumingquerywithoutstopwords

Text field on Dialog

Remaining lengthgreater than 1

Page 67: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Ranking & Interpreting Results

• How do we merge ranked results?– Example: Query: “distributed databases”– Source1: (d1, 0.7), (d2, 0.3)– Source2: (d3, 100), (d4, 82). (d5, 71)

Page 68: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Ranking & Interpreting Results• Need additional information from sources

– Example: Query: “distributed databases”– Source1: ( doc = d1,

rank = 0.7,frequency[“distributed”] = 100,frequency[“databases”] = 1000,totalDocuments = 5000 ),

( doc = d2,rank = 0.3,frequency[“distributed”] = 10,frequency[“databases”] = 300,totalDocuments = 5000 )

Page 69: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Target Ranking

• Compute target ranking:– Source1: (d1, T100), (d2, T50)– Source2: (d3, T150), (d4, T80), (d5, T25)

• Merge:– Combined: (d3, T150), (d1, T100), (d4, T80),

(d2, T50), (d5, T25)

Page 70: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Target Ranking

• Compute target ranking:– Source1: (d1, T100), (d2, T50)

– Source2: (d3, T150), (d4, T80), (d5, T25)

• Merge:– Combined: (d3, T150), (d1, T100), (d4, T80), (d2,

T50), (d5, T25)

• Question: Are we positive (d3, T150) is best?– Maybe (dx, 0.25) at Source1 (ranked below d2 there)

has target rank of (dx, T200)??

Page 71: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Summary

• Sources need to export auxiliary ranking information• We need some ``knowledge’’ of source ranking

function

Page 72: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

STARTS

• Stanford Protocol for Internet Search and Retrieval

• Participants:– Fulcrum, Infoseek, Microsoft Network, Verity, WAIS– GILS, Harvest, Netscape, PLS, HP, others

• Goal: Simplify the Job of Meta-searchers.• Goal: Simplicity• Can be used by different transport protocols.• Visit:

– http://www-db.stanford.edu/~gravano/starts_home.html

Page 73: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

STARTS Components

(1) Common scheme for collecting meta-data(2) Common query language(3) Common result ranking information

SourceQueries

Collector

Answers

(1)

(2)

(3)

Page 74: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Metadata Example (SOIF)

@SMetaAttributes{Version{10}: STARTS 1.0SourceID{8}: Source-1FieldsSupported{17}: [basic-1 author]ModifiersSupported{19}: {basic-1 phonetics}FieldModifierCombinations{39}: ([basic-1 author] {basic-1 phonetics})QueryPartsSupported{2}: RFScoreRange{7}: 0.0 1.0RankingAlgorithmID{6}: Acme-1...

Page 75: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Sample Query (SOIF)

@SQuery{Version{10}: STARTS 1.0FilterExpression{48}: ((author ``Ullman'') and (title stem ``databases''))RankingExpression{61}: list( (body-of-text ``distributed'') (body-of-text ``databases''))DropStopWords{1}: TDefaultAttributeSet{7}: basic-1DefaultLanguage{5}: en-USAnswerFields{12}: title authorMinDocumentScore{3}: 0.5MaxNumberDocuments{2}: 10}

Page 76: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Meta-Searching Conclusion

• Need extra information from sources:– Meta-data– Ranking information

• For querying multiple sources:– Need standard query language; or– Need query translation machinery

Page 77: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

Meta-Searching Conclusion

• Other issues:– Payment– Preserving advertisements– Improved “value” filtering

Page 78: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

The Stanford Digital Library Project

InternetLibraries

PaymentInstitutions

SearchAgents

User Interfacesand Annotations

Commercial Information Brokers &

Providers

CopyrightServices

Query/DataConversionHTTP

Z39.50

Telnet

Page 79: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

79

Interoperability Challenges

• Growing number of players, formats, countries,...• Repositories Services• Dynamic artifacts

Digital Libraries

Page 80: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

80

Standards

• Too Many– e.g., Z39-50, HTTP, SDLIP, CORBA, DASL, ...

• Narrow– e.g., XML not a silver bullet

• Nevertheless Important...translation

Page 81: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

81

Query Translation Example

Q: Find Title contains(“cats” near “dogs”)

targetsystem

blah, blah,cats and dogs

blah, blah

doc 1:

blah, cats,blah,blah, blah,blah, dogs

doc 2:

Q’: Find Title contains(“cats”)AND contains(“dogs”)

translate filter

{doc1, doc2}

{doc1}

Page 82: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

82

Another Query Translation Example

Q: Find [grade > 8] AND [name =“elton john”]

Q’: Find [score = A] AND [last-name = “john”] AND [first-name = “elton”]

targetsystem

translate• basic rules• translation algorithm• error estimation

Page 83: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

83

Filtering Challenges

• Too much information

• Not controlled

Page 84: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

84

Current Filtering

textualsimilarity

Page 85: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

85

Page Rank Filtering

textualsimilarity

page rank(Google)

Page 86: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

86

Initial Page Rank

4

1

Page 87: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

87

Recursive Page Rank

4

1

6

1

2

2

1+2+1+2 = 6

Page 88: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

88

Value Filtering

textualsimilarity

page rank

geography

context

opinions

access

Page 89: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

89

Mobile Access Challenges

• Limited Screen Size

• Limited Bandwidth

• Disconnected Operation

• Limited Power

Page 90: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

90

Power Browsing

Techniques• Show only text headers• Show URLs, anchors, titles• Order URLs by page rank• Summarize text• Summarize set of pages• Low-resolution pictures• Site search, word completion• ...

Page 91: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

91

PowerBrowser - Text View

Page 92: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

92

PowerBrowser - History

Page 93: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

93

Economic Challenges

• Piracy

• Payment

• Heterogeneity

• Security/Privacy

Page 94: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

94

Piracy on the Internet

Page 95: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

95

Approaches

• Copy Prevention– isolation– cryptography– secure viewer

• Copy Detection– watermarking– content based

Page 96: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

96

Copy Detection

• Content– text

– audio

– video

• Challenges– crawling the web, mailing lists,...

– large scale comparison

– false negatives, positives

– different formats, sampling rates, frame rates,...

– adversary tries to fool system

Page 97: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

97

Example: Text Copy Detection

chunk signature

database(hash table)

get document

break intochunks

computesignature

store indatabase

get document

break intochunks

computesignature

probedatabase

abovethreshold?

Page 98: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

98

Text Detection Issues

• What are chunks?• What is threshold?• How to foil adversary?• How to compare hypertext documents?

Page 99: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

99

Information Preservation Challenges

• Preserving the Bits– Evolving hardware– Evolving software– Evolving organizations

• Preserving the Meaning

Page 100: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

100

Archiving the Web

server

documents

web server

web pages

stanfordarchival repository web users

Page 101: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

101

InfoMonitor History View

Page 102: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

102

InfoMonitor Snapshot View

Page 103: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

103

Page 104: Digital Libraries Initiatives: What I learned (and didn't) in 10 years Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher.

104

Archival Repository

• Object Identifier Signature

• No Deletions (never ever!)

handle

set set

new version?