Reference Collections: Collection Characteristics.

9
Reference Collections: Collection Characteristics

Transcript of Reference Collections: Collection Characteristics.

Page 1: Reference Collections: Collection Characteristics.

Reference Collections:Collection Characteristics

Page 2: Reference Collections: Collection Characteristics.

CACM Collection

3204 Communications of the ACM articles

Focus of collection: computer science

Structured subfields: – Author names– Date information– Word stems from title and abstract– Categories from hierarchical classification– Direct references between articles– Bibliographic coupling connections– Number of co-citations for each pair of articles

Page 3: Reference Collections: Collection Characteristics.

CACM Collection

3204 Communications of the ACM articles

Test information requests:– 52 information requests in natural language

with two Boolean query expressions– Average of 11.4 terms per query– Requests are rather specific with an average

of about 15 relevant documents– Result in relatively low precision and recall

Page 4: Reference Collections: Collection Characteristics.

ISI Collection

1460 documents from the Institute of Scientific Information

Focus of collection: information science

Structured subfields: – Author names– Word stems from title and abstract– Number of co-citations for each pair of

articles

Page 5: Reference Collections: Collection Characteristics.

ISI Collection

1460 documents from the Institute of Scientific Information

Test information requests:– 35 information requests in natural language

with Boolean query expressions– Average of 8.1 terms per query– 41 information requests in NL without

Boolean query expression– Requests are fairly general with an average

of about 50 relevant documents– Higher precision and recall

Page 6: Reference Collections: Collection Characteristics.

Observation

Collection # of Docs # of Terms Terms/Doc

CACM 3204 10446 40.1

ISI 1460 7392 104.9

Number of terms increases slowly with number of documents

Page 7: Reference Collections: Collection Characteristics.

Cystic Fibrosis Collection1239 articles with “Cystic Fibrosis” index in

MEDLINEStructured subfields:

– MEDLINE accession number– Author– Title– Source– Major subjects– Minor subjects– Abstract (or extract)– References in the document– Citations to the document

Page 8: Reference Collections: Collection Characteristics.

Cystic Fibrosis Collection

1239 articles with “Cystic Fibrosis” index in MEDLINE

Test information requests:– 100 information requests– Relevance assessed by four experts with a

scale of 0 (not relevant), 1 (marginal relevance), and 2 (high relevance)

– Overall relevance is sum (0-8)

Page 9: Reference Collections: Collection Characteristics.

Discussion Questions

In developing a search engine:– How would you use metadata (e.g. author,

title, abstract)?– How would you use document structure?– How would you use references, citations,

co-citations?– How would you use hyperlinks?