N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can...

15
N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact of research literature?

Transcript of N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can...

Page 1: N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.

N-gram Topic Models for Bibliometric Analysis

Gideon Mann, David Mimno, and Andrew McCallum

Can topic models provide better measurements of the impact of research literature?

Page 2: N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.

Bibliometrics and Scientometrics

Typically analyzes patterns of citations in research literature

Derek de Solla Price: “Little Science, Big Science”

Eugene Garfield: Science Citation Index, Journal Citation Reports

Page 3: N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.

Comparing apples to apples: top journals by citations

Biochemistry and molecular biology:

J. Biol. Chem 405017

Cell 136472

Biochem.-US 96809

MathematicsLect. Notes Math 6926

T. Am. Math. Soc 6469

J. Math. Anal. Appl. 6004

Source: Journal Citation Reports (2004)

Page 4: N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.

What’s wrong with grouping by journal?

• 10 of the 200 most cited papers in CiteSeer are unpublished technical reports, 15% of most cited papers are from conference proceedings

• Open-access publication increasing, but venue information often not available

• Hand entered ISI citation data noisy• Article has only one venue, journals

cover many topics

Page 5: N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.

A topic model for N-grams

Determine whether the next word will be part of an n-gram based on the current word and the current hidden topic. “White house” is a collocation in politics, but may not be one in real estate.

Page 6: N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.

Sample n-gram topics1. Digital Libraries (102): digital, electronic, library,

metadata, access; “digital libraries”, “digital library”, “electronic commerce”, “dublin core”, “cultural heritage”

2. WWW (129): web, site, pages, page, www, sites; “world wide web”, “web pages”, “web sites”, “web site”, “world wide”

3. Ontologies (186): semantic, ontology, ontologies, rdf, semantics, meta; “semantic web”, “description logics”, “rdf schema”, “description logic”, “resource description framework”

4. Web services (184): web, services, service, xml, business; “web services”, “web service”, “markup language”, “xml documents”, “xml schema”

Page 7: N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.

Assigning topics to documents

1. Build a 200 topic n-gram topic model on 300k documents

2. Remove stopword or methodological topics (e.g. “efficient, fast, speed”)

3. For each document d, if more than 10% of d’s tokens are assigned to topic t, and that comprises more than two tokens, assign d to t

Each topic is now an intellectual “domain” that includes some number of documents. We can substitute topic for journal in most traditional bibliometric indicators. We can also now define several new indicators.

Page 8: N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.

Impact FactorJournal Impact Factor: Citations from

articles published in 2004 to articles in Cell published in 2002-3, divided by the number of articles published in Cell in 2002-3.

2004 Impact factors from JCR:

Nature 32.182

Cell 28.389

JMLR 5.952

Machine Learning 3.258

Page 9: N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.

Topic Impact Factor

Page 10: N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.

Broad Impact: DiffusionJournal Diffusion: # of journals citing Cell divided by the total number of citations to Cell, over a given time period, times 100

Problem: relatively brittle at low citation counts. If a topic/journal is cited twice by two different topics/journals, it will have high diffusion.

Page 11: N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.

Broad Impact: DiversityTopic Diversity: Entropy of the distribution of citing topics

Better at capturing broad end of impact spectrum: the high diffusion topics are identical to the least frequently cited topics

Page 12: N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.

Broad Impact: DiversityTopic Diversity: Entropy of the distribution of citing topics

Topic diversity can also be measured for papers:

Page 13: N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.

Longevity: Cited Half LifeTwo views:• Given a paper, what is the median age of

citations to that paper?• What is the median age of citations from

current literature?

Page 14: N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.

History: Topical Precedence

Within a topic, what are the earliest papers that received more than n citations?

Information Retrieval (138):On Relevance, Probabilistic Indexing and Information Retrieval,

Kuhns and Maron (1960)Expected Search Length: A Single Measure of Retrieval

Effectiveness Based on the Weak Ordering Action of Retrieval Systems, Cooper (1968)

Relevance feedback in information retrieval, Rocchio (1971)Relevance feedback and the optimization of retrieval

effectiveness, Salton (1971)New experiments in relevance feedback, Ide (1971)Automatic Indexing of a Sound Database Using Self-organizing

Neural Nets, Feiten and Gunzel (1982)

Page 15: N-gram Topic Models for Bibliometric Analysis Gideon Mann, David Mimno, and Andrew McCallum Can topic models provide better measurements of the impact.

Sharing: Topical Transfer