Exploring data quality and retrieval strategies for Mendeley reader counts

17
Exploring data quality and retrieval strategies for Mendeley reader counts Zohreh Zahedi 1 , Stefanie Haustein 2 & Timothy D. Bowman 2 [email protected] [email protected] [email protected] @zohrehzahedi @stefhaustein @timothydbowman 1 Leiden University, The Netherlands 2 Université de Montréal, Canada Metrics14 - ASIS&T SIGMET Workshop, Seattle, 5 th November, 2014

Transcript of Exploring data quality and retrieval strategies for Mendeley reader counts

Exploring data quality and retrieval strategies for Mendeley reader counts

Zohreh Zahedi1, Stefanie Haustein2 & Timothy D. Bowman2

[email protected] [email protected] [email protected]

@zohrehzahedi @stefhaustein @timothydbowman

1Leiden University, The Netherlands2Université de Montréal, Canada

Metrics14 - ASIS&T SIGMET Workshop, Seattle, 5th November, 2014

• online reference management tool• usage statistics, available via open API

• 2.8 million users, 275,860 groups, 535 user documents (02/2014)

• 68 million unique publications (08/2012; 281 million user documents)

Mendeley statistics based on monthly user counts from 10/2010 to 02/2014 on the Mendeley website accessed through the Internet Archive

Research Objectives

• metadata quality and its effect on retrieval

Research Objectives

• fluctuation in Mendeley coverage and readership counts over time and through different retrieval strategies (Bar-Ilan, 2014)

• altmetric studies and tools use different retrieval strategies• DOI API search• title search (e.g., Webometric Analyst)

lack of systematic study to determine effect of retrieval strategy

Research Objectives

• analyzing metadata quality of Mendeley entries systematically

• testing completeness and accuracy of relevant metadata fields

• identify and quantify error types• analyze difference between retrieval strategies

determine best retrieval strategy for collecting Mendeley reader counts

Research Questions

• How accurate is the metadata on Mendeley for a random sample of publications?

• In how far do results differ between:• manual title search in online catalog• API search via DOI

• What are the most frequent error types in the bibliographic data on Mendeley?

• What retrieval strategy provides the most accurate and complete results for the sampled publications?

Data set and Method

• random sample of 2012 WoS publications:384 of 1,873,759 documents

• manual title search via Mendeley online catalog n=384

• DOI search via Mendeley API simultaneouslyn=264 (=-31%)

• comparison of all relevant metadata• Author• DOI• ISSN

• Pages• Source• Title

• Title• Volume• Year

Found by manual title search

Found by API DOI search

Results: overviewn=2642 false positives91.3% of searched documents

n=38447.4% of searched documents

Results: overview documents reader counts

N % N % +

identical reader counts 103 36.4 975 41.1 0identical 102 36.0 975 41.1 0identical, both 0 1 0.4 0 0 0

API higher 111 39.2 752 31.7 718API higher 10 3.5 204 8.6 170API higher, manual not found 80 28.3 548 23.1 548API 0, manual not found 21 7.4 0 0 0

manual higher 69 24.4 644 27.2 563manual higher 21 7.4 379 16.0 298manual higher, API not found 40 14.1 242 10.2 242manual higher, API 0 6 2.1 23 1.0 23manual 0, API not found 2 0.7 0 0 0

all documents 283 100.0 2,371 100.0 1,281

Results: incorrect metadata

Title searchn=182

DOI searchn=241

93%92%87%90%80%73%85%94%99%

7%4%13%6%14%27%15%6%1%

AuthorDOIISSNIssuePagesSource

TitleVolume

Year

6%0%*68%10%10%24%18%

7%1%

94%100%*32%83%83%76%82%91%99%

*the API DOI search retrieved two false positives which are not included in this analysis

Results: error types

Title search DOI search

Conclusions• errors in fields commonly used for matching:

• Title: 15/18%• First author: 7/6%• Year: 1/1%

• source (27/24%), ISSN (13/68%), volume (6/7%), issue (6/10%), page number (14/10%) should not be used for matching

• special characters produce most errors, removing them would resolve large share of errors:• Title: 81/84%• First author: 67/73%

Conclusions• results of retrieval strategies:

• manual title: 182 (64%) documents & 1,653 readers• API DOI: 241 (85%) & 1,808• combined: 283 & 2,371 (max) / 2,486 (sum)

• DOI search found 101 (36%) additional documents, but:• could not be applied to 120 (31%) documents w/out DOI• did not retrieve 42 (15%) documents found by title search• led to 2 (1%) false positives

combination of DOI and title search w/out special characters

Thank you for your attention!