Exploring data quality and retrieval strategies for Mendeley reader counts
-
Upload
stefanie-haustein -
Category
Data & Analytics
-
view
589 -
download
0
Transcript of Exploring data quality and retrieval strategies for Mendeley reader counts
Exploring data quality and retrieval strategies for Mendeley reader counts
Zohreh Zahedi1, Stefanie Haustein2 & Timothy D. Bowman2
[email protected] [email protected] [email protected]
@zohrehzahedi @stefhaustein @timothydbowman
1Leiden University, The Netherlands2Université de Montréal, Canada
Metrics14 - ASIS&T SIGMET Workshop, Seattle, 5th November, 2014
• online reference management tool• usage statistics, available via open API
• 2.8 million users, 275,860 groups, 535 user documents (02/2014)
• 68 million unique publications (08/2012; 281 million user documents)
Mendeley statistics based on monthly user counts from 10/2010 to 02/2014 on the Mendeley website accessed through the Internet Archive
Research Objectives
• fluctuation in Mendeley coverage and readership counts over time and through different retrieval strategies (Bar-Ilan, 2014)
• altmetric studies and tools use different retrieval strategies• DOI API search• title search (e.g., Webometric Analyst)
lack of systematic study to determine effect of retrieval strategy
Research Objectives
• analyzing metadata quality of Mendeley entries systematically
• testing completeness and accuracy of relevant metadata fields
• identify and quantify error types• analyze difference between retrieval strategies
determine best retrieval strategy for collecting Mendeley reader counts
Research Questions
• How accurate is the metadata on Mendeley for a random sample of publications?
• In how far do results differ between:• manual title search in online catalog• API search via DOI
• What are the most frequent error types in the bibliographic data on Mendeley?
• What retrieval strategy provides the most accurate and complete results for the sampled publications?
Data set and Method
• random sample of 2012 WoS publications:384 of 1,873,759 documents
• manual title search via Mendeley online catalog n=384
• DOI search via Mendeley API simultaneouslyn=264 (=-31%)
• comparison of all relevant metadata• Author• DOI• ISSN
• Pages• Source• Title
• Title• Volume• Year
Results: overview documents reader counts
N % N % +
identical reader counts 103 36.4 975 41.1 0identical 102 36.0 975 41.1 0identical, both 0 1 0.4 0 0 0
API higher 111 39.2 752 31.7 718API higher 10 3.5 204 8.6 170API higher, manual not found 80 28.3 548 23.1 548API 0, manual not found 21 7.4 0 0 0
manual higher 69 24.4 644 27.2 563manual higher 21 7.4 379 16.0 298manual higher, API not found 40 14.1 242 10.2 242manual higher, API 0 6 2.1 23 1.0 23manual 0, API not found 2 0.7 0 0 0
all documents 283 100.0 2,371 100.0 1,281
Results: incorrect metadata
Title searchn=182
DOI searchn=241
93%92%87%90%80%73%85%94%99%
7%4%13%6%14%27%15%6%1%
AuthorDOIISSNIssuePagesSource
TitleVolume
Year
6%0%*68%10%10%24%18%
7%1%
94%100%*32%83%83%76%82%91%99%
*the API DOI search retrieved two false positives which are not included in this analysis
Conclusions• errors in fields commonly used for matching:
• Title: 15/18%• First author: 7/6%• Year: 1/1%
• source (27/24%), ISSN (13/68%), volume (6/7%), issue (6/10%), page number (14/10%) should not be used for matching
• special characters produce most errors, removing them would resolve large share of errors:• Title: 81/84%• First author: 67/73%
Conclusions• results of retrieval strategies:
• manual title: 182 (64%) documents & 1,653 readers• API DOI: 241 (85%) & 1,808• combined: 283 & 2,371 (max) / 2,486 (sum)
• DOI search found 101 (36%) additional documents, but:• could not be applied to 120 (31%) documents w/out DOI• did not retrieve 42 (15%) documents found by title search• led to 2 (1%) false positives
combination of DOI and title search w/out special characters