Towards a Semantic Citation Index for the German Social Sciences
-
Upload
philipp-mayr -
Category
Data & Analytics
-
view
118 -
download
1
Transcript of Towards a Semantic Citation Index for the German Social Sciences
Towards a Semantic Citation Index for the German Social
SciencesWilliam Dinkel, Philipp Mayr,
Frank Sawitzky, Andreas Strotmann*
GESIS – Leibnizinstitut für Sozialwissenschaften, Köln
*alphabetic ordering of names
The Problem ● German sociology / political science research output /
impact coverage in SSCI– SOLIS: ~ 1/3 each of books, journal articles, chapters
● Cover ~ 50% of German researchers' “relevant” output*
– ~1/3 of core journals covered in SSCI**
– So, ~10% of literature indexed there
– Very low percentage of cited literature indexed in SSCI***● * Research rating exercise Sociology, Wissenschaftsrat● ** compared to SOLIS “class A” journals● *** Chi (IfQ) study of core German political science journals
The Problem (ctd.)● Citation culture in the social sciences
– Citations are important● Perhaps even more so than in the natural sciences
– Some authors are extremely highly cited (Weber, Marx...)● Suspect very high(!!) Gini coefficient in distribution● But: it is their books (not articles) that are highly cited!
– Significant fraction of citations are contrastive
– Datasets (survey results) highly mentioned, not cited
– Multilingual citation environment
–
–
The Need
● German social scientists & SSCI – They consider their field inadequateinadequately
represented in “the” citation index
– But useBut use it quite heavily anyway● e.g. for research, evaluation
● Survey of sociologists and political scientists, GESIS
The Need (ctd.)
● We need a citation index for the (German) social sciences – Existing citation indexes frankly inadequate
● No reasonable effort in sight to resolve this
– Hence, we need to build our own● If we want to do serious bibliometrics on SocSci● If we want to provide a decent social science citation
index in, e.g., sociology or political science
The Need (ctd.)● We need an open semantic citation index for the
(German) social sciences – Incorporate referential semantics into search engine
● e.g., reliable hyperlinks to referenced articles● e.g., equivalence or hierarchy relations for translations, aggregations
– Publish referential semantics as linked open data● Allow other institutions to discover references to their holdings in our
database(s)● Invite them to offer the same service to us, too
– Bibliometrics requires cleaned/disambiguated data!
The Long-Term Goal A globally distributed open semantic citation index
● Based on digital full-text collections (cooperate with publishers)– Semi-automatic / Computer-aided
– Algorithms + professional indexers (authority files) + crowd sourcing +...
● Reference extraction (with contexts)– Enables sentiment analysis (important in social sciences)
● Reference matching – Enables referential semantics
● Open reference semantics information exchange– „<this> paper indexed in our collection cites <that> paper indexed in yours“
Sowiport – German Social Sciences Research Information
● GESIS' Sowiport portal: Single access point to 18 databases, including – 6 Cambridge Scientific Abstracts databases on social sciences– GESIS' own SOLIS (literature) and SOFIS (projects) RISs– SSOAR (Social Science Open Access Repository) @ GESIS
● Goal: Extend to social science citation index– CSA comes with cited refs for some docs– SSOAR – extract refs from OA full text and index in Sowiport– Extract links to data sets / surveys used but not cited from full texts – Crawl Google Scholar for citations to “our” docs – Link to/from RepEc (and other) data ...
First Steps: National CSA Social Sciences Citation Index
● Cambridge Scientific Abstracts – Social Sciences– 6 CSA databases offered & run by GESIS
● National research licence for Germany
– Include >8 mio references● A good starting point● Recently activated in Sowiport● ~25-30% refs found to link to other records
– Using simple matching algorithm– Biased towards accuracy (>90%), not recall
First Steps: CSA Reference Matching
Reference matching is much(!) harder in social sciences● Social science publication culture
– Books & chapters, and articles● Published in roughly equal numbers, books cited most
– Multilingual publishing● English is not the only language● Publications may be cited in translation, different editions
– Broad referencing behaviour● Large proportion of references to non-source items
=> A first-try high-precision match rate of ~25-30% is an excellent result● Close to expected rate of references to journal articles
CSA References in GESIS' Sowiport Database
● Each full record contains „references“ and „cited-by“ information– Some with actionable links to full records
● Combines WoS/Scopus and Google Scholar approaches to citation index construction
First Steps: Citation Extraction
● SSOAR full texts – First successful experiments to extract
references from full text● Based on RepEc's ParsCit ● Extended to German citation styles
– First successful experiments to identify acknowledgments of large surveys in text
Next Steps: “Haus der Sozialwissenschaften”
● Goal: Digital Special Collection for German Social Scientists– Digital access to full literature in one place
● Large parts unfortunately only accessible in-house● Collect existing digital versions from “all” sources● Digitize “important” literature where necessary● Full text of literature, survey data, project descriptions...
● Joint DFG application with Sondersammelgebiet Sozialwissenschaften, Univ.- & Stadt-Bibl. Köln
Next Step: “GESIS Application Laboratory Web 3.0”
● Full text collection and processing results available in toto to visiting researchers– Social scientists
– Computer scientists
– Computer linguists
– Bibliometricians: You are invited!!!
● Upgrade database– e.g. disambiguation of authors, institutions, titles
e.g. incorporation of external authority files / semantic web
–
Experiment: E-Traces ● Goal: Tracking ideas through the sociology literature
(“text re-use”)– Experiment (ongoing): attempt to categorize citation contexts
as positive/neutral/negative (sentiment analysis)– BMBF funded project with U Leipzig, U Göttingen
● Long term use: identify negative citations and contrastive co-citations for social science citation index
Summary ● For GESIS' core covered social sciences (German sociology, political
science), traditional citation indexes are inadequate● and Google Scholar only provides “cited by” info
● Yet, GESIS' core audience uses them● and complains about their inadequacies
● Bibliometrics requires an adequate citation index for reliable results (given typical distributions)
● but no improvements in sight for classic indexes
● Therefore, we need to build our own● and we have the expertise at GESIS to succeed where others have failed● and we have taken the first few steps in this direction●
Summary (ctd.) ● In the long run, we would like
– A citation index that is● Semantic (with explicit referential semantics)● Distributed (each institution builds their own)● Open (each institution shares semantics as LOD)● Global (implemented world wide)● Cooperative (indexers+researchers contribute)● Computer-aided (software to get started, people to improve)
– Based on best practices we hope to develop
Two Models of Citation GraphsBipartite (Classic IR) Model:Citing and Cited Partitions
• Citing nodes: full bibliographic records
• Cited nodes: „keys“, e.g.– First author name & initials
+ Year of publication+ Journal key, + volume, +number, +page
Uniform Model:Interconnected Documents
• All nodes: bibliographic records– Citing nodes full records– Cited nodes mostly simplified
records– „Matched“ cited nodes have
full records
Citation Matching
• Goal: Citation network–Unique nodes for documents
• Sub tasks:–Match cited references to each other–Match cited references to full records–Match full records across databases
Matching Citations to Full Records
„Internal“ matching● Direct access to
full database(s)● Options: match
key based or algorithmic matching
„External“ matching● Access only via
search engine● Options: matching
against same or different database
Scopus Citations
• Cited reference info contains–Up to 8 author names (family+inits)
• Including last author• Frequently as cited (not standardized or corrected)
–Publication year, title, journal name/vol./nr./p.• Frequently as cited
–Reasonably well parsable, not normalized
Matching Scopus Citations to Scopus Full Records
External matching: Scopus search engine● „Algorithm“: parse Scopus reference into subfields,
construct complex search queries for Scopus engine, download resulting full records, choose best fit
● High precision searches: complex searches allowed, many searchable fields– Improve recall by successively vaguer queries
● Small number of downloads allowed, so many queries needed to construct sizable citation index
Matching Scopus Citations to PubMed Full Records
CrossDB External Match: Scopus/Medline● „Algorithm“: parse Scopus reference, construct
PubMed batch citation matcher queries, download matched PubMed(!) records– Only for biomedical fields
– Result is a citation network of PubMed records, not Scopus
– Requires matching of Scopus citing records as well● Either direction (Scopus<->PubMed)● Both include PubMed IDs
Matching Web of Science References to WoS Full RecordsWoS cited reference info contains● First author (last name plus initials)● Publication year● Source title code● Vol./num./page● More and more frequently DOI
No title included!
Matching WoS Cited References to WoS Record
External matching via WoS web search● Only small queries supported
– Many downloads necessary
● Crucial search fields not supported (vol., num.)– Therefore highly ambiguous results to be expected
● Requires translation of source title from code to full● Requires algorithmic filtering of correct hit from long
result list
Matching WoS references to WoS
● Internal Matching● Kompetenzzentrum Bibliometrie has full local
copy of WoS data● Experiment: good „match key“ to support
this?– Dinkel (2011), ISSI
– Results in error estimates for references
Building a Citation Index for the Social Sciences: CSA
● Basis: Cambridge Scientific Abstracts (Social Sciences)– To be extended with additional sources of cited refs info
● Nationwide licensing scheme for Germany administered at GESIS
● Six CSA/Proquest databases incorporated into GESIS' „Sowiport“ social sciences portal– Now including ~8.5 mio cited references
● No matchings to full records provided by Proquest● Early experimental results available on portal
– Focus on precision, not recall
Citation Matching in CSA„Algorithm“:● Internal matching
– However, across multiple CSA databases
● Parse references; construct search queries (Solr) – exact title and year
– or fuzzy title and year and ISSN;
– choose first match
● Favors precision over recall – Fuzzy match only for journal literature, for example
● Research to be continued!
Experiments - Datasets
Caveat● Scopus/PubMed and WoS experiments run on stem cell
research field (biomedical area)– < 100k citing docs, ~1mio references– >95% refs are to journal articles
● CSA experiment run on social sciences databases– ~1mio full records, ~10mio references
● Only recent records contain refs● Many(!!) refs to non-journal articles
Some Rough Numbers● Scopus ↔ PubMed full record matching
– >95% match rate
● Scopus references → Scopus/PubMed full record– ~90% match rate „exact“ + ~5% fuzzy match
– ~1% false positives needed to be filtered out
● WoS references → WoS full record– ~90% match rate– >>50% false positives needed to be filtered out
● CSA references → CSA full record
– ~30% match rate– ~1% false positives
CSA reference information● Fields: citing ID, reference ID, authors, title, year, publisher,
source title/num/vol/p., ISSN– Format changes, though
● Mostly automatically parsed, as fields frequently mis-assigned● Example (book):
<CI>200601317</CI><CA>Voice UK</CA>
<CT>No More Abuse.</CT><CY>2000</CY>
<CZ>Derby: Voice UK</CZ>
Discussion
● Plenty of research opportunities to improve matching of non-journal literature references to source records– e.g. to GESIS' own SOLIS / SOFIS / SSOAR databases
– e.g. by crawling Google Scholar for reference links
– You are invited to try your hands at this, too!● See below: GESIS Application Laboratory