Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science...

37
Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–10—16

Transcript of Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science...

Page 1: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

Evaluation of INCISO: A system for automatic elaboration of a Citation

Index in Social Science Spanish Journals

Thomas Krichel

LIU & HГУ

2007–10—16

Page 2: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

Authors part 1

• José Manuel Barrueco Cruz– Social Sciences Libraries, University of

Valencia, Torrengers Campus, 46022 Valencia, Spain

[email protected]

• Pedro Blesa Pons– Department of Information System and

Computation, Polytechnic University of Valencia, Valencia, Spain,

[email protected]

Page 3: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

Authors part 2

• Thomas Krichel– College of Information and Computer Science,

Long Island University, 720 Northern Boulevard, Brookville 11548-1300, USA

– Faculty of Information Technology, Novosibirsk State University, 2 Pirogova Street, Novosibirsk, Russia

[email protected]

Page 4: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

Authors part 3

• Julia Osca-Lluch – Institute of the History of Sciences and

Documentation López Piñero (University of Valencia-CSIC) 46010 Valencia, Spain

[email protected]

• Elena Velasco– Department of Information System and

Computation, Polytechnic University of Valencia, Valencia, Spain

[email protected]

Page 5: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

Acknowledgments

• The research was supported by grant HUM2004-05532 from the Spanish Science and Education Ministry.

• The grant pays for my expenses here.• I paid for my own travel.

Page 6: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

vocabulary

• The terms – "citation"...– "reference"...

• are very often used as synonyms.

• This usage is only partly correct.

Page 7: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

reference

• A reference is what you find in the "References" section at the end of an academic paper.

• It is a string that describes another academic paper.

• The idea of an academic paper itself has become more vague, with the same paper appearing in different versions.

Page 8: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

citation

• A citation is a link between two papers.

• To find citations, we need some way of identifying papers.

• We need to find the references in papers.

• We need to what papers a reference point to.

Page 9: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

why bother?

• Citation indexes are useful for two purposes– to navigate the literature...– to assess the literature...

• The first function is in decline.

• The second function is becoming more important.

Page 10: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

to navigate the literature

• A citation index allows to move between papers. It allows to find papers that are related to paper without having to use key terms.

• Use of a citation index is vital in surveying work, when we want to be sure no important paper has been left out.

Page 11: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

to assess the literature

• One measure of a paper's success is the use of the paper by others as detailed by number of citations it receives.

• How good this measure is generally being disputed.

• But impact measure through citations has been pretty much the only game in town.

Page 12: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

supply of citation indexes

• The Institute for Scientific Information has been holding a virtual monopoly on large, quality-controlled citation indexes.

• Recently Elsevier's Scopus system has come as a competitor.

• None of the two sources is freely available. But that's not the only problem.

Page 13: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

quality problems

• The quality of the ISI data, last time I looked at it was terrible.

• It was not more than a glorified reference list since no effort was made to find citations.

• Author first names in references were cut, making it difficult to find references to an author.

Page 14: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

limited coverage

• The two commercial indexes have limited coverage of non-English materials.

• Generally they fare poorly on any off-mainstream work. They exclude much of non-English written work.

• Such work is therefore difficult to evaluate.

Page 15: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

alternative projects 1

• CiteSeer was the first automated citation index.

• It uses an automated procedure. Starting from seed papers, it extracts references.

• If a reference contains a link, it looks up the next paper etc.

Page 16: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

alternative projects 2

• In 2000 Jose Manual Barrueco Cruz used a modified version of the CiteSeer code to build CitEc, a citation index for RePEc.

• This index is freely available. While it does not cover all of RePEc, is is large enough to do scientometric studies with it. Ask me about it.

Page 17: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

alternative projects 3

• Google Scholar (apparently) has a set of quality-controlled sources, mainly from participating publishers and/or intermediaries.

• It extracts references from these sources and links them together.

• It is popular with users.

Page 18: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

INCISO

• stands for "indice de ciencias sociales"

• Aims to build a tool to parse citations from Spanish social sciences journals.

• The index is supposed to be made public on the project web site http://inciso.openlib.org

Page 19: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

system aims

• Some of the aims are– It should in principle be multi-disciplinary.– It should be freely available.– It should be based on open source software as

much as possible.– It should run autonomously and continuously.

• Many of these aims are not reached.

Page 20: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

system evaluation

• The system is not meant for end-users so a user evaluation makes no sense.

• Likewise performance evaluation through speed, disk space usage and other such criteria appears useless.

• We are using instead criteria from information retrieval (IR) evaluation.

Page 21: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

IR based evaluation• We compare the results of the system with

what a human would find.

• What the human has found is assumed to be true.

• Then we can use – recall=correct features found by inciso divided

by correct features found by human– precision= correct features found by inciso

divided by features found by inciso.

Page 22: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

step 0

• We use a database of papers published in Spanish social science journals between 1994 to 2004.

• There are 133798 items in this dataset.

• We call it the metadata base.

Page 23: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

step 1

• Select Spanish social science journals.

• We use six journals because they allow open access.

• The span of publication varies.

Page 24: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

step 2

• For each article we have in the open access journals, we download the full text of the papers.

• Inciso does not deal with full-text other than PDF.

• There 1493 papers available as PDF full text.

Page 25: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

step 3• We convert the PDF to text.

• We use Vividata OCR shop as this appears to give better results than the freely available rivals.

• Many converters "bomb out" at some stage of some documents. Unfortunately, reference lists are at the end.

• For 495 files, conversion fails.

• 1028 texts remain.

Page 26: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

step 4• We parse the converted file to find the

reference section.

• For 219 files, inciso was not able to detect a reference section.

• Of the 219 files, roughly 40% did not contain a reference section.

• For the other 60% the reference section was there but inciso did not find it.

• 809 texts remain.

Page 27: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

step 5

• We have found a reference section. Now we have to split it into reference strings.

• Finding the limits of a reference string is an important challenge at that stage.

Page 28: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

evaluation of step 5

• Here we deal with references found by inciso vs those found by a human.

• inciso has found 13279 references in the 809 documents.

• For the data that I have, the human has only looked at 236 out of the documents.

• And I am only giving you the results in the paper, I have no other data.

Page 29: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

evaluation of step 5

• For a subset of 236 of them we examine the references manually to calculate precision and recall.

• Precision is 93%. Most of the references found are references rather than some junk.

• But recall is poor at 52%. There are thus around 7,000 references in the papers.

Page 30: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

step 6

• In step 6, we parse the reference string to extract metadata elements– authors– title– year

• Basically, it's regular expressions orgy.

Page 31: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

evaluation of step 6

Precision figures for the manually checked references

96,3 % of years 62,5 % of titles 60,7 % of texts 58,0 % of authors

Page 32: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

main problems at stage 6

• Over 50% are related to the Spanish language Use of hyphen with author names Papers without a year of edition, and then you

can find “en prensa” The system identifies “et al.”, but not its Spanish

version “y otros” The title consists of more than one sentence, or

contains punctuation marks.

Page 33: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

step 7

• We parse the database of reference to look for citations. There are three cases– case "I" : a correct citation to a document in the

metadata base.– case "E" : a reference to a document that is not

in the metadata base.– case "P" : a reference to a document in the

metadata base, but not found.

Page 34: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

step 7 evaluation

• Let R be the total number of references that inciso has found.

• Then we have P/R=80%.

• Recall is only 20%.

Page 35: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

summary

• The current state of the system is really poor.

• Say you have a complete metadata base.

• Out there is a pile of 1000 citations in documents.

• You try to find them using inciso.

Page 36: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

• For 1000 references we find 62:– (1493-495)/1493=331/1000 will be in

documents that don't convert– 219*6/10/1028=127/1000 will be in reference

sections that are not found– 480/1000 will not be found because of incorrect

reference parsing

• For those 62 references that we have found, inciso will only find 62/5 = 12 citations.

Page 37: Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science Spanish Journals Thomas Krichel LIU & HГУ 2007–1016.

http://openlib.org/home/krichel

Thank you for your attention!

weep now!