Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science...

Post on 27-Mar-2015

214 views 0 download

Tags:

Transcript of Evaluation of INCISO: A system for automatic elaboration of a Citation Index in Social Science...

Evaluation of INCISO: A system for automatic elaboration of a Citation

Index in Social Science Spanish Journals

Thomas Krichel

LIU & HГУ

2007–10—16

Authors part 1

• José Manuel Barrueco Cruz– Social Sciences Libraries, University of

Valencia, Torrengers Campus, 46022 Valencia, Spain

– barrueco@uv.es

• Pedro Blesa Pons– Department of Information System and

Computation, Polytechnic University of Valencia, Valencia, Spain,

– pblesa@dsic.upv.es

Authors part 2

• Thomas Krichel– College of Information and Computer Science,

Long Island University, 720 Northern Boulevard, Brookville 11548-1300, USA

– Faculty of Information Technology, Novosibirsk State University, 2 Pirogova Street, Novosibirsk, Russia

– krichel@openlib.org

Authors part 3

• Julia Osca-Lluch – Institute of the History of Sciences and

Documentation López Piñero (University of Valencia-CSIC) 46010 Valencia, Spain

– m.julia.osca@uv.es

• Elena Velasco– Department of Information System and

Computation, Polytechnic University of Valencia, Valencia, Spain

– elvear@aaa.upv.es

Acknowledgments

• The research was supported by grant HUM2004-05532 from the Spanish Science and Education Ministry.

• The grant pays for my expenses here.• I paid for my own travel.

vocabulary

• The terms – "citation"...– "reference"...

• are very often used as synonyms.

• This usage is only partly correct.

reference

• A reference is what you find in the "References" section at the end of an academic paper.

• It is a string that describes another academic paper.

• The idea of an academic paper itself has become more vague, with the same paper appearing in different versions.

citation

• A citation is a link between two papers.

• To find citations, we need some way of identifying papers.

• We need to find the references in papers.

• We need to what papers a reference point to.

why bother?

• Citation indexes are useful for two purposes– to navigate the literature...– to assess the literature...

• The first function is in decline.

• The second function is becoming more important.

to navigate the literature

• A citation index allows to move between papers. It allows to find papers that are related to paper without having to use key terms.

• Use of a citation index is vital in surveying work, when we want to be sure no important paper has been left out.

to assess the literature

• One measure of a paper's success is the use of the paper by others as detailed by number of citations it receives.

• How good this measure is generally being disputed.

• But impact measure through citations has been pretty much the only game in town.

supply of citation indexes

• The Institute for Scientific Information has been holding a virtual monopoly on large, quality-controlled citation indexes.

• Recently Elsevier's Scopus system has come as a competitor.

• None of the two sources is freely available. But that's not the only problem.

quality problems

• The quality of the ISI data, last time I looked at it was terrible.

• It was not more than a glorified reference list since no effort was made to find citations.

• Author first names in references were cut, making it difficult to find references to an author.

limited coverage

• The two commercial indexes have limited coverage of non-English materials.

• Generally they fare poorly on any off-mainstream work. They exclude much of non-English written work.

• Such work is therefore difficult to evaluate.

alternative projects 1

• CiteSeer was the first automated citation index.

• It uses an automated procedure. Starting from seed papers, it extracts references.

• If a reference contains a link, it looks up the next paper etc.

alternative projects 2

• In 2000 Jose Manual Barrueco Cruz used a modified version of the CiteSeer code to build CitEc, a citation index for RePEc.

• This index is freely available. While it does not cover all of RePEc, is is large enough to do scientometric studies with it. Ask me about it.

alternative projects 3

• Google Scholar (apparently) has a set of quality-controlled sources, mainly from participating publishers and/or intermediaries.

• It extracts references from these sources and links them together.

• It is popular with users.

INCISO

• stands for "indice de ciencias sociales"

• Aims to build a tool to parse citations from Spanish social sciences journals.

• The index is supposed to be made public on the project web site http://inciso.openlib.org

system aims

• Some of the aims are– It should in principle be multi-disciplinary.– It should be freely available.– It should be based on open source software as

much as possible.– It should run autonomously and continuously.

• Many of these aims are not reached.

system evaluation

• The system is not meant for end-users so a user evaluation makes no sense.

• Likewise performance evaluation through speed, disk space usage and other such criteria appears useless.

• We are using instead criteria from information retrieval (IR) evaluation.

IR based evaluation• We compare the results of the system with

what a human would find.

• What the human has found is assumed to be true.

• Then we can use – recall=correct features found by inciso divided

by correct features found by human– precision= correct features found by inciso

divided by features found by inciso.

step 0

• We use a database of papers published in Spanish social science journals between 1994 to 2004.

• There are 133798 items in this dataset.

• We call it the metadata base.

step 1

• Select Spanish social science journals.

• We use six journals because they allow open access.

• The span of publication varies.

step 2

• For each article we have in the open access journals, we download the full text of the papers.

• Inciso does not deal with full-text other than PDF.

• There 1493 papers available as PDF full text.

step 3• We convert the PDF to text.

• We use Vividata OCR shop as this appears to give better results than the freely available rivals.

• Many converters "bomb out" at some stage of some documents. Unfortunately, reference lists are at the end.

• For 495 files, conversion fails.

• 1028 texts remain.

step 4• We parse the converted file to find the

reference section.

• For 219 files, inciso was not able to detect a reference section.

• Of the 219 files, roughly 40% did not contain a reference section.

• For the other 60% the reference section was there but inciso did not find it.

• 809 texts remain.

step 5

• We have found a reference section. Now we have to split it into reference strings.

• Finding the limits of a reference string is an important challenge at that stage.

evaluation of step 5

• Here we deal with references found by inciso vs those found by a human.

• inciso has found 13279 references in the 809 documents.

• For the data that I have, the human has only looked at 236 out of the documents.

• And I am only giving you the results in the paper, I have no other data.

evaluation of step 5

• For a subset of 236 of them we examine the references manually to calculate precision and recall.

• Precision is 93%. Most of the references found are references rather than some junk.

• But recall is poor at 52%. There are thus around 7,000 references in the papers.

step 6

• In step 6, we parse the reference string to extract metadata elements– authors– title– year

• Basically, it's regular expressions orgy.

evaluation of step 6

Precision figures for the manually checked references

96,3 % of years 62,5 % of titles 60,7 % of texts 58,0 % of authors

main problems at stage 6

• Over 50% are related to the Spanish language Use of hyphen with author names Papers without a year of edition, and then you

can find “en prensa” The system identifies “et al.”, but not its Spanish

version “y otros” The title consists of more than one sentence, or

contains punctuation marks.

step 7

• We parse the database of reference to look for citations. There are three cases– case "I" : a correct citation to a document in the

metadata base.– case "E" : a reference to a document that is not

in the metadata base.– case "P" : a reference to a document in the

metadata base, but not found.

step 7 evaluation

• Let R be the total number of references that inciso has found.

• Then we have P/R=80%.

• Recall is only 20%.

summary

• The current state of the system is really poor.

• Say you have a complete metadata base.

• Out there is a pile of 1000 citations in documents.

• You try to find them using inciso.

• For 1000 references we find 62:– (1493-495)/1493=331/1000 will be in

documents that don't convert– 219*6/10/1028=127/1000 will be in reference

sections that are not found– 480/1000 will not be found because of incorrect

reference parsing

• For those 62 references that we have found, inciso will only find 62/5 = 12 citations.

http://openlib.org/home/krichel

Thank you for your attention!

weep now!