INIS Training Seminar Subject Analysis, Thesaurus und Computer Assisted Indexing
-
Upload
evan-ayers -
Category
Documents
-
view
27 -
download
0
description
Transcript of INIS Training Seminar Subject Analysis, Thesaurus und Computer Assisted Indexing
November 2009 INIS Training Seminar 1
International Atomic Energy Agency
INIS Training SeminarINIS Training Seminar
Subject Analysis, Thesaurus undSubject Analysis, Thesaurus undComputer Assisted IndexingComputer Assisted Indexing
23 – 27 November 2009
Vienna, Austria
Alexander NevyjelHead, Content Management Group
November 2009INIS Training Seminar 2 International Atomic Energy Agency
Introduction to Subject AnalysisIntroduction to Subject Analysis
• Subject Analysis should be carried out whenever possible by subject specialists with a good knowledge of the subject matter and a familiarity with the subject analysis tools of the respective database (subject categories, thesaurus, subject analysis rules)
• Steps of Subject Analysis• subject classification
• abstracting
• subject indexing
November 2009INIS Training Seminar 3 International Atomic Energy Agency
Subject ClassificationSubject Classification
• The main topic of the document determines the primary subject category
• If there are other significant topics, one or more secondary subject categories can be assigned in addition
November 2009INIS Training Seminar 4 International Atomic Energy Agency
AbstractingAbstracting
• Each input item should contain an English abstract(exception: short communications)
• Abstracts in other languages are optional
• If an author abstract is available, it should be checked by the subject specialist, and edited, if necessary
• An abstract should be as informative as possible
• Emphasize what is novel about the information in the original document
November 2009INIS Training Seminar 5 International Atomic Energy Agency
ThesaurusThesaurus
„A thesaurus is a terminological control device used in translating from the natural language of documents, indexers or users into a more constrained system language. It is a controlled and dynamic vocabulary of semantically and generically related terms which covers a specific domain of knowledge“
This definition has been adopted by UNESCO„Guidelines for the establishment and development of monolingual
thesauri“, UNESCO, SC/W/255, Paris, September 1973
November 2009INIS Training Seminar 6 International Atomic Energy Agency
The Thesaurus and its StructureThe Thesaurus and its Structure
Relationship Sy Cross reference
hierarchical BT broader term (level 1, 2,...)hierarchical NT narrower term (level 1, 2,...)
affinitive RT related term
preferential UF used for (reciprocally USE ...)
preferential UF+ used for multiple(reciprocally USE ... AND ...)
preferential SF seen for(reciprocally SEE ... OR ...)
November 2009INIS Training Seminar 7 International Atomic Energy Agency
Subject IndexingSubject Indexing
Subject indexing means analysing the information content of a piece of literature and expressing the meaningfull information content in the language of the database using the controlled vocabulary of the Thesaurus
• Understanding of the content --> subject specialist
• Familiarity with Thesaurus and indexing rules
• Select a set of descriptors that describes the subject content of the piece of literature
November 2009INIS Training Seminar 8 International Atomic Energy Agency
Procedures for IndexingProcedures for Indexing
• Carefully read the title and abstract and scan the body of the piece of literature
• scan the full text (introduction, table of content, tables,
graphs, figures, conclusion) to find information items missing from the abstract or requiring more precision
• Identify the concept(s) about which the piece of literature contains useful information
• Translate the concepts into descriptors
• Avoid overindexing
November 2009INIS Training Seminar 9 International Atomic Energy Agency
Proposed Terms (Technical Note 175)Proposed Terms (Technical Note 175)
If no suitable descriptor exists in the Thesaurus for the retrieval of a usefull concept, make a proposal for a new one, containing the following:
• Proposed term
• Proposed word block of the term (in particular proposed BTs)
• Potential forbidden terms pointing to this proposed descriptor
• Scope note when appropriate
• Explanation and justification for the proposal
• One or more sample records
November 2009 INIS Training Seminar 10
International Atomic Energy Agency
The purpose of subject indexing isThe purpose of subject indexing is
to enable useful retrievalto enable useful retrieval
November 2009INIS Training Seminar 11 International Atomic Energy Agency
Computer-assisted Indexing - CAIComputer-assisted Indexing - CAI
• Kick-off Meeting Jan 2004
• Implementation and Customisation Jun 2004
• Production Indexing from Jun 2004 ongoing
• CAI version 1.0 final acceptance Aug 2004
• Tuning of the system from Aug 2004 ongoing
• CAI batch processing for Member States Dec 2004
• CAI online from remote for MS Nov 2007
November 2009INIS Training Seminar 12 International Atomic Energy Agency
CAI Thesaurus extensionCAI Thesaurus extension
“Hidden terms” are character patterns representing the different appearances of a concept in the free text, which is indexed by one or more descriptors. • handled similar to “forbidden terms” with one or more
USE relations• CAI internal only • not exported to INIS production system• not exported to FIBRE • not printed in any appearance of the thesaurus • support identification of descriptors in the free text
November 2009INIS Training Seminar 13 International Atomic Energy Agency
Hidden Terms: CompoundsHidden Terms: Compounds
Descriptor hidden term free text
MAGNESIUM BORIDES MgB_2 MgB2
MAGNESIUM CARBONATES MgCO_3 MgCO3
MAGNESIUM HYDRIDES MgH_2 MgH2
IRON BROMIDES iron dibromideIRON BROMIDES iron tribromideARSENIC IONS As"3"- As3-
ACETYLENE C_2H_2 C2H2
ACETALDEHYDE C_2H_4O C2H4O
ACETIC ACID C_2H_4O_2 C2H4O2
approx. 1400 hidden terms (expected 3000)
November 2009INIS Training Seminar 14 International Atomic Energy Agency
Hidden Terms: IsotopesHidden Terms: Isotopes
Descriptor hidden term free text
CESIUM 137 Cesium 137, Cesium-137"1"3"7cs 137Cs137 caesium 137 Caesium, 137-Caesiumcaesium 137 Caesium 137, Caesium-137137 cesium 137 Cesium, 137-Cesium137 cs 137 Cs, 137-Css 137 Cs 137, Cs-137cs"1"3"7 Cs137
cs137 Cs137CESIUM 138 "1"3"8"mcs 138mCs
cs"1"3"8"m Cs138m
approx. 22.400 hidden terms
November 2009INIS Training Seminar 15 International Atomic Energy Agency
Hidden Terms: Elementary ParticlesHidden Terms: Elementary Particles
Descriptor hidden term free text
B QUARKS bottom quarks
T QUARKS top quarks
ELECTRON NEUTRINOS #nu#_e νe
MUON NEUTRINOS #nu#_#mu# νμ
TAU NEUTRINOS #nu#_#tau# ντ
RHO-770 MESONS #rho#-770 ρ-770
OMEGA-782 MESONS #omega#-782 ω-782
KAONS NEUTRAL K"0 K0
KAONS NEUTRAL SHORT-LIVED K"0_S K0S
KAONS NEUTRAL LONG-LIVED K"0_L K0L
approx. 300 hidden terms
November 2009INIS Training Seminar 16 International Atomic Energy Agency
Hidden Terms: UK/US Spellings Hidden Terms: UK/US Spellings
Descriptor hidden term
A CENTERS a centresACTIVITY METERS activity metresANALOG COMPUTERS analogue computersANESTHESIA anaesthesiaARCHAEOLOGY archeologyAUSTRIAN ORGANIZATIONS austrian organisationsBALLISTIC MISSILE DEFENSE ballistic missile defenceBAYARD-ALPERT GAGES bayard-alpert gaugesBEAM ANALYZERS beam analysersBEHAVIOR behaviourCATALOGS catalogues
approx. 800 hidden terms
November 2009INIS Training Seminar 17 International Atomic Energy Agency
Hidden Terms: Diacritics and Countries Hidden Terms: Diacritics and Countries
Descriptor hidden termDiacritics:
BAECKLUND TRANSFORMATION backlund transformationBRUECKNER MODEL bruckner modelBRUNSBUETTEL REACTOR brunsbuttel reactorMOESSBAUER EFFECT mossbauer effect
Country Names:CAMBODIA kampucheaCOTE D'IVOIRE ivory coastGREECE hellasMYANMAR burmaSYRIA syrian arab republicTHAILAND siam
approx. 250 hidden terms
November 2009INIS Training Seminar 18 International Atomic Energy Agency
Hidden Terms: Other Spellings Hidden Terms: Other Spellings
Descriptor hidden termSingular/Plural
FUNGI fungusFUNGI fungusesG MATRIX g matricesG MATRIX g matrixes
Reverse SequenceATOM-MOLECULE COLLISIONS atom-molecule scatteringATOM-MOLECULE COLLISIONS molecule-atom scatteringATOM-MOLECULE COLLISIONS atom-molecule reactionsATOM-MOLECULE COLLISIONS molecule-atom reactionsATOM-MOLECULE COLLISIONS atom-molecule interactionsATOM-MOLECULE COLLISIONS molecule-atom interactions
approx. 900 hidden terms
November 2009INIS Training Seminar 19 International Atomic Energy Agency
CAI Thesaurus ExtensionCAI Thesaurus Extension
• Thesaurus• Valid Descriptors 21.826
• Forbidden Terms 9.009
• CAI • Hidden Terms 34.381
• Total 65.216
Terminological Knowledge Base
November 2009INIS Training Seminar 20 International Atomic Energy Agency
Further Improvements necessary Further Improvements necessary
• “+” and “-“ signs
• K+ KAONS PLUS, KAONS MINUS, POTASSIUM IONS
• Case sensitivity
• TiN TIN (instead of TITANIUM NITRIDES)
• gas GALLIUM SULFIDES
• “…who is the …” WHO (World Health Organization)
• Verbs versus Nouns
• “… this leads us to …” LEAD
• “… this leaves it ….” LEAVES
• Homographic terms
• Solutions SOLUTIONS or MATHEMATICAL SOLUTIONS
• Nuclear Reactions, e.g. 14N(γ,α)10B
• Targets
• Beams
• Reactions
November 2009INIS Training Seminar 21 International Atomic Energy Agency
CAI InteractiveTraining of CAI
Records with FullIndexing
INIS Verification andProduction System
CAI Offline/Batch
Records withCAI-suggested
Descriptors
INIS SubjectAnalysis Module
Input fromMember States
FullIndexing
Proposed Terms/No Indexing
Electronic Recordsfrom Publishers
Proposed Terms/No Indexing
CAI-Workflow
Interactive CAI ProcessingBatch Mode
Conventional Processing
November 2009INIS Training Seminar 22 International Atomic Energy Agency
November 2009INIS Training Seminar 23 International Atomic Energy Agency
CAI Batch and Online ProcessingCAI Batch and Online Processing
• Input: MemSt-CC-yymmdd-xxxxxxxxxxx
• MemSt is a standard prefix (meaning “member state”)
• CC is the country code
• yymmdd is the date when the file was generated
• xxxxxxxxxxx is any additional identification
• Examples• MemSt-AR-041203-thisismytestfile
• MemSt-FR-041212-fileidentification
November 2009INIS Training Seminar 24 International Atomic Energy Agency
CAI Batch ProcessingCAI Batch Processing
• Output: _MemSt-CC-yymmdd-xxxxxxxxxxx
• These files will carry the CAI suggested descriptors in tag 800, preceded by the string
##CAI suggestions##;
• Example:• 800^##CAI suggestions##; DESCRIPTOR1;
DESCRIPTOR2; DESCRIPTOR3; …….
• sent back to the member state for reviewing
November 2009INIS Training Seminar 25 International Atomic Energy Agency
CAI Batch and Online ProcessingCAI Batch and Online ProcessingReviewing ProcessReviewing Process
• Delete all suggested descriptors which are too general
• Add relevant descriptors which were not found • numerical values, e.g. pressure ranges, temperature
ranges,...
• nuclear reactions
• chemical compounds, alloys, etc.
• CAI is cleaning up BT/NTs clean up BT/NTs from manual additions
• Clean up suggestions from homographic terms
November 2009INIS Training Seminar 26 International Atomic Energy Agency
CAI Batch and Online ProcessingCAI Batch and Online ProcessingFinalisation ProcessFinalisation Process
CAI batch• When reviewing of the record completed:
Delete “##CAI suggestions## “
• When reviewing of all records completed: Submit file to “INIS Input Box”
CAI online• When reaching the last record:
press “export and exit” button
• File goes directly to INIS production system, or if required, sent back to Member State for reviewing
November 2009INIS Training Seminar 27 International Atomic Energy Agency
CAI Production StatisticsCAI Production Statistics01-06-2004 until 31-08-200901-06-2004 until 31-08-2009
CAI Production Statistics (01-06-2004 until 31-08-2009)
2004
2005 2006 2007 2008
2009
TotalJun-Dec Jan-Aug
AIP 19859 17827 19557 9657 8249 4108 79257
ANS 813 1256 2069
Elsevier 3124 23809 35716 32175 26993 18625 140442
IOPP 3291 8751 8059 7973 10526 8355 46955
IAEA 2131 2171 3984 4445 4843 2532 20106
Springer 6113 1000 7113
MemSt 660 65 3045 3105 6875
Total 28405 52558 68789 55571 59769 37725 302817
November 2009INIS Training Seminar 28 International Atomic Energy Agency
CAI Batch Processing StatisticsCAI Batch Processing Statistics2005 until 31-08-20092005 until 31-08-2009
2005 2006 2007 2008 2009/1-8 Total
AR 141 4 53 198
AU 224 224
BG 32 199 151 43 425
CN 299 2319 2314 2959 3059 10950
DE 363 644 1019 879 607 3512
ET 13017 9186 4062 26265
FR 138 721 859
JP 11 32 43
LT 39 69 108
MY 133 270 205 112 61 781
US 97 46 143
UZ 359 396 43 798
VN 8 16 83 82 189
others 306 105 411
Total 2014 4611 16965 13402 7914 44906
November 2009INIS Training Seminar 29 International Atomic Energy Agency
CAI online for Member StatesCAI online for Member Statesintroduced in July 2007introduced in July 2007
• Tested by• China• Germany• France• India• Japan• Switzerland• Uruguay
• Regularly in use by• Argentina• Brazil• China• Czech Republic• Japan• Switzerland
CAI online and CAI batch are now regular CAI online and CAI batch are now regular services for Member Statesservices for Member States