Decomposing Text Processing for Retrieval: Cheshire tries GRID@CLEF Ray R Larson School of...
-
Upload
winifred-hood -
Category
Documents
-
view
216 -
download
0
Transcript of Decomposing Text Processing for Retrieval: Cheshire tries GRID@CLEF Ray R Larson School of...
Decomposing Text Processing for Retrieval:
Cheshire tries GRID@CLEF
Ray R LarsonSchool of Information
University of California, Berkeley
CLEF 2009 -- Corfu, Greece September 21, 2007
GRID@CLEF TaskGRID@CLEF Task
Capture intermediate stages of text processing for IR and export those in an XML format that can be integrated with other (compatible) systems
This year looks at stages including tokenization and stemming, but doesn’t address the issues of index creation or the actual retrieval process
Capture intermediate stages of text processing for IR and export those in an XML format that can be integrated with other (compatible) systems
This year looks at stages including tokenization and stemming, but doesn’t address the issues of index creation or the actual retrieval process
CLEF 2009 -- Corfu, Greece September 21, 2007
GRID@CLEF Task GRID@CLEF Task
One goal is for systems to be able to both export and import intermediate processing streams and eventually to share them, we also hope to be able to use others’ streams as inputs for subtasks in which we currently cannot do or cannot do effectively such as decompounding German words
One goal is for systems to be able to both export and import intermediate processing streams and eventually to share them, we also hope to be able to use others’ streams as inputs for subtasks in which we currently cannot do or cannot do effectively such as decompounding German words
CLEF 2009 -- Corfu, Greece September 21, 2007
Adapting Cheshire II for GRID@CLEF
Adapting Cheshire II for GRID@CLEF
Cheshire II is a suite of C programs for IR including over 150K lines of codeMain programs are the indexer and
several server and client programs where retrieval is performed
Since identical text processing must be used in both indexing and search, those modules are shared across several programs
Cheshire II is a suite of C programs for IR including over 150K lines of codeMain programs are the indexer and
several server and client programs where retrieval is performed
Since identical text processing must be used in both indexing and search, those modules are shared across several programs
CLEF 2009 -- Corfu, Greece September 21, 2007
Adapting Cheshire II for GRID@CLEF
Adapting Cheshire II for GRID@CLEF
For this task we created a special version of the main Cheshire indexing program which included:A new module to output the XML
streamsA significant number of changes to the
source code for particular modules Many changes involved passing more
information into lower levels of the call hierarchy via new parameters
For this task we created a special version of the main Cheshire indexing program which included:A new module to output the XML
streamsA significant number of changes to the
source code for particular modules Many changes involved passing more
information into lower levels of the call hierarchy via new parameters
CLEF 2009 -- Corfu, Greece September 21, 2007
IssuesIssues
The tasks assume “bag of words”But Cheshire is an SGML/XML search
system, but the tasks as currently defined did not consider structural analysis and facetted indexingE.g. there is no provision for multiple
indexes taken from different parts of the overall records determined by the SGML/XML tags
The tasks assume “bag of words”But Cheshire is an SGML/XML search
system, but the tasks as currently defined did not consider structural analysis and facetted indexingE.g. there is no provision for multiple
indexes taken from different parts of the overall records determined by the SGML/XML tags
CLEF 2009 -- Corfu, Greece September 21, 2007
IssuesIssues
No specification of how unique identifiers for tokens, documents, etc are to be derivedIn Cheshire II the unique document identifier is
just a serial number assigned in the first stage of collection processing (before any tokenizing, parsing, etc. is done)
There are also term id numbers assigned to unique terms in an index But not until a much later stage in our normal processing
Other participants made different choices, revealing a challenge for interoperability
No specification of how unique identifiers for tokens, documents, etc are to be derivedIn Cheshire II the unique document identifier is
just a serial number assigned in the first stage of collection processing (before any tokenizing, parsing, etc. is done)
There are also term id numbers assigned to unique terms in an index But not until a much later stage in our normal processing
Other participants made different choices, revealing a challenge for interoperability
CLEF 2009 -- Corfu, Greece September 21, 2007
<?xml version="1.0" encoding="UTF-8"?><circo xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://circo.dei.unipd.it/" xsi:schemalocation="http://circo.dei.unipd.it/ http://ims.dei.unipd.it/xml/circo-schema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/"> <metadata> <dc:creator> Cheshire II Grid Version </dc:creator> <dc:rights> Copyright (c) 1990-2009 Regents of the University of California, All Rights Reserved. </dc:rights> <dc:date> Thu Aug 20 18:42:31 2009 </dc:date> </metadata> <stream identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar" chunked="false" chunk-number="0" last-chunk="false" digest-type="NONE"> <component identifier="cheshire_idxdata1" type="tokenizer" description="A tokenizer separates an input document into a stream of tokens.">
CLEF 2009 -- Corfu, Greece September 21, 2007
<actor identifier="Larson" /> </component> <resources> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar-1" mime-type="text/plain"> <stream identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar" chunked="false" chunk-number="0" last-chunk="false" digest-type="NONE" /> tokens> <token identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar-1-0" value="LA070294-0001"> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar-1" mime-type="text/plain"/> </token> <token identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar-1-1" value="LA070294"> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar-1" mime-type="text/plain"/> </token> <token identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar-1-2" value="056774"> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar-1" mime-type="text/plain"/>
CLEF 2009 -- Corfu, Greece September 21, 2007
Sizes of Output FilesSizes of Output Files
Size Language Type17,512,495,859 ENG Lowercase17,160,746,788 ENG Raw
9,428,111,692 ENG Stoplist9,039,244,183 ENG Stemmer
9,351,225,505 FRE Lowercase9,179,611,865 FRE Raw4,750,229,994 FRE Stoplist4,565,266,242 FRE Stemmer
18,519,746,160 GER Lowercase18,207,971,231 GER Raw10,533,324,838 GER Stoplist10,127,100,602 GER Stemmer
CLEF 2009 -- Corfu, Greece September 21, 2007
ConclusionsConclusionsTurned out to be useful in uncovering
unrecognized bugs in the systemE.g. Dual extraction for hyphenated terms was
only extracting the first term of a hyphenated pair, not both
Because the streams (so far) can be dumped as encountered, the performance impact is only the I/O overhead
Revisiting the text processing of the system suggested some new possible functions at this level
Turned out to be useful in uncovering unrecognized bugs in the systemE.g. Dual extraction for hyphenated terms was
only extracting the first term of a hyphenated pair, not both
Because the streams (so far) can be dumped as encountered, the performance impact is only the I/O overhead
Revisiting the text processing of the system suggested some new possible functions at this level
CLEF 2009 -- Corfu, Greece September 21, 2007
ConclusionsConclusions
The challenge will be to make the stream representations universal enough for sharing and combining different system results for different stages
The challenge will be to make the stream representations universal enough for sharing and combining different system results for different stages