|
IBM Research
© Copyright IBM Corporation 2005
A Development Environment for Configurable Meta-Annotators in a Pipelined
NLP Architecture
Youssef Drissi, Branimir Boguraev, Mary Neff, David Ferrucci, Paul Keyser and Anthony Levas
IBM T.J. Watson Research Center
{youssefd,bran,ferrucci,pkeyser,levas}@us.ibm.com
IBM Research
© Copyright IBM Corporation 2003
Outline
Background:
- Text Analytics
- Unstructured Information Management Architecture (UIMA)
The Challenges
- The Consumability Challenges
Our Approach to meet these challenges
- The Concept-Centric Approach
- Our Text Analytics Development Cycle
A Scenario (Demo)
- Detecting sentiments about cars from a corpus of car reviews
IBM Research
© Copyright IBM Corporation 2003
Text Analytics
FredFred isis thetheCenterCenter CEOCEO ofof
OrganizationOrganizationPersonPerson
CeoOfCeoOf
Arg2:OrgArg2:OrgArg1:PersonArg1:Person
PPPPVPVPNPNPParserParser
Named EntityNamed Entity
RelationshipRelationship
CenterCenter MicrosMicros
UIMA: Unstructured Information Management Architecture
IBM Research
© Copyright IBM Corporation 2003
UIMA: A runtime framework for Text Analytics
UIMA: Unstructured Information Management Architecture
CEO RelationshipCEO RelationshipPERSON FinderPERSON FinderPOS TaggerPOS TaggerTokenizerTokenizer COMPANY FinderCOMPANY Finder
data
PERSONCOMPANYCEO Relationship
Conceptsanalysisresults
annotators
List of termsDictionariesRegular expressionsPattern filesStatistical modelsetc.
Modelsrepresented
by
IBM Research
© Copyright IBM Corporation 2003
Sample Annotator: Java Code /**
* This annotator searches for person titles using simple string matching.
*
* @param aTCAS TCAS containing document text and previously discovered
* annotations, and to which new annotations are to be written.
* @param aResultSpec A list of output types and features that this annotator
* should produce.
*
* @see com.ibm.uima.analysis_engine.annotator.TextAnnotator#process(TCAS, ResultSpecification)
*/
public void process(TCAS aTCAS, ResultSpecification aResultSpec)
throws AnnotatorProcessException
{
try
{
//If the ResultSpec doesn't include the PersonTitle type, we have
//nothing to do.
if (!aResultSpec.containsType("example.PersonTitle"))
{
return;
}
if (mContainingType == null)
{
//Search the whole document for PersonTitle annotations
String text = aTCAS.getDocumentText();
annotateRange(aTCAS, text, 0, aResultSpec);
}
else
{
//Search only within annotations of type mContainingType
// Get an iterator over the annotations of type mContainingType.
FSIterator it = aTCAS.getAnnotationIndex(mContainingType).iterator();
// Loop over the iterator.
while (it.isValid())
{
// Get the next annotation from the iterator
AnnotationFS annot = (AnnotationFS) it.get();
// Get text covered by this annotation
String coveredText = annot.getCoveredText();
// Get begin position of this annotation
int annotBegin = annot.getBegin();
//search for matches within this
annotateRange(aTCAS, coveredText, annotBegin, aResultSpec);
// Advance the iterator.
it.moveToNext();
}
}
}
catch(Exception e)
{
throw new AnnotatorProcessException(e);
}
}
IBM Research
© Copyright IBM Corporation 2003
# Shallow parser cascade: level 8
honour % SUB[] , PSUB[] , Phrase[] ; boundary % Sentence[] ;
#_____# auxtensed = Token[_unilex=~"VB+AUX:P"] | Token[_unilex=~"VB+AUX:Z"] | Token[_unilex=~"VB+AUX:D"] ;
vrbtensed = Token[_unilex=~"VB-AUX:P"] | Token[_unilex=~"VB-AUX:Z"] | Token[_unilex=~"VB-AUX:D"] ; vrbuntensed = Token[_unilex=~"VB-AUX:I"] ;
vrbgrpmodal = ( VG[@descend] . Token[_unilex=~"MD"] . Token[_unilex=~"RB"]* . ( ( Token[_unilex=~"VB-AUX:I"] ) | ( Token[_unilex=~"VB+AUX:I"] . Token[_unilex=~"VB-AUX:G"] ) ) . Token[_unilex=~"RB"]* . <U> ) | ( PVG[@descend] . Token[_unilex=~"MD"] . Token[_unilex=~"RB"]* . Token[_unilex=~"VB+AUX:I"] . Token[_unilex=~"RB"]* . Token[_unilex=~"VB-AUX:N"] . Token[_unilex=~"RB"]* . <U> ) ;
vrbgrpinfform = VG[@descend] . Token[_orth=~*SWORD]* . Token[_unilex=~"VB:I"] . <U> ;
Sample Annotator: AFST Grammar Syntax
#_____
simplenp = NP[] ;# simple noun phrase
possnp = PNP[] ;# possessive noun phrase
npp = NPP[] ;# noun phrase with a trailing PP
nplist = NPList[] ;# a list of NP's
complexnp = CNP[] ;# complex (appositive) NP
npphrase = :simplenp |
:possnp |
:npp |
:nplist |
:complexnp ; # an entity behaving like an NP
#______
export
scannerEight = ( :vrbgrptensed | :vrbgrpinfform ) .
Token[_unilex=~"RP"]|<E> .
<E>/[OBJ . :npphrase . <E>/]OBJ ;
IBM Research
© Copyright IBM Corporation 2003
Sample Annotator: Semantic Dictionary Authority File <?xml version="1.0" encoding="UTF-8"?> <authority name="BlueJAuthority">
<FirstName> <class name="First" superclass="NameComponent"> <instance base="Ronald" variant="Ronney" confidence="1.0" syncat="np" /> <instance base="Ronald" variant="Ronni" confidence="1.0" syncat="np" /> <instance base="Ronald" variant="Ronnie" confidence="1.0" syncat="np" /> <instance base="Ronald" variant="Ronny" confidence="1.0" syncat="np" /> <instance base="Ronald" variant="Rony" confidence="1.0" syncat="np" /> </class> </FirstName>
<MovieTitle> <class name="MovieTitle" superclass="Art"> <instance base="12 Angry Men" variant="" confidence="1.0" syncat="np" /> <instance base="2001 : A Space Odyssey" variant="" confidence="1.0" syncat="np" /> <instance base="25th Hour" variant="" confidence="1.0" syncat="np" /> <instance base="42nd Street" variant="" confidence="1.0" syncat="np" /> <instance base="A Beautiful Mind" variant="" confidence="1.0" syncat="np" /> <instance base="A Clockwork Orange" variant="" confidence="1.0" syncat="np" /> <instance base="A Farewell to Arms" variant="" confidence="1.0" syncat="np" /> <instance base="A Few Good Men" variant="" confidence="1.0" syncat="np" /> <instance base="A League of Their Own" variant="" confidence="1.0" syncat="np" /> <instance base="A Letter to Three Wives" variant="" confidence="1.0" syncat="np" /> <instance base="A Life Less Ordinary" variant="" confidence="1.0" syncat="np" /> <instance base="A Man for All Seasons" variant="" confidence="1.0" syncat="np" /> <instance base="A Midsummer Night 's Dream" variant="" confidence="1.0" syncat="np" /> <instance base="A New Hope" variant="" confidence="1.0" syncat="np" /> <instance base="A Night At The Opera" variant="" confidence="1.0" syncat="np" /> </class> </MovieTitle>
IBM Research
© Copyright IBM Corporation 2003
The Consumability Challenge
Building Analytics is a complex process
- Requires highly trained individuals:• NLP Experts• UIMA Experts• Advanced Java programmers with XML skills
- Is very time consuming:• Need time for learning the UIMA framework• Need time for building the annotators
IBM Research
© Copyright IBM Corporation 2003
Key Features End to End Text Analytics Development Tool
- Supports the full Cycle of Text Analytics Development Activities
Ease Of Use
- Insulates the user from the complexity of the underlying frameworks
Concept-Centric
- Lets the user think in terms of concepts as opposed to annotators and software components
Extensibility
- Supports for plugging new model types, model editors, results viewers, and exploration tools
IBM Research
© Copyright IBM Corporation 2003
Text Analytics Development Cycle Develop
Concept Models
DevelopConcept Models
IdentifyDomain-RelevantConcepts
IdentifyDomain-RelevantConcepts
Configure&
AssembleApplication
AnalysisEngine
Configure&
AssembleApplication
AnalysisEngine
EvaluateDiscovery
Results
EvaluateDiscovery
Results
RunAnalytics
RunAnalytics
EvaluationResults
EvaluationResults
Ontology(Type System)
Ontology(Type System)
ConceptModels
ConceptModels
Concept Finder
Concept Finder
Start
StructuredInformationStructuredInformation
Corpus & Domain
Exploration
Corpus & Domain
Exploration
Type SystemDevelopment
Type SystemDevelopment
IBM Research
© Copyright IBM Corporation 2003
Scenario:Detecting Sentiments about Cars and Car Features
IBM Research
© Copyright IBM Corporation 2003
Demo
IBM Research
© Copyright IBM Corporation 2003
Conclusion This work addresses the text analytics consumability challenges with
Platform, that provides:
- Support the full Cycle of Text Analytics Development Activities
- Ease Of Use
- Support for a Concept-Centric development process
- Extensibility
IBM Research
© Copyright IBM Corporation 2003
Thank YouMerci
Shoukran
IBM Research
© Copyright IBM Corporation 2003
Concepts
- Concepts to find in Text
Documents
- Corpora that can be used in analysis
Concept Finders
- Analysis Engines built from concept models
Results
- Results from running Concept Finder on Corpora.
Overview
IBM Research
© Copyright IBM Corporation 2003
IBM Research
© Copyright IBM Corporation 2003
GlossEx: Domain Exploration Tool
Domain Exploration
IBM Research
© Copyright IBM Corporation 2003
Ontology
- A group of concepts in a domain
Concept
- A Concept in the domain
Model
- Analytic for finding a specific Concept
Ontologies, Concepts and Models
IBM Research
© Copyright IBM Corporation 2003
Build CarAspectModel using Semantic Dictionary CAT
1. Enter a representative Term
2. Select synonyms (e.g. From WordNet)
3. Store Terms in a dictionary
Building Models For Concepts
IBM Research
© Copyright IBM Corporation 2003
Build CarAspectModel using Semantic Dictionary CAT
1. add representative Terms
2. Select synonyms (e.g. From WordNet)
3. Store Terms in a dictionary
Building Models For Concepts
IBM Research
© Copyright IBM Corporation 2003
Build CarSentimentModel using AFST CAT
1. Drag and Drop ConceptModels onto WorkArea
2. Interconnect to define pattern sequence
Building Models
IBM Research
© Copyright IBM Corporation 2003
Build a ConceptFinder for CarSentiments
1. Select All Relevant Concepts
2. The System generates a ConceptFinder for the selected concepts
Building ConceptFinders
IBM Research
© Copyright IBM Corporation 2003
Run ConceptFinder on a Corpus
1. Select ConceptFinder
2. Select Corpus
3. Run the analysis
Running Analytics to get Results
IBM Research
© Copyright IBM Corporation 2003
Annotations Viewer
Results Evaluation
IBM Research
© Copyright IBM Corporation 2003
Concordance Viewier
Iterative Refinement Tools
IBM Research
© Copyright IBM Corporation 2003
Collection Level Statistics : Comparing Results
Results Evaluation
IBM Research
© Copyright IBM Corporation 2003
Plugin Components: CATs & KoGs
Dictionary Configurable
Annotator
Configurable Annotator
Semantic Dictionary UI
CATs Plugin Framework
CAT
Concordance Indexer
KoG
KoGs Plugin Framework
Concordance Explorer UIKoG
Top Related