Post on 29-Jan-2016
description
What’s “NEXT”?What’s “NEXT”?
Navigating throughNavigating through Dense Annotation Spaces Dense Annotation Spaces
Branimir K. BoguraevBranimir K. BoguraevMary S. NeffMary S. Neff
Language Engineering Language Engineering for Content Analysisfor Content Analysis
IBM T.J. Watson Research CenterIBM T.J. Watson Research CenterYorktown Heights, NYYorktown Heights, NY
Dense Annotation SpacesDense Annotation Spaces
Service Reps can read customer name, in order to contact the customer. Service Reps can read customer name, in order to contact the customer.
{np}{np}{np}{np} {nps}{nps}{nps}{nps}{md}{md}{md}{md}{vb}{vb}{vb}{vb} {nn}{nn}{nn}{nn} {nn}{nn}{nn}{nn} {in}{in}{in}{in} {nn}{nn}{nn}{nn}{to}{to}{to}{to} {vb}{vb}{vb}{vb} {dt}{dt}{dt}{dt} {nn}{nn}{nn}{nn}
[NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP][VG][VG][VG][VG] [VG][VG][VG][VG]
[PP][PP][PP][PP]
[SUB][SUB][SUB][SUB] [OBJ][OBJ][OBJ][OBJ] [OBJ][OBJ][OBJ][OBJ]
[SC][SC][SC][SC]
[SENT][SENT][SENT][SENT]
{np}{np}{np}{np} {nps}{nps}{nps}{nps}{md}{md}{md}{md}{vb}{vb}{vb}{vb} {nn}{nn}{nn}{nn} {nn}{nn}{nn}{nn} {in}{in}{in}{in} {nn}{nn}{nn}{nn}{to}{to}{to}{to} {vb}{vb}{vb}{vb} {dt}{dt}{dt}{dt} {nn}{nn}{nn}{nn}
[NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP][VG][VG][VG][VG] [VG][VG][VG][VG]
[PP][PP][PP][PP]
[SUB][SUB][SUB][SUB] [OBJ][OBJ][OBJ][OBJ] [OBJ][OBJ][OBJ][OBJ]
[SC][SC][SC][SC]
[SENT][SENT][SENT][SENT]
Annotation ‘trees’Annotation ‘trees’
Service Reps can read customer name, in order to contact the customer. Service Reps can read customer name, in order to contact the customer.
{np}{np}{np}{np} {nps}{nps}{nps}{nps}{md}{md}{md}{md}{vb}{vb}{vb}{vb} {nn}{nn}{nn}{nn} {nn}{nn}{nn}{nn} {in}{in}{in}{in} {nn}{nn}{nn}{nn}{to}{to}{to}{to} {vb}{vb}{vb}{vb} {dt}{dt}{dt}{dt} {nn}{nn}{nn}{nn}
[NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP][VG][VG][VG][VG] [VG][VG][VG][VG]
[PP][PP][PP][PP]
[SUB][SUB][SUB][SUB] [OBJ][OBJ][OBJ][OBJ] [OBJ][OBJ][OBJ][OBJ]
[SC][SC][SC][SC]
[SENT][SENT][SENT][SENT]
Annotation latticeAnnotation lattice
Service Reps can read customer name, in order to contact the customer. Service Reps can read customer name, in order to contact the customer.
{np}{np}{np}{np} {nps}{nps}{nps}{nps}{md}{md}{md}{md}{vb}{vb}{vb}{vb} {nn}{nn}{nn}{nn} {nn}{nn}{nn}{nn} {in}{in}{in}{in} {nn}{nn}{nn}{nn}{to}{to}{to}{to} {vb}{vb}{vb}{vb} {dt}{dt}{dt}{dt} {nn}{nn}{nn}{nn}
[NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP][VG][VG][VG][VG] [VG][VG][VG][VG]
[PP][PP][PP][PP]
[SUB][SUB][SUB][SUB] [OBJ][OBJ][OBJ][OBJ] [OBJ][OBJ][OBJ][OBJ]
[SC][SC][SC][SC]
[SENT][SENT][SENT][SENT]
Navigational ChallengesNavigational Challenges
[PNAME ][PNAME ][Title][Name ][Title][Name ] [First] [Middle] [Last][First] [Middle] [Last]
What is visible to the lattice traversal What is visible to the lattice traversal engine?engine?
Annotation-Based Finite Annotation-Based Finite State Transducer (AFst)State Transducer (AFst)
UIMA-basedUIMA-based A finite state calculus over typed feature A finite state calculus over typed feature
structuresstructures Cf. “grep” over a sequence of annotations, Cf. “grep” over a sequence of annotations,
specified as types and featuresspecified as types and features
np = <E>/[NP .np = <E>/[NP .Token[pos=~”DT”] | <E> .Token[pos=~”DT”] | <E> .Token[pos=~”JJ”]* .Token[pos=~”JJ”]* . ( Token[pos=~”NN”] | Token[pos=~”NNS”] ) .( Token[pos=~”NN”] | Token[pos=~”NNS”] ) .
<E>/]NP ;<E>/]NP ;
Pitching the Iterator: support Pitching the Iterator: support for navigational controlfor navigational control
Service Reps can read customer name, in order to contact the customer. Service Reps can read customer name, in order to contact the customer.
{np}{np}{np}{np} {nps}{nps}{nps}{nps}{md}{md}{md}{md}{vb}{vb}{vb}{vb} {nn}{nn}{nn}{nn} {nn}{nn}{nn}{nn} {in}{in}{in}{in} {nn}{nn}{nn}{nn}{to}{to}{to}{to} {vb}{vb}{vb}{vb} {dt}{dt}{dt}{dt} {nn}{nn}{nn}{nn}
[NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP][VG][VG][VG][VG] [VG][VG][VG][VG]
[PP][PP][PP][PP]
[SUB][SUB][SUB][SUB] [OBJ][OBJ][OBJ][OBJ] [OBJ][OBJ][OBJ][OBJ]
[SC][SC][SC][SC]
[SENT][SENT][SENT][SENT]
Defining a particular path through Defining a particular path through the annotation space requires a the annotation space requires a lattice traversal engine that can lattice traversal engine that can focus on—simultaneously—focus on—simultaneously—
o Sequential constraints ~ pattern matchingSequential constraints ~ pattern matching Horizontal—prenominal mod and nominal headHorizontal—prenominal mod and nominal head
o Structural constraintsStructural constraints Vertical—iterate over NP with specific Vertical—iterate over NP with specific
configurational relationship – e.g. not sentence configurational relationship – e.g. not sentence initial, not in a PPinitial, not in a PP
o Configurational constraintsConfigurational constraints Type prioritizationType prioritization
Afst Traversal RegimeAfst Traversal Regime
Linearizing the Lattice: Linearizing the Lattice: what’s “next”?what’s “next”?
Unambiguous Typeset iterator, inferred Unambiguous Typeset iterator, inferred from grammar: from grammar: …… [SUB] . [VG] . [OBJ] . [PP] …[SUB] . [VG] . [OBJ] . [PP] …
UIMA natural annotation sort order:UIMA natural annotation sort order:o Start position ascendingStart position ascendingo Length descendingLength descendingo Type priority, defined in UIMA descriptorsType priority, defined in UIMA descriptors
[NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP] [NP][NP][NP][NP][VG][VG][VG][VG] [VG][VG][VG][VG]
[PP][PP][PP][PP]
[SUB][SUB][SUB][SUB] [OBJ][OBJ][OBJ][OBJ] [OBJ][OBJ][OBJ][OBJ]
Linearizing the Lattice: Linearizing the Lattice: what’s “next”?what’s “next”?
Grammar-wide declarationsGrammar-wide declarations boundary % Sentence[];boundary % Sentence[];
honour % Address[] ;honour % Address[] ;month = Token[lemma=~”January”] |month = Token[lemma=~”January”] |
Token[lemma=~”February”]|Token[lemma=~”February”]| … … ;;
date = <E>/[Year . date = <E>/[Year . :month | <E> .:month | <E> . Token[string=~:^[12]\d[{3}$:] Token[string=~:^[12]\d[{3}$:]
<E>/]Year;<E>/]Year;
Focus:Focus:Selecting Nested Boundary Selecting Nested Boundary
AnnotationsAnnotations<nameValuePair><nameValuePair> <name><name>FocusFocus</name></name>
<value><array><value><array> <string><string>Section[label~=:Education:]Section[label~=:Education:]
</string></string><string><string>Sentence[number==1]Sentence[number==1]
</string></string></array></value></array></value>
</nameValuePair></nameValuePair>
Linearizing the Lattice: Linearizing the Lattice: what’s “next”?what’s “next”?
Grammar-wide declarationsGrammar-wide declarations
match % first, last, longesr, match % first, last, longesr, shortest, allshortest, all
advance % skip, stepadvance % skip, step
What’s “next”?:What’s “next”?:Switching Levels, Mixed Switching Levels, Mixed
IteratorIteratorRefocus the iterator to examine Refocus the iterator to examine
inner contour: inner contour: @descend, @ascend@descend, @ascend
findDrSmith =findDrSmith =<E>/PName[@descend] .<E>/PName[@descend] .
Title[string=~”Dr.”Title[string=~”Dr.” ..<E>/Name[@descend] .<E>/Name[@descend] .
First[]|<E> . First[]|<E> . Last[string==“Smith”] .Last[string==“Smith”] .
<E>/Name[@ascend] .<E>/Name[@ascend] .<E>/PName[@ascend] ;<E>/PName[@ascend] ;
Alternate Multiple Level Alternate Multiple Level AccessAccess
Upper/lower context without Upper/lower context without switching levelsswitching levels
Token[_costarts=~Sentence[number==1];Token[_costarts=~Sentence[number==1];
Subject[_covers=~PName[];Subject[_covers=~PName[];
PName[_costarts=~NP[],_coends=~NP[]];PName[_costarts=~NP[],_coends=~NP[]];
Grammar cascadingGrammar cascading
From simpler to more complex analysesFrom simpler to more complex analyses Lower levels of output feed as inputs Lower levels of output feed as inputs
into higher levelsinto higher levels
Small noun phrases & verb groupsSmall noun phrases & verb groups Prepositional, possessive & adjectival Prepositional, possessive & adjectival
phrasesphrases More complex noun phrasesMore complex noun phrases Variety of clause typesVariety of clause types Grammatical relations (subject, object)Grammatical relations (subject, object)
ImplementationsImplementations
Shallow ParsingShallow Parsing Named Entity Detection interleaved Named Entity Detection interleaved
with shallow parsingwith shallow parsing Terminology identification in new Terminology identification in new
domainsdomains Temporal expression parsing Temporal expression parsing Privacy policy rulesPrivacy policy rules Information extraction from resumesInformation extraction from resumes Information extraction from contact Information extraction from contact
center telephone callscenter telephone calls
Future work listFuture work list
Alternate (semi-ambiguous) Alternate (semi-ambiguous) iterator, useful for “disambiguator” iterator, useful for “disambiguator” grammarsgrammars Actor[] Director[]Actor[] Director[]
Tree-walk iterator for tree Tree-walk iterator for tree representations where children are representations where children are explicitly referenced in featuresexplicitly referenced in features
Performance NotesPerformance Notes
Performance is a function ofPerformance is a function of How grammar is writtenHow grammar is written Optimisation of fst graph (grammar Optimisation of fst graph (grammar
compiler)compiler) Optimisation of symbol compilerOptimisation of symbol compiler Optimisation of executorOptimisation of executor
However … for the benefit of the curious However … for the benefit of the curious ……IBM Software Group (Dublin) IBM Software Group (Dublin) optimised the last two, and …optimised the last two, and …
IBM LanguageWare (Dublin) IBM LanguageWare (Dublin) text analysis performance text analysis performance
resultsresultsThe analysis:The analysis:
- AFST rules and FST - AFST rules and FST dictionarydictionary- 26 rules, 7 - 26 rules, 7 dictionaries (things dictionaries (things like first names, like first names, indicators like Corp. indicators like Corp. etc)etc)
- creating Person and - creating Person and Company annotationsCompany annotations
The TestThe Test- test set: Enron- test set: Enron- 924 files - 924 files - (4.5Mb)- (4.5Mb)
The Results:The Results:
Precision for Company Precision for Company Annotations only: 0.81Annotations only: 0.81
Recall for Company Recall for Company Annotations only: 0.67Annotations only: 0.67
Precision for Person Precision for Person Annotations only: 0.93Annotations only: 0.93
Recall for Person Recall for Person Annotations only: 0.91Annotations only: 0.91
Processing time: 3.4 Processing time: 3.4 secondsseconds
These numbers are 10 These numbers are 10 times faster than the times faster than the best of breed internal best of breed internal reference annotators.reference annotators.
Perpetrators … er…Perpetrators … er…Responsible partiesResponsible parties
Bran BoguraevBran Boguraev Mary NeffMary Neff Bran LambovBran Lambov D.J. McCloskeyD.J. McCloskey
Thilo GoetzThilo Goetz Thomas Hampp Thomas Hampp Oliver SuhreOliver Suhre
Roy ByrdRoy Byrd Herb ChongHerb Chong Albert EskenaziAlbert Eskenazi Paul Kaye Paul Kaye Son Bao PhamSon Bao Pham Lokesh ShrestaLokesh Shresta Max SilberzteinMax Silberztein
For more on AFst and tools For more on AFst and tools ----
Tomorrow, 12:25 in Fez 1:Tomorrow, 12:25 in Fez 1:
A Development Environment for A Development Environment for Configurable Meta-Annotators in a Configurable Meta-Annotators in a Pipelined NLP EnvironmentPipelined NLP Environment
Youssef Drissi, Branimir Boguraev, Youssef Drissi, Branimir Boguraev, David Ferrucci, Paul Keyser, and David Ferrucci, Paul Keyser, and Anthony LevasAnthony Levas