ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

34
ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin

Transcript of ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Page 1: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

ANNIE and JAPE

GATE Training Course23 November 2006

Diana MaynardAndrey Shafirin

Page 2: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Alala 2

GATE and Information Extraction

● Basic introduction to IE and GATE

● Overview of ANNIE

● JAPE: rule writing

● JAPE debugger

Page 3: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

GATE and IE

● IE is one of the core tasks GATE is designed for

● IE is the basis for many other, more complex applications, e.g. semantic annotation

● Cornerstone of IE is Named Entity Recognition

Page 4: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Alala 4

A Typical IE System

1. Pre-processing – format detection – tokenisation – word segmentation – sense disambiguation – sentence splitting – POS tagging

2. Named entity detection – entity detection – coreference

Page 5: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Alala 5

Two Approaches to IE

Knowledge Engineering● rule based ● developed by experienced

language engineers ● make use of human intuition ● obtain marginally better

performance ● development could be very

time consuming ● some changes may be hard

to accommodate

Learning Systems● use statistics or other

machine learning ● developers do not need LE

expertise ● requires large amounts of

annotated training data ● some changes may require

re-annotation of the entire training corpus

Page 6: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Alala 6

Named Entity Recognition● NE involves identification of proper names in texts, and

classification into a set of predefined categories of interest.

● Three universally accepted categories: person, location and organisation

● Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), email addresses etc.

● Other domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.

Page 7: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Alala 7

ANNIE

Unicode Tokeniser

FS GazetteerLookup

SentenceSplitter

Hepple POSTagger

Input:URL or text

Document format(XML, HTML, SGML, email, …)

GATEDocument

CharacterClass Sequence

Rules

Lists

JAPE SentencePatterns

Brill RulesLexicon

SemanticTagger

OrthoMatcher

JAPE IEGrammarCascade

GATE DocumentXML dump of

IE AnnotationsOutput:

ANNIEIE modules

NOTE: square boxes areprocesses, rounded ones aredata.

PronominalCoreferencer JAPE Grammar

Page 8: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Alala 8

Unicode Tokeniser

•Bases tokenisation on Unicode character classes

•Language-independent tokenisation

•Declarative token specification language, e.g.:

"UPPERCASE_LETTER" LOWERCASE_LETTER"* >

Token; orthography=upperInitial; kind=word

Look at the ANNIE English tokeniser and at tokenisers for other languages (in plugins directory) for more information and examples

Page 9: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Alala 9

Gazetteer● Set of lists compiled into Finite State Machines ● 60k entries in 80 types, inc.: organization; artifact; location; amount_unit; manufacturer; transport_means; company_designator; currency_unit; date; government_designator; ...

● Each list has attributes MajorType and MinorType and Language): city.lst: location: city: englishcurrency_prefix.lst: currency_unit: pre_amountcurrency_unit.lst: currency_unit: post_amount

● Attributes are used as input to JAPE grammars● List entries may be entities or parts of entities, or they

may contain contextual information (e.g. job titles often indicate people)

Page 10: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Alala 10

The Named Entity Grammar● JAPE phases run sequentially and constitute a cascade

of FSTs over annotations ● hand-coded rules applied to annotations to identify NEs ● annotations from format analysis, tokeniser. POS tagger

and gazetteer modules ● use of contextual information ● rule priority based on pattern length, rule status and rule

ordering ● Common entities: persons, locations, organisations,

dates, addresses.

Page 11: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Orthomatcher

● Orthographic coreference between annotations in the same document, e.g. Mr Brown, James Brown

● Matching rules are invoked between annotations of the same type, or between an existing annotation and an “Unknown” annotation

● The latter is the only case where an annotation type can be changed

● Lookup tables of aliases and exceptions (i.e. overriding of matching rules)

● Also pronominal coreference (see User Guide)

Page 12: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Alala 12

JAPE: a Jolly And Pleasant Experience

● Grammars (cascades of phases)– Phases (lists of rules)

● Rules– LHS (patterns)– RHS (actions)

● Priority– Implicit

● longest match● first mention

– Explicit● priority

Page 13: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

LHS of JAPE rules

● The LHS of the rule contains patterns to be matched, in the form of annotations (and optionally their attributes).

● Annotation types to be recognised must be declared at the beginning of the phase

● Annotations may be combined using traditional operators [ | * + ?]

● There is no negative operator

● More than one pattern can be matched in a single rule

● Left and right context (not to be annotated) can be matched

Page 14: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Examples of LHS patterns

({Lookup.majorType == location}) :loc    

---------------------

({Token.string == "in"} |  {Token.string == "by"})

({Year}) :date 

--------------------

(

({Lookup.majorType == jobtitle}  ):jobtitle  

  {Surname}  

):person  

Page 15: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

RHS of JAPE rules

({Lookup.majorType == location}) :loc    

:loc.Location = {kind = “city", rule = “Location1"}

----------------------

(

({Lookup.majorType == jobtitle}  ):jobtitle  

  {Surname}  

):person

:jobtitle.JobTitle = {rule = "PersonJobTitle"},

 :person.Person = {kind = “Surname", rule = "PersonJobTitle"}  

Page 16: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Complex RHS ● JAPE RHS is quite limited in what you can do ● But you can use any Java you like on the RHS of the

rule ● Useful for e.g. removing temporary annotations and

percolating and manipulating features from previous annotations

● Also means you can use JAPE for many other things apart from just creating annotations, e.g. counting things, manipulating the text, adding annotations to the document, etc.

● And you don’t have to be a JAVA expert to do it.● Although it helps to have friends who are….

Page 17: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Example of using Java in a ruleRule: FirstName({Lookup.majorType == person_first}):person-->{

gate.AnnotationSet person = (gate.AnnotationSet)bindings.get("person");gate.Annotation personAnn = (gate.Annotation)person.iterator().next();gate.FeatureMap features = Factory.newFeatureMap();features.put("gender", personAnn.getFeatures().get("minorType"));features.put("rule", "FirstName");outputAS.add(person.firstNode(), person.lastNode(),

"FirstPerson", features);}

Page 18: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Available Java objects

● bindings: binding variables● doc: GATE Document● annotations: all GATE Document annotations● inputAS, outputAS: phase input and output

annotations● ontology

See documentation for more details…..

Page 19: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Alala 19

JAPE Application modes● Brill (fires all matches)● First (shortest match fires)● Once (Phase exits after first match)● All (as for Brill, but matching continues from offset

following the current one, not from the end of the last match)

● Appelt (priority ordering: longest match fires, then explicit rule priority, then first defined rule fires)

Note that prioritisation only operates within a single phase, not globally

Page 20: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

20

{A}+ Application Modes

A A AAppelt

Once

Brill

First

All

Page 21: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Example: “China Sea”

Rule:   Location1  

Priority: 25  

 (  

({Lookup.majorType == loc_key, Lookup.minorType == pre})?  

{Lookup.minorType == country}  

{Lookup.majorType == loc_key, Lookup.minorType == post})?  

)  :locName -->

:locName.Location = {kind = "location", rule = "Location1"}   

Rule: Location2  

Priority: 20  

 ({Lookup.minorType == location}) :location  -->   

:location.Name = {kind = "location", rule=GazLocation}

Page 22: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

JAPE Hints and Tricks

● JAPE is quite limited in some respects as to what can be done– There is no negative operator– It can be slow if it is badly written, e.g. ({Token})*– Context is consumed, which can make rule-writing

awkward– Priority can be difficult to set correctly

● But fear not, there is generally a sneaky way around it…..

Page 23: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

How to avoid a pattern from matchingRule: disablePattern

Priority: 1000

(<pattern>)

{}

● Instead of having a negative operator, we can simply put a high priority rule which does nothing when fired.

● This will be preferred to a lower priority rule which performs the action intended, i.e. only in the case when the former pattern doesn’t apply.

Page 24: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

How to play with input annotations

Input: Person Organisation VerbWork Split…Rule: RelationWorkIn

({Person} {VerbWork} {Organisation}){… /* create annotation of type “Relation” */ …}

● Use existing annotations to find relations● We ignore Tokens to enable more flexibility, i.e. there

could be additional words between the annotations specified

● Split ensures we don’t cross sentence boundaries

Page 25: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

How to deal with overlapping annotations

● Because matched annotations are consumed, when two annotations overlap (e.g. in gazetteer lists), the second one will never be matched.

● E.g. for the string “hALCAM” with Lookups hAL, ALCAM, and CAM, ALCAM will never be matched

● Solution is to delete the annotations once matched, and then rerun the same grammar phase over the text

● The process may need to be repeated several times (determine by trial and error)

Page 26: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

More examples

● In the GATE User Guide under the section “Useful tricks with JAPE”

● Look in the ANNIE grammars and in the foreign language grammars – there are many examples of little tricks

● Check the GATE mailing list archives

Page 27: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Custom Processing Resource for your grammars 1. Java developer extends GATE's default JAPE Transducer

creating Java classpackage com.yourcompany;import gate.creole.Transducer;public class CustomTransducer extends Transducer {}

2. JAPE developer adds definition in the plugin’s creole.xml

<RESOURCE><NAME>My custom JAPE Transducer</NAME><CLASS> com.yourcompany.CustomTransducer </CLASS><PARAMETER NAME="document" RUNTIME="true"</PARAMETER><PARAMETER NAME="inputASName" RUNTIME="true“ OPTIONAL="true">java.lang.String </PARAMETER><PARAMETER NAME="outputASName" RUNTIME="true“ OPTIONAL="true">java.lang.String</PARAMETER><PARAMETER NAME="grammarURL" DEFAULT=“myDir/myMain.jape" SUFFIXES="jape">java.net.URL</PARAMETER><PARAMETER NAME="encoding" DEFAULT="UTF-8">java.lang.String</PARAMETER>

</RESOURCE>

3. GATE user opens custom resource in GATE GUI

Right-Click on “Processing Resources”In the pop-up menu select “New >” --> “My custom JAPE Transducer”

Page 28: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

JAPE debugger● Speeds up the development of JAPE grammars

● Integrated in GATE GUI

● Friendly for non-experts

Allows you to:● Inspect the pattern matching

● Find overridden rules

● Detect complex inter-rule influence

● And many other things

Page 29: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Inspection of pattern matching

Page 30: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Overridden rules

Page 31: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Inter-rule influence (finding problem)

Page 32: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Inter-rule influence (what is that?)

Page 33: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Inter-rule influence (problem synopsis)

Text processed:

… of the J. L. Kellog Graduate School of Management and the Indiana University School of Business …

Conflicting rule:Rule: NotPersonFullPriority: 80// Det + Surname// This rule was commented course //J.L. Kellog processed without J. //17.06.03(

{Token.category == DT} | {Token.category == PRP} | {Token.category == RB}

)(

(PREFIX)* (UPPER) (PERSONENDING)?

):foo

Shadowed rule:Rule: PersonFullExtPriority: 100// F.W. Jones Fred Jones// Andrew "Flip" Filipowski// Andrew J. "Flip" Filipowski//({Token.category == DT})?( ((FIRSTNAME | FIRSTNAMEAMBIG))+ (INITIALS)? ((FIRSTNAME | FIRSTNAMEAMBIG) )* (PREFIX)* ((UPPER)):surname (PERSONENDING)?):person-->

Page 34: ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.

Coming soon…..JAPE4What JAPE4 IS:● a new version of internal language in GATE release 4● language is based on original JAPE● incorporate best practices from JAPE, Jape+ and Japec● 3-5 times faster than JAPE

What JAPE4 IS NOT:● an improved version of original Jape, Jape+ or Japec but rather

a new language● a language backward compatible with JAPE

In most cases it seems to be possible to easily modify original Jape, Jape+ or Japec grammars to be compatible with JAPE4 specification.