GATE: Bridging the Gap between Terminology and Linguistics · GATE: Bridging the Gap between...

University of Sheffield, NLP

GATE: Bridging the Gap between Terminology and

Linguistics

Diana Maynard

University of Sheffield, UK


Why do terminologists need GATE?• Terminologists face the problem of lack of suitable tools to

process their data.• Lots of in-house tools for doing individual things• Lack of common tools that can be used collaboratively

and across different systems and domains.• Tools must be flexible, robust and able to adapt to

different processing tasks and languages• GATE and its components are a key tool in today's world of

information and data overload• Enable users to perform tasks such as document

management, business intelligence, information retrieval, question answering, and knowledge indexing, modelling and conceptualisation.


GATE can help terminologists:

• Save time and money on management of text and data from multiple sources

• Find hidden links scattered across huge volumes of diverse information

• Integrate structured data from variety of sources

• Interlink text and data

• Collect information and extract new facts


A vision for text mining

• It is difficult to access unstructured information efficiently

• IE automates extraction of facts from text at reasonable accuracy and cost, increasing the value and utility of unstructured content

• Interlinking of text and data enables more efficient search, navigation and querying

• Text analysis is a matter of engineering: GATE offers practical solutions able to match specific requirements


Threat tracking application


Text mining and semantic annotation

• Extract structured data from text by

– Linking references to entities – Linking entities to their semantic descriptions

• Automatic semantic annotation based on IE technology

• Attaches metadata to documents, which can be used for searching and hyperlinking

• Adds value to content of libraries, enabling user interaction with content

• Enhanced capability for cross-referencing and dynamic document classification


Semantic Annotation


Semantic Annotation of Entities

• Recognition of the type of the entities in the text from a rich taxonomy of classes

• Reference to their semantic description.

• Traditional NE recognition approach results in: <Person>Lama Ole Nydahl</Person>

• Semantic Annotation of NEs results in:<ReligiousPerson ID=“http://..kim/Person111111”>Lama Ole Nydahl

</ReligiousPerson>


GATE: the Swiss Army Knife of NLP• Has an attachment for almost

every eventuality

• Some are hard to prise open

• Some are useful, but you might have to put up with a bit of clunkiness in practice

• Some will only be useful once in a lifetime, but you're glad to have them just in case.

• There are many imitations, but nothing like the real thing.


History of GATE• early 1990s: you want me to write that all over again?• 1995-7: first GATE (and "large-scale IE") project• 1996: GATE 1: Tcl/Tk, Perl, C++, ...• 2002: release of completely rewritten version 2, 100%

Java• 2009: mature ecosystem with established community

– Tens of thousands of research users– 25,000 downloads per year– commercial users getting serious


GATE is very eco-friendly!


GATE commercial usersTypical commercial uses:

• dynamic search and indexing of repositories

• finding relations between elements in distributed repositories

• aggregating information from different text sources

• populating repositories

• fact finding from distributed knowledge sources

Typical users:

• Pharmaceutics, news, intelligence (business, competitor, government, etc.), manufacturing, telecommunications


So what exactly is GATE?

An architecture: A macro-level organisational picture for HLT software systems.

A framework: For programmers, GATE is an object-oriented class library that implements the architecture.

A development environment: For language engineers, computational linguists et al, a graphical development environment.

A community of users and contributors


Architectural principlesNon-prescriptive, theory neutral

(strength and weakness)

Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of Protégé, Jena, Yale...)

(Almost) everything is a component, and component sets are user-extendable

(Almost) all operations are available both from API and GUI


In short…GATE includes:

• components for language processing, e.g. parsers, machine learning tools, stemmers, IR tools, IE components for various languages...

• tools for visualising and manipulating text, annotations, ontologies, parse trees, etc.

• various information extraction tools

• evaluation and benchmarking tools


Algorithms + Data + GUI = Applications

GATE components are one of three types:Language Resources (LRs), e.g. lexicons, corpora,

ontologiesProcessing Resources (PRs), e.g. parsers, generators,

taggersVisual Resources (VRs), i.e. visualisation and editing

components Algorithms are separated from the data, which

means:– the two can be developed independently by users with

different expertise.– alternative resources of one type can be used without

affecting the other, e.g. a different visual resource can be used with the same language resource


But isn’t GATE just about IE?

• Many people think of GATE as an IE tool• IE is its primary function, but it also does a lot more• Pretty much kind of linguistic processing can be done

in GATE• The only field we really don't cover is Machine

Translation, but you could easily add components for that if you wanted

• More about the other functionality later, but now back to IE...


Two Approaches to IE

Knowledge Engineering rule based developed by

experienced language engineers

make use of human intuition

obtain marginally better performance

development could be very time consuming

some changes may be hard to accommodate

Learning Systems use statistics or other

machine learning developers do not need

LE expertise requires large amounts of

annotated training data some changes may

require re-annotation of the entire training corpus


Named Entity Recognition• Named Entity recognition is the cornerstone of IE• Identification of proper names in texts, and

classification into a set of predefined categories of interest.

• Three universally accepted categories: person, location and organisation

• Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), email addresses etc.

• Other domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.


ANNIE• ANNIE is GATE's rule-based IE system• It uses the language engineering approach (though

we also have tools in GATE for ML)• Distributed as part of GATE• Uses a finite-state pattern-action rule language, JAPE • More on JAPE later.....• ANNIE contains a reusable and easily extendable set

of components:– generic preprocessing components for

tokenisation, sentence splitting etc– components for performing NE on general open

domain text


ANNIE Modules


Unicode Tokeniser• Bases tokenisation on Unicode character classes • Language-independent tokenisation • Declarative token specification language, e.g.:"UPPERCASE_LETTER" LOWERCASE_LETTER"* > Token; orthography=upperInitial; kind=word• Identifies words, numbers, spaces, different classes of punctuation, orthography• Recognition deliberately basic so that

− more powerful tools (JAPE) can be used for finer distinctions

− greater reuse possibilities


Gazetteer• Set of lists compiled into Finite State Machines • 60k entries in 80 types• List entries are matched in the text as Lookup

annotations• Each list has some pre-defined features, which enable

different kinds of matches to be identified• Additional arbitrary features and values can be added to

individual list entries• Entries can be matched according to root forms, or more

flexibly based on e.g. edit distance


Limitations of gazetteers

• Gazetteer lists are designed for annotating simple, regular features

• Some flexibility is provided, but this is not enough for most tasks

• Recognising e-mail addresses using just a gazetteer would be impossible

• But combined with other linguistic pre-processing results, we have a whole lot of annotations and features

• POS tags, capitalisation, punctuation, lookup features, etc can all be combined to form patterns suggesting more complex information

• Luckily, we have JAPE to take care of this.


What is JAPE?

• a Jolly and Pleasant Experience• Specially developed pattern matching language for

GATE• Each JAPE rule consists of

– LHS which contains patterns to match– RHS which details the annotations (and

optionally features) to be created• JAPE rules combine to create a phase• Rule priority based on pattern length, rule status

and rule ordering • Phases combine to create a grammar


Named Entity Grammars • Hand-coded rules written in JAPE applied to

annotations to identify NEs • Phases run sequentially and constitute a cascade of

FSTs over annotations • Annotations from format analysis, tokeniser. splitter,

POS tagger, morphological analysis, gazetteer etc.• Because phases are sequential, annotations can be

built up over a period of phases, as new information is gleaned

• Standard named entities: persons, locations, organisations, dates, addresses, money

• Basic NE grammars can be adapted for new applications, domains and languages


JAPE exampleUniversity of SheffieldRule: namedUniversity ( {Token.string == "University"} {Token.string == "of"}

({Lookup.minorType == city} | ({Token.category == NNP})+ )

):orgName --> :orgName.Organisation = {kind = "university", rule = "namedUniversity"}

• Looks for specific words “University of” followed by:

– city name from gazetteer, or– one or more proper nouns


Combining existing annotationsAssociate a company with a share price

e.g. Whitbread shares closed up 2p at 645p.

Phase: SharesInput: Token Organization Lookup Money PercentOptions: control = appeltRule:ShareChange( {Organization} ({Token})[0,3] {Lookup.majorType=="change"} ({Token})[0,3] ({Money}|{Percent})):change --> :change.ShareChange = {rule = "ShareChange"}


Orthomatcher• Orthographic coreference between annotations in

the same document, e.g. Mr Brown, James Brown• Matching rules are invoked between annotations

of the same type, or between an existing annotation and an “Unknown” annotation

• The latter is the only case where an annotation type can be changed

• Lookup tables of aliases and exceptions (i.e. overriding of matching rules)

• Also PRs for pronominal and nominal coreference


What about other languages?• Since we're based in Sheffield, you can't blame us for

developing GATE primarily for English• But contrary to popular belief about the British, we

don't hate all foreigners!• And we have lots of capabilities for processing in other

languages

• Currently systems for English, French, German, Romanian, Bulgarian, Russian, Cebuano, Hindi, Chinese, Arabic

• You have a POS tagger for Swahili? Just add it as a plugin and combine it with existing tokeniser etc.


It's all Chinese to me....


Processing multiple languages• If you have a language

identifier PR, you can combine processing of texts in different languages in a single application

• The system will choose the right PRs for each document or document section

• Conditional application fires a PR if some condition is met


Other plugins• Parsers (Stanford, MiniPar, RASP, SUPPLE)

• More flexible gazetteers

• Specialised NE (Chemistry, Biomedicine, etc)

• PRs for other languages, Alignment

• Lemmatisers, morphological analyser, NP and VP chunkers

• Machine Learning

• Evaluation toolkit including IAA

• IR, Google and Yahoo search engines, web crawlers

• WordNet

• Whole host of ontology-based tools


Alignment plugin


GATE in use

• We have dozens of applications, not all just research projects!

• A few examples.....


Semantic Annotation• Adding information to documents that is usable by

machines to enable better presentation, navigation or searching, e.g. Perseus:


Indexing news at the BBCBBC Archives: 'Newsnight' archiving time is 8 hours per

hour

Automatic transcription to extract some potential indexing terms

• Result: temporally precise, but very noisy dataPartial solution: search the web, intranet, digital library for

related pages, and process with IE/SA• Result: less noisy but temporally imprecise

So we merge this information with the speech signal data• Result: works well for easy stuff (high precision, low

recall)


Ontology linking at FAO• FAO have sets of fisheries-related ontologies, e.g.

Gear, species, fishing areas• No way to link between them using ontology alignment

techniques, because we require information external to the ontology (fish lives in a particular area)

• NLP techniques make use of information from documents which provide this missing link

• Not always an exact match between text and the ontology elements, e.g. Mummichogs vs. fundulus heteroclitus

• Use techniques such as headword matching, noun phrase chunking, synonym and acronym finding, etc

• Find relations in the text to link the entities together


Ontology linking at FAO

Fishing Gear Fishing Area

Species

Commodities

caught_by found_in

basis_of


Matching text descriptions• Find NPs and terms; use OntoRootGazetteer to find

morphological variants of ontology elements, perform headword and synonym matching etc.

• “Pelagic species, mainly fish and cephalopds , northern shrimp (also small crustaceans, krill”

• Match text span to ontology instance, retaining URIs• Create annotations and features, e.g. caught_by =

{gear_type = midwater otter trawlstarget_species = cephalopods}

• Convert to RDF triples

53


Using ANNIC to view results


Outsmarting our competitors


If you can't beat 'em, join 'em

• UIMA

• OpenCalais

• Lingpipe

All integrated into

GATE as plugins


UIMA• UIMA is an NL engineering platform developed by IBM

• Shares some functionality with GATE, but is complementary in most respects.

• Interoperability layer has been developed to allow UIMA applications to be run within GATE, and vice versa, in order to combine elements of both.

• Emphasis is on architectural support, including asynchronous scaleout (deploying many copies of an application in parallel)

• Much narrower range of resources provided than GATE

http://incubator.apache.org/uima/


OpenCalais

• Web service for semantic annotation of text.

• The user submits a document to the web service, which returns entity and relations annotations in RDF, JSON or some other format.

• Typically, users integrate OpenCalais annotation of their web pages to provide additional links and ‘semantic functionality’.

• OpenCalais annotates both relations and entities, although the GATE plugin only supports entities.

http://www.opencalais.com


LingPipe

• Provides set of IE and data mining tools largely ML-based. Has a set of models trained for particular tasks/corpora.

• Limited ontology support: can connect entities found to databases and ontologies

• Advantage: ML models can suggest more than one output, ranked by confidence. The user can choose number of suggestions generated.

• Disadvantage: ML models only apply to specific tasks and domains.

http://alias-i.com/lingpipe/index.html


In summary...• We like to think GATE is the best thing since sliced

bread for most NLP and terminology tasks

• You can use it for plenty of other things too, don't let us stop you being creative!

• Incorporates huge number of plugins, is easily extendable and highly customisable

• The only limit is your imagination...

• So if you're now convinced you can't live without GATE, there are two possibilities:

– ask us to get involved with a project– try GATE yourself


Get your own hands dirty

• We run 3x yearly training courses in Sheffield and other selected locations

• Different tracks available

• GATE certification available


More info, contact details, demos, publications: http://gate.ac.uk

Now it's time to nudge your neighbour if they are asleep....

Or ask that burning question about GATE.

GATE: Bridging the Gap between Terminology and Linguistics · GATE: Bridging the Gap between...

Documents

Transcript of GATE: Bridging the Gap between Terminology and Linguistics · GATE: Bridging the Gap between...