GATE: Bridging the Gap between Terminology and Linguistics · GATE: Bridging the Gap between...
Transcript of GATE: Bridging the Gap between Terminology and Linguistics · GATE: Bridging the Gap between...
University of Sheffield, NLP
GATE: Bridging the Gap between Terminology and
Linguistics
Diana Maynard
University of Sheffield, UK
University of Sheffield, NLP
Why do terminologists need GATE?• Terminologists face the problem of lack of suitable tools to
process their data.• Lots of in-house tools for doing individual things• Lack of common tools that can be used collaboratively
and across different systems and domains.• Tools must be flexible, robust and able to adapt to
different processing tasks and languages• GATE and its components are a key tool in today's world of
information and data overload• Enable users to perform tasks such as document
management, business intelligence, information retrieval, question answering, and knowledge indexing, modelling and conceptualisation.
University of Sheffield, NLP
GATE can help terminologists:
• Save time and money on management of text and data from multiple sources
• Find hidden links scattered across huge volumes of diverse information
• Integrate structured data from variety of sources
• Interlink text and data
• Collect information and extract new facts
University of Sheffield, NLP
A vision for text mining
• It is difficult to access unstructured information efficiently
• IE automates extraction of facts from text at reasonable accuracy and cost, increasing the value and utility of unstructured content
• Interlinking of text and data enables more efficient search, navigation and querying
• Text analysis is a matter of engineering: GATE offers practical solutions able to match specific requirements
University of Sheffield, NLP
Threat tracking application
University of Sheffield, NLP
Text mining and semantic annotation
• Extract structured data from text by
– Linking references to entities – Linking entities to their semantic descriptions
• Automatic semantic annotation based on IE technology
• Attaches metadata to documents, which can be used for searching and hyperlinking
• Adds value to content of libraries, enabling user interaction with content
• Enhanced capability for cross-referencing and dynamic document classification
University of Sheffield, NLP
Semantic Annotation
University of Sheffield, NLP
Semantic Annotation of Entities
• Recognition of the type of the entities in the text from a rich taxonomy of classes
• Reference to their semantic description.
• Traditional NE recognition approach results in: <Person>Lama Ole Nydahl</Person>
• Semantic Annotation of NEs results in:<ReligiousPerson ID=“http://..kim/Person111111”>Lama Ole Nydahl
</ReligiousPerson>
University of Sheffield, NLP
GATE: the Swiss Army Knife of NLP• Has an attachment for almost
every eventuality
• Some are hard to prise open
• Some are useful, but you might have to put up with a bit of clunkiness in practice
• Some will only be useful once in a lifetime, but you're glad to have them just in case.
• There are many imitations, but nothing like the real thing.
University of Sheffield, NLP
History of GATE• early 1990s: you want me to write that all over again?• 1995-7: first GATE (and "large-scale IE") project• 1996: GATE 1: Tcl/Tk, Perl, C++, ...• 2002: release of completely rewritten version 2, 100%
Java• 2009: mature ecosystem with established community
– Tens of thousands of research users– 25,000 downloads per year– commercial users getting serious
University of Sheffield, NLP
GATE is very eco-friendly!
University of Sheffield, NLP
GATE commercial usersTypical commercial uses:
• dynamic search and indexing of repositories
• finding relations between elements in distributed repositories
• aggregating information from different text sources
• populating repositories
• fact finding from distributed knowledge sources
Typical users:
• Pharmaceutics, news, intelligence (business, competitor, government, etc.), manufacturing, telecommunications
University of Sheffield, NLP
University of Sheffield, NLP
University of Sheffield, NLP
University of Sheffield, NLP
University of Sheffield, NLP
University of Sheffield, NLP
University of Sheffield, NLP
University of Sheffield, NLP
University of Sheffield, NLP
So what exactly is GATE?
An architecture: A macro-level organisational picture for HLT software systems.
A framework: For programmers, GATE is an object-oriented class library that implements the architecture.
A development environment: For language engineers, computational linguists et al, a graphical development environment.
A community of users and contributors
University of Sheffield, NLP
Architectural principlesNon-prescriptive, theory neutral
(strength and weakness)
Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of Protégé, Jena, Yale...)
(Almost) everything is a component, and component sets are user-extendable
(Almost) all operations are available both from API and GUI
University of Sheffield, NLP
In short…GATE includes:
• components for language processing, e.g. parsers, machine learning tools, stemmers, IR tools, IE components for various languages...
• tools for visualising and manipulating text, annotations, ontologies, parse trees, etc.
• various information extraction tools
• evaluation and benchmarking tools
University of Sheffield, NLP
Algorithms + Data + GUI = Applications
GATE components are one of three types:Language Resources (LRs), e.g. lexicons, corpora,
ontologiesProcessing Resources (PRs), e.g. parsers, generators,
taggersVisual Resources (VRs), i.e. visualisation and editing
components Algorithms are separated from the data, which
means:– the two can be developed independently by users with
different expertise.– alternative resources of one type can be used without
affecting the other, e.g. a different visual resource can be used with the same language resource
University of Sheffield, NLP
But isn’t GATE just about IE?
• Many people think of GATE as an IE tool• IE is its primary function, but it also does a lot more• Pretty much kind of linguistic processing can be done
in GATE• The only field we really don't cover is Machine
Translation, but you could easily add components for that if you wanted
• More about the other functionality later, but now back to IE...
University of Sheffield, NLP
Two Approaches to IE
Knowledge Engineering rule based developed by
experienced language engineers
make use of human intuition
obtain marginally better performance
development could be very time consuming
some changes may be hard to accommodate
Learning Systems use statistics or other
machine learning developers do not need
LE expertise requires large amounts of
annotated training data some changes may
require re-annotation of the entire training corpus
University of Sheffield, NLP
Named Entity Recognition• Named Entity recognition is the cornerstone of IE• Identification of proper names in texts, and
classification into a set of predefined categories of interest.
• Three universally accepted categories: person, location and organisation
• Other common tasks: recognition of date/time expressions, measures (percent, money, weight etc), email addresses etc.
• Other domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references etc.
University of Sheffield, NLP
ANNIE• ANNIE is GATE's rule-based IE system• It uses the language engineering approach (though
we also have tools in GATE for ML)• Distributed as part of GATE• Uses a finite-state pattern-action rule language, JAPE • More on JAPE later.....• ANNIE contains a reusable and easily extendable set
of components:– generic preprocessing components for
tokenisation, sentence splitting etc– components for performing NE on general open
domain text
University of Sheffield, NLP
ANNIE Modules
University of Sheffield, NLP
Unicode Tokeniser• Bases tokenisation on Unicode character classes • Language-independent tokenisation • Declarative token specification language, e.g.:"UPPERCASE_LETTER" LOWERCASE_LETTER"* > Token; orthography=upperInitial; kind=word• Identifies words, numbers, spaces, different classes of punctuation, orthography• Recognition deliberately basic so that
− more powerful tools (JAPE) can be used for finer distinctions
− greater reuse possibilities
University of Sheffield, NLP
Gazetteer• Set of lists compiled into Finite State Machines • 60k entries in 80 types• List entries are matched in the text as Lookup
annotations• Each list has some pre-defined features, which enable
different kinds of matches to be identified• Additional arbitrary features and values can be added to
individual list entries• Entries can be matched according to root forms, or more
flexibly based on e.g. edit distance
University of Sheffield, NLP
University of Sheffield, NLP
Limitations of gazetteers
• Gazetteer lists are designed for annotating simple, regular features
• Some flexibility is provided, but this is not enough for most tasks
• Recognising e-mail addresses using just a gazetteer would be impossible
• But combined with other linguistic pre-processing results, we have a whole lot of annotations and features
• POS tags, capitalisation, punctuation, lookup features, etc can all be combined to form patterns suggesting more complex information
• Luckily, we have JAPE to take care of this.
University of Sheffield, NLP
What is JAPE?
• a Jolly and Pleasant Experience• Specially developed pattern matching language for
GATE• Each JAPE rule consists of
– LHS which contains patterns to match– RHS which details the annotations (and
optionally features) to be created• JAPE rules combine to create a phase• Rule priority based on pattern length, rule status
and rule ordering • Phases combine to create a grammar
University of Sheffield, NLP
Named Entity Grammars • Hand-coded rules written in JAPE applied to
annotations to identify NEs • Phases run sequentially and constitute a cascade of
FSTs over annotations • Annotations from format analysis, tokeniser. splitter,
POS tagger, morphological analysis, gazetteer etc.• Because phases are sequential, annotations can be
built up over a period of phases, as new information is gleaned
• Standard named entities: persons, locations, organisations, dates, addresses, money
• Basic NE grammars can be adapted for new applications, domains and languages
University of Sheffield, NLP
JAPE exampleUniversity of SheffieldRule: namedUniversity ( {Token.string == "University"} {Token.string == "of"}
({Lookup.minorType == city} | ({Token.category == NNP})+ )
):orgName --> :orgName.Organisation = {kind = "university", rule = "namedUniversity"}
• Looks for specific words “University of” followed by:
– city name from gazetteer, or– one or more proper nouns
University of Sheffield, NLP
Combining existing annotationsAssociate a company with a share price
e.g. Whitbread shares closed up 2p at 645p.
Phase: SharesInput: Token Organization Lookup Money PercentOptions: control = appeltRule:ShareChange( {Organization} ({Token})[0,3] {Lookup.majorType=="change"} ({Token})[0,3] ({Money}|{Percent})):change --> :change.ShareChange = {rule = "ShareChange"}
University of Sheffield, NLP
Orthomatcher• Orthographic coreference between annotations in
the same document, e.g. Mr Brown, James Brown• Matching rules are invoked between annotations
of the same type, or between an existing annotation and an “Unknown” annotation
• The latter is the only case where an annotation type can be changed
• Lookup tables of aliases and exceptions (i.e. overriding of matching rules)
• Also PRs for pronominal and nominal coreference
University of Sheffield, NLP
What about other languages?• Since we're based in Sheffield, you can't blame us for
developing GATE primarily for English• But contrary to popular belief about the British, we
don't hate all foreigners!• And we have lots of capabilities for processing in other
languages
• Currently systems for English, French, German, Romanian, Bulgarian, Russian, Cebuano, Hindi, Chinese, Arabic
• You have a POS tagger for Swahili? Just add it as a plugin and combine it with existing tokeniser etc.
University of Sheffield, NLP
It's all Chinese to me....
University of Sheffield, NLP
Processing multiple languages• If you have a language
identifier PR, you can combine processing of texts in different languages in a single application
• The system will choose the right PRs for each document or document section
• Conditional application fires a PR if some condition is met
University of Sheffield, NLP
Other plugins• Parsers (Stanford, MiniPar, RASP, SUPPLE)
• More flexible gazetteers
• Specialised NE (Chemistry, Biomedicine, etc)
• PRs for other languages, Alignment
• Lemmatisers, morphological analyser, NP and VP chunkers
• Machine Learning
• Evaluation toolkit including IAA
• IR, Google and Yahoo search engines, web crawlers
• WordNet
• Whole host of ontology-based tools
University of Sheffield, NLP
Alignment plugin
University of Sheffield, NLP
GATE in use
• We have dozens of applications, not all just research projects!
• A few examples.....
University of Sheffield, NLP
Semantic Annotation• Adding information to documents that is usable by
machines to enable better presentation, navigation or searching, e.g. Perseus:
University of Sheffield, NLP
University of Sheffield, NLP
Indexing news at the BBCBBC Archives: 'Newsnight' archiving time is 8 hours per
hour
Automatic transcription to extract some potential indexing terms
• Result: temporally precise, but very noisy dataPartial solution: search the web, intranet, digital library for
related pages, and process with IE/SA• Result: less noisy but temporally imprecise
So we merge this information with the speech signal data• Result: works well for easy stuff (high precision, low
recall)
University of Sheffield, NLP
University of Sheffield, NLP
Ontology linking at FAO• FAO have sets of fisheries-related ontologies, e.g.
Gear, species, fishing areas• No way to link between them using ontology alignment
techniques, because we require information external to the ontology (fish lives in a particular area)
• NLP techniques make use of information from documents which provide this missing link
• Not always an exact match between text and the ontology elements, e.g. Mummichogs vs. fundulus heteroclitus
• Use techniques such as headword matching, noun phrase chunking, synonym and acronym finding, etc
• Find relations in the text to link the entities together
University of Sheffield, NLP
Ontology linking at FAO
Fishing Gear Fishing Area
Species
Commodities
caught_by found_in
basis_of
University of Sheffield, NLP
Matching text descriptions• Find NPs and terms; use OntoRootGazetteer to find
morphological variants of ontology elements, perform headword and synonym matching etc.
• “Pelagic species, mainly fish and cephalopds , northern shrimp (also small crustaceans, krill”
• Match text span to ontology instance, retaining URIs• Create annotations and features, e.g. caught_by =
{gear_type = midwater otter trawlstarget_species = cephalopods}
• Convert to RDF triples
University of Sheffield, NLP
53
University of Sheffield, NLP
Using ANNIC to view results
University of Sheffield, NLP
Outsmarting our competitors
University of Sheffield, NLP
If you can't beat 'em, join 'em
• UIMA
• OpenCalais
• Lingpipe
All integrated into
GATE as plugins
University of Sheffield, NLP
UIMA• UIMA is an NL engineering platform developed by IBM
• Shares some functionality with GATE, but is complementary in most respects.
• Interoperability layer has been developed to allow UIMA applications to be run within GATE, and vice versa, in order to combine elements of both.
• Emphasis is on architectural support, including asynchronous scaleout (deploying many copies of an application in parallel)
• Much narrower range of resources provided than GATE
http://incubator.apache.org/uima/
University of Sheffield, NLP
OpenCalais
• Web service for semantic annotation of text.
• The user submits a document to the web service, which returns entity and relations annotations in RDF, JSON or some other format.
• Typically, users integrate OpenCalais annotation of their web pages to provide additional links and ‘semantic functionality’.
• OpenCalais annotates both relations and entities, although the GATE plugin only supports entities.
http://www.opencalais.com
University of Sheffield, NLP
LingPipe
• Provides set of IE and data mining tools largely ML-based. Has a set of models trained for particular tasks/corpora.
• Limited ontology support: can connect entities found to databases and ontologies
• Advantage: ML models can suggest more than one output, ranked by confidence. The user can choose number of suggestions generated.
• Disadvantage: ML models only apply to specific tasks and domains.
http://alias-i.com/lingpipe/index.html
University of Sheffield, NLP
In summary...• We like to think GATE is the best thing since sliced
bread for most NLP and terminology tasks
• You can use it for plenty of other things too, don't let us stop you being creative!
• Incorporates huge number of plugins, is easily extendable and highly customisable
• The only limit is your imagination...
• So if you're now convinced you can't live without GATE, there are two possibilities:
– ask us to get involved with a project– try GATE yourself
University of Sheffield, NLP
Get your own hands dirty
• We run 3x yearly training courses in Sheffield and other selected locations
• Different tracks available
• GATE certification available
University of Sheffield, NLP
More info, contact details, demos, publications: http://gate.ac.uk
Now it's time to nudge your neighbour if they are asleep....
Or ask that burning question about GATE.