GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott...

24
GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen

Transcript of GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott...

Page 1: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

GATEOverview and Demo

University of WashingtonCLMA Treehouse Presentation

October 8, 2010Prescott Klassen

Page 2: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Overview

• Summary of GATE information and documentation found at gate.ac.uk

• GATE Developer features, components, and plug-ins

• IDE Demo• Embedded GATE• Using GATE with Condor on Patas• GATE code samples

Page 3: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Background• Sheffield Natural Language Processing Group at the University of

Sheffield• Released 1996 – re-written and re-released 2002 • Latest Release GATE 5.2.1 (May 6, 2010) – Windows, Linux,

Solaris, and Mac OS• Beta Release GATE 6.0 (Beta 1 – August 21, 2010)• 100% Java Reference Implementation• Compatible with IBM Unstructured Information Management

Architecture (UIMA)• Open Source (GNU Library General Public License)• XML Corpus Encoding Standard (XCES) format, used by the

American National Corpus

Page 4: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

What is GATE?

• An architecture describing how language processing systems are made up of components.

• A framework (class library) written in Java and tested on Linux, Windows and Solaris.

• A graphical development environment built on the framework (IDE for NLP)

Page 5: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

GATE Products• GATE Developer

– IDE for language processing components bundled with the ANNIE (A Nearly-New Information Extraction system) and plug-ins

• GATE Teamware– Web app for collaborative semantic annotation projects incorporating a

workflow engine and a backend service infrastructure • GATE Embedded

– Object library optimized for inclusion in applications• GATE Services

– Hosted services for cloud application development• GATE Wiki

– Wiki/CMS• GATE Cloud

– Cloud computing solution for hosted large-scale text processing

Page 6: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

GATE Components

• Language Resources (LRs)—documents, corpora and ontologies

• Processing Resources (PRs)—parsers, stemmers, co-reference resolvers, ML components, etc.

• Visual Resources (VRs)—IDE components that provide a visual interface (GUI) to GATE components and plug-ins

Page 7: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Language Resources

• Documents, corpora, and ontologies• Can persist in Java Serial Store or Lucene Serial Data Store• Document = content + annotations + features• “Stand-off” Markup• Annotations as Directed Acyclic Graphs (start Node, end

Node, ID, type, Feature Map, pointers into the sources document—character offsets)

• Input Formats: Plain Text, HTML,SGML,XML, RTF, Email, PDF, Microsoft Word

• Ontology support (Sesame2,OWLIM3)

Page 8: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Processing Resources

• ANNIE (a Nearly-New Information Extraction System)– Document Reset– Tokeniser– Gazetteer– Sentence Splitter– RegEx Sentence Splitter– Part of Speech Tagger– Semantic Tagger – Orthographic Coreference (OrthoMatcher)– Pronominal Coreference

Page 9: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Processing Resources

• JAPE (Java Annotation Pattern Engine): – Regular expressions over annotations– Finite state transduction over annotations based on

regular expressions– Not against strings but against annotation graphs– Non-deterministic

• ANNIC: ANNotations-In-Context– full-featured annotation indexing and retrieval system– Searchable Serial DataStore– Based on Lucene

Page 10: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Processing Resources

• The Annotation Diff Tool– enables two sets of annotations in one or two

documents to be compared– figures are generated for precision, recall, F-

measure• Corpus Benchmark Tool– Apply evaluation across an entire corpus

• Balance Distance Measure (BDM) Ontology Tool

Page 11: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Processing Resources (PlugIns)

• OntoGazetteer• HashGazetteer• Gazetteer List Collector• Large KB Gazetteer• Ontology-Aware JAPE Transducer• Batch Learning PR (LibSVM, PAUM algorithm,

Weka interface)• Machine Learning PR (Maxent, Weka and SVM

Light)

Page 12: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Resources on the Web sitegate.ac.uk

• User Guide• Movie Tutorials• Developer’s Guide/API docs• NLP Application Programmer’s Guide• Research Papers• GATE project descriptions• Demos• Plug-in Info• Commerical/Academic partnerships• Etc…

Page 13: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

IDE Demo

Page 14: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

What is GATE Embedded?

• Everything in GATE IDE without the GUI• A Java framework for many different types of

NLP solutions• A complex assortment of core functionality

and plug-ins• Extensible and Composable– GATE can be included as a component in other

Java Frameworks and vice-versa

Page 15: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Example Application with a GATE Embedded Component

Page 16: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Running GATE (“Hello World”)import gate.*;import gate.creole.*;

public class Main {

public static void main(String[] args) throws Exception {

Gate.setGateHome(new File(<Path to GATE>));Gate.setPluginsHome(new File(<Path to Plugins>));

Gate.init(); // start GATE}

Page 17: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Registering Directories

Gate.getCreoleRegister().registerDirectories(new File(Gate.getPluginsHome(), "ANNIE").toURL());

Gate.getCreoleRegister().registerDirectories(new File(Gate.getPluginsHome(), "Information_Retrieval").toURL());

Gate.getCreoleRegister().registerDirectories(new File(Gate.getPluginsHome(), "Stemmer_Snowball").toURL());

Page 18: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Creating Processing ResourcesSerialAnalyserController annieController =

(SerialAnalyserController) Factory.createResource("gate.creole.SerialAnalyserController",Factory.newFeatureMap(),Factory.newFeatureMap(), "ANNIE");

FeatureMap params = Factory.newFeatureMap();

annieController.add((ProcessingResource) Factory.createResource("gate.creole.annotdelete.AnnotationDeletePR", params));annieController.add((ProcessingResource) Factory.createResource("gate.creole.tokeniser.DefaultTokeniser", params));annieController.add((ProcessingResource) Factory.createResource("stemmer.SnowballStemmer", params));annieController.add((ProcessingResource) Factory.createResource("gate.creole.gazetteer.DefaultGazetteer", params));annieController.add((ProcessingResource) Factory.createResource("gate.creole.splitter.RegexSentenceSplitter", params));annieController.add((ProcessingResource) Factory.createResource("gate.creole.POSTagger", params));annieController.add((ProcessingResource) Factory.createResource("gate.creole.ANNIETransducer", params));annieController.add((ProcessingResource) Factory.createResource("gate.creole.orthomatcher.OrthoMatcher", params));

FeatureMap coRefParams = Factory.newFeatureMap();coRefParams.put("resolveIt", "true");

annieController.add((ProcessingResource) Factory.createResource("gate.creole.coref.Coreferencer", coRefParams));

Page 19: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Creating Language ResourcesCorpus corpus = Factory.newCorpus("DUC Queries");

@SuppressWarnings("static-access")File topicsFile = new File(ConfigMgr.getTopicFilePath() + "topics.xml");gate.Document topicDoc = Factory.newDocument(topicsFile.toURL());

corpus.add(topicDoc);annieController.setCorpus(corpus);

annieController.execute();

Page 20: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Iteration and CleanupAnnotationSet defaultAnnotations = topicDoc.getAnnotations();AnnotationSet originalMarkup = topicDoc.getAnnotations("Original markups");AnnotationSet topicAnnotationSet = originalMarkup.get("TOPIC");

for (Annotation topicAnnotation : topicAnnotationSet) { ArrayList<Query> topicQueryArrayList = new ArrayList<Query>();

if (ConfigMgr.isQueryBreakdown()) {topicQueryArrayList = Utilities.buildTopicMultiQuery(topicAnnotation,

originalMarkup, defaultAnnotations, config); } else {

topicQueryArrayList = Utilities.buildTopicQuery(topicAnnotation, originalMarkup, defaultAnnotations, config); }

String topicKey = null;

topicKey = topicQueryArrayList.get(0).getDucTopicName(); globalQueryHash.put(topicKey, topicQueryArrayList);}

topicDoc.cleanup();Factory.deleteResource(topicDoc);corpus.cleanup();Factory.deleteResource(corpus);

Page 21: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Iterating through Annotations public static AnnotationSet getChildAnnotationSet(

String childAnnotationSetName, Annotation annotation, AnnotationSet parentAnnotationSet) throws NullPointerException {

AnnotationSet childAnnotationSet = null;

// traverse nested Annotation Set for named annotation using parent offsets to delimit rangetry { childAnnotationSet = parentAnnotationSet.get(childAnnotationSetName,

annotation.getStartNode().getOffset(), annotation.getEndNode().getOffset()); if (childAnnotationSet == null) {

throw new NullPointerException(); }} catch (Exception e) { System.err.println(e.getMessage());}

return childAnnotationSet; }

Page 22: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Example Script for Compiling on Patas#! /bin/bash

javac -classpath .:/NLP_TOOLS/tool_sets/gate/gate-5.1/bin/gate.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/activation.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-contrib-1.0b2.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-junit.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-launcher.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jdom.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/antlr.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-nodeps.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-trax.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/Bib2H‚Ñ¢L.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-discovery-0.2.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-fileupload-1.0.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-lang-2.4.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-logging.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/concurrent.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gate-asm.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gate-compiler-jdt.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gateHmm.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/geronimo-ws-metadata_2.0_spec-1.1.1.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/GnuGetOpt.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/icu4j.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jakarta-oro-2.0.5.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/javacc.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jaxb-api-2.0.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jaxen-1.1.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jaxws-api-2.0.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/junit.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jwnl.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/log4j-1.2.14.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/lubm.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/lucene-core-2.2.0.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/mail.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/nekohtml-1.9.8+2039483.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ontotext.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/orajdbc3.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/PDFBox-0.7.2.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/pg73jdbc3.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/poi-2.5.1-final-20040804.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/spring-beans-2.0.8.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/spring-core-2.0.8.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/stax-api-1.0.1.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/tm-extractors-0.4.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/wstx-lgpl-3.2.3.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xercesImpl.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xml-apis.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xmlunit-1.2.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xpp3-1.1.3.3_min.jar:/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xstream-1.2.jar:edu.mit.jwi_2.1.5.jar ling573extractive/*.java

Page 23: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

GATE Condor Scriptuniverse = javaexecutable = ling573extractive/Main.classarguments = ling573extractive.Mainoutput = ling573extractive.outputerror = ling573extractive.errorjar_files =

/NLP_TOOLS/tool_sets/gate/gate-5.1/bin/gate.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/junit.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ant-junit.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/jdom.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/commons-lang-2.4.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gate-asm.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/gate-compiler-jdt.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/log4j-1.2.14.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/lucene-core-2.2.0.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/nekohtml-1.9.8+2039483.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/ontotext.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/PDFBox-0.7.2.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/orajdbc3.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/wstx-lgpl-3.2.3.jar,/NLP_TOOLS/tool_sets/gate/gate-5.1/lib/xercesImpl.jar,edu.mit.jwi_2.1.5.jar

java_vm_args = -Xmn100M -Xms500M -Xmx500M+RequiresWholeMachine = TrueRequirements = ( Memory > 0 && TotalMemory >= (7*1024) )queue

Page 24: GATE Overview and Demo University of Washington CLMA Treehouse Presentation October 8, 2010 Prescott Klassen.

Discussion