Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of...
-
Upload
mark-adams -
Category
Documents
-
view
213 -
download
0
Transcript of Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of...
![Page 1: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/1.jpg)
Experiences with UIMA in NLP teaching and research
Manuela Kunze,
Dietmar Rösner
University of Magdeburg Knowledge Based Systems and Document Processing
![Page 2: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/2.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 2
Overview
• What is UIMA?
• First Experiments
• NLP Teaching
• Conclusion
![Page 3: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/3.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 3
UIMA: Unstructured Information Management Architecture
• a software architecture for developing and deploying unstructured information management (UIM) applications
• UIM application: a software system – analyse large volumes of unstructured information to
• discover, • organize, and • deliver relevant knowledge to the end user
• software architecture which specifies – component interfaces, data representations, …
• http://www.research.ibm.com/UIMA/
![Page 4: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/4.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 4
UIMA: Unstructured Information Management Architecture
… interfaces to a collection of data items (e.g., documents) to beanalyzed. Collection Readers return CASes that contain the documents toanalyze, possibly along with additional metadata.
… takes a CAS, analyzes its contents, and produces an enrichedCAS. Analysis Engines can be recursively composed of other Analysis Engines(called an Aggregate Analysis Engine). Aggregates may also contain CASConsumers.
… may be used by a Collection Reader to populate a CAS from a document. An example of a CAS Initializer is an HTML parser that de-tags an HTML document and also inserts paragraph annotations (determined from <P> tags in the original HTML) into the CAS.
CAS: Common Analysis StructureCPE: Collecting Processing Manager… consume the enriched CAS that was produced by the sequence of Analysis
Engines before it, and produce an application-specific data structure, such as a search engine index or database.
[Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]
![Page 5: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/5.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 5
• Analysis Engine (AE):– a component that analyzes artifacts (e.g. documents) and infers
information about them
– consists of two parts:• Java classes (typically packaged as one or more JAR files) and
• AE descriptors (one or more XML files)– the configuration settings for the Analysis Engine as well as – a description of the AE’s input and output requirements.
UIMA: Unstructured Information Management Architecture
![Page 6: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/6.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 6
UIMA: Unstructured Information Management Architecture
analysis engine
Annotator
processing resources
type system
Annotation Interface
define annotation type:• name• features (begin, end, …)
describe analysis engine:• annotator class• input parameter • output of annotations• external resources
• interface• resources
linked to atype system
uses
define anannotator
create
JavaXML
![Page 7: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/7.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 7
• Aggregate Analysis Engine:– combine different analysis engine within one Analysis Engine
UIMA: Unstructured Information Management Architecture
[Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]
![Page 8: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/8.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 8
Overview
• Introduction
• First Experiments
• NLP Teaching
• Conclusion
![Page 9: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/9.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 9
First Experiments: UIMA vs. GATE
• base line:– 2 persons, 2 systems, 1 corpus and 1 extraction task– skills/experiences of the persons:
UIMA GATE Eclipse/Java
Person 1
Person 2
![Page 10: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/10.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 10
Task of the Experiment
• process a corpus of websites– to detect and extract information relevant for tourists
• opening times of museum, prices of hotels,…
• corpus:– 30 tourism web sites of Egypt– additional 20 web sites of Washington, New York, London
• output: – Prolog facts for a reasoner– Questions:
• Which museum is now open?• …
![Page 11: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/11.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 11
Evaluation Topics/Points
• ease of getting acquainted with system?:
– quality of docus: completeness, clarity, up-to-date, …?
– tutorials, use cases, …?
• processing and linguistic resources?
– lexica, Gazetteer lists, tools
• tools for resource maintenance and extension?
– quality: selfexplanatory, robust, comfortable
• speed of processing?
• single document vs. large corpora?
• limitations, suggestions for improvement?
• support for im-/export of a variety of document formats?
![Page 12: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/12.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 12
Excerpts from the Corpus
• The Egyptian Museum is open the hours: 9am-5pm daily
• The Military Museum is open the hours: Summer: 8am-5:30pm; winter: 8am-4:30pm
• Palace Museum is open the hours: 8am-5:30pm (summer) 8am-4:30pm (winter)
• 10am-2pm, 6pm-9pm Sat-Wed; 6pm-9pm Fri
• …
![Page 13: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/13.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 13
UIMA Application
• several annotators (like a pipeline)
museum pattern
time pattern
interval of times
restrictions
museum information
... *Fraunces Tavern Museum*54 Pearl St. - 1-212-425-1778Tuesday-Friday, 12pm?5pm; …
regular expressions
regular expressions
regular expressions
window covering two time intervals and a restriction
window covering a museum and opening hours
Prolog facts: museumopen('Fraunces Tavern Museum ',
'2005-12-01T12:00:00', '2005-12-01T17:00:00').museumopen('Fraunces Tavern Museum ',
'2005-12-02T12:00:00', '2005-12-02T17:00:00').museumopen('Fraunces Tavern Museum ',
'2005-12-03T12:00:00', '2005-12-03T17:00:00').
![Page 14: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/14.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 14
UIMA: Results
• information annotated in the documents:– names of museums, hotels
– times, time intervals
– time restrictions
– prices, intervals of prices (hotel prices)
– keywords for museum category
– names of pharaohs (annotated with a correction of mispellings)
• information about hotel and museum are exported into Prolog facts and into a short textual summary – templates filled with the detected information
• hotels: Price information about Cosmopolitan Hotel : $157• museums:
*** *Fraunces Tavern Museum* ***
Open from 12:00:00 to 17:00:00;
Restriction: Tuesday-Friday
![Page 15: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/15.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 15
UIMA vs. GATE: Conclusion
• no final judgement about: use GATE or UIMA– depends on
• your tasktask description
expected results
which processing resources are necessary
• your preferences for interfaceprefer the Eclispe environment (or other Java editors)
prefer a comfortable GUI
![Page 16: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/16.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 16
UIMA vs. GATE: Conclusion
• GATE:tools availablecomfortable GUI
• UIMA: plain frameworksimplified definition of (complex) result structures simplified pre- and postprocessing of annotations
• both are extensible– e.g. for processing German documents
![Page 17: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/17.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 17
'German' Extension of Processing Resources
• XDOC document suite– tools for processing German documents – tools implemented in CommonLisp
• for UIMA– Java reimplementation of the tools– several analysis engines
![Page 18: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/18.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 18
XDOC in UIMA
• annotation of – part-of-speech (Morphix, heuristics)– semantic categories – named entities (vehicles, cities, …)
• a coarse approach for classification of PP – using maxent library
![Page 19: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/19.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 19
UIMA: Evaluation
documentation?
processing and linguistic resources?
tools for resource maintenance and extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for improvement?
im-/export of document formats?
- good
- illustrative examples (tutorial)
- completeness: sometimes it is very shortly described
- experiences with Eclipse and Java programming are advantageous
- prior knowledge about Java and Eclipse is helpful
![Page 20: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/20.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 20
UIMA: Evaluation
documentation?
processing and linguistic resources?
tools for resource maintenance and extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for improvement?
im-/export of document formats?
- annotators only from tutorial- sentence annotation
- word annotation
- date/time annotators
- examples for using regular expressions etc.
- external resources can be integrated:- lexical resources as external resources
(text files)
- existing processing resources- implementation of an interface is
necessary
![Page 21: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/21.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 21
UIMA: Evaluation
documentation?
processing and linguistic resources?
tools for resource maintenance and extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for improvement?
im-/export of document formats?
- specific Eclipse component editors or - simple text editors
![Page 22: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/22.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 22
UIMA: Evaluation
documentation
processing and linguistic resources
tools for resource maintenance and extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for improvement?
im-/export of document formats?
- faster than GATE?- in CPE detailed information about
processing time for each module
![Page 23: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/23.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 23
UIMA: Evaluation
documentation
processing and linguistic resources
tools for resource maintenance and extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for improvement?
im-/export of document formats?
- Collection Reader- document(s) from a directory
![Page 24: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/24.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 24
UIMA: Evaluation
documentation
processing and linguistic resources
tools for resource maintenance and extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for improvement?
im-/export of document formats?
• no limitations: – all is possible, but implementation or
interfacing by user
• wish: – more processing and linguistic
resources within the distribution
![Page 25: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/25.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 25
UIMA: Evaluation
documentation
processing and linguistic resources
tools for resource maintenance and extension?
speed of processing?
single docs vs. large corpora?
limitations, suggestions for improvement?
im-/export of document formats?
- import: CAS Initializer- export: CAS Consumer
- transform annotations in any other format
- export of - document + annotations
- only annotations
- required: Java application
![Page 26: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/26.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 26
Overview
• Introduction
• First Experiments
• NLP Teaching
• Conclusion
![Page 27: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/27.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 27
NLP Teaching
• course: Information Extraction
• aim of the course: to make our students acquainted with information extraction as basic NLP technology– UIMA, GATE
• students: computer science, data-knowledge engineering
• skills of the students: programming Java
![Page 28: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/28.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 28
NLP Teaching
• different corpora: – news about FIFA world cup 2006 in Germany,– description of drugs,– announcements of new books, …
• tasks for students– to develop different anaylsis engines and combine them for
annotation of• URLs, • email addresses, • name of players, • results of games, …
• using regular expressions, external resources, maximum entropy models
![Page 29: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/29.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 29
NLP Teaching
![Page 30: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/30.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 30
UIMA: A Students View
easy to handle
Java programming (environment)
problems of students:– to understand the dependencies between the several
descriptors
• for teaching helpful (future work):– a 'comparator' of different solutions of students– which solution is the best, related to a 'master' solution
![Page 31: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/31.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 31
Overview
• Introduction
• First Experiments
• NLP Teaching
• Conclusion
![Page 32: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.](https://reader035.fdocuments.in/reader035/viewer/2022062714/56649d0a5503460f949dd717/html5/thumbnails/32.jpg)
Kunze, Rösner: Experiences with UIMA in NLP teaching and research 32
Conclusion
• UIMA:– easy to learn and to handle– support the management of
• different annotations
• different processing resources
– integration of external resources (processing resources as well lexical resources)
– splitting of 'processing steps':• reader, initalizer, analysis engine, consumer
• 'wish-list':– a kind of jape transducer
• interface to GATE's processing resources is available
– 'comparator' for evaluation of solutions