Apache UIMA and Metadata Generation
-
Upload
tommaso-teofili -
Category
Documents
-
view
5.884 -
download
2
description
Transcript of Apache UIMA and Metadata Generation
Apache UIMA and Metadata Generation
Gestione delle Informazioni su Web - 2009/2010Tommaso Teofili
tommaso [at] apache [dot] org
mercoledì 14 aprile 2010
Agenda
Unstructured information management
The ASF
Apache UIMA
Goals
Overview
Components
Usage
mercoledì 14 aprile 2010
UIM ?
Unstructured Information Management
A wide topic: text, audio, video
Different (possibly mixed) approaches (NLP, Machine Learning, IR, Ontologies, Automated reasoning, Knowledge Sources)
Apache UIMA
mercoledì 14 aprile 2010
Apache Software Foundation
No profit corporation
“...provides organizational, legal, and financial support for a broad range of open source software projects...”
“...collaborative and meritocratic development process...”
“...pragmatic Apache License...”
mercoledì 14 aprile 2010
Apache UIMA
Architectural framework to manage unstructured data (Java, C++)
Just graduated as Apache Top Level Project
Former IBM research project donated to ASF
OASIS Standard
mercoledì 14 aprile 2010
Apache UIMA - Goals
“Our goal is to support a thriving community of users and developers of UIMA frameworks, tools, and annotators, facilitating the analysis of unstructured content such as text, audio and video”
mercoledì 14 aprile 2010
Apache UIMA - bridging worlds
mercoledì 14 aprile 2010
Apache UIMA - Overview
UIMA supports the development, discovery, composition and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies
mercoledì 14 aprile 2010
Apache UIMA - Multimodal Analysis
Multimodal Analysis means the ability of processing some resource from various “points of view”
Sample: a video stream for which we want to extract subtitles and also automatically recognize the actors involved
We are though mainly interested in text...
mercoledì 14 aprile 2010
Sample scenario
Content Management System containing free text articles about movies
We want such articles to be automatically enriched with metadata contained inside the text (movies, directors, actors/actresses, distribution) and linked to “similar” articles (i.e.: dealing with same movies or actors)
So that we can search for “similar” articles
mercoledì 14 aprile 2010
Sample scenario - articles about movies
mercoledì 14 aprile 2010
Sample scenario
UIMA can help on enriching articles with metadata
Think of filling an Article.java instance variables with proper values
Then persisting it to a database to query articles dealing with the same actors
mercoledì 14 aprile 2010
Filling Article with metadatamercoledì 14 aprile 2010
Sample scenario - metadatamercoledì 14 aprile 2010
UIMA - Annotations and Entities
mercoledì 14 aprile 2010
Apache UIMA - Annotation
The association of a metadata, such as a label, with a region of text (or other type of artifact).
For example, the label “Person” associated with a region of text “Fred Center” constitutes an annotation. We say “Person” annotates the span of text from X to Y containing exactly “Fred Center”
mercoledì 14 aprile 2010
Apache UIMA - Basic Steps
Domain model definition
Analysis pipeline definition
Arrange components:
Define components draining data from sources
Add and customize analysis components: Patterns, Dictionaries, RegEx, External services, NLP, etc...
Define components outputting information on target storages
Analysis pipeline(s) execution
mercoledì 14 aprile 2010
Defining domain model within UIMA using Type Systems
Type System is the place where we describe which metadata we would like to extract
Low representational gap
Like almost everything in UIMA: described (and generated!) using XML
Possible to define multiple Type Systems for different purposes
mercoledì 14 aprile 2010
Defining domain model within UIMA using Type SystemsDefine at least a Type inside Type System for each object inside the domain model
Useful to define more fine grained Types (for values of type properties, called Features)
If we want to extract information about articles we create an Article type inside the Type System
Also we’ll need to create annotations/entites for movies, actors, directors, etc...
Types usually extends Annotation or TOP
mercoledì 14 aprile 2010
Type System for Articlesmercoledì 14 aprile 2010
How do UIMA extract metadata?
mercoledì 14 aprile 2010
Apache UIMA - Analysis Engines
Basic UIMA building blocks
Analyze a document
Infer and record descriptive attributes (about documents/regions)
Generating analysis results
mercoledì 14 aprile 2010
Apache UIMA - AEs
Analysis Engines are described by a descriptor (XML)
Can be Primitive (a single AE) or Aggregated (a pipeline of AEs)
Analysis algorithms can be switched changing descriptor instead of code
Contain TypeSystems definitions
Define Capabilites
mercoledì 14 aprile 2010
Apache UIMA - AnalysisComponent API
initialize : Performs (once) any startup tasks required by this component
process : Process the resource to analyze generating analysis results (metadata)
destroy : Frees all resources held, called only once when it is finished using this component
mercoledì 14 aprile 2010
Apache UIMA - Annotators
Analysis Engine algorithm
Annotator : A software component implemented to produce and record annotations over regions of an artifact (e.g., text document, audio, and video)
Annotators implement AnalysisComponent interface
mercoledì 14 aprile 2010
Apache UIMA - Roles
AnalysisEngine : High level block responsible for analysis - contains at least one AnalysisComponent
AnalysisComponent : interface for any component responsible for analyzing artifacts
Annotator : implementation of AnalysisComponent responsible for creating Annotations
mercoledì 14 aprile 2010
Apache UIMA - AEs
mercoledì 14 aprile 2010
Analysis Engines in a Pipeline
mercoledì 14 aprile 2010
Apache UIMA - Analysis Results
Where do analysis results end up?
How annotators represent and share their results?
CAS - Common Analysis Structure
Maintain typed indexes of extracted results
mercoledì 14 aprile 2010
Common Analysis Structuremercoledì 14 aprile 2010
Which algorithms lay under AEs?
mercoledì 14 aprile 2010
Apache UIMA & NLP
NLP (Natural Language Processing) is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications
It’s an AI discipline
mercoledì 14 aprile 2010
Apache UIMA & NLP
“accomplish human-like language processing”
Paraphrase an input text
Translate the text into another language
Answer questions about the contents of the text
Draw inferences from the text <--
mercoledì 14 aprile 2010
Apache UIMA & NLP
“an NLP-based IR system has the goal of providing more precise, complete information in response to a user’s real information need”
various levels of processing
that’s where we are!
mercoledì 14 aprile 2010
Apache UIMA - First Approaches
Simplest : Write RegEx and Dictionaries and mix them together
NLP-like : Tokenize -> Sentence identification -> PoS Tagging -> Custom (Domain specific) structures
mercoledì 14 aprile 2010
Analysis Engines in a Pipeline
mercoledì 14 aprile 2010
Sample scenario - extract actors
Tokenize article text
Identify sentences
Tag PoS
Identify Persons using regular expressions and PoS
Use Person annotations, Tokens’ PoS and Sentences to extract relations between terms to identify Persons who are also Actors
mercoledì 14 aprile 2010
Sample scenario - PersonAnnotator
I have a dictionary of names (simple to find and/or build)
I use a DictionaryAnnotator to extract NameAnnotations
I don’t have a dictionary of surnames
Everytime a matching name (a NameAnnotation) is found we look for one ore more (considering persons with double name or surname) subsequent tokens whose PoS is “undefined” or a noun (but not a verb) and starts with Uppercase letter
If found then the name + token(s) sequence annotates a Person (i.e. “Michael J. Fox”)
mercoledì 14 aprile 2010
PersonAnnotator samplemercoledì 14 aprile 2010
Sample scenario - articles about movies
mercoledì 14 aprile 2010
Sample scenario
Getting actors can be simple if we know that Persons who are also actors do some well known actions
i.e.: a Person “stars as” CharacterInTheMovie (that will be eventually tagged as Person too) when is also an Actor
i.e.: if the snippet “CharacterInTheMovie (Person)” exists, then Person is usually an Actor
then we can build an ActorAnnotator
mercoledì 14 aprile 2010
Sample scenariomercoledì 14 aprile 2010
Apache UIMA experience
Under SVN at
http://svn.apache.org/repos/asf/uima/uimaj/trunk/uimaj-examples/
there are some examples and also the getting started guides are very useful to start to get in touch with UIMA
http://uima.apache.org/documentation.html#getting_started
Subscribe to users@ and [email protected] MLs
mercoledì 14 aprile 2010
Apache UIMA - Components
Type Systems
Analysis Engines
CAS
Collection Processing Manager/Engine
Flow Controllers
CAS Consumers
Asynchronous Scaleout
Sandbox Components
Eclipse Plugins
Tools
mercoledì 14 aprile 2010
Apache UIMA - Flow Controllers
A component which implements the interfaces needed to specify a custom flow within an Aggregate Analysis Engine
Enabling conditional pipelines
mercoledì 14 aprile 2010
Apache UIMA - CAS Consumers
Components responsible for taking the results from the CAS and storing them into a database, or other storage device
mercoledì 14 aprile 2010
Apache UIMA - Collection Processing and a bigger picture
mercoledì 14 aprile 2010
Apache UIMA - Asynchronous Scaleout
add-on to the base Java framework, supporting a very flexible scaleout capability based on JMS (Java Messaging Services) and Apache ActiveMQ (a messaging an integration patterns provider)
a powerful clustering solution very useful when source documents size is huge
mercoledì 14 aprile 2010
Apache UIMA - Sandbox Basics
Tokenizer
HMM Tagger
Dictionaries (DictionaryAnnotator, ConceptMapper)
Snowball
ConfigurableFeatureExtractor
mercoledì 14 aprile 2010
Apache UIMA - External Services
External IE engines exposing webservices integrated easily inside UIMA:
AlchemyAPI Annotator
OpenCalais Annotator
mercoledì 14 aprile 2010
Apache UIMA - Tika
Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. The TikaAnnotator uses Tika to generate annotations representing the original markup of a document, extract its text and metadata
mercoledì 14 aprile 2010
Apache UIMA - Lucas
Very useful to build search engines!
stores CAS data on Lucene indexes
transforms annotation objects of a CAS into Lucene token streams which are stored in a Lucene document
mercoledì 14 aprile 2010
Apache UIMA - Tools
JCasGen
PEAR Installer, Merger, Packager
Component Descriptor Editor
CPE Configurator
Java Annotation Viewer
CAS Visual Debugger
Document Analyzer
mercoledì 14 aprile 2010
Apache UIMA
We can aggregate existing components or write and deploy our new ones
There are lots of repositories for UIMA containing open source analysis engines, type systems, etc...
We though have to know better enough our domain
Please mind the “false positives” issue
mercoledì 14 aprile 2010
Referenceshttp://www.apache.org
http://uima.apache.org
http://www.oasis-open.org
http://www.cnlp.org/publications/03NLP.LIS.Encyclopedia.pdf
http://nlp.stanford.edu/
http://www.opencalais.com/gnosis/
http://www.dsi.unive.it/~marin/docs/hmm-it.pdf
http://en.wikipedia.org/wiki/Hidden_Markov_model
mercoledì 14 aprile 2010