Apache UIMA and Metadata Generation

55
Apache UIMA and Metadata Generation Gestione delle Informazioni su Web - 2009/2010 Tommaso Teofili tommaso [at] apache [dot] org mercoledì 14 aprile 2010

description

Slides about an overview about Apache UIMA and how it can be used for Metadata Generation in the context of the "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University

Transcript of Apache UIMA and Metadata Generation

Page 1: Apache UIMA and Metadata Generation

Apache UIMA and Metadata Generation

Gestione delle Informazioni su Web - 2009/2010Tommaso Teofili

tommaso [at] apache [dot] org

mercoledì 14 aprile 2010

Page 2: Apache UIMA and Metadata Generation

Agenda

Unstructured information management

The ASF

Apache UIMA

Goals

Overview

Components

Usage

mercoledì 14 aprile 2010

Page 3: Apache UIMA and Metadata Generation

UIM ?

Unstructured Information Management

A wide topic: text, audio, video

Different (possibly mixed) approaches (NLP, Machine Learning, IR, Ontologies, Automated reasoning, Knowledge Sources)

Apache UIMA

mercoledì 14 aprile 2010

Page 4: Apache UIMA and Metadata Generation

Apache Software Foundation

No profit corporation

“...provides organizational, legal, and financial support for a broad range of open source software projects...”

“...collaborative and meritocratic development process...”

“...pragmatic Apache License...”

mercoledì 14 aprile 2010

Page 5: Apache UIMA and Metadata Generation

Apache UIMA

Architectural framework to manage unstructured data (Java, C++)

Just graduated as Apache Top Level Project

Former IBM research project donated to ASF

OASIS Standard

mercoledì 14 aprile 2010

Page 6: Apache UIMA and Metadata Generation

Apache UIMA - Goals

“Our goal is to support a thriving community of users and developers of UIMA frameworks, tools, and annotators, facilitating the analysis of unstructured content such as text, audio and video”

mercoledì 14 aprile 2010

Page 7: Apache UIMA and Metadata Generation

Apache UIMA - bridging worlds

mercoledì 14 aprile 2010

Page 8: Apache UIMA and Metadata Generation

Apache UIMA - Overview

UIMA supports the development, discovery, composition and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies

mercoledì 14 aprile 2010

Page 9: Apache UIMA and Metadata Generation

Apache UIMA - Multimodal Analysis

Multimodal Analysis means the ability of processing some resource from various “points of view”

Sample: a video stream for which we want to extract subtitles and also automatically recognize the actors involved

We are though mainly interested in text...

mercoledì 14 aprile 2010

Page 10: Apache UIMA and Metadata Generation

Sample scenario

Content Management System containing free text articles about movies

We want such articles to be automatically enriched with metadata contained inside the text (movies, directors, actors/actresses, distribution) and linked to “similar” articles (i.e.: dealing with same movies or actors)

So that we can search for “similar” articles

mercoledì 14 aprile 2010

Page 11: Apache UIMA and Metadata Generation

Sample scenario - articles about movies

mercoledì 14 aprile 2010

Page 12: Apache UIMA and Metadata Generation

Sample scenario

UIMA can help on enriching articles with metadata

Think of filling an Article.java instance variables with proper values

Then persisting it to a database to query articles dealing with the same actors

mercoledì 14 aprile 2010

Page 13: Apache UIMA and Metadata Generation

Filling Article with metadatamercoledì 14 aprile 2010

Page 14: Apache UIMA and Metadata Generation

Sample scenario - metadatamercoledì 14 aprile 2010

Page 15: Apache UIMA and Metadata Generation

UIMA - Annotations and Entities

mercoledì 14 aprile 2010

Page 16: Apache UIMA and Metadata Generation

Apache UIMA - Annotation

The association of a metadata, such as a label, with a region of text (or other type of artifact).

For example, the label “Person” associated with a region of text “Fred Center” constitutes an annotation. We say “Person” annotates the span of text from X to Y containing exactly “Fred Center”

mercoledì 14 aprile 2010

Page 17: Apache UIMA and Metadata Generation

Apache UIMA - Basic Steps

Domain model definition

Analysis pipeline definition

Arrange components:

Define components draining data from sources

Add and customize analysis components: Patterns, Dictionaries, RegEx, External services, NLP, etc...

Define components outputting information on target storages

Analysis pipeline(s) execution

mercoledì 14 aprile 2010

Page 18: Apache UIMA and Metadata Generation

Defining domain model within UIMA using Type Systems

Type System is the place where we describe which metadata we would like to extract

Low representational gap

Like almost everything in UIMA: described (and generated!) using XML

Possible to define multiple Type Systems for different purposes

mercoledì 14 aprile 2010

Page 19: Apache UIMA and Metadata Generation

Defining domain model within UIMA using Type SystemsDefine at least a Type inside Type System for each object inside the domain model

Useful to define more fine grained Types (for values of type properties, called Features)

If we want to extract information about articles we create an Article type inside the Type System

Also we’ll need to create annotations/entites for movies, actors, directors, etc...

Types usually extends Annotation or TOP

mercoledì 14 aprile 2010

Page 20: Apache UIMA and Metadata Generation

Type System for Articlesmercoledì 14 aprile 2010

Page 21: Apache UIMA and Metadata Generation

How do UIMA extract metadata?

mercoledì 14 aprile 2010

Page 22: Apache UIMA and Metadata Generation

Apache UIMA - Analysis Engines

Basic UIMA building blocks

Analyze a document

Infer and record descriptive attributes (about documents/regions)

Generating analysis results

mercoledì 14 aprile 2010

Page 23: Apache UIMA and Metadata Generation

Apache UIMA - AEs

Analysis Engines are described by a descriptor (XML)

Can be Primitive (a single AE) or Aggregated (a pipeline of AEs)

Analysis algorithms can be switched changing descriptor instead of code

Contain TypeSystems definitions

Define Capabilites

mercoledì 14 aprile 2010

Page 24: Apache UIMA and Metadata Generation

Apache UIMA - AnalysisComponent API

initialize : Performs (once) any startup tasks required by this component

process : Process the resource to analyze generating analysis results (metadata)

destroy : Frees all resources held, called only once when it is finished using this component

mercoledì 14 aprile 2010

Page 25: Apache UIMA and Metadata Generation

Apache UIMA - Annotators

Analysis Engine algorithm

Annotator : A software component implemented to produce and record annotations over regions of an artifact (e.g., text document, audio, and video)

Annotators implement AnalysisComponent interface

mercoledì 14 aprile 2010

Page 26: Apache UIMA and Metadata Generation

Apache UIMA - Roles

AnalysisEngine : High level block responsible for analysis - contains at least one AnalysisComponent

AnalysisComponent : interface for any component responsible for analyzing artifacts

Annotator : implementation of AnalysisComponent responsible for creating Annotations

mercoledì 14 aprile 2010

Page 27: Apache UIMA and Metadata Generation

Apache UIMA - AEs

mercoledì 14 aprile 2010

Page 28: Apache UIMA and Metadata Generation

Analysis Engines in a Pipeline

mercoledì 14 aprile 2010

Page 29: Apache UIMA and Metadata Generation

Apache UIMA - Analysis Results

Where do analysis results end up?

How annotators represent and share their results?

CAS - Common Analysis Structure

Maintain typed indexes of extracted results

mercoledì 14 aprile 2010

Page 30: Apache UIMA and Metadata Generation

Common Analysis Structuremercoledì 14 aprile 2010

Page 31: Apache UIMA and Metadata Generation

Which algorithms lay under AEs?

mercoledì 14 aprile 2010

Page 32: Apache UIMA and Metadata Generation

Apache UIMA & NLP

NLP (Natural Language Processing) is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications

It’s an AI discipline

mercoledì 14 aprile 2010

Page 33: Apache UIMA and Metadata Generation

Apache UIMA & NLP

“accomplish human-like language processing”

Paraphrase an input text

Translate the text into another language

Answer questions about the contents of the text

Draw inferences from the text <--

mercoledì 14 aprile 2010

Page 34: Apache UIMA and Metadata Generation

Apache UIMA & NLP

“an NLP-based IR system has the goal of providing more precise, complete information in response to a user’s real information need”

various levels of processing

that’s where we are!

mercoledì 14 aprile 2010

Page 35: Apache UIMA and Metadata Generation

Apache UIMA - First Approaches

Simplest : Write RegEx and Dictionaries and mix them together

NLP-like : Tokenize -> Sentence identification -> PoS Tagging -> Custom (Domain specific) structures

mercoledì 14 aprile 2010

Page 36: Apache UIMA and Metadata Generation

Analysis Engines in a Pipeline

mercoledì 14 aprile 2010

Page 37: Apache UIMA and Metadata Generation

Sample scenario - extract actors

Tokenize article text

Identify sentences

Tag PoS

Identify Persons using regular expressions and PoS

Use Person annotations, Tokens’ PoS and Sentences to extract relations between terms to identify Persons who are also Actors

mercoledì 14 aprile 2010

Page 38: Apache UIMA and Metadata Generation

Sample scenario - PersonAnnotator

I have a dictionary of names (simple to find and/or build)

I use a DictionaryAnnotator to extract NameAnnotations

I don’t have a dictionary of surnames

Everytime a matching name (a NameAnnotation) is found we look for one ore more (considering persons with double name or surname) subsequent tokens whose PoS is “undefined” or a noun (but not a verb) and starts with Uppercase letter

If found then the name + token(s) sequence annotates a Person (i.e. “Michael J. Fox”)

mercoledì 14 aprile 2010

Page 39: Apache UIMA and Metadata Generation

PersonAnnotator samplemercoledì 14 aprile 2010

Page 40: Apache UIMA and Metadata Generation

Sample scenario - articles about movies

mercoledì 14 aprile 2010

Page 41: Apache UIMA and Metadata Generation

Sample scenario

Getting actors can be simple if we know that Persons who are also actors do some well known actions

i.e.: a Person “stars as” CharacterInTheMovie (that will be eventually tagged as Person too) when is also an Actor

i.e.: if the snippet “CharacterInTheMovie (Person)” exists, then Person is usually an Actor

then we can build an ActorAnnotator

mercoledì 14 aprile 2010

Page 42: Apache UIMA and Metadata Generation

Sample scenariomercoledì 14 aprile 2010

Page 44: Apache UIMA and Metadata Generation

Apache UIMA - Components

Type Systems

Analysis Engines

CAS

Collection Processing Manager/Engine

Flow Controllers

CAS Consumers

Asynchronous Scaleout

Sandbox Components

Eclipse Plugins

Tools

mercoledì 14 aprile 2010

Page 45: Apache UIMA and Metadata Generation

Apache UIMA - Flow Controllers

A component which implements the interfaces needed to specify a custom flow within an Aggregate Analysis Engine

Enabling conditional pipelines

mercoledì 14 aprile 2010

Page 46: Apache UIMA and Metadata Generation

Apache UIMA - CAS Consumers

Components responsible for taking the results from the CAS and storing them into a database, or other storage device

mercoledì 14 aprile 2010

Page 47: Apache UIMA and Metadata Generation

Apache UIMA - Collection Processing and a bigger picture

mercoledì 14 aprile 2010

Page 48: Apache UIMA and Metadata Generation

Apache UIMA - Asynchronous Scaleout

add-on to the base Java framework, supporting a very flexible scaleout capability based on JMS (Java Messaging Services) and Apache ActiveMQ (a messaging an integration patterns provider)

a powerful clustering solution very useful when source documents size is huge

mercoledì 14 aprile 2010

Page 49: Apache UIMA and Metadata Generation

Apache UIMA - Sandbox Basics

Tokenizer

HMM Tagger

Dictionaries (DictionaryAnnotator, ConceptMapper)

Snowball

ConfigurableFeatureExtractor

mercoledì 14 aprile 2010

Page 50: Apache UIMA and Metadata Generation

Apache UIMA - External Services

External IE engines exposing webservices integrated easily inside UIMA:

AlchemyAPI Annotator

OpenCalais Annotator

mercoledì 14 aprile 2010

Page 51: Apache UIMA and Metadata Generation

Apache UIMA - Tika

Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. The TikaAnnotator uses Tika to generate annotations representing the original markup of a document, extract its text and metadata

mercoledì 14 aprile 2010

Page 52: Apache UIMA and Metadata Generation

Apache UIMA - Lucas

Very useful to build search engines!

stores CAS data on Lucene indexes

transforms annotation objects of a CAS into Lucene token streams which are stored in a Lucene document

mercoledì 14 aprile 2010

Page 53: Apache UIMA and Metadata Generation

Apache UIMA - Tools

JCasGen

PEAR Installer, Merger, Packager

Component Descriptor Editor

CPE Configurator

Java Annotation Viewer

CAS Visual Debugger

Document Analyzer

mercoledì 14 aprile 2010

Page 54: Apache UIMA and Metadata Generation

Apache UIMA

We can aggregate existing components or write and deploy our new ones

There are lots of repositories for UIMA containing open source analysis engines, type systems, etc...

We though have to know better enough our domain

Please mind the “false positives” issue

mercoledì 14 aprile 2010