Apache UIMA - Hands on code

36
Apache UIMA - hands on code Gestione delle Informazioni su Web - 2010/2011 Tommaso Teofili tommaso [at] apache [dot] org

description

lesson about UIMA real use cases, integration with search engines and a little hands on code session

Transcript of Apache UIMA - Hands on code

Page 1: Apache UIMA - Hands on code

Apache UIMA - hands on code

Gestione delle Informazioni su Web - 2010/2011Tommaso Teofili

tommaso [at] apache [dot] org

Page 2: Apache UIMA - Hands on code

Use Cases - Agenda

UC1 : Real Estatate market analysis

UC2 : Tenders automatic information extraction

UIMA & search engines

Tutorial

Assignment

Page 3: Apache UIMA - Hands on code

UC1 : Source

An online announcement site for sellers and buyers

Wide purpose (cars, RE, hi-fi, etc...)

Local scope (Rome and nearby)

Page 4: Apache UIMA - Hands on code

UC1 - Goals

Track real estate market in order to:

Take smart decisions

Predict how things will go in the (near) future

Estate listings text is unstructered

Aggregate queries for statistical analysis need structured information

Page 5: Apache UIMA - Hands on code

UC1 - Source

Page 6: Apache UIMA - Hands on code

UC1 - Blocks

Page 7: Apache UIMA - Hands on code

UC1 - CrawlerA specialized crawler extract data from the source

Estate listings data are stored grouped by zones in files on some directory on a managed machine

Define navigation of the site using one XML for each city zone

The crawler downloads page fragments two times a week

The estate listings extracted free text is saved on XML grouped by zone

Page 8: Apache UIMA - Hands on code

UC1 - Crawler

Issues :

Enabled cookies

Some HTTP headers needed

Needed to put fixed sleeping intervals between requests

Page 9: Apache UIMA - Hands on code

UC1 - Domain

Announcement

Zone

MagazineNumber

HouseStructure (with properties)

Page 10: Apache UIMA - Hands on code

UC1 - Information Extraction Engine

Goal : extract price, zone and telephone number

The first version used huge regular expressions

Hard to maintain and unefficient

Poor extraction

Page 11: Apache UIMA - Hands on code

UC1 - IE Engine

New requirements: extract the structure of the house

Number of rooms, box, garden(s), external spaces, number of bathrooms, kitchen, etc...

Track more fine grained zones

Page 12: Apache UIMA - Hands on code

Sample text

“ven 26 Dic APPIA via grottaferrata metro 2 ¡ piano assolato ingresso salone americana cucina camera cameretta bagno soppalco posto auto e 295.000”

Page 13: Apache UIMA - Hands on code

UC1 - ContentAnnotator

From the XML produced by the crawler only estate listings must be extracted

A simple parser to get each node containing an estate listing (that in turn will be unstructured)

Create a ContentAnnotation over the document

Page 14: Apache UIMA - Hands on code

ContentAnnotation

Page 15: Apache UIMA - Hands on code

UC1 - Entities

Page 16: Apache UIMA - Hands on code

UC1 - ZoneAnnotation

Page 17: Apache UIMA - Hands on code

UC1 - Consuming extracted information

the previous version of the IE engine produced XML files that needed to be reparsed to store structured data inside the DB

with UIMA a CAS Consumer at the end of the analysis pipeline can automatically put extracted information on the DB

Page 18: Apache UIMA - Hands on code

UIMA - CAS Consumer

Analysis Engine responsible for consuming information contained inside the CAS

Can write extracted information to:

DBMS

Lucene index

Filesystem

...

Page 19: Apache UIMA - Hands on code

UC1 - Analysis Graphs

Page 20: Apache UIMA - Hands on code

UC1 - Analysis Graphs

Page 21: Apache UIMA - Hands on code

UC2 - Monitor of EU announcements

Monitor various sources which provide announcement and tenders

Automate the long monitoring process of such sources and automatically extract useful common information from announcements’ texts

Page 22: Apache UIMA - Hands on code

UC2 - Blocks

Page 23: Apache UIMA - Hands on code

Different input texts

Page 24: Apache UIMA - Hands on code

Different input texts

Page 25: Apache UIMA - Hands on code

Different input texts

Page 26: Apache UIMA - Hands on code

UC2 - Domain annotations

Language

Abstract

Activity

Beneficiary

Budget

Expiration date

Funding type

Geographic region

Sector

Subject

Title

Tags

Page 27: Apache UIMA - Hands on code

UC2 - Domain entities

First and most important is an entity that represents the entire tender or announcement

Annotations inside the domain will finally fill such entity properties

Page 28: Apache UIMA - Hands on code

Each annotator first looks:

if some metadata was extracted during navigation

for the most common pattern for defining information inside such announcements

i.e.: “Budget: 200000$” or “Financial information: ......”

Such patterns are common in different languages

UC2 - Simple first

Page 29: Apache UIMA - Hands on code

UC2 - AbstractAnnotator

The abstract is usually in the first part of the document

We use Tokenizer and Tagger to get Tokens (with PoS tags) and Sentences

We use dictionary of “good” words and linguistic patterns

We look in the first sentences of the document looking for objectives of the announcement

Page 30: Apache UIMA - Hands on code

UC2 - ExpirationDateAnnotator

A DateAnnotator is executed before

Iterate over DateAnnotations

Get sentences wrapping such DateAnnotations

Check if some terms or patterns like “the deadline is ...” appear near a DateAnnotation

Page 31: Apache UIMA - Hands on code

UC2 - BandoEntity

Page 32: Apache UIMA - Hands on code

UIMA & Search Engines

Decorate documents with automatically extracted metadata to improve search experience

relevance

results

clustering

Page 33: Apache UIMA - Hands on code

Information Retrieval and Named Entities

Page 34: Apache UIMA - Hands on code

UIMA & Search Engines“Push” scenario:

documents are sent to UIMA which extracts metadata and writes on the index with a CAS Consumer

“Pull” scenario:

documents are sent to Lucene which asks UIMA to extract metadata for it and then Lucene itself writes them to the index

“On demand” scenario:

metadata are extracted only on demand each time a document is retrieved/showed...

Page 35: Apache UIMA - Hands on code

UIMA - tutorial

create a Type System

create an Analysis Engine descriptor

create a simple Annotator

Page 36: Apache UIMA - Hands on code

Assignment

Named Entities Recognition

sport: person, player, coach, team, competition

videogames: person, videogame character, videogame, software house, hardware requirement

Preciosion & Recall test