BeeSpace Informatics Research

33
BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library & Information Science University of Illinois at Urbana-Champaign BeeSpace Workshop, May 22, 2009

description

BeeSpace Informatics Research. ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library & Information Science University of Illinois at Urbana-Champaign. BeeSpace Workshop, May 21, 2007. Overview of BeeSpace Technology. - PowerPoint PPT Presentation

Transcript of BeeSpace Informatics Research

Page 1: BeeSpace Informatics Research

BeeSpace Informatics Research

ChengXiang (“Cheng”) Zhai

Department of Computer ScienceInstitute for Genomic Biology

StatisticsGraduate School of Library & Information Science

University of Illinois at Urbana-Champaign

BeeSpace Workshop, May 22, 2009

Page 2: BeeSpace Informatics Research

Overview of BeeSpace Technology

Literature Text

Search Engine

Words/Phrases Entities

Natural Language Understanding

UsersFunction Annotator

Space/Region Manager, Navigation Support

Gene Summarizer …

Relational Database

Text Miner

Meta Data

Task Support

Space Navigation

ContentAnalysis

Page 3: BeeSpace Informatics Research

Part 1: Content Analysis

Page 4: BeeSpace Informatics Research

Natural Language Understanding

…We have cloned and sequenced

a cDNA encoding Apis mellifera ultraspiracle (AMUSP)

and examined its responses to …

NP

NP NP

NPVP

VP VP

Gene Gene

Page 5: BeeSpace Informatics Research

Sample Technique 1: Automatic Gene Recognition

• Syntactic clues:– Capitalization (especially acronyms)– Numbers (gene families)– Punctuation: -, /, :, etc.

• Contextual clues:– Local: surrounding words such as “gene”,

“encoding”, “regulation”, “expressed”, etc.– Global: same noun phrase occurs several times in

the same article

Page 6: BeeSpace Informatics Research

Maximum Entropy Modelfor Gene Tagging

• Given an observation (a token or a noun phrase), together with its context, denoted as x

• Predict y {gene, non-gene}

• Maximum entropy model:

P(y|x) = K exp(ifi(x, y))

• Typical f:– y = gene & candidate phrase starts with a capital letter– y = gene & candidate phrase contains digits

• Estimate i with training data

Page 7: BeeSpace Informatics Research

Domain overfitting problem• When a learning based gene tagger is applied to a

domain different from the training domain(s), the performance tends to decrease significantly.

• The same problem occurs in other types of text, e.g., named entities in news articles.

Training domain Test domain F1mouse mouse 0.541

fly mouse 0.281Reuters Reuters 0.908Reuters WSJ 0.643

Page 8: BeeSpace Informatics Research

Observation I

• Overemphasis on domain-specific features in the trained model

winglessdaughterless

eyelessapexless…

fly

“suffix –less” weighted high in the model trained from fly data

Page 9: BeeSpace Informatics Research

Observation II

• Generalizable features: generalize well in all domains– …decapentaplegic and wingless are expressed in

analogous patterns in each primordium of… (fly)– …that CD38 is expressed by both neurons and glial

cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. (mouse)

Page 10: BeeSpace Informatics Research

Observation II

• Generalizable features: generalize well in all domains– …decapentaplegic and wingless are expressed in

analogous patterns in each primordium of… (fly)– …that CD38 is expressed by both neurons and glial

cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. (mouse)

“wi+2 = expressed” is generalizable

Page 11: BeeSpace Informatics Research

Generalizability-based feature ranking

fly mouse D3 Dm…training data

……-less……expressed……

………expressed………-less

………expressed……-less…

…………expressed……-less

12345678

12345678

12345678

12345678

…expressed………-less……

…0.125………0.167……

Page 12: BeeSpace Informatics Research

T1 Tm…training data

E

testing

test data

O1 Om

individual domainfeature ranking

domain-specific features

feature re-ranking

O’

generalizable features

feature selection for D1

feature selection for D0

top d0 features for D0

top d1 features for D1

feature selection for Dm

top dm features for Dm

learningentity recognizer

d = λ0d0 + (1 – λ0)(λ1d1 + … + λmdm)d features

Adapting Biological Named Entity Recognizer

λ0, λ1, … , λm

Page 13: BeeSpace Informatics Research

Effectiveness of Domain AdaptationExp Method Precision Recall F1

F+M→Y Baseline 0.557 0.466 0.508Domain 0.575 0.516 0.544% Imprv. +3.2% +10.7% +7.1%

F+Y→M Baseline 0.571 0.335 0.422Domain 0.582 0.381 0.461% Imprv. +1.9% +13.7% +9.2%

M+Y→F Baseline 0.583 0.097 0.166Domain 0.591 0.139 0.225% Imprv. +1.4% +43.3% +35.5%

•Text data from BioCreAtIvE (Medline)•3 organisms (Fly, Mouse, Yeast)

Page 14: BeeSpace Informatics Research

Gene Recognition in V3

• A variation of the basic maximum entropy– Classes: {Begin, Inside, Outside} – Features: syntactical features, POS tags, class

labels of previous two tokens– Post-processing to exploit global features

• Leverage existing toolkit: BMR

Page 15: BeeSpace Informatics Research

Part 2: Navigation Support

Page 16: BeeSpace Informatics Research

Space-Region Navigation

Literature Spaces

Bee Fly

Behavior

Bird…

Topic Regions

Bee Forager

MAP MAP

Bird Singing

EXTRACT

…Fly Rover

EXTRACT

SWITCHING

Intersection, Union,…

Intersection, Union,…

My Regions/Topics

My Spaces

Page 17: BeeSpace Informatics Research

MAP: Topic/RegionSpace

• MAP: Use the topic/region description as a query to search a given space

• Retrieval algorithm:– Query word distribution: p(w|Q)

– Document word distribution: p(w|D)

– Score a document based on similarity of Q and D

• Leverage existing retrieval toolkits: Lemur/Indri

Vocabularyw D

QQDQ wp

wpwpDDQscore

)|()|(

log)|()||(),(

Page 18: BeeSpace Informatics Research

EXTRACT: Space Topic/Region

• Assume k topics, each being represented by a word distribution

• Use a k-component mixture model to fit the documents in a given space (EM algorithm)

• The estimated k component word distributions are taken as k topic regions

| |

1 1

log ( | ) log[ ( | ) (1 ) ( | )]D k

i B j i jD C i j

p C p D p D

Likelihood:

Maximum likelihood estimator: * arg max ( | )p C

Bayesian estimator: * arg max ( | ) arg max ( | ) ( )p C p C p

Page 19: BeeSpace Informatics Research

A Sample Topic & Corresponding Space

filaments 0.0410238muscle 0.0327107actin 0.0287701z 0.0221623filament 0.0169888myosin 0.0153909thick 0.00968766thin 0.00926895sections 0.00924286er 0.00890264band 0.00802833muscles 0.00789018antibodies 0.00736094myofibrils 0.00688588flight 0.00670859images 0.00649626

actin filamentsflight muscleflight muscles

labels

• actin filaments in honeybee-flight muscle move collectively• arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections• identification of a connecting filament protein in insect fibrillar flight muscle• the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles• structure of thick filaments from insect flight muscle

Word Distribution (language model)

Example documents

Meaningful labels

Page 20: BeeSpace Informatics Research

Incorporating Topic Priors

• Either topic extraction or clustering:– User exploration: usually has preference.– E.g., want one topic/cluster is about foraging

behavior

• Use prior to guild topic extraction– Prior as a simple language model– E.g. forage 0.2; foraging 0.3; food 0.05; etc.

Page 21: BeeSpace Informatics Research

Incorporating a Topic Prior

Original EM:

EM with Prior:

Prior

Prior

Page 22: BeeSpace Informatics Research

Incorporating Topic Priors: Sample Topic 1

age 0.0672687division 0.0551497labor 0.052136colony 0.038305foraging 0.0357817foragers 0.0236658workers 0.0191248task 0.0190672behavioral 0.0189017behavior 0.0168805older 0.0143466tasks 0.013823old 0.011839individual 0.0114329ages 0.0102134young 0.00985875genotypic 0.00963096social 0.00883439

Prior:

labor 0.2division 0.2

Page 23: BeeSpace Informatics Research

Incorporating Topic Priors: Sample Topic 2

behavioral 0.110674age 0.0789419maturation 0.057956task 0.0318285division 0.0312101labor 0.0293371workers 0.0222682colony 0.0199028social 0.0188699behavior 0.0171008performance 0.0117176foragers 0.0110682genotypic 0.0106029differences 0.0103761polyethism 0.00904816older 0.00808171plasticity 0.00804363changes 0.00794045

Prior:

behavioral 0.2maturation 0.2

Page 24: BeeSpace Informatics Research

foraging 0.290076nectar 0.114508food 0.106655forage 0.0734919colony 0.0660329pollen 0.0427706flower 0.0400582sucrose 0.0334728source 0.0319787behavior 0.0283774individual 0.028029rate 0.0242806recruitment 0.0200597time 0.0197362reward 0.0196271task 0.0182461sitter 0.00604067rover 0.00582791rovers 0.00306051

foraging 0.142473foragers 0.0582921forage 0.0557498food 0.0393453nectar 0.03217colony 0.019416source 0.0153349hive 0.0151726dance 0.013336forager 0.0127668information 0.0117961feeder 0.010944rate 0.0104752recruitment 0.00870751individual 0.0086414reward 0.00810706flower 0.00800705dancing 0.00794827behavior 0.00789228

Exploit Prior for Concept Switching

Page 25: BeeSpace Informatics Research

Part 3: Task Support

Page 26: BeeSpace Informatics Research

Gene Summarization

• Task: Automatically generate a text summary for a given gene

• Challenge: Need to summarize different aspects of a gene

• Standard summarization methods would generate an unstructured summary

• Solution: A new method for generating semi-structured summaries

Page 27: BeeSpace Informatics Research

An Ideal Gene Summary• http://flybase.bio.indiana.edu/.bin/fbidq.html?FBgn0000017

GP

EL

SI

GI

MP

WFPI

Page 28: BeeSpace Informatics Research

Semi-structured Text Summarization

Page 29: BeeSpace Informatics Research

Summary example (Abl)

Page 30: BeeSpace Informatics Research

A General Entity Summarizer

• Task: Given any entity and k aspects to summarize, generate a semi-structured summary

• Assumption: Training sentences available for each aspect

• Method: – Train a recognizer for each aspect – Given an entity, retrieve sentences relevant to the entity– Classify each sentence into one of the k aspects– Choose the best sentences in each category

Page 31: BeeSpace Informatics Research

Summary• All the methods we developed are

– General– Scalable

• The problems are hard, but good progress has been made in all the directions– The V3 system has only incorporated the basic research results– More advanced technologies are available for immediate

implementation• Better tokenization for retrieval• Domain adaptation techniques• Automatic topic labeling• General entity summarizer

• More research to be done in– Entity & relation extraction– Graph mining/question answering– Domain adaptation – Active learning

Page 32: BeeSpace Informatics Research

Looking Ahead: X-Space…

Literature Text

Search Engine

Words/Phrases Entities

Natural Language Understanding

UsersFunction Annotator

Space/Region Manager, Navigation Support

Gene Summarizer …

Relational Database

Text Miner

Meta Data

Task Support

Space Navigation

ContentAnalysis

Page 33: BeeSpace Informatics Research

Thank You!

Questions?