Analyzing NIH Funding Patterns with Statistical Text...

29
Analyzing NIH Funding Patterns with Statistical Text Analysis Jihyun Park Eric Nalisnick Padhraic Smyth Dept. Of Computer Science University Of California, Irvine Margaret Blume-Kohout New Mexico Consor>um Ralf Krestel Web Science Research Group Hasso-PlaGner-Ins>tut

Transcript of Analyzing NIH Funding Patterns with Statistical Text...

Page 1: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

Analyzing NIH Funding Patterns with Statistical Text Analysis

JihyunPark EricNalisnickPadhraicSmyth

Dept.OfComputerScienceUniversityOfCalifornia,Irvine

MargaretBlume-KohoutNewMexicoConsor>um

RalfKrestelWebScienceResearchGroup

Hasso-PlaGner-Ins>tut

Page 2: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

▸ NIHinvestsover$30billioneachyear

▸ Canwegaininsightintothisprocessusingtextandmetadata?

▸ Ourapproachistousesta>s>caltopicmodeling

▸ WeusedgrantsdatafromNCI(Na>onalCancerIns>tute)

YEAR1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

NU

MB

ER O

F G

RA

NTS

(TH

OU

SAN

DS)

4

5

6

7

8

9

10

11NCI FUNDED GRANTS

YEAR1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

BIL

LIO

NS

(US

DO

LLA

RS)

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6NCI FUNDING AMOUNT

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

Measuring the Impact of NIH(National Institute of Health) Funding2

ARRAFundedARRAFunded

Page 3: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

ForeachgrantGENETICS

HUMAN GENOME

BIOENGINEERING

NANO TECHNOLOGY

0 0.25 0.5 0.75 1

1.0

0.8

0.7

0.2

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

Overview3

FundingpaAernsoverCmeforeacharea

Probabilityofeachlabelbeingassociatedwiththegrant

NCIData

0 35 70 105 140

PROJECT ID

GRANT ABSTRACT RCDCLabels

FUNDING YEAR …

0 35 70 105 140

...

TextClassificaConTechniques

Funding+Year informaCon

Page 4: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

NCI (National Cancer Institute) Data

▸ Grantabstractsfrom1994through2013

▸ TextProcessing▸ BOWrepresenta>on▸ Removed500commonstopwords▸ Extractednoun-phrasetermsusingaNLPparser

▸ BOWData▸ Total149,901documents▸ Numberofdocumentswithlabels(trainingdata):31,628(2008~2011)▸ Numberofdocumentswithoutlabels:118,273▸ Sizeofvocabulary(W):29,713

4

Page 5: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

LDA: Topics are Represented as Distributions over Words5

WEEKDOW_JONES

POINTS10_YR_TREASURY_YIELD

PERCENTCLOSE

NASDAQ_COMPOSITESTANDARD_POOR

CHANGEFRIDAY

WALL_STREETANALYSTSINVESTORS

FIRMGOLDMAN_SACHS

FIRMSINVESTMENT

MERRILL_LYNCHCOMPANIESSECURITIES

SEPT_11WAR

SECURITYIRAQ

TERRORISMNATIONKILLED

AFGHANISTANATTACKS

OSAMA_BIN_LADEN

BANKRUPTCYCREDITORS

BANKRUPTCY_PROTECTIONASSETS

COMPANYFILED

BANKRUPTCY_FILINGENRON

BANKRUPTCY_COURTKMART

Terrorism WallStreetFirms StockMarket Bankruptcy

Figures from Mark Steyvers

Page 6: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

LDA: Documents are Represented as Combinations of Topics6

WEEKDOW_JONES

POINTS10_YR_TREASURY_YIELD

PERCENTCLOSE

NASDAQ_COMPOSITESTANDARD_POOR

CHANGEFRIDAY

WALL_STREETANALYSTSINVESTORS

FIRMGOLDMAN_SACHS

FIRMSINVESTMENT

MERRILL_LYNCHCOMPANIESSECURITIES

SEPT_11WAR

SECURITYIRAQ

TERRORISMNATIONKILLED

AFGHANISTANATTACKS

OSAMA_BIN_LADEN

BANKRUPTCYCREDITORS

BANKRUPTCY_PROTECTIONASSETS

COMPANYFILED

BANKRUPTCY_FILINGENRON

BANKRUPTCY_COURTKMART

Terrorism WallStreetFirms StockMarket Bankruptcy

Document1

70% 30%

Document2 Document3

…50% 50% 90%

Figures from Mark Steyvers

Page 7: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

LDA (Latent Dirichlet Allocation)7

W

D

T

T

W

D

Page 8: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

LDA (Latent Dirichlet Allocation)

▸ TopicModelsasFactorAnalysisforCountData

8

W

D

T

T

W

D

Ttopicweightsforeachdocument

topic-wordprobabilitydistribuCon

Page 9: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

9

doc1 3 1 1 1 1

doc2 1 1

doc3 2 1

doc4 1 1 1

doc5 1 1

doc6 2

doc7 1 1

doc8 2 1

doc9 1 1 1 2

doc10 2

brai

n

lung

_can

cer

wom

en

obes

ity

child

ren

mic

e

expe

rimen

t

hbv

qual

ity

glio

ma

rese

arch

er

Do

cu

me

nts

Words or Terms

NIH Data Representation for L-LDA

Page 10: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

10

doc1 3 1 1 1 1

doc2 1 1

doc3 2 1

doc4 1 1 1

doc5 1 1

doc6 2

doc7 1 1

doc8 2 1

doc9 1 1 1 2

doc10 2

1

1

1

1 1

1

1

1

1

1 1

1br

ain ca

ncer

brea

st ca

ncer

kidne

y dise

ase

lung

canc

er

min

d an

d bo

dy

brai

n

lung

_can

cer

wom

en

obes

ity

child

ren

mic

e

expe

rimen

t

hbv

qual

ity

glio

ma

rese

arch

er

Do

cu

me

nts

Words or Terms Codes or Labels

NIH Data Representation for L-LDA

Page 11: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

11

doc1 3 1 1 1 1

doc2 1 1

doc3 2 1

doc4 1 1 1

doc5 1 1

doc6 2

doc7 1 1

doc8 2 1

doc9 1 1 1 2

doc10 2

1 1 1

1 1 1

1 1 1

1 1 1 1

1 1 1

1 1 1

1 1 1

1 1 1

1 1 1 1

1 1 1br

ain ca

ncer

brea

st ca

ncer

kidne

y dise

ase

lung

canc

er

min

d an

d bo

dy

Back

grou

nd 1

Back

grou

nd 2

brai

n

lung

_can

cer

wom

en

obes

ity

child

ren

mic

e

expe

rimen

t

hbv

qual

ity

glio

ma

rese

arch

er

Do

cu

me

nts

Words or Terms Codes or Labels

NIH Data Representation for L-LDA

Page 12: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

Examples of Topics from NCI Abstracts (5 out of 98)12

BrainCancer

glioma

braintumor

gbm

malignantglioma

glioblastoma

brain

BreastCancer

breastcancer

women

breastcancercell

breast

breastcancerpaCent

brca1

KidneyDisease

rcc

kidneycancer

renalcellcarcinoma

vhl

renalcancer

pvhl

Background1

program

trainee

university

training

candidate

field

Background7

model

mice

work

experiment

human

mousemodel

88TopicsfromRCDClabels10Backgroundtopics

Page 13: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

Evaluation 13

GrantswithRCDClabels(31,628documents)

TRAIN 90 %

TEST 10 %

28Kdocs

3Kdocs

29713terms

Samplingprobabili>eswereaveragedoverthewordsinadocumenttocalculateAUCandR-precisionscores

p(code | doc)AUC

R-Precision

p(code | doc)∝ p(code |wordi ,doc)i∑

Page 14: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

Logistic Regression Classifier14

LogisCcRegressionClassifier

pLR(code = k | d)

p(code = 1| doc)

p(code = 2 | doc)

p(code = 87 | doc)

p(code = 88 | doc)

88logisCcregressionclassifierstrainedtoproducecalibrated

probabiliCesusingtrainingdata

L-LDA TOPIC PROBABILITY CALIBRATED TOPIC PROBABILITY

p(code = k | doc)

Page 15: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

Evaluation Result15

p(c | d) pLR(c | d)

L-LDA L-LDA+LogisCcRegression

AUC 0.80 0.89

R-Precision 0.56 0.64

Page 16: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

Analyzing Funding Patterns over Time

▸ Frac>onallyassignthefundsindirectpropor>ontotheprobabili>esfromthelogis>cregressionclassifiers

16

pLR(code | doc)

wcd =pLR(c | d)

pk (c = k | d)k∑

Fcy = wcdxd

d:yd=y∑

c = 1,2,...88

:weightforthecategorycfordocumentd

:amountoffundingfordocumentd(consideredinfla>on)

:yearwhendocumentdwasfunded

:totales>matedamountoffundingforcategorycinyeary

wcd

xdydFc

y

Page 17: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

17

Es>matedpercentageoffundingallocatedto4generalRCDCcategoriesYEAR

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

FUN

DIN

G P

ERC

ENTA

GE

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6NanotechnologyNetworking-and-Information-Technology-RandDHuman-GenomeObesity

Page 18: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

18

Es>matedpercentageoffundingallocatedto4specificRCDCdiseasecategoriesYEAR

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

FUN

DIN

G P

ERC

ENTA

GE

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6Lung-CancerLiver-CancerBrain-CancerInfectious-Diseases

Page 19: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

19

YEAR1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

FUN

DIN

G P

ERC

ENTA

GE

0

1

2

3

4

5

6

7 Breast-CancerTranslational-ResearchEpidemiology-And-Longitudinal-Studies

Page 20: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

20

YEAR1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

FUN

DIN

G P

ERC

ENTA

GE

0

2

4

6

8

10

12

14 GeneticsBiotechnologyBreast-CancerObesitySleep-Research

Page 21: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

Conclusions

▸ Summary

▸ Labeledtopicmodelingandlogis>cclassifierscanbecombinedtoanalyzeNIHgrantfundingdata

▸ Sta>s>caltopicmodelingallowslinkingoftextwithmetadatainaquan>fiablemanner

▸ FutureWork▸ Jointlyanalyzegrantsandscien>ficar>clesrelatedtothegrants(ongoing)

▸ Broaderanalysisoftheeconomicandpolicyimplica>ons

▸ Improvementsontopicmodel

▸ Howtobestcalibrate

▸ Selec>ngtherightHyperparameters

▸ Methodssuchasusingseedwords

21

Page 22: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

22

THANK YOU

Page 23: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

23

BACKUP SLIDES

Page 24: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

NCI (National Cancer Institute) Data24

NCIData

0 35 70 105 140

PROJECT ID

GRANT ABSTRACT RCDCLabels

FUNDING YEAR …

0 35 70 105 140

...

number of tokens0 50 100 150 200 250

#104

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Number of tokens in a document Number of labels in a documentnumber of labels per document

5 10 15 20 25 300

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Page 25: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

Examples of Topics from NCI Abstracts (5 out of 88)25

BrainCancer

glioma

braintumor

gbm

malignantglioma

glioblastoma

brain

BreastCancer

breastcancer

women

breastcancercell

breast

breastcancerpaCent

brca1

KidneyDisease

rcc

kidneycancer

renalcellcarcinoma

vhl

renalcancer

pvhl

HepaCCs

hcv

hbv

livercancer

hepaCCsvirus

hbvinfecCon

hbvreplicaCon

LungCancer

gliomalungcancer

nsclc

lung

leadingcause

cancerdeath

egfr

Page 26: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

Labeled-LDA for NIC Grants

▸ 88Topics(RCDCCodes)

▸ 10BackgroundTopics

▸ Hyperparameters

▸ Dirichletpriorforword-topicdistribu>on

▸ =0.01

▸ Dirichletpriorfordoc-topicdistribu>on

▸ Usedpropor>onalalphas

26

βw

αc = 5c=1

88

∑ αb = 1b

B

Page 27: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

27

Page 28: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

Analyzing Funding Patterns over Time

▸ Frac>onallyassignthefundsindirectpropor>ontotheprobability

28

wcd =pl (c | d)

pk (c = k | d)k∑

Fcy = wcdxd

d:yd=y∑

c = 1,2,...88

Page 29: Analyzing NIH Funding Patterns with Statistical Text Analysisjihyunp/files/jihyunp_NIH_LLDA_slides_final.pdf · NANO TECHNOLOGY 0 0.25 0.5 0.75 1 1.0 0.8 0.7 0.2 Jihyun Park, AAAI-16

JihyunPark,AAAI-16ScholarlyBigdataworkshop,Feb2016

NCI (National Cancer Institute) Data

▸ 149,901grantsintotal

▸ forFY1994~FY2013

▸ Numberofgrantswithlabels:31,628(2008~2011)

▸ Numberofgrantswithoutlabels:118,273

▸ Sizeofvocabulary(W):29,713

29

3 1 1 1 1

1 1

2 1

1 1 1

1 1

2

1 1

2 1

1 1 1 2

2

W

D