I2 B2 2006 Pedersen

November 10, 2006 I2B2 - Smoker Status Challenge 1

Determining Smoker Status using Supervised and Unsupervised Learning with Lexical Features

Ted PedersenUniversity of Minnesota, Duluth

tpederse@d.umn.eduhttp://www.d.umn.edu/~tpederse

Approaches

• Smoking Status as Text Classification– supervised learning– lexical features– techniques used to good effect in word sense

disambiguation

• Smoking Status as Text Clustering– unsupervised learning– lexical features – techniques used to good effect in word sense

discrimination

Objectives

• How well do WSD techniques generalize to related but different problems?– smoking status as "meaning" of record??– not quite the same problem…

• How well do WSD features generalize? – bag of words, unigrams– bigrams– collocations

• How well do learning algorithms generalize?– supervised and unsupervised

Experimental VariationsSupervised Learning

• Learning Algorithm – naïve Bayesian classifier– J48 decision tree– support vector machine (SMO)

• Feature Sets (also used in unsupervised)– unigrams, bigrams, trigrams– various frequency and measure of association cutoffs– Stop List of 472 words

• 392 function words• 80 words that occurred in more than half the records

Decision Tree

• J48 most accurate when using unigram features that occurred 5 or more times in the training data– over 3,600 unigrams as candidate features– decision tree has 47 nodes and 24 leaves– accuracy of 82% (327/401)

Decision Treeunigrams : 5 or more times

82% accuracy (327/398)10-fold cross validation on train

a b c d e <-- classified as

20 5 1 7 3 | a = PAST-SMOKER

8 46 3 8 1 | b = NON-SMOKER

8 2 240 2 0 | c = UNKNOWN

7 5 1 21 1 | d = CURR-SMOKER

1 3 1 4 0 | e = SMOKER

Manual Inspection

• From the decision tree learned from the 3,600 features, we decided to use the following in a second experiment:– cigarette, drinks, quit, smoke, smoked,

smoker, smokes, smoking, tobacco

9-feature Decision Treeselected from unigram tree

9-feature Decision Tree87% accuracy (345/398)

10 fold cross validation on train

20 5 1 10 0 | a = PAST-SMOKER

0 51 2 13 0 | b = NON-SMOKER

0 1 250 1 0 | c = UNKNOWN

5 4 2 24 0 | d = CURR-SMOKER

0 3 1 5 0 | e = SMOKER

9-feature Decision Tree 82% accuracy (85/104)

evaluation data

62 0 1 0 0 | a = UNKNOWN

1 10 1 0 4 | b = NON-SMOKER

0 2 4 0 5 | c = PAST-SMOKER

0 0 0 0 3 | d = SMOKER

0 1 1 0 9 | e = CURR-SMOKER

9-feature Decision Tree90% accuracy (94/104)

evaluation data

a b f <-- classified as

62 0 1 | a = UNKNOWN

1 10 5 | b = NON-SMOKER

0 3 22 | f = ALL-SMOKER

Unsupervised Experiments

• Bigram Features– allow up to 5 intervening words– occur 2 or more times in training data– limit to those that include "smok" --> 96 features– social smoking, pack smoking, smoking alcohol,

smoking family, smoke drink, cigarette smoking, allergies smoking, allergies smoked, smoking quit, quit smoking, smoker drinks, former smoker, social smoke, denies smoking, habits smoking, ...

Unsupervised Context Representations

• 2nd order Context Representations– Latent Semantic Analysis, native SenseClusters– each record represented by a vector that is the

average of vectors that represent the individual features :

• LSA– each bigram is replaced by a vector showing the

records in which it occurs

• native SenseClusters– each word is replaced by a vector showing the

second words it occurs with as a bigram

Unsupervised Clustering

• Once vectors for all records are created, they are clustered using a partitional method similar to k-means

• The number of clusters is automatically discovered using the PK2 measure, which compares successive values of clustering criterion function

• assign clusters to categories based on distribution in training data – unknown, non-smoker, past-smoker,

current-smoker, smoker

SenseClusters69% accuracy (72/104)

evaluation data a b c d e <-- classified as

63 0 0 0 0 | a = UNKNOWN

10 0 0 0 6 | b = NON-SMOKER

1 0 0 0 2 | d = SMOKER

SenseClusters79% accuracy (82/104)

evaluation data

Latent Semantic Analysis68% accuracy (71/104)

evaluation data

63 0 0 0 0 | a = UNKNOWN

10 0 0 0 6 | b = NON-SMOKER

1 0 0 0 2 | d = SMOKER

Latent Semantic Analysis77% accuracy (80/104)

evaluation data

Conclusions

• Results dominated by UNKNOWN – sets lower bound of 61%

• Errors dominated by confusion in ALL-SMOKER – reduction to 3 classes improves results significantly

• Decision tree aided feature selection• Manual tuning of feature sets performed since

records focus well beyond smoking status• Unsupervised clustering found "right" number of

clusters perhaps, did well in that light

Software Resources

• Supervised Experiments– SenseTools (free, from Duluth)

http://www.d.umn.edu/~tpederse/sensetools.html

– Weka (free, from Waikato)

http://www.cs.waikato.ac.nz/ml/weka/

• Unsupervised Experiments– SenseClusters (free, from Duluth)

http://senseclusters.sourceforge.net

I2 B2 2006 Pedersen

Health & Medicine

Transcript of I2 B2 2006 Pedersen

Pedersen Analysis Now

Pedersen Lecture

Partee2020CUNYPhilog - Arthur Paul Pedersen

FRANDSEN - Henrik Pedersen Design

Murphy Ranch Pedersen Livestock

Debugging 2013- Lars pedersen

SULPHUR DELL M A R K E T · 2020-07-09 · Hotel Retail Historic Buildings Retail I2 I4 J1 J2 K3 L1 L2 F1 F2 F3 H1 J1 J2 H1 H2 I4 J1 J2 A2 B2 B5 B7 H1 I2 J1 J2 L2 B2 B4 B6 B1 B3 B6

WORKBOOK I2

Fort Collins Artisans By Richard Pedersen and Jenni Pedersen.

Monica Pedersen by Softline

PrOFILe PeDerSeN CONTrACTING ... - pedersen … · ask Michael Pedersen to do it and he’ll load up the poul-try litter, take it away and find a good use for it. His company, Pedersen

Network Video Recorderftp.viatec.ua/Hikvision/Руководство пользователя (manual)/NVR... · DS-7600NI-I2 DS-7608NI-I2 DS-7616NI-I2 DS-7632NI-I2 DS-7600NI-I2/P

SOCIO-ECONOMIC SURVEY 2002-03...99 5473053 JATAV JIVANBHAI LALAJIBHAI 17 None B2B2 B2 D4 B2 B2 B2 B2 E5 B2 B2 B2 C3 N B2 100 5473054 SOLAKI VALABHAI AMATHV 17 None B2B2 B2 D4 B2 B2

6 6 B2-50 5 5 B2-51 B2-84 B2-29 RB2-14 B2-30 B2-85 B2-49 4 ...

Feedback Summary Map 1 - Auckland Transport · I2 I2 I2 I2 I2 I2 I2 I2 I2 I2 Kepa Bush Reserve Mt Hobson Domain Waiatarua Reserve Victoria Park Madills Farm Rec Reserve Auckland Domain

i2.res.24o.it€¦ · i2.res.24o.it

Architecture Portfolio Braden Pedersen

20 - Mark Pedersen

AMIPI INC.amipi.com/downloads/jewelry.pdf · si2 si3 i1 si2 si3 i1 si2 si3 i1 si3/i1 i1/i2+ i2 si3/i1 i1/i2+ i2 si3/i1 i1/i2+ i2 0.23 - 0.29 $842 $700 $624 $700 $586 $520 $596 $520

Articulo Andersen Heinesen Pedersen