LSA.303 Introduction to Computational Linguistics

Click here to load reader

download LSA.303 Introduction to Computational Linguistics

of 83

description

 

Transcript of LSA.303 Introduction to Computational Linguistics

  • 1. CS 424P/ LINGUIST 287Extracting Social Meaning and Sentiment
    Dan Jurafsky
    Lecture 8: Medical Applications:
    Intoxication,Depression, Trauma, Alzheimers, General Medical Health

2. Topic 1: Intoxication
3. Hollien et al 2001
Methods:
35 young adults, 19 males, 16 females
given series of doses of alcohol
speech collected at 4 BAC stages
Rainbow passage
difficult words (buttercup, shapupie)
extemp speech (Tell us about your favorite TV program)
head-mounted mikes
Investigated:
F0 mean and variance
duration/rate of speech
intensity
disfluencies
4. Hollien et al 2001 Results: F0
5. Hollien et al 2001 Results: Duration
6. Hollien et al 2001 Results: Disfluencies
7. Hollien et al 2001 Results: Magnitudes
8. Hollien et al 2001 Results: Speaker Specific Effects
What did they find?
9. A famous case study
Johnson, K., Pisoni, D. & Bernacki, R. (1990) Do voice recordings reveal whether a person is intoxicated?: A case study. Phonetica. 47: 215-237.
10. Exxon Valdez
11. Was Captain Hazelwood drunk?
Not clear if this is relevant, since
he was asleep below deck
The third mate was in charge of the wheelhouse
the ships radar was broken
But is a well-studied case
12. Johnson et al examined 3 kinds of cues
Segmental Effects
Disfluencies
Suprasegmental Effects
13. Keith Johnsons /s/ and//
14. //: Captain Hazelwood
15. 16. 17. Duration
18. F0
19. Summary
20. Questions
Johnson et al. examined various possible causes.
What other kinds of speaker state could cause drop in F0, slower speech, and disfluencies?
21. New Corpus!
Alcohol Language Corpus
Florian Schiel et al 2009, 2010
http://www.bas.uni-muenchen.de/forschung/Bas/BasALCeng.html
124 speakers, 11,160 recordings
recorded in a car (sometimes with engine running)
tonguetwisters
command and control speech (turn off the radio)
spontaneous dialogue and monologue
sample, drunk:
sample, sober:
22. Automatic Classification
Use of prosodic speech characteristicsfor automated detection of alcohol intoxicationMichael Levit, Richard Huber, Anton Batliner, Elmar Noeth
Break utterance into phrases automatically, based on
fundamental frequency (where possible);
zero-crossing rate
energy
23. Then use 4 classes of features
Prosodic
F0 max, F0 min, energy max, energy min, pause length
Duration of voiced regions,unvoiced regions, etc.
Jitter and shimmer
Average cepstrum and cepstral slope
24. Methods
Alcoholized speech samples collected at the Police Academy of Hessen, Germany
120 readings (87 minutes) of a fable
33 male speakers
BAC between 0 and .24/mille
Binary task: above or below 0.8/mille
leave-one-out cross-validation
neural net classifier
25. Results of Levit et al.
Used dev set to find best classifier
This used two feature classes:
Prosodic features
Jitter/shimmer
Results with this classifier
62% phrase-accuracy
69% for the whole speech sample
voting of the phrases
26. Automatic detection features in the Bavarian corpus
Humans: 62%-75%
Machine: features used to date:
F0
duration
rhythm (correlated with duration but doesnt require word transcripts)
formants (f1 mean and F4 variance)
Future work!!!
disfluencies
other segmental features:
s versus sh
but Schiel finding: more hyperarticulation in vowels in women in their corpus
27. Topic 2: Depression
28. Stirman and Pennebaker
Suicidal poets
300 poems from early, middle, late periods of
9 suicidal poets
9 non-suicidal poets
29. Stirman and Pennebaker:2 models
Durkheim disengagement model:
suicidal individual has failed to integrate into society sufficiently, is detached from social life
detach from the source of their pain, withdraw from social relationships, become more self-oriented
prediction:
more self-reference, less group references
Hopelessness model:
Suicide takes place during extended periods of sadness and desperation, pervasive feelings of helplessness, thoughts of death
prediction:
more negative emotion, fewer positive, more refs to death
30. Methods
156 poems from 9 poets who
committed suicide
published, well-known
in English
have written within 1 year of commmiting suicide
Control poets matched for nationality, education, sex, era.
31. The poets
32. Stirman and Pennebaker:Results
33. Significant factors
Disengagement theory
I, me, mine
we, our, ours
Hopelessness theory
death, grave
Other
sexual words (lust, breast)
34. Rude et al: Language use of depressed and depression-vulnerable college students
Beck (1967) cognitive theory of depression
depression-prone individuals see the world and tehmselves in pervasively engative terms
Pyszynski and Greenberg (1987)
think about themselves
after the loss of a central source of self-worth, unable to exit a self-regulatory cycle concerned with efforts to regain what was lost.
results in self-focus, self-blame
Durkheim social integration/disengagement
perception of self as not integrated into societyis key to suicidality and possibly depression
35. Methods
College freshmen
31 currently-depressed (standard inventories)
26 formerly-depressed
67 never-depressed
Session 1: take depression inventory
Session 2: write essay
please describe your deepest thoughts and feelings about being in college write continuously off the top of your head. Dont worry about grammar or spelling. Just write continuously.
36. Results
depressed used more I,me than never-depressed
turned out to be only I
and used more negative emotional words
not enough we to check Durkheim model
formerly depressed participants used more I in the last third of the essay
37. Ramirez-Esparza et al: Depression in English and Spanish
Study 1: Use LIWC counts on posts from 320 English and Spanish forums
80 posts each from depression forums in English and Spanish
80 control posts each frombreast cancer forums
Run the following LIWC categories
I
we
negative emotion
positive emotion
38. Results of Study 1
39. Conclusions?
40. Study 2
From depression forums:
404 English posts
404 Spanish posts
Create a term by document matrix of content words
200 most frequent content words
Do a factor analysis
dimensionality reduction in term-document matrix
Used 5 factors
41. English Factors
a
42. Spanish Factors
a
43. Implications?
Problems?
New applications?
44. Topic 3: Trauma
45. Cohn, Mehl, Pennebaker: Linguistic Markers of Psychology Change Surrounding September 11, 2001
1084 LiveJournal users
all blog entries for 2 months before and after 9/11
Lumped prior two months into one baseline corpus.
Investigated changes after 9/11 compared to that baseline
Using LIWC categories
46. Variables examined
Emotional positivity
difference between LIWC scores for positive emotion words (happy, good, nice) and negative emotion words (kill, ugly, guilty).
cognitive processing
think, question, because: concerned with organizing and intellectually understanding issues
social orientation
talk, share, friends and personal pronouns besides I/me. (essentially counts # of references to other people)
47. Last factor: Psychological Distancing
psychological distancing
factor-analytic:
+ articles,
+ words > 6 letters long
- I/me/mine
- would/should/could
- present tense verbs
low score = personal, experiential lg, focus on here and now
high score: abstract, impersonal, rational tone
48. Results
49. Implications?
Methodological problems?
Ideas for exciting new studies?
50. Topic 4: Alzheimers
51. The Nun Study
Linguistic Ability in Early Life and the Neuropathology of Alzheimers Disease and Cerebrovascular Disease: Findings from the Nun Study
D.A. SNOWDON, L.H. GREINER, AND W.R. MARKESBERY
The Nun Study: a longitudinal study of aging and Alzheimers disease
Cognitive and physical function assessed annually
All participants agreed to brain donation at death
At the first exam given between 1991 and 1993, the 678 participants were 75 to 102 years old.
This study:
subset of 74 participants
for whom we hadhandwritten autobiographies from early life,
all of whom had died.
52. The data
In September 1930
leader of the School Sisters of Notre Dame religious congregation requested each sister write
a short sketch of her own life. This account should not contain more than two to three hundred words and should be written on a single sheet of paper ... include the place of birth, parentage, interesting and edifying events of one's childhood, schools attended, influences that led to the convent, religious life, and its outstanding events.
Handwritten diaries found in two participating convents, Baltimore and Milwaukee
53. The linguistic analysis
Grammatical complexity
Developmental Level metric (Cheung/Kemper)
sentences classified from 0 (simple one-clause sentences) to 7 (complex sentences with multiple embedding and subordination)
Idea density:
average number of ideas expressed per 10 words. elementary propositions, typically verb, adjective, adverb, or prepositional phrase. Complex propositions that stated or inferred causal, temporal, or other relationships between ideas also were counted.
Prior studies suggest:
idea density is associated with educational level, vocabulary, and general knowledge
grammatical complexity is associated with working memory, performance on speeded tasks, and writing skill.
54. Idea density
I was born in Eau Claire, Wis., on May 24, 1913 and was baptized in St. James Church.
(1) I was born,
(2) born in Eau Claire, Wis.,
(3) born on May 24, 1913,
(4) I was baptized,
(5) was baptized in church
(6) was baptized in St. James Church,
(7) I was born...and was baptized.
There are 18 words or utterances in that sentence.
The idea density for that sentence was 3.9 (7/18 * 10 = 3.9 ideas per 10 words).
55. Results
correlation betweenneuropatholocially defined Alzheimers desiease
had lower idea desnity socres than thnon-Alzheimers
Correlations between idea density scores and mean neurofibrillary tangle counts
0.59 for the frontal lobe,
0.48 for the temporal lobe,
0.49 for the parietal lobe
56. Explanations?
Early studies found same results with a college-education subset of the population who were teachers, suggesting education was not the key factor
They suggest:
Low linguistic ability in early life may reflect suboptimal neurological and cognitive development which might increase susceptibility to the development of Alzheimers disease pathology in late life
57. Garrod et al. 2005
British writer Iris Murdoch
last novel published 1995,
Diagnosed with Alzheimers 1997
Compared three novels
Under the Net (first)
The Sea (in her prime)
Jackson's Dilemma (final novel)
All her books written in longhand with little editing
58. Type to token ratio in the 3 novels
59. Syntactic Complexity
60. Mean proportions of usages of the 10 most frequently occurring words in each book that appear twice within a series of short intervals, ranging from consecutive positions in the text to a separation of three intervening words.
Garrard P et al. Brain 2005;128:250-260
Brain Vol. 128 No. 2 Guarantors of Brain 2004; all rights reserved
61. Parts of speech
62. Comparative distributions of values of: (A) frequency and (B) word length in the three books.
Garrard P et al. Brain 2005;128:250-260
Brain Vol. 128 No. 2 Guarantors of Brain 2004; all rights reserved
63. From Under the Net, 1954
"So you may imagine how unhappy it makes me to have to cool my heels at Newhaven, waiting for the trains to run again, and with the smell of France still fresh in my nostrils. On this occasion, too, the bottles of cognac, which I always smuggle, had been taken from me by the Customs, so that when closing time came I was utterly abandoned to the torments of a morbid self-scrutiny.
From Jackson's Dilemma, 1995
"His beautiful mother had died of cancer when he was 10. He had seen her die. When he heard his father's sobs he knew. When he was 18, his younger brother was drowned. He had no other siblings. He loved his mother and his brother passionately. He had not got on with his father. His father, who was rich and played at being an architect, wanted Edward to be an architect too. Edward did not want to be an architect."
64. Lancashire and Hirst
Vocabulary Changes in Agatha Christies Mysteries as an Indication of Dementia: A Case Study
Ian Lancashire and Graeme Hirst 2009
65. 66. Vocabulary Changes in Agatha Christies Mysteries as an Indication of Dementia: A Case StudyIan Lancashire and Graeme Hirst 2009
Examined all of Agatha Christies novels
Features:
Nicholas, M., Obler, L. K., Albert, M. L., Helm-Estabrooks, N. (1985). Empty speech in Alzheimers disease and fluent aphasia. Journal of Speech and Hearing Research, 28: 40510.
Number ofunique word types
Number of different repeated n-grams up to 5
Number of occurences of thing, anything, and something
67. 68. Results
69. Topic 5: Writing and physical health
People asked to write about traumatic experiences
subsequently exhibit better physical health
than people asked to write about superficial topics
Intuition:
people who write about emotional topics report that the experiment makes them think differently about their experience.
Hypothesis:
Do changes in writing style correlate with improved health?
Could we find these changes automatically?
70. Singular Value Decomposition
Singular Value Decomposition (SVD) is a form of factor analysis
Any mn matrix A can be written using an SVD of the form
A = UDVT
where:
U is an mn matrix (a hanger matrix)
D is an nn diagonal matrix (a stretcher matrix)
VT is an nn matrix (an aligner matrix)
(see http://www.uwlax.edu/faculty/will/svd/index.html)
71. Application of SVD to LSA
Assemble a large corpus of natural language
Parse corpus into meaningful passages
Form matrix with passages as rows and words as columns
SVD applied to re-represent the words and passages as vectors in a high-dimensional semantic space
72. SVD: an example (1)Titles of Technical Memos

  • c1: Human machine interface for ABC computer applications

73. c2: A survey of user opinion of computersystemresponsetime 74. c3: The EPSuserinterface management system 75. c4: System and humansystem engineering testing of EPS 76. c5: Relation of user perceived responsetime to error measurement 77. m1: The generation of random, binary, ordered trees 78. m2: The intersection graph of paths in trees 79. m3: Graphminors IV: Widths of trees and well-quasi-ordering 80. m4: Graphminors: A survey