Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining...
Transcript of Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining...
![Page 1: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/1.jpg)
Beyond Bag of Words!Taking Statistical Text Mining to the Next Level
Andrew Fast, Ph.D. Chief Scientist
Elder Research Inc. 300 W Main St., Suite 301
Charlottesville, Virginia 22903 (434) 973-7673
Text Analytics World, San Francisco, April 2013
![Page 2: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/2.jpg)
Challenge of Text • Large amounts of unstructured textual data but
still need understanding
• Do not know all useful features in advance – Useful features are unknown – Or, too labor intensive to enumerate all features
• Combination of structured (numerical) and unstructured data
![Page 3: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/3.jpg)
Complementary Strengths • Humans – Thoughtful – Nuance – Slow
• Machines – Repeatable – Brute Force – Fast
Text Mining
![Page 4: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/4.jpg)
Focus on Document Classification
Are you interested in results about individual words or at a higher level
(i.e., sentences, paragraphs or documents)?
Do you want to sort all documents into
categories or search for specific documents ?
Do you want to automatically identify specific facts or gain
overall understanding?
Do you have categories already?
Are your documents independent or connected via
hyperlinks?
Information Retrieval
Web MiningDocument
Classification
DocumentClustering
InformationExtraction
Concept Extraction
ConnectedIndependent
Is your focus on the meaning of the text or the
structure?
Natural Language Processing
Structure Meaning
Text Mining Foundations
WordsDocuments
No Categories Have Categories
Search SortUnderstandingSpecific Facts
From Chapter 2 of:
![Page 5: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/5.jpg)
Text Mining vs. Text Analytics
Text Mining Both Text Analy1cs
Approach Sta,s,cal Machinery Linguis,c Rules
Inputs Features of words and documents
Performance Equivalent (eventually)
Effort Crea,ng training data Tuning rule-‐sets
Strength Flexibility Human Understanding
Rule Generator Algorithmic Human
![Page 6: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/6.jpg)
Naïve Bayes Classification • A statistical text mining algorithm for document
classification and categorization
– “Bag of Words” - Assumes all features are independent given the class labels
– No human insight or understanding into the text
– Allows for identification of unknown rules
– Extends to new datasets with minimal work
– Incorporates multiple kinds of evidence
– Fast!
Strengths Weaknesses
![Page 7: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/7.jpg)
“Bag of Words” Limits
“She can refuse to overlook our row,” he moped, “unless I entrance her with the right present: a hit!”
Her moped is presently right at the entrance to the overlook; she had hit a row of refuse cans!
![Page 8: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/8.jpg)
Success Stories • Customer Satisfaction (sentiment analysis) for a
major insurance company • Predicting churn of mobile phone customers for
nTelos
• Disability Approval for the Social Security Administration
![Page 9: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/9.jpg)
• Big Value
• Rich meaning
• Large collection of historical data
• Multiple meanings • Mixed messages
• Messy text
• Short text
• Both structured data and unstructured text
Shared Characteristics
![Page 10: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/10.jpg)
“Bag of Words” • Assumption that each word occurrence is
independent as if drawn from a bag – Context and word order do not matter!
• Transform each document into a feature vector for input into statistical modeling algorithms
• Extremely high-dimensional space – Typically one dimension per word
1 1 1 1 1 1 0 0 0 0 1 1 0 1 0 1 1 1 0 0 1 1
1 0 1 0 1 1 1 0 0 1 1
![Page 11: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/11.jpg)
Effectiveness of Bag of Words
Document Length
Level of Nuance
![Page 12: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/12.jpg)
Beyond Bag of Words
• Focus on methods for improving the accuracy of statistical document classification by transforming the feature vector creation process
1 1 1 1 1 1 0 1 0 1 0 1 1 1
1 0 1 0 1 1 1
Raw Text Feature Vectors (1)
Multi-Word Phrases
(2) Concept
Reduction
(3) Term
Weighting
(4) Combining Evidence
![Page 13: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/13.jpg)
Multi-Word Phrases
• Biggest exception for the bag of words assumption
• Also known as collocations
Raw Text (1)
Multi-Word Phrases
![Page 14: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/14.jpg)
Examples of multi-word phrases • “President of the United States” • “moving violation” • “homeowners insurance” • “Lou Gehrig’s Disease” • “Learning Disability” • “Blackberry Pearl Flip” • “HTC Desire”
![Page 15: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/15.jpg)
Multi-Word Phrase Detection
Raw Text (1)
Multi-Word Phrases
List-‐based Rule-‐based Sta,s,cal
![Page 16: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/16.jpg)
Case Study: nTelos Wireless • Task: Identify customers likely to leave the
network (“churn”)
• Data: Combination of structured and unstructured data – Text consists of customer service rep notes from
customer interaction including summaries of customer responses.
• Key Insight: The model of the customers phone is major factor in customer satisfaction.
![Page 17: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/17.jpg)
List-based Collocation • Used a list of supported phones to group full
phone models together
Size
Less frequent
Color
Negative Positive Neutral More frequent
![Page 18: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/18.jpg)
Overall Results: nTelos • Adding textual features to structured data resulted
in a 3.1% increase in prediction accuracy on hold-out data.
• Brand alone is churn neutral.
• Certain older phone models tied strongly to churn
• Also, customer provided equipment (CPE) is tied strongly to churn
![Page 19: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/19.jpg)
Challenge: Unknown Phrases • What happens when the helpful phrases are not
known in advance? – Or, taxonomy too extensive to use efficiently?
• SSA: medical concerns and diseases
• Insurance: positive and negative sentiment
![Page 20: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/20.jpg)
• Use regular expressions with Part-of-Speech tags
• Focused on Noun-based patterns
Collocations using Rules Start
Noun1 Adj
Conj.
No Yes
Det.
Prep.
JJ*
N*JJ*
N*
*
JJ*
Noun2
N*IN
N*
*
CC
N*
*N*
JJ*
DT
N**
*
*
N* -‐> Nouns JJ* -‐> Adjec,ves CC -‐> Coordina,ng Conjunc,on DT -‐> Determiner IN -‐> Preposi,on
[Justeson and Katz 1995]
![Page 21: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/21.jpg)
Collocations Using Statistics • Use inductive
modeling approach 1. Use analysts to label
training documents 2. Build a statistical
model from labeled data
3. Apply model to new data
21
![Page 22: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/22.jpg)
Method Comparison • you could lower my rates. I'm an excellent driver
with no accidents or moving violations.
• $COMPANY$'s homeowner's coverage is good and priced reasonably. But $COMPANY$'s auto coverage is, at best, average yet overpriced. The same applies to our motorcycle policy - average yet overpriced.
Statistical Only Rule-based Only Both
![Page 23: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/23.jpg)
Entity Extraction • Entity Extraction is a specialized form of multi-
word phrase detection – Limited to proper names, organizations, and places.
• Often more accurate than general purpose detectors for names
• Both rule-based and statistical approaches.
![Page 24: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/24.jpg)
Concept Reduction
• Combine multiple expressions of the same concept into a single term
Raw Text (1)
Multi-Word Phrases
(2) Concept
Reduction
![Page 25: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/25.jpg)
Concept Reduction • Replace tokens representing the same concept
with a standard token – Synonyms – Dates – Specific proper nouns – Abbreviation Expansion
• How do you do this? – Lists, Regexes, Rules – Automatic synonym detection
![Page 26: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/26.jpg)
Word Clustering • Uses similar techniques to document clustering • …but uses a “term-context” vector instead.
dolphin
homew
ork
sand
wich
0
my
ate
a cat
dog
the
0
1
1
0
1
1
0
0
0
1
0
2
1
0
1
0
1 Document 2
Document 1
1 1 0 0 1 1 0 0 1 Document 3
Document 1: My dog ate my homework. Document 2: My cat ate the sandwich. Document 3: A dolphin ate the homework.
sandwich
0
my
ate
a cat
dog
dolphin
homew
ork
sand
wich
the
0
0
1
0
1
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
0
2
0
0
2
1
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
2
0
0
1
0
0
1
0
0
0
0
0
1
0
1
0
0
0
0
0
1
0
0
0
my
ate a
cat dog
dolphin
homework
the
![Page 27: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/27.jpg)
Concept Extraction
• Automatically detect conceptually related words using statistical clustering
cancer
germ cell tumor
gastric cancer
progression
osteoarthritis
radiation therapy
hodgkin lymphoma
staginglung cancer
cancers
risk factor
adjuvant chemotherapyovarian cancer
breast cancer
endometrial cancer
tumors
pancreatic cancer
colorectal cancer
breast
malignantprostate
malignancy
prostate cancer
advanced
bladder cancer
adenocarcinoma
![Page 28: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/28.jpg)
Case Study: Satisfaction Survey • Task: Identify sentiment and level of satisfaction from
combination of numerical and textual survey data for an insurance company
• Data: Combination of structured and unstructured data – Text consists of open-ended comments supplied by the
customer
• Key Insight: Not all negative sentiment about the company, the survey itself was a major source of negative sentiment!
![Page 29: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/29.jpg)
company
Understanding Customer Behavior
Copyright © 2012 Elder Research, Inc. 29
![Page 30: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/30.jpg)
Statistical Categorization Summary Area Posi1ve
Experience Posi1ve No
Experience Neutral Mixed Sugges1on Nega1ve Nega1ve
Experience Off-‐Topic
Agency 1.4% 61.1% 17.8% 7.7% 0.3% 0.1% 9.2% 1.5% 0.9%
Billing 1.0% 47.5% 16.4% 18.5% 0.3% 2.6% 10.1% 1.2% 2.3%
Claims 2.0% 33.3% 40.7% 16.0% 0.6% 0.0% 3.4% 2.5% 1.5%
Coverage 1.0% 43.3% 14.2% 20.7% 1.0% 0.6% 14.4% 0.3% 4.5%
Onboarding 0.7% 29.2% 46.5% 16.0% 0.1% 0.0% 3.7% 0.4% 3.4%
Phone 0.7% 28.3% 20.2% 45.6% 0.1% 0.0% 2.2% 0.4% 2.5%
Policy 0.5% 42.3% 27.5% 22.1% 0.0% 0.3% 3.7% 0.4% 3.2%
Premium 0.9% 39.4% 26.2% 20.5% 0.0% 0.2% 9.5% 0.2% 3.1%
Renewal 0.7% 49.3% 18.6% 18.2% 0.1% 0.5% 8.4% 0.8% 3.2%
Value/Price 0.6% 38.1% 14.6% 16.6% 0.1% 0.4% 27.4% 0.3% 1.9%
Website 1.4% 30.8% 19.3% 40.5% 0.0% 0.5% 4.9% 0.1% 2.3%
![Page 31: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/31.jpg)
Term Weighting
• Focus on methods for improving the accuracy of statistical document classification by transforming the feature vector creation process
Raw Text (1)
Multi-Word Phrases
(2) Concept
Reduction
(3) Term
Weighting
![Page 32: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/32.jpg)
Challenge: Weighting Scores • How do you weight rare concepts?
• Naïve Bayes: – Count number of times a word is associated with a
particular outcome – Leads to skewed weights on rare concepts
![Page 33: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/33.jpg)
Case Study: SSA Disability Approval • Pain: Approval process is long, bureaucratic
• Goal: Fast-track “easy” cases • Challenge: Free-text on disability application • Result: 20% of Approvals possible immediately
and with greater consistency
33
Up to 2 Years ! 1/3 of cases
eventually approved
1/2 of appeals overturn original decision
With Text Mining, 1/5 of cases approved
immediately!
![Page 34: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/34.jpg)
Weighting with a Prior • Draw from Bayesian statistics and smooth the raw count
with an empirical prior – Use baseline probability of the most probable classification
• For SSA, roughly 33% of applications approved – Counts for each word are initialized with the baseline probability
• Known as Shrinkage, James-Stein Estimator, Ridge Regression, etc.
• Hypothetical Example: Multiple Myeloma – Appears 5 times, 4 times was approved = 80% predicted weight – With prior of 33%, now weight is 5/8 = 62.5% predicted weight
• More data outweighs the prior
![Page 35: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/35.jpg)
Combining Evidence
• How to score documents so that “strong” words are emphasized and “weak” words are ignored
1 1 1 1 1 1 0 1 0 1 0 1 1 1
1 0 1 0 1 1 1
Raw Text Feature Vectors (1)
Multi-Word Phrases
(2) Concept
Reduction
(3) Term
Weighting
(4) Combining Evidence
![Page 36: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/36.jpg)
Example • “Multiple Myeloma I have been diagnosed with
Multiple Myeloma (cancer of the bone marrow) and am currently undergoing treatment to prepare me for an autologous stem cell transplant. There has been a brain tumor associated with this, for which I have had....”
![Page 37: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/37.jpg)
Combining Weights • Common aggregations don’t match medical
domain requirements – SUM: many symptoms increases probability of
predicting approval – MAX: ignores multiple serious symptoms – AVG: minor symptoms water down major symptoms
• Naïve Bayes uses maximum a priori (MAP) approach – All evidence combined equally
![Page 38: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/38.jpg)
Our approach for SSA If (no data), then use prior Else If (max(probability) < 0.5) then use that max. Else:
i. Ignore concepts with probability < 0.5ii. Combine the remaining ones with a log-
likelihood formula and use the resulting joint probability.
![Page 39: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/39.jpg)
Example Weights
![Page 40: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/40.jpg)
SSA Solution
• Convert individual words and phrases to features • Exploit custom method for combining evidence from
multiple features • Text classifier accuracy equivalent to committee of
experts 30% baseline -> 90% model accuracy
40
fractured mesothelioma
cast metastized
pain leg
doctor collision
ankle pancreatic cancer
Claims
![Page 41: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/41.jpg)
Summary
• Described methods for improving the accuracy of statistical document classification by transforming the feature vector creation process
1 1 1 1 1 1 0 1 0 1 0 1 1 1
1 0 1 0 1 1 1
Raw Text Feature Vectors (1)
Multi-Word Phrases
(2) Concept
Reduction
(3) Term
Weighting
(4) Combining Evidence
![Page 42: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/42.jpg)
42
Contact Information
Andrew Fast, Ph.D. Chief Scientist
(434) 973-7673 www.datamininglab.com
![Page 43: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/43.jpg)
Practical Text Mining • Winner of the 2012
PROSE award for Computing and Information Science
• Written for a technical audience seeking more text experience
• Includes trial versions of major software tools
![Page 44: Beyond Bag of Words - Text Analytics World · Beyond Bag of Words! Taking Statistical Text Mining to the Next Level Andrew Fast, Ph.D. Chief Scientist Elder Research Inc. 300 W Main](https://reader034.fdocuments.in/reader034/viewer/2022052015/602d010874a50d424c1ca8b8/html5/thumbnails/44.jpg)
44
Andrew Fast!Chief Scientist, Elder Research, Inc.
Dr. Fast graduated Magna Cum Laude from Bethel University and earned Master’s and Ph.D. degrees in Computer Science from the University of Massachusetts Amherst. There, his research focused on causal data mining and mining complex relational data such as social networks. At ERI, Andrew leads the development of new tools and algorithms for data and text mining for applications of capabilities assessment, fraud detection, and national security. Dr. Fast has published on an array of applications including detecting securities fraud using the social network among brokers, and understanding the structure of criminal and violent groups. Other publications cover modeling peer-to-peer music file sharing networks, understanding how collective classification works, and predicting playoff success of NFL head coaches (work featured on ESPN.com). With John Elder and other co-authors, Andrew has written a book on Practical Text Mining, that was awarded the prose Award for Computing and Information Science in 2012.
Dr. Andrew Fast leads research in Text Mining and Social Network Analysis at Elder Research, the nation’s leading data mining consultancy. ERI was founded in 1995 and has offices in Charlottesville VA and Washington DC,(www.datamininglab.com). ERI focuses on Federal, commercial, investment, and security applications of advanced analytics, including stock selection, image recognition, biometrics, process optimization, cross-selling, drug efficacy, credit scoring, risk management, and fraud detection.