ACRE Text Categorization Platform Greg Brewster, PhD Associate Professor, School of Computing CTO,...
-
Upload
esmond-kelley -
Category
Documents
-
view
215 -
download
0
Transcript of ACRE Text Categorization Platform Greg Brewster, PhD Associate Professor, School of Computing CTO,...
ACREText Categorization Platform
Greg Brewster, PhDAssociate Professor, School of Computing
CTO, Vertical Data LLC
CSC 594 – Text Mining and AnalyticsOctober 15, 2015
1October 15, 2015 CSC 594
ACRE is…
The Auto-Categorization and Retrieval
Engine:
A scalable Text Document Labeling system
Python (command line) && web2py (web – in
development)
A product (soon) of Vertical Data, LLC
Patent pending
A research project on
Hybrid Text classification methods
Medical NLP 2October 15, 2015 CSC 594
Unstructured Text
Businesses must monitor and manage a massive inflow of unstructured text.
Documents
Social Media
Big Data3October 15, 2015 CSC 594
What Everybody Wants…
Unstructured Text => Structured Data
This can be done by Labeling Text
(ties in with Network Technologies – Multi-Protocol Label Switching (MPLS))
4October 15, 2015 CSC 594
What Labels Let You Do
Filtering Data Routing Monitor / Notify Auto-Categorization Clustering Prioritization Etc.
5October 15, 2015 CSC 594
Label Models An ACRE Label Model analyzes each text
item and selects one or more values from a list or Label Tree (taxonomy), using:
Natural Language Processing Stemming, Part-of-Speech analysis, Named Entity
extraction, synonym substitutions, drop/go lists Pattern Extraction / Rules
Keywords/Pattern match determine label value Machine Learning
Similarity to Trained Word Clouds determine values
Executing 1 Label Model adds 1 column of results (labels) to the Label Table.
Model Execution ResultsEach Label Model Adds a Column
7
Survey Results
Labeled Survey Results
ACRE
Models Executed1. Topic2. Sentiment3. Before_Room4. Alarm_Words
Why ACRE? Adds Value! => Creates new structured
data from existing unstructured data Designed from the ground up for text
analysis as opposed to numerical analytics with text add-on.
Human-guided analysis Intuitive model definition, use and results Combined/extended models Iterative model improvement Fast prototyping Scalable
8
Hierarchical Labeling: Label Value Trees
9
• Each text item labeled with one or more leaf values • Hierarchical evaluation during ML Stage• Tree model for aggregated results visualization
October 15, 2015 CSC 594
Labeling and Indexing Indexing for Search has undergone
massive growth 1980/90s – Isolated topic-specific search
engines 1990/00s – Spiders index WWW – Google search Today – Continuous indexing by OS
Labeling is poised for similar growth and ubiquitous deployment
Today – Isolated auto-classification within specific applications.
Tomorrow – Auto-Labeling of all enterprise data for enhanced search, data filtering, monitoring.
10October 15, 2015 CSC 594
Previous Label Model Examples
Sentiment Model Possible labels = {Positive, Negative, Neutral}
Language Model {English, Spanish, French}
Retention Model {Discard, Retain, Discoverable, Legal Hold}
Emotion Model {Happy, Satisfied, Enthusiastic, Complaining,
Angry, Threatening, Violent} Themes Model, Spam Model, Threat
Analysis Model.
11
Model Library The Model Library contains tested Label Models. Users can copy models and edit for their own use. Users can contribute models.
12
Label Model LibraryLabel Name Description Label Values
RetentionShould data be retained and for how long?
Discard, Retain, Retain: 30 days, Retain: 90 days
Sentiment Overall tone of text Positive, Neutral, Negative
EmotionIdentifies Presence of Words with High Emotional Content
Angry, Anxious, Sad, Happy, Grateful, Enthusiastic, Threatening, None
HashTags List of Hashtags ("#tag") in Extracted text with pattern "#text"
Affi nitiesIdentifies common user interests
MLB, NBA, NFL, Golf, Cars, Shopping, New York, Cooking, Technology, etc.
Social Media Labeling 5 Label Models / 4 Stakeholders
13October 15, 2015 CSC 594
ACRE for Web (Nov. 2015)
14
ACRE with SaaS (2016)
15
Deploy local or cloud. BigML for Text!
Origin of ACRE
Dr. Peter Jackson’s work Pioneered legal document
classification at Thomson Reuters in 1990s
Eliminated >90% of human effort.
ACRE Additions: Combined ES-ML
modeling ‘Vertical’ Expertise
Medical Security
16October 15, 2015 CSC 594
Why is Text Analytics Hard?
Lots of Domain-Specific details Many models are ad-hoc and not very reusable Hard to generalize techniques Extensive dictionaries and topics lists often
required Requires 2 experts: Modeler and Subject
Expert These two are often far apart in knowledge base Requires a lot of time from both
Multiple techniques used – often in combination
NLP Search / Rules Machine Learning
17
Combining ES – ML Models Combined models can reduce effort and yield
improved accuracy: ES Model trains ML Model
Each labeling result from ES model is used as a training instance for ML model
ES Model failsover-to ML Model Evaluate each text input with ES model first,
then fail over to ML model if no ES result. Iterative improvement: Add ES rules to
improve ML model results incrementally (**Next steps**) ML Model => Rules
18October 15, 2015 CSC 594
ACRE Labeling Process
19October 15, 2015 CSC 594
Pattern Models
Metadata Extraction by Pattern Decision Manager defines “Patterns”, which are:
A context pattern containing an extraction pattern in parentheses.
If the context pattern is matched, then extraction pattern match becomes the label value
Example: Extract social security numbers Pattern: “(\n{3}-\n{2}-\n{4})”
Example: Extract hashtags Pattern: “(#\w*)”
Example: Extract word before “Room” Pattern: “(\w*) Room”
20October 15, 2015 CSC 594
Hashtag Extraction Results
21October 15, 2015 CSC 594
Rules Associate text patterns with label value.
When text is processed, rule match assigns the corresponding label value.
Current ACRE Tools: Rules
22October 15, 2015 CSC 594
Rules Definition File
23
• Patterns are regular expressions connected by booleans• If multiple Pattern matches in a given document, then:
• If option Multi_Values = True, then all values retained.
• If option Multi_Values = False, then Value with greatest sum of matched rule weights is selected.
Category Value Pattern WeightSentiment Positive Good 1Sentiment Positive Great 1.5Sentiment Positive "Not bad" 0.4Sentiment Positive Gr[eio].* 1Sentiment Positive Best & Good 1.7Sentiment Negative Terrible 2Sentiment Negative Awful 1.5Sentiment Negative unhappy 1
October 15, 2015 CSC 594
Results Report (CSV)
24October 15, 2015 CSC 594
ACRE Machine Learning Current: Nearest Word Cloud Machine
Learning Model is “trained” on example text for each
label. ACRE applies NLP Processing to each text item to isolate
the best terms to be used in analysis Model stores “Trained Word Cloud” (aggregated term
frequencies calculated from training) for each label value.
Combined model option: Train ML model using Rule matches
To choose label for new text, ACRE calculates Confidence value for each possible label, based on vector similarity between new text cloud and Trained Word Cloud.
Next: ML vector can be constructed from any data fields (structured or unstructured)
25
Categorizing “Sentiment” using ACRE ML Algorithm
26
Compare New Document Word Cloud to the Trained Word Clouds for Positive and Negative. Pick best match, or Neutral if no good match.
October 15, 2015 CSC 594
NLP Processing for ML Vocabulary
To eliminate noise and increase accuracy, it is essential to minimize the “vocabulary” of words included in ML analysis.
ACRE’s NLP Processing does this: *Stemming groups different word forms (run, running, ran) Part-of-Speech (POS) tags word use (noun, verb, adjective, ..) Named Entity (NE) tags names of people, places, institutions,
.. Synonym processing groups words with identical meaning Filtering eliminates data records that match filter rules *Drop List specifies words, phrases, stems, POS tags or NE
tags that should be eliminated from analysis Go List specifies words, phrases, stems, POS tags or NE tags
that must be included in analysis
* Implemented now
27
Tuning Parameters Options for tuning the ML algorithm:
(**More research here**)
Drop List options Term Weights IDF (Inverse Document Frequency)
options Confidence Threshold Max Term caps Similarity Measures
Cosine Similarity, MSE, Absolute error
28October 15, 2015 CSC 594
Matching with Search and “Find Similar”
Search finds all data records for search query, filtered by Labels, POS, NE results.
“Find Similar” uses ML Confidence results to match similar data records.
Step 1: Create new ML model and train it on a single reference data record.
Step 2: Execute the ML model on other data records, generating Confidence factor for each one, measuring its similarity to the reference.
Step 3: Sort results by Confidence to group closest matches together.
29
Example – Find Similar Dataset: 5000 CCN-International tweets. Find
Similar results for first tweet listed.
30
Example: Language Labeling
Dataset: 3625 survey responses from CFI Group. Objective: Determine language of each
response. Results: For each language: Number of
responses, list of responses, word cloud. Approach:
First, Rules-only analysis using common words in each language
Second, combined model where Rules train ML model and then ES fails-over to ML.
31October 15, 2015 CSC 594
Example: Language Labeling
Define Language Label Model Language = {English, Spanish, French} Multi_Values = False (1 language per response)
Create simple Rules: Spanish: (todo OR perfecto OR siempre OR luces OR
excelente) French: (troit OR avec OR vendeuse OR normaux OR
ajouter) English: (clothes OR store OR best OR service OR jeans
OR price OR sale) Experiment #1: Rules-only results:
English: 1237 items Spanish: 38 items French: 9 item Unlabeled: 2346 items 32October 15, 2015 CSC 594
Example: Language Labeling
Experiment #2: Combined model where Rules train ML model and ES fails-over to ML.
Results: English: 3540 items Spanish: 76 items French: 9 items Unlabeled: 5 items (below Threshold)
Combined ES-ML analysis correctly labels Language for nearly every survey response.
(**Research**) What other labeling decisions are well-suited for combined analysis?
33October 15, 2015 CSC 594
Software History May, 2013: Vertical Data, LLC,
incorporated October, 2014: ACRE v1.2 release
Categorization search engine on a Windows/SQL 2014 Server .Net platform – suspended.
July 6, 2015: ACRE v1.3 deployed! Command-line categorization models using
Python/NLTK. November, 2015: ACRE for Web
Web2py deployment of ACRE 2016: ACRE 2.0
Full GUI, visualizations, SaaS interface.34
ACRE v1.3 – Category Model Contents
35
ACRE Label Model (.alm) File contents
ACRE Model Execution Example
36
ACRE v1.3 Example – Label Table Results
37
ACRE v1.3 Example – Summary Results (per model)
38
RoomBathLocationServiceOther
ACRE v1.3: Term Frequency Results
Term frequency table and Term Word Cloud for
every node in the Category Tree.
39
ACRE v1.3 Model Execution Modes Rules-Only Execution:
Select labels based on rule pattern matches Deterministic, auditable, reproducible.
Classic ML Execution: Train using exemplars for each value Select labels based on Nearest Word Cloud
(NWC) Rules with ML Fill-in
Train using rule matches (labeling patterns) Select labels based on rule pattern matches –
then if no rule match, then select based on NWC.
ML with Seeding Patterns: Train using rule matches (seed patterns) Select labels based on Nearest Word Cloud
(NWC)
40
ACRE v1.3 for You!
I can provide Prof. Tomuro with a Python distribution of ACRE v1.3 for CSC 594 students to try out by next week (10/20/2015) if you are interested.
User documentation: http://vertical-data.com Click on PRODUCTS User’s Guide, Command Reference, Options
Reference available.
41
ACRE v2.0 Wire Frames
42
ACRE v2.0 – Word Cloud Viewer
43
Reviewing Word Clouds
44
Reviewing Word Clouds
45
Reviewing Word Clouds
46
Graphing Results
47
Graphing Results
48
Ongoing Trials
Vertical Data has 2 product trials once Web version releases in November: Optimization Group for Customer Survey data Mt. Sinai hospital / HealthIX research study of
radiology electronic medical records: Extracting Findings and SNOMED diagnosis codes
from doctor’s comments. Literature: 2007 Medical NLP Challenge results on
extracting ICD9 codes from radiology reports.
49October 15, 2015 CSC 594
Conclusions
ACRE provides a set of useful tools for text analytics, categorization and search.
I can provide a distribution of ACRE v1.3 (command line) for CSC 594 use by 10/20.
Contact me ([email protected]) if you would like to work on: Research on combined ES/ML model
performance Research on Medical NLP.
50October 15, 2015 CSC 594