ACRE Text Categorization Platform Greg Brewster, PhD Associate Professor, School of Computing CTO,...

ACREText Categorization Platform

Greg Brewster, PhDAssociate Professor, School of Computing

CTO, Vertical Data LLC

CSC 594 – Text Mining and AnalyticsOctober 15, 2015

1October 15, 2015 CSC 594

ACRE is…

The Auto-Categorization and Retrieval

Engine:

A scalable Text Document Labeling system

Python (command line) && web2py (web – in

development)

A product (soon) of Vertical Data, LLC

Patent pending

A research project on

Hybrid Text classification methods

Medical NLP 2October 15, 2015 CSC 594

Unstructured Text

Businesses must monitor and manage a massive inflow of unstructured text.

E-mail

Documents

Social Media

Big Data3October 15, 2015 CSC 594

What Everybody Wants…

Unstructured Text => Structured Data

This can be done by Labeling Text

(ties in with Network Technologies – Multi-Protocol Label Switching (MPLS))


What Labels Let You Do

Filtering Data Routing Monitor / Notify Auto-Categorization Clustering Prioritization Etc.


Label Models An ACRE Label Model analyzes each text

item and selects one or more values from a list or Label Tree (taxonomy), using:

Natural Language Processing Stemming, Part-of-Speech analysis, Named Entity

extraction, synonym substitutions, drop/go lists Pattern Extraction / Rules

Keywords/Pattern match determine label value Machine Learning

Similarity to Trained Word Clouds determine values

Executing 1 Label Model adds 1 column of results (labels) to the Label Table.

Model Execution ResultsEach Label Model Adds a Column

7

Survey Results

Labeled Survey Results

ACRE

Models Executed1. Topic2. Sentiment3. Before_Room4. Alarm_Words

Why ACRE? Adds Value! => Creates new structured

data from existing unstructured data Designed from the ground up for text

analysis as opposed to numerical analytics with text add-on.

Human-guided analysis Intuitive model definition, use and results Combined/extended models Iterative model improvement Fast prototyping Scalable

8

Hierarchical Labeling: Label Value Trees

9

• Each text item labeled with one or more leaf values • Hierarchical evaluation during ML Stage• Tree model for aggregated results visualization

October 15, 2015 CSC 594

Labeling and Indexing Indexing for Search has undergone

massive growth 1980/90s – Isolated topic-specific search

engines 1990/00s – Spiders index WWW – Google search Today – Continuous indexing by OS

Labeling is poised for similar growth and ubiquitous deployment

Today – Isolated auto-classification within specific applications.

Tomorrow – Auto-Labeling of all enterprise data for enhanced search, data filtering, monitoring.

10October 15, 2015 CSC 594

Previous Label Model Examples

Sentiment Model Possible labels = {Positive, Negative, Neutral}

Language Model {English, Spanish, French}

Retention Model {Discard, Retain, Discoverable, Legal Hold}

Emotion Model {Happy, Satisfied, Enthusiastic, Complaining,

Angry, Threatening, Violent} Themes Model, Spam Model, Threat

Analysis Model.

11

Model Library The Model Library contains tested Label Models. Users can copy models and edit for their own use. Users can contribute models.

12

Label Model LibraryLabel Name Description Label Values

RetentionShould data be retained and for how long?

Discard, Retain, Retain: 30 days, Retain: 90 days

Sentiment Overall tone of text Positive, Neutral, Negative

EmotionIdentifies Presence of Words with High Emotional Content

Angry, Anxious, Sad, Happy, Grateful, Enthusiastic, Threatening, None

HashTags List of Hashtags ("#tag") in Extracted text with pattern "#text"

Affi nitiesIdentifies common user interests

MLB, NBA, NFL, Golf, Cars, Shopping, New York, Cooking, Technology, etc.

Social Media Labeling 5 Label Models / 4 Stakeholders

13October 15, 2015 CSC 594

ACRE for Web (Nov. 2015)

14

ACRE with SaaS (2016)

15

Deploy local or cloud. BigML for Text!

Origin of ACRE

Dr. Peter Jackson’s work Pioneered legal document

classification at Thomson Reuters in 1990s

Eliminated >90% of human effort.

ACRE Additions: Combined ES-ML

modeling ‘Vertical’ Expertise

Medical Security

16October 15, 2015 CSC 594

Why is Text Analytics Hard?

Lots of Domain-Specific details Many models are ad-hoc and not very reusable Hard to generalize techniques Extensive dictionaries and topics lists often

required Requires 2 experts: Modeler and Subject

Expert These two are often far apart in knowledge base Requires a lot of time from both

Multiple techniques used – often in combination

NLP Search / Rules Machine Learning

17

Combining ES – ML Models Combined models can reduce effort and yield

improved accuracy: ES Model trains ML Model

Each labeling result from ES model is used as a training instance for ML model

ES Model failsover-to ML Model Evaluate each text input with ES model first,

then fail over to ML model if no ES result. Iterative improvement: Add ES rules to

improve ML model results incrementally (**Next steps**) ML Model => Rules

18October 15, 2015 CSC 594

ACRE Labeling Process

19October 15, 2015 CSC 594

Pattern Models

Metadata Extraction by Pattern Decision Manager defines “Patterns”, which are:

A context pattern containing an extraction pattern in parentheses.

If the context pattern is matched, then extraction pattern match becomes the label value

Example: Extract social security numbers Pattern: “(\n{3}-\n{2}-\n{4})”

Example: Extract hashtags Pattern: “(#\w*)”

Example: Extract word before “Room” Pattern: “(\w*) Room”

20October 15, 2015 CSC 594

Hashtag Extraction Results

21October 15, 2015 CSC 594

Rules Associate text patterns with label value.

When text is processed, rule match assigns the corresponding label value.

Current ACRE Tools: Rules

22October 15, 2015 CSC 594

Rules Definition File

23

• Patterns are regular expressions connected by booleans• If multiple Pattern matches in a given document, then:

• If option Multi_Values = True, then all values retained.

• If option Multi_Values = False, then Value with greatest sum of matched rule weights is selected.

Category Value Pattern WeightSentiment Positive Good 1Sentiment Positive Great 1.5Sentiment Positive "Not bad" 0.4Sentiment Positive Gr[eio].* 1Sentiment Positive Best & Good 1.7Sentiment Negative Terrible 2Sentiment Negative Awful 1.5Sentiment Negative unhappy 1


Results Report (CSV)

24October 15, 2015 CSC 594

ACRE Machine Learning Current: Nearest Word Cloud Machine

Learning Model is “trained” on example text for each

label. ACRE applies NLP Processing to each text item to isolate

the best terms to be used in analysis Model stores “Trained Word Cloud” (aggregated term

frequencies calculated from training) for each label value.

Combined model option: Train ML model using Rule matches

To choose label for new text, ACRE calculates Confidence value for each possible label, based on vector similarity between new text cloud and Trained Word Cloud.

Next: ML vector can be constructed from any data fields (structured or unstructured)

25

Categorizing “Sentiment” using ACRE ML Algorithm

26

Compare New Document Word Cloud to the Trained Word Clouds for Positive and Negative. Pick best match, or Neutral if no good match.


NLP Processing for ML Vocabulary

To eliminate noise and increase accuracy, it is essential to minimize the “vocabulary” of words included in ML analysis.

ACRE’s NLP Processing does this: *Stemming groups different word forms (run, running, ran) Part-of-Speech (POS) tags word use (noun, verb, adjective, ..) Named Entity (NE) tags names of people, places, institutions,

.. Synonym processing groups words with identical meaning Filtering eliminates data records that match filter rules *Drop List specifies words, phrases, stems, POS tags or NE

tags that should be eliminated from analysis Go List specifies words, phrases, stems, POS tags or NE tags

that must be included in analysis

* Implemented now

27

Tuning Parameters Options for tuning the ML algorithm:

(**More research here**)

Drop List options Term Weights IDF (Inverse Document Frequency)

options Confidence Threshold Max Term caps Similarity Measures

Cosine Similarity, MSE, Absolute error

28October 15, 2015 CSC 594

Matching with Search and “Find Similar”

Search finds all data records for search query, filtered by Labels, POS, NE results.

“Find Similar” uses ML Confidence results to match similar data records.

Step 1: Create new ML model and train it on a single reference data record.

Step 2: Execute the ML model on other data records, generating Confidence factor for each one, measuring its similarity to the reference.

Step 3: Sort results by Confidence to group closest matches together.

29

Example – Find Similar Dataset: 5000 CCN-International tweets. Find

Similar results for first tweet listed.

30

Example: Language Labeling

Dataset: 3625 survey responses from CFI Group. Objective: Determine language of each

response. Results: For each language: Number of

responses, list of responses, word cloud. Approach:

First, Rules-only analysis using common words in each language

Second, combined model where Rules train ML model and then ES fails-over to ML.

31October 15, 2015 CSC 594


Define Language Label Model Language = {English, Spanish, French} Multi_Values = False (1 language per response)

Create simple Rules: Spanish: (todo OR perfecto OR siempre OR luces OR

excelente) French: (troit OR avec OR vendeuse OR normaux OR

ajouter) English: (clothes OR store OR best OR service OR jeans

OR price OR sale) Experiment #1: Rules-only results:

English: 1237 items Spanish: 38 items French: 9 item Unlabeled: 2346 items 32October 15, 2015 CSC 594


Experiment #2: Combined model where Rules train ML model and ES fails-over to ML.

Results: English: 3540 items Spanish: 76 items French: 9 items Unlabeled: 5 items (below Threshold)

Combined ES-ML analysis correctly labels Language for nearly every survey response.

(**Research**) What other labeling decisions are well-suited for combined analysis?

33October 15, 2015 CSC 594

Software History May, 2013: Vertical Data, LLC,

incorporated October, 2014: ACRE v1.2 release

Categorization search engine on a Windows/SQL 2014 Server .Net platform – suspended.

July 6, 2015: ACRE v1.3 deployed! Command-line categorization models using

Python/NLTK. November, 2015: ACRE for Web

Web2py deployment of ACRE 2016: ACRE 2.0

Full GUI, visualizations, SaaS interface.34

ACRE v1.3 – Category Model Contents

35

ACRE Label Model (.alm) File contents

ACRE Model Execution Example

36

ACRE v1.3 Example – Label Table Results

37

ACRE v1.3 Example – Summary Results (per model)

38

RoomBathLocationServiceOther

ACRE v1.3: Term Frequency Results

Term frequency table and Term Word Cloud for

every node in the Category Tree.

39

ACRE v1.3 Model Execution Modes Rules-Only Execution:

Select labels based on rule pattern matches Deterministic, auditable, reproducible.

Classic ML Execution: Train using exemplars for each value Select labels based on Nearest Word Cloud

(NWC) Rules with ML Fill-in

Train using rule matches (labeling patterns) Select labels based on rule pattern matches –

then if no rule match, then select based on NWC.

ML with Seeding Patterns: Train using rule matches (seed patterns) Select labels based on Nearest Word Cloud

(NWC)

40

ACRE v1.3 for You!

I can provide Prof. Tomuro with a Python distribution of ACRE v1.3 for CSC 594 students to try out by next week (10/20/2015) if you are interested.

User documentation: http://vertical-data.com Click on PRODUCTS User’s Guide, Command Reference, Options

Reference available.

41

http://vertical-data.com/

ACRE v2.0 Wire Frames

42

ACRE v2.0 – Word Cloud Viewer

43

Reviewing Word Clouds

44


45


46

Graphing Results

47

Graphing Results

48

Ongoing Trials

Vertical Data has 2 product trials once Web version releases in November: Optimization Group for Customer Survey data Mt. Sinai hospital / HealthIX research study of

radiology electronic medical records: Extracting Findings and SNOMED diagnosis codes

from doctor’s comments. Literature: 2007 Medical NLP Challenge results on

extracting ICD9 codes from radiology reports.

49October 15, 2015 CSC 594

Conclusions

ACRE provides a set of useful tools for text analytics, categorization and search.

I can provide a distribution of ACRE v1.3 (command line) for CSC 594 use by 10/20.

Contact me ([email protected]) if you would like to work on: Research on combined ES/ML model

performance Research on Medical NLP.

50October 15, 2015 CSC 594

ACRE Text Categorization Platform Greg Brewster, PhD Associate Professor, School of Computing CTO,...

Documents

Transcript of ACRE Text Categorization Platform Greg Brewster, PhD Associate Professor, School of Computing CTO,...