SemTech Text Analytics Evaluation

32
SemTech Text Analytics Evaluation Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com

description

SemTech Text Analytics Evaluation. Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com. Agenda. Text Analytics Features, Varieties, Vendors Evaluation Process Start with Self-Knowledge Text Analytics Team - PowerPoint PPT Presentation

Transcript of SemTech Text Analytics Evaluation

Page 1: SemTech Text Analytics Evaluation

SemTechText Analytics

EvaluationTom Reamy

Chief Knowledge ArchitectKAPS Group

Knowledge Architecture Professional Serviceshttp://www.kapsgroup.com

Page 2: SemTech Text Analytics Evaluation

2

Agenda Text Analytics Features, Varieties, Vendors Evaluation Process

– Start with Self-Knowledge – Text Analytics Team – Features and Capabilities – Filter

Proof of Concept/Pilot– Themes and Issues– Case Study

Conclusion

Page 3: SemTech Text Analytics Evaluation

3

KAPS Group: General Knowledge Architecture Professional Services Virtual Company: Network of consultants – 8-10 Partners – SAS, SAP, FAST, Smart Logic, Concept Searching, etc. Consulting, Strategy, Knowledge architecture audit Services:

– Taxonomy/Text Analytics development, consulting, customization– Technology Consulting – Search, CMS, Portals, etc.– Evaluation of Enterprise Search, Text Analytics– Metadata standards and implementation– Knowledge Management: Collaboration, Expertise, e-learning– Applied Theory – Faceted taxonomies, complexity theory, natural

categories

Page 4: SemTech Text Analytics Evaluation

4

Introduction to Text AnalyticsText Analytics Features Noun Phrase Extraction (Entity, Concept, Events, etc.)

– Catalogs with variants, rule based dynamic– Multiple types, custom classes – entities, concepts, events– Feeds facets

Summarization– Customizable rules, map to different content

Fact Extraction– Relationships of entities – people-organizations-activities– Ontologies – triples, RDF, etc.

Sentiment Analysis– Statistical, rules – full categorization set of operators

Page 5: SemTech Text Analytics Evaluation

5

Introduction to Text AnalyticsText Analytics Features Auto-categorization

– Training sets – Bayesian, Vector space– Terms – literal strings, stemming, dictionary of related terms– Rules – simple – position in text (Title, body, url)– Semantic Network – Predefined relationships, sets of rules– Boolean– Full search syntax – AND, OR, NOT– Advanced – NEAR (#), PARAGRAPH, SENTENCE

This is the most difficult to develop Build on a Taxonomy Combine with Extraction

– If any of list of entities and other words

Page 6: SemTech Text Analytics Evaluation

6

Page 7: SemTech Text Analytics Evaluation

7

Page 8: SemTech Text Analytics Evaluation

8

Page 9: SemTech Text Analytics Evaluation

9

Page 10: SemTech Text Analytics Evaluation

10

Page 11: SemTech Text Analytics Evaluation

11

Page 12: SemTech Text Analytics Evaluation

12

Varieties of Taxonomy/ Text Analytics Software Taxonomy Management

– Synaptica, SchemaLogic Full Platform

– SAS-Teragram, SAP-Inxight, Smart Logic, Data Harmony, Concept Searching, Expert System, IBM, GATE

Content Management – embedded Embedded – Search

– FAST, Autonomy, Endeca, Exalead, etc. Specialty

– Sentiment Analysis , VOC – Lexalytics, Attensity / Reports– Ontology – extraction, plus ontology

Page 13: SemTech Text Analytics Evaluation

Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge Strategic and Business Context Info Problems – what, how severe Strategic Questions – why, what value from the taxonomy/text

analytics, how are you going to use it Formal Process - KA audit – content, users, technology, business

and information behaviors, applications - Or informal for smaller organization, application specific initiatives

Text Analytics Strategy/Model – forms, technology, people– Existing taxonomic resources, software

Need this foundation to evaluate and to develop

13

Page 14: SemTech Text Analytics Evaluation

Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge

Do you need it – and what blend if so? Taxonomy Management Full Functionality

– Multiple taxonomies, languages, authors-editors Technology Environment – Text Mining, ECM, Enterprise Search

– Where is it embedded, integration issues Publishing Process – where and how is metadata being added –

now and projected future– Can it utilize auto-categorization, entity extraction, summarization

Applications – text mining, BI, CI, Social Media, Mobile?

14

Page 15: SemTech Text Analytics Evaluation

Design of the Text Analytics Selection Team

Traditional Candidates - IT Experience with large software purchases

– Search/Categorization is unlike other software Experience with needs assessments

– Need more – know what questions to ask, knowledge audit Objective criteria

– Looking where there is light? Asking IT to select text analytics software is like asking a

construction company to select the design of your house. They have the budget

– OK, they can play.

15

Page 16: SemTech Text Analytics Evaluation

Design of the Text Analytics Selection Team

Traditional Candidates - Business Owners Understand the business

– But don’t understand information behavior Focus on business value, not technology

– Focus on semantics is needed Asking business owners to select text analytics software is

like asking a restaurant owner to do the cooking They can get executive sponsorship, support, and budget.

– OK, they can play

16

Page 17: SemTech Text Analytics Evaluation

Design of the Text Analytics Selection Team

Traditional Candidates - Library Understand information structure

– But not how it is used in the business Experts in search experience and categorization

– Suitable for experts, not regular users Asking librarians to select text analytics software is like asking an

accountant to establish your financial strategy Experience with variety of search engines, taxonomy software,

integration issues– OK, they can play

17

Page 18: SemTech Text Analytics Evaluation

Design of the Text Analytics Selection Team

Interdisciplinary Team, headed by Information Professionals Relative Contributions

– IT – Set necessary conditions, support tests– Business – provide input into requirements, support project– Library – provide input into requirements, add understanding

of search semantics and functionality Much more likely to make a good decision Create the foundation for implementation

18

Page 19: SemTech Text Analytics Evaluation

Evaluating Text Analytics Software – Process

Start with Self Knowledge Eliminate the unfit

– Filter One- Ask Experts - reputation, research – Gartner, etc.• Market strength of vendor, platforms, etc.

– Filter Two - Feature scorecard – minimum, must have, filter – Filter Three – Technology Filter – match to your overall scope

and capabilities – Filter not a focus– Filter Four – Focus Group one day visit – 3-4 vendors

Deep pilot (2) / POC – advanced, integration, semantics Focus on working relationship with vendor.

19

Page 20: SemTech Text Analytics Evaluation

Initial Evaluation Example Outcomes

Filter One:– Company A, B – sentiment analysis focus, weak categorization– Company C – Lack of full suite of text analytics– Company D – business concerns, support– Open Source – license issues– Ontology Vendors – missing categorization capabilities

4 Demos– Saw a variety of different approaches, but – Company X – lacking sentiment analysis, require 2 vendors– Company Y – lack of language support, development cost

20

Page 21: SemTech Text Analytics Evaluation

Evaluating Taxonomy SoftwarePOC - Approach Quality of results is the essential factor 6 weeks POC – bake off / or short pilot Real life scenarios, categorization with your content Preparation:

– Preliminary analysis of content and users information needs– Set up software in lab – relatively easy– Train taxonomist(s) on software(s)– Develop taxonomy if none available

Six week POC – 3 rounds of development, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial

categorization of content

21

Page 22: SemTech Text Analytics Evaluation

Evaluating Taxonomy SoftwarePOC – Initial Design Majority of time is on auto-categorization Need to balance uniformity of results with vendor unique

capabilities – have to determine at POC time Risks – getting software installed and working, getting the right

content, initial categorization of content Elements:

– Content– Search terms / search scenarios– Training sets– Test sets of content

Development Team – expert consultants plus internal taxonomists, technical

22

Page 23: SemTech Text Analytics Evaluation

Evaluating Taxonomy SoftwarePOC – Range of Evaluations Basic – Can this stuff work at all? Auto-categorization to existing taxonomy – variety of content Clustering – automatic node generation Summarization Entity extraction – build a number of catalogs – design which

ones based on projected needs – example privacy info (SS#, phone, etc.)

Entity example –people, organization, methods, etc. Evaluate usability in action by taxonomists Integration – with ontologies Output – XML, API’s

23

Page 24: SemTech Text Analytics Evaluation

Evaluating Text Analytics SoftwarePOC - Issues Quality of content – range of issues – spelling to size to ? Quality of initial human categorization Normalize among different test evaluators Quality of taxonomists – experience with text analytics software

and/or experience with content and information needs and behaviors

Quality of taxonomy– General issues – structure (too flat or too deep)– Overlapping categories – Differences in use – browse, index, categorize

Categorization essential issue is complexity of language Entity Extraction essential issue is scale and disambiguation

24

Page 25: SemTech Text Analytics Evaluation

Evaluating Text Analytics Software Risks CIO/CTO Problem –This is not a regular software process Language is messy not just complex

– 30% accuracy isn’t 30% done – could be 90% Variability of human categorization / expression

– Even professional writers – journalists examples Categorization is iterative, not “the program works”

– Need realistic budget and flexible project plan Anyone can do categorization

– Librarians often overdo, SME’s often get lost (keywords) Meta-language issues – understanding the results

– Need to educate IT and business in their language

25

Page 26: SemTech Text Analytics Evaluation

Case Study: Telecom Service

Company History, Reputation Full Platform –Categorization,

Extraction, Sentiment Integration – java, API-SDK,

Linux Multiple languages Scale – millions of docs a day Total Cost of Ownership Ease of Development - new Vendor Relationship – OEM,

etc.

Expert Systems IBM SAS Smart Logic

Option – Multiple vendors – Sentiment & Platform

26

Page 27: SemTech Text Analytics Evaluation

POC Design Discussion: Evaluation Criteria

Basic Test Design – categorize test set– Score – by file name, human testers

Categorization– Accuracy Level – 80-90%– Effort Level per accuracy level

Sentiment Analysis– Accuracy Level – 80-90%– Effort Level per accuracy level

Quantify development time – main elements Comparison of two vendors – how score?

– Combination of scores and report

27

Page 28: SemTech Text Analytics Evaluation

Text Analytics POC OutcomesCategorization Results

SAS IBM

Recall-Motivation 92.6 90.7

Recall-Actions 93.8 88.3

Precision – Mot. 84.3

Precision-Act 100

Uncategorized 87.5

Raw Precision 73 46

28

Page 29: SemTech Text Analytics Evaluation

Text Analytics POC OutcomesVendor Comparisons

Categorization Results – both good, edge to SAS on precision– Use of Relevancy to set thresholds

Development Environment– IBM as toolkit provides more flexibility but it also increases

development effort Methodology – IBM enforces good method, but takes more time

– SAS can be used in exactly the same way SAS has a much more complete set of operators – NOT, DIST,

START

29

Page 30: SemTech Text Analytics Evaluation

Text Analytics POC OutcomesVendor Comparisons - Functionality

Sentiment Analysis – SAS has workbench, IBM would require more development– SAS also has statistical modeling capabilities

Entity and Fact extraction – seems basically the same– SAS and use operators for improved disambiguation –

Summarization – SAS has built-in– IBM could develop using categorization rules – but not clear that

would be as effective without operators Conclusion: Both can do the job, edge to SAS

30

Page 31: SemTech Text Analytics Evaluation

Conclusion

Start with self-knowledge – what will you use it for?– Current Environment – technology, information

Basic Features are only filters, not scores Integration – need an integrated team (IT, Business, KA)

– For evaluation and development POC – your content, real world scenarios – not scores Foundation for development, experience with software

– Development is better, faster, cheaper Categorization is essential, time consuming Next: Text Analytics + Semantic Web + Ontology

– Integration of Data and Text Mining– Mutual Enrichment – smarter data, richer analytics

31

Page 32: SemTech Text Analytics Evaluation

Questions? Tom Reamy

[email protected] Group

Knowledge Architecture Professional Serviceshttp://www.kapsgroup.com