Taming Text

31
Taming Text Grant Ingersoll CTO, LucidWorks @tamingtext, @gsingers

description

Presentation from March 18th, 2013 Triangle Java User Group on Taming Text. Presentation covers search, question answering, clustering, classification, named entity recognition, etc. See http://www.manning.com/ingersoll for more.

Transcript of Taming Text

Page 1: Taming Text

Taming Text

Grant IngersollCTO, LucidWorks

@tamingtext, @gsingers

Page 2: Taming Text

About the Book

• Goal: An engineer’s guide to search and Natural Language Processing (NLP) and Machine Learning

• Target Audience: You• All examples in Java, but concepts easily ported• Covers:– Search, Fuzzy string matching, human language

basics, clustering, classification, Question Answering, Intro to advanced topics

Page 3: Taming Text

Answer Me This!

• What is trimethylbenzene?– http://localhost:8983/solr/answer?q=What+is+trimethylbenzene%3F&defType

=qa&qa=true&qa.qf=body

• who is ten minute warning?– http://localhost:8983/solr/answer?q=who+is+ten+minute+warning%3F&defTy

pe=qa&qa=true&qa.qf=body

• what station serves the A train?– http://localhost:8983/solr/answer?q=what+station+serves+the+A+train%3F&d

efType=qa&qa=true&qa.qf=body

Page 4: Taming Text

Fact-based QA Demo

Page 5: Taming Text

What does it take to build this system?

Page 6: Taming Text

Agenda

• Question Answering In Detail– Building Blocks– Indexing– Search/Passage Retrieval– Classification– Scoring

• Other Interesting Topics– Clustering– Fuzzy-Wuzzy Strings

• What’s next?• Resources

Page 7: Taming Text

A Grain of Salt

• Text is a strange and magical world filled with…– Evil villains– Jesters– Wizards– Unicorns– Heroes!

• In other words, no system will be perfect

http://images1.wikia.nocookie.net/__cb20121110131756/lotr/images/thumb/e/e7/Gandalf_the_Grey.jpg/220px-Gandalf_the_Grey.jpg

Page 8: Taming Text

The Ugly Truth

• You will spend most of your time in NLP, search, etc. doing “grunt” work nicely labeled as:– Preprocessing– Feature Selection– Sampling– Validation/testing/etc.– Content extraction– ETL

• Corollary: Start with simple, tried and true algorithms, then iterate

Page 9: Taming Text

Getting Started

• git clone [email protected]:tamingtext/book.git• See the README for pre-requisites• ./bin contains useful scripts to get started• You’ll need to download some pretty big

dependencies:– OpenNLP Models– WordNet– Wikipedia subset

Page 10: Taming Text

Question Answering (QA)

Page 11: Taming Text

What is QA?

• You’ve seen QA in action already thanks to IBM and Jeopardy!

• Instead of providing 10 blue links, provide the answer!

• Exercises many search and NLP features

• See Ch. 8

Page 12: Taming Text

Simple QA Workflow

Page 13: Taming Text

Building Blocks

• Sentence Detection

• Part of Speech Tagging

• Parsing

• Ch. 2

Page 14: Taming Text

QA in Taming Text

• Apache Solr for Passage Retrieval and integration

• Apache OpenNLP for sentence detection, parsing, POS tagging and answer type classification

• Custom code for Query Parsing, Scoring– See com.tamingtext.qa package

• Wikipedia for “truth”

Page 15: Taming Text

Demo

• $TT_HOME/bin/start-solr.sh solr-qa– http://localhost:8983/solr/answer

• Once that is up and running– $TT_HOME/bin/indexWikipedia.sh --wikiFile

~/projects/manning/maven.tamingtext.com/freebase-wex-2011-01-18-articles-first10k.tsv

• When done, you can ask questions!

Page 16: Taming Text

Indexing

• Ingest raw data into the system and make it available for search

• Garbage In, Garbage Out– Need to spend some time understanding and modeling

your data just like you would with a DB– Lather, rinse, repeat

• See the $TT_HOME/apache-solr/solr-qa/conf/schema.xml for setup

• WikipediaWexIndexer.java for indexing code

Page 17: Taming Text

Aside: Named Entity Recognition

• NER is the process of extracting proper names, etc. from text

• Plays a vital role in a QA and many other NLP systems• Often solved using classification approaches

Page 18: Taming Text

• Custom Query Parser takes in user’s natural language query, classifies it to find the Answer Type and generates Solr query

• Retrieve candidate passages that match keywords and expected answer type

• Unlike keyword search, we need to know exactly where matches occur

Page 19: Taming Text

Answer Type Classification

• Answer Type examples:– Person (P), Location (L), Organization (O), Time Point

(T), Duration (R), Money (M)– See page 248 for more

• Train an OpenNLP classifier off of a set of previously annotated questions, e.g.:– P Which French monarch reinstated the divine right

of the monarchy to France and was known as `The Sun King' because of the splendour of his reign?

Page 20: Taming Text

Scoring

Page 21: Taming Text

Other Areas of NLP/Machine Learning

Page 22: Taming Text

Clustering

• Group together content based on some notion of similarity

• Book covers (ch. 6):– Search result clustering using

Carrot2

– Whole collection clustering using Mahout

– Topic Modeling• Mahout comes with many

different algorithms

Page 23: Taming Text

Clustering Use Cases

• Google News

• Outlier detection in smart grids

• Recommendations– Products– People, etc.

Page 24: Taming Text

In Focus: K-Means

http://en.wikipedia.org/wiki/K-means_clustering

Page 25: Taming Text

Fuzzy-Wuzzy Strings

• Fuzzy string matching is a common, and difficult, problem

• Useful for solving problems like:– Did you mean spell checking– Auto-suggest– Record linkage

Page 26: Taming Text

Common Approaches

• See com.tamingtext.fuzzy package• Jaccard– Measure character overlap

• Levenshtein (Edit Distance)– Count the number of edits required to transform

one word into the other• Jaro-Winkler– Account for position

Page 27: Taming Text

Trie

• The Trie is a very useful data structure for working with strings

• Find common subsequences

• Auto-suggest, others

• Ternary Search Trie

Page 28: Taming Text

What’s Next?

Page 29: Taming Text

Much Harder Problems

• Chapter 9• Semantics, Pragmatics and beyond• Sentiment Analysis• Document and collection summarization• Relationship Extraction• Cross-language Search• Importance

Page 30: Taming Text

Thank You!

• 3 copies of Taming Text

Page 31: Taming Text

Resources

• http://www.manning.com/ingersoll– http://github.com/tamingtext/

book• http://www.tamingtext.com• @tamingtext• Me:– @gsingers– [email protected]