Post on 10-May-2015
description
Taming Text
Grant IngersollCTO, LucidWorks
@tamingtext, @gsingers
About the Book
• Goal: An engineer’s guide to search and Natural Language Processing (NLP) and Machine Learning
• Target Audience: You• All examples in Java, but concepts easily ported• Covers:– Search, Fuzzy string matching, human language
basics, clustering, classification, Question Answering, Intro to advanced topics
Answer Me This!
• What is trimethylbenzene?– http://localhost:8983/solr/answer?q=What+is+trimethylbenzene%3F&defType
=qa&qa=true&qa.qf=body
• who is ten minute warning?– http://localhost:8983/solr/answer?q=who+is+ten+minute+warning%3F&defTy
pe=qa&qa=true&qa.qf=body
• what station serves the A train?– http://localhost:8983/solr/answer?q=what+station+serves+the+A+train%3F&d
efType=qa&qa=true&qa.qf=body
Fact-based QA Demo
What does it take to build this system?
Agenda
• Question Answering In Detail– Building Blocks– Indexing– Search/Passage Retrieval– Classification– Scoring
• Other Interesting Topics– Clustering– Fuzzy-Wuzzy Strings
• What’s next?• Resources
A Grain of Salt
• Text is a strange and magical world filled with…– Evil villains– Jesters– Wizards– Unicorns– Heroes!
• In other words, no system will be perfect
http://images1.wikia.nocookie.net/__cb20121110131756/lotr/images/thumb/e/e7/Gandalf_the_Grey.jpg/220px-Gandalf_the_Grey.jpg
The Ugly Truth
• You will spend most of your time in NLP, search, etc. doing “grunt” work nicely labeled as:– Preprocessing– Feature Selection– Sampling– Validation/testing/etc.– Content extraction– ETL
• Corollary: Start with simple, tried and true algorithms, then iterate
Getting Started
• git clone git@github.com:tamingtext/book.git• See the README for pre-requisites• ./bin contains useful scripts to get started• You’ll need to download some pretty big
dependencies:– OpenNLP Models– WordNet– Wikipedia subset
Question Answering (QA)
What is QA?
• You’ve seen QA in action already thanks to IBM and Jeopardy!
• Instead of providing 10 blue links, provide the answer!
• Exercises many search and NLP features
• See Ch. 8
Simple QA Workflow
Building Blocks
• Sentence Detection
• Part of Speech Tagging
• Parsing
• Ch. 2
QA in Taming Text
• Apache Solr for Passage Retrieval and integration
• Apache OpenNLP for sentence detection, parsing, POS tagging and answer type classification
• Custom code for Query Parsing, Scoring– See com.tamingtext.qa package
• Wikipedia for “truth”
Demo
• $TT_HOME/bin/start-solr.sh solr-qa– http://localhost:8983/solr/answer
• Once that is up and running– $TT_HOME/bin/indexWikipedia.sh --wikiFile
~/projects/manning/maven.tamingtext.com/freebase-wex-2011-01-18-articles-first10k.tsv
• When done, you can ask questions!
Indexing
• Ingest raw data into the system and make it available for search
• Garbage In, Garbage Out– Need to spend some time understanding and modeling
your data just like you would with a DB– Lather, rinse, repeat
• See the $TT_HOME/apache-solr/solr-qa/conf/schema.xml for setup
• WikipediaWexIndexer.java for indexing code
Aside: Named Entity Recognition
• NER is the process of extracting proper names, etc. from text
• Plays a vital role in a QA and many other NLP systems• Often solved using classification approaches
• Custom Query Parser takes in user’s natural language query, classifies it to find the Answer Type and generates Solr query
• Retrieve candidate passages that match keywords and expected answer type
• Unlike keyword search, we need to know exactly where matches occur
Answer Type Classification
• Answer Type examples:– Person (P), Location (L), Organization (O), Time Point
(T), Duration (R), Money (M)– See page 248 for more
• Train an OpenNLP classifier off of a set of previously annotated questions, e.g.:– P Which French monarch reinstated the divine right
of the monarchy to France and was known as `The Sun King' because of the splendour of his reign?
Scoring
Other Areas of NLP/Machine Learning
Clustering
• Group together content based on some notion of similarity
• Book covers (ch. 6):– Search result clustering using
Carrot2
– Whole collection clustering using Mahout
– Topic Modeling• Mahout comes with many
different algorithms
Clustering Use Cases
• Google News
• Outlier detection in smart grids
• Recommendations– Products– People, etc.
In Focus: K-Means
http://en.wikipedia.org/wiki/K-means_clustering
Fuzzy-Wuzzy Strings
• Fuzzy string matching is a common, and difficult, problem
• Useful for solving problems like:– Did you mean spell checking– Auto-suggest– Record linkage
Common Approaches
• See com.tamingtext.fuzzy package• Jaccard– Measure character overlap
• Levenshtein (Edit Distance)– Count the number of edits required to transform
one word into the other• Jaro-Winkler– Account for position
Trie
• The Trie is a very useful data structure for working with strings
• Find common subsequences
• Auto-suggest, others
• Ternary Search Trie
What’s Next?
Much Harder Problems
• Chapter 9• Semantics, Pragmatics and beyond• Sentiment Analysis• Document and collection summarization• Relationship Extraction• Cross-language Search• Importance
Thank You!
• 3 copies of Taming Text
Resources
• http://www.manning.com/ingersoll– http://github.com/tamingtext/
book• http://www.tamingtext.com• @tamingtext• Me:– @gsingers– grant@lucidworks.com