The well tempered search application

44
The Well-Tempered Search Application Variations on a Theme: Why does my search app suck, and what can I do about it? Ted Sullivan – (old Phuddy Duddy) Senior (very much so I’m afraid) Solutions (I hope) Architect (and sometime plumber) Lucidworks Technical Services

Transcript of The well tempered search application

The Well-Tempered Search Application

Variations on a Theme:

Why does my search app suck, and what can I do about it?

Ted Sullivan – (old Phuddy Duddy)Senior (very much so I’m afraid) Solutions (I hope) Architect (and sometime plumber)Lucidworks Technical Services

Our Basic Premises (Premisi?)

• Lemma 1: Search Applications use algorithms that make finding chunks of text within large datasets possible in HTT (human-tolerable time).

• Lemma 2: These algorithms work by breaking text into primitive components and building up a search “experience” from that.

• Lemma 3: Lemma 2 is not sufficient to achieve Lemma 1.

The Basic Disconnect

• Text can be analyzed at the level of tokens (syntax) and at the level of meaning (semantics).

• We think one way (semantics), search engines think another (syntax – i.e. token order).

• How do we bridge the gap? … More clever algorithms!

Art and Science

• We need to be intelligent curators of these algorithms. Craftsmen (craftswomen?) that think of these as tools with a specific purpose.

• Like any good craftsperson – we need a wide array of tools to get the job done (well almost).

When is my search app done?

• Quick answer: NEVER (ain’t consultin’ great?)

• Long answer – As long as it is continues to improve, like fine wine or bourbon, you are on the path to enlightenment.

• How do you get there grasshopper? Add semantic intelligence to the engine!

Search cannot be shrink-wrapped!!

What have we got for Donny behind Curtain #1 Jay?

Well Monty - Heeeeeeeeeeeerrrrrrrreeeeesssss the Google … SEARCH Appliance!!!!

Search cannot be shrink-wrapped!!

What have we got for Donny behind Curtain #1 Jay?

Well Monty - Heeeeeeeeeeeerrrrrrrreeeeesssss the Google … SEARCH Appliance!!!!*

Sorry Donny – It’s a ZONK!* but Google Web Search has some Serious Mojo!

Prelude part 1– The basic problem

The inverted index and “bag-of-words” search:

The red fox jumped over the fence.

Time flies like an arrow. Fruit flies like a banana.

the 1,6red 2fox 3

jumped 4over 5fence 7

flies 2,7like 3,8

Prelude part B – The Tried and True

• Phrase and Proximity boosting and “Slop”

• Synonyms and stop words

• Stemming or Lemmatization

• Autocomplete

• Best Bets / Landing Pages – the sledgehammer

• Spell check – spell suggest – aka the warm fuzzies.

Fugue - Subject or Exposition

Search engines need more ‘semantic awareness’ or at least the illusion of this.

There is a heavy duty solution called Artificial Intelligence – which except in the fertile imagination of Hollywood screenwriters, is not there yet. So we need to fake it just a bit.

Theme and Variations I autophrasing and the red sofa

Theme: When multiple words mean just one thing.

Fuzzy way: Boosting phrases (proximity and phrase slop)- pushes false positives down – i.e. out of the limelight- i.e. - shoves ‘em under the rug

This encounters a problem with faceted search

Like the eye of Sauron in LOTR or Santa Claus, the faceting engine SEES ALL (sins)!

Brake Pads example: hit on things that have ‘brake’ (like children’s stroller brakes) and ‘pads’ – like mattress pads.

Variation I: Autophrasing

AutophrasingTokenFilter tells Lucene not to tokenize when a noun phrase represents a single thing - by providing a flat list of phrases.

Creates one-to-one token mapping that Luceneprefers because it avoids the “sausagization” problem.

https://github.com/LucidWorks/auto-phrase-tokenfilter

income tax refundincome tax

tax refund

“income tax” is not income. A “tax refund” is not a tax.

Solution: Autophrasing + synonym mapping

income tax => taxtax refund => refund

Autophrasing Example

Autophrasing Setupautophrases.txt:

income taxtax refundtax rebatesales taxproperty tax

synonyms.txt

income_tax,property_tax,sales_tax,taxtax_refund,refund,rebate,tax_rebate

<fieldType name="text_autophrase" class="solr.TextField" positionIncrementGap="100"><analyzer type="index">

<tokenizer class="solr.StandardTokenizerFactory"/><filter class="solr.LowerCaseFilterFactory"/><filter class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory"

phrases="autophrases.txt" includeTokens="true” replaceWhitespaceWith="_" />

<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

<filter class="solr.PorterStemFilterFactory"/></analyzer>

</fieldType>

Multi-term synonym problem• New York, New York – it’s a HELLOVA town!

Subject was inspired by an old JIRA ticket: Lucene-1622

“if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., “big” in the synonym “big apple” for “new york city”) causes the document to match”

(or “apple” which will hit on my blog post if you crawl lucidworks.com !)

This means certain phrase queries should match but don't (e.g.: "hotspot is down"), and other phrase queries shouldn't match but do (e.g.: "fast hotspot fi").

Other cases do work correctly (e.g.: "fast hotspot"). We refer to this "lossy serialization" as sausagization, because the incoming graph is unexpectedly turned from a correct word lattice into an incorrect sausage.

This limitation is challenging to fix: it requires changing the index format (and Codec APIs) to store an additional int position length per position, and then fixing positional queries to respect this value.

Sausagization: from Mike McCandless blog Changing Bits: Lucene's TokenStreams are actually graphs!

http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

Multi-term synonym demo

new yorknew york stateempire statenew york citynew york new yorkbig appleny nycity of new yorkstate of new yorkny state

autophrases.txt

new_york => new_york_state,new_york_city,big_apple,new_york_new_york,ny_ny,nyc,empire_state,ny_state,state_of_new_york

new_york_state,empire_state,ny_state,state_of_new_york

new_york_city,big_apple,new_york_new_york,ny_ny,nyc,city_of_new_york

synonyms.txt

This document is about new york state.

This document is about new york city.

There is a lot going on in NYC.

I heart the big apple.

The empire state is a great state.

New York, New York is a hellova town.

I am a native of the great state of New York.

new york new york city new york state

Multi-term synonym demo

/select /autophrase

This document is about new york state.

This document is about new york city.

There is a lot going on in NYC.

I heart the big apple.

The empire state is a great state.

New York, New York is a hellova town.

I am a native of the great state of New York.

empire state

Multi-term synonym demo

/select /autophrase

(Even a blind squirrel finds a nut once in a while)

Variation II: The Red Sofa Problem

{"response":{"numFound":3,"start":0,"docs":[

{"color":"red",”text":"This is the red sofa example. Please find with 'red sofa' query.",

{"color":"red",”text":"This is a red beach ball. It is red in color but is not something that you

should not sit on because you would tend to roll off.",{

"color":"blue",”text":"This is a blue sofa, it should only hit on sofas that are blue in color."]

}}

OOTB – q=red sofa is interpreted as text:red text:sofa (default OR)

http://localhost:8983/solr/collection1/select?q=red+sofa&wt=json

Closing the Loop: Content Tagging and Intelligent Query Filtering

Using the search index itself as the knowledge source:

Solution for the Red Sofa problemQuery Autofiltering: Search Index driven query introspection / query rewriting:

Lucene FieldCache MagicLucene FieldCache (to be renamed UninvertedIndex in Lucene 5.0)

Inverted Index: Show all documents that have this term value in this field.

Uninverted or Forward Index:Show all term values that have been indexed in this field.

SolrIndexSearcher searcher = rb.req.getSearcher();SortedDocValues fieldValues = FieldCache.DEFAULT.getTermsIndex(

searcher.getAtomicReader( ), categoryField );…StringTokenizer strtok = new StringTokenizer ( query, " .,:;\"'" );while (strtok.hasMoreTokens( ) ) {

String tok = strtok.nextToken( ).toLowerCase( );BytesRef key = new BytesRef( tok.getBytes() );if (fieldValues.lookupTerm( key ) >= 0) {

Query Autofiltering

{"response":{"numFound":1,"start":0,"docs":[

{"id":"1","color":"red","description":"This is the red sofa example. Please find with 'red sofa' query."]

}

http://localhost:8983/solr/collection1/infer?q=red+sofa&wt=json

Now search for “red sofa” only returns ….. red sofas!

But – is this too “brute force”? The takeaway is that using the search index AS a knowledge store can be very powerful!

Architecture: its all about Plumbing

• Pipelines for every occasion.

Indexing Pipelines – good ‘ole ETL- Content enrichment, tagging - Metadata cleanup

Query Pipelines – identification, query preprocessing - introspection

One is the “hand” the other, the “glove”

Index Pipelines

Lots of choices here:

• Internal to Solr – DIH, UpdateRequestProcessor

Pros and cons

• External – Morphlines, Open Pipeline, Flume, Spark, Hadoop, Custom SolrJ

• Lucidworks Fusion

Entity and Fact Extraction

Entities: Things, Locations, Dates, People, Organizations, Concepts

Entity RelationshipsCompany was acquired by CompanyDrug cures DiseasePerson likes Pizza

Annotation Pipelines (UIMA, Lucidworks Fusion):Entity Extraction followed by Fact ExtractionPattern method:

$Drug is used to treat $ConditionParts of Speech (POS) analysis

Subject Predicate Object

Theme and Variations IIThe Classification Wars

• Machine Learning or Taxonomy – is it a Floor Wax or a Dessert Topping?

Answer: It’s a floor wax AND a dessert topping! Its delicious and just look at that shine!

Machine Learning

Use mathematical vector-crunching algorithms like Latent Dirichlet Allocation (LDA), Bayesian Inference, Maximum Entropy, log likelihood, Support Vector Machines (SVM) etc., to find patterns and to associate those patterns with concepts.

Can be supervised (i.e. given a training set) or unsupervised (the algorithm just finds clusters). Supervised learning are called semi-automatic classifiers.

Check out Taming Text by Ingersoll, Morton and Farris (Manning)

Machine Learning In LucidworksFusion

Training DataNLP Trainer

Stage

NLP Model

Test DataNLP Classifier

StageClassified

Documents

Taxonomy or Ontology“Knowledge graphs” that relate things and concepts to each other either hierarchically or associatively.

Pros:Works without large amounts of content to analyze

Encapsulates the knowledge of human subject matter experts

Cons:Often not well designed for search (mixes semantic relationship

types / organizational logic)

Requires curation by subject matter experts whose time is costly

Taxonomies Designed for SearchCategory nodes and Evidence nodes

Category Node:A ‘parent’ nodeCan have child nodes that are:

Sub CategoriesEvidence Nodes

Evidence Node:Tends to be a leaf node (no children)Contains keyterms (synonyms)May contain “rules” e.g. (if contains term a and term b but not term c)Evidence Nodes can have more than one category node parent

Hits on Evidence Nodes add to the cumulative score of a Category Node.

Scores can be diluted as the accumulate up the hierarchy – so that the nearest category gets the strongest ‘vote’.

US Corporations

Foreign CorporationsBritishChineseFrenchGermanJapaneseRussianetc.

Fortune 100 Companies

Energy

Financial ServicesInvestment BanksCommercial Banks

Health CareHealth InsuranceHMOMedical DevicesPharmaceuticals

Hospitality

Manufacturing

AircraftAutomobilesElectrical Equipment

Ford, GM, Chrysler

Fortune 100 Companies

Energy

Financial ServicesInvestment BanksCommercial Banks

Health CareHealth InsuranceHMOMedical DevicesPharmaceuticals

Hospitality

Manufacturing

AircraftAutomobilesElectrical Equipment

US Corporations

Foreign CorporationsBritishChineseFrenchGermanJapaneseRussianetc.

Ford, GM, Chrysler,Toyota,BMW

Fortune 100 Companies

Energy

Financial ServicesInvestment BanksCommercial Banks

Health CareHealth InsuranceHMOMedical DevicesPharmaceuticals

Hospitality

Manufacturing

AircraftAutomobilesElectrical Equipment

US Corporations

Foreign CorporationsBritishChineseFrenchGermanJapaneseRussianetc.

Fortune 100 Companies

Energy

Financial ServicesInvestment BanksCommercial Banks

Health CareHealth InsuranceHMOMedical DevicesPharmaceuticals

Hospitality

Manufacturing

AircraftAutomobilesElectrical Equipment

Ford, GM, Chrysler,Toyota,BMW

GE, Boeing

US Corporations

Foreign CorporationsBritishChineseFrenchGermanJapaneseRussianetc.

Fortune 100 Companies

Energy

Financial ServicesInvestment BanksCommercial Banks

Health CareHealth InsuranceHMOMedical DevicesPharmaceuticals

Hospitality

Manufacturing

AircraftAutomobilesElectrical Equipment

Ford, GM, Chrysler,Toyota,BMW

GE, Boeing

Bank of America, Hyatt

US Corporations

Foreign CorporationsBritishChineseFrenchGermanJapaneseRussianetc.

Query Pipelines

The ‘Wh’ Words: Who, What, When, Where

Who are they (authentication)?

What can they see (security - authorization)?

When can they see it (entitlement)?

What are they interested in (personalization / recommendation)?

Where are they now (location)?

Query Pipelines

Inferential SearchQuery introspection -> Query modification.

Query Autofiltering

Are you feeling lucky today?

Topic boosting / spotlightingUse ML to detect the topic, then boost and/or spotlight results tagged this way.

Use a specialized collection to store ‘facet knowledge’

The Art of the Fugue:Inferential Search

• Infer what the user is looking for and give them that

• Clever software infers meaning aka query “intent”

• When we do this right, it appears to be magic!

Machine Learning DrivesQuery Introspection

Training DataNLP Trainer

Stage

NLP Model

Test DataNLP Classifier

StageClassified

Documents

Machine Learning models can driveQuery Introspection

NL QueryNLP Query

Stage

NLP Model

Tagged Query

Landing Page

Boost Documents

Da Capo al Coda

• Killer search apps are crafted from fine ingredients and like fine whiskey will get better with age - if you are paying attention to ‘what’ your users are looking for.

• Putting the pieces together requires an understanding of ‘what’ things, independent of what words they use to describe it.

Thanks for your attention!

Ted SullivanLucidworks, Technical Services

[email protected]: ted.sullivan5LinkedIn

Metuchen, New Jersey (You gotta problem with that?)