Building Search & Recommendation Engines

96
Building Search & Recommendation Engines Trey Grainger SVP of Engineering, Lucidworks Greenville Data Science 2017.06.29

Transcript of Building Search & Recommendation Engines

Page 1: Building Search & Recommendation Engines

Building Search & Recommendation EnginesTrey Grainger

SVP of Engineering, Lucidworks

Greenville Data Science 2017.06.29

Page 2: Building Search & Recommendation Engines

Trey GraingerSVP of Engineering

• Previously Director of Engineering @ CareerBuilder

• MBA, Management of Technology – Georgia Tech

• BA, Computer Science, Business, & Philosophy – Furman University

• Information Retrieval & Web Search - Stanford University

Other fun projects:

• Co-author of Solr in Action, plus numerous research papers

• Frequent conference speaker

• Founder of Celiaccess.com, the gluten-free search engine

• Lucene/Solr contributor

• Startup Investor / Advisor

About Me

Page 3: Building Search & Recommendation Engines

what do you do?

Page 4: Building Search & Recommendation Engines
Page 5: Building Search & Recommendation Engines

Search-Driven Everything

Customer Service

Customer Insights

Fraud Surveillance

Research Portal

Online RetailDigital Content

Page 6: Building Search & Recommendation Engines

Apache Solr

Page 7: Building Search & Recommendation Engines

“Solr is the popular, blazing-fast,

open source enterprise search

platform built on Apache Lucene™.”

Page 8: Building Search & Recommendation Engines

Key Solr Features:

● Multilingual Keyword search

● Relevancy Ranking of results

● Faceting & Analytics (nested / relational)

● Highlighting

● Spelling Correction

● Autocomplete/Type-ahead Prediction

● Sorting, Grouping, Deduplication

● Distributed, Fault-tolerant, Scalable

● Geospatial search

● Complex Function queries

● Recommendations (More Like This)

● Graph Queries and Traversals

● SQL Query Support

● Streaming Aggregations

● Batch and Streaming processing

● Highly Configurable / Plugins

● Learning to Rank

● Building machine-learning models

● … many more*source: Solr in Action, chapter 2

Page 9: Building Search & Recommendation Engines

The standard

for enterprise

search.of Fortune 500

uses Solr.

90%

Page 10: Building Search & Recommendation Engines

Lucidworks Fusion

Page 11: Building Search & Recommendation Engines
Page 12: Building Search & Recommendation Engines

All Your Data

Page 13: Building Search & Recommendation Engines

• Over 50 connectors to

integrate all your data

• Robust parsing framework

to seamlessly ingest all your

document types

• Point and click Indexing

configuration and iterative

simulation of results for full

control over your ETL

process

• Your security model

enforced end-to-end from

ingest to search across your

different datasources

Page 14: Building Search & Recommendation Engines

Experience

Management

Page 15: Building Search & Recommendation Engines

• Relevancy tuning: Point-and-click

query pipeline configuration allow

fine-grained control of results.

• Machine-driven relevancy:

Signals aggregation learn and

automatically tune relevancy and

drive recommendations out of the

box .

• Powerful pipeline stages:

Customize fields, stages,

synonyms, boosts, facets,

machine learning models, your

own scripted behavior, and

dozens of other powerful search

stages.

• Turnkey search UI

(Lucidworks View): Build a

sophisticated end-to-end search

application in just hours.

Page 16: Building Search & Recommendation Engines

• Seamless integration of your

entire search & analytics

platform

• All capabilities exposed

through secured API's, so

you can use our UI or build

your own.

• End-to-end security policies

can be applied out of the

box to every aspect of your

search ecosystem.

• Distributed, fault-tolerant

scaling and supervision of

your entire search

application

Page 17: Building Search & Recommendation Engines
Page 18: Building Search & Recommendation Engines

• Modular library of UI components to create

prototypes in hours, not weeks.

• Fine-grained security for government and

Fortune 500, military, law enforcement,

enforcing permissions by item, role,

geography or other parameters.

• Stateless architecture so apps are robust,

easy to deploy, and highly scalable.

• Supports over 25 data platforms including

Solr, SharePoint, Elasticsearch, Cloudera,

Attivio, FAST, MongoDB, and many more -

and of course Lucidworks Fusion.

• Full library of visualization components for

charts, pivots, graphs and more.

• Pre-tested re-usable modules include

pagination, faceting, geospatial mapping, rich

snippets, heatmaps, topic pages, and more.

Create custom search and

discovery applications in minutes.

Page 19: Building Search & Recommendation Engines

SECURITY BUILT-IN

Shards Shards

Apache Solr

Apache Zookeeper

ZK 1

Leader Election

Load Balancing

Shared Config Management

Worker Worker

Apache Spark

Cluster Manager

Core Services

• • •

NLP

Recommenders / Signals

Blob Storage

Pipelines

Scheduling

Alerting / Messaging

Connectors

RE

ST

AP

I

Admin UI

Twigkit

LOGS FILE WEB DATABASE CLOUD

HD

FS

(O

ptio

na

l)

Lucidworks Fusion Architecture

Page 20: Building Search & Recommendation Engines

Fusion powers search for the brightest companies in the world.

Page 21: Building Search & Recommendation Engines

Lucidworks Fusion

Page 22: Building Search & Recommendation Engines

search & relevancy

Page 23: Building Search & Recommendation Engines

Basic Keyword Search

The beginning of a typical search journey

Page 24: Building Search & Recommendation Engines

Term Documents

a doc1 [2x]

brown doc3 [1x] , doc5 [1x]

cat doc4 [1x]

cow doc2 [1x] , doc5 [1x]

… ...

once doc1 [1x], doc5 [1x]

over doc2 [1x], doc3 [1x]

the doc2 [2x], doc3 [2x],

doc4[2x], doc5 [1x]

… …

Document Content Field

doc1 once upon a time, in a land far,

far away

doc2 the cow jumped over the moon.

doc3 the quick brown fox jumped over

the lazy dog.

doc4 the cat in the hat

doc5 The brown cow said “moo”

once.

… …

What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):

The inverted index

Greenville Data Science & Analytics

Page 25: Building Search & Recommendation Engines

/solr/select/?q=apache solr

Field Documents

… …

apache doc1, doc3, doc4,

doc5

hadoop doc2, doc4, doc6

… …

solr doc1, doc3, doc4,

doc7, doc8

… …

doc5

doc7 doc8

doc1 doc3 doc4

solr

apache

apache solr

Matching queries to documents

Greenville Data Science & Analytics

Page 26: Building Search & Recommendation Engines

Text Analysis

Generating terms to index from raw text

Page 27: Building Search & Recommendation Engines

Text Analysis in Solr

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

*From Solr in Action, Chapter 6

Greenville Data Science & Analytics

Page 28: Building Search & Recommendation Engines

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

Text Analysis in Solr

*From Solr in Action, Chapter 6

Greenville Data Science & Analytics

Page 29: Building Search & Recommendation Engines

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

Text Analysis in Solr

*From Solr in Action, Chapter 6

Greenville Data Science & Analytics

Page 30: Building Search & Recommendation Engines

A text field in Lucene/Solr has an Analyzer containing:

① Zero or more CharFiltersTakes incoming text and “cleans it up” before it is tokenized

② One TokenizerSplits incoming text into a Token Stream containing Zero or more Tokens

③ Zero or more TokenFiltersExamines and optionally modifies each Token in the Token Stream

Text Analysis in SolrText Analysis in Solr

*From Solr in Action, Chapter 6

Greenville Data Science & Analytics

Page 31: Building Search & Recommendation Engines

Multi-lingual Text Analysis

Analyzing text across multiple languages

Page 32: Building Search & Recommendation Engines

Example English Analysis Chains

<fieldType name="text_en" class="solr.TextField"positionIncrementGap="100">

<analyzer><tokenizer class="solr.StandardTokenizerFactory"/><filter class="solr.StopFilterFactory"

words="lang/stopwords_en.txt”ignoreCase="true" />

<filter class="solr.LowerCaseFilterFactory"/><filter class="solr.EnglishPossessiveFilterFactory"/><filter class="solr.KeywordMarkerFilterFactory"

protected="lang/en_protwords.txt"/><filter class="solr.PorterStemFilterFactory"/>

</analyzer></fieldType>

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">

<analyzer><charFilter class="solr.HTMLStripCharFilterFactory"/><tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.SynonymFilterFactory"

synonyms="lang/en_synonyms.txt" IignoreCase="true" expand="true"/>

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>

<filter class="solr.ASCIIFoldingFilterFactory"/><filter class="solr.KStemFilterFactory"/><filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

</analyzer></fieldType>

Greenville Data Science & Analytics

Page 33: Building Search & Recommendation Engines

Per-language Analysis Chains

*Some of the 32 different languages configurations in Appendix B of Solr in Action

Greenville Data Science & Analytics

Page 34: Building Search & Recommendation Engines

Per-language Analysis Chains

*Some of the 32 different languages configurations in Appendix B of Solr in Action

Greenville Data Science & Analytics

Page 35: Building Search & Recommendation Engines

Which Stemmer do I choose?

*From Solr in Action, Chapter 14

Greenville Data Science & Analytics

Page 36: Building Search & Recommendation Engines

Common English Stemmers

Greenville Data Science & Analytics

Page 37: Building Search & Recommendation Engines

Greenville Data Science & Analytics

Page 38: Building Search & Recommendation Engines

Relevancy Ranking

Scoring the results, returning the best matches

Page 39: Building Search & Recommendation Engines

Classic Lucene/Solr Relevancy Algorithm:

*Source: Solr in Action, chapter 3

Score(q, d) =

∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q)t in q

Where:t = term; d = document; q = query; f = field

tf(t in d) = numTermOccurrencesInDocument ½

idf(t) = 1 + log (numDocs / (docFreq + 1))

coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery

queryNorm(q) = 1 / (sumOfSquaredWeights ½ )

sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2

t in q

norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()

Greenville Data Science & Analytics

Page 40: Building Search & Recommendation Engines

Classic Lucene/Solr Relevancy Algorithm:

*Source: Solr in Action, chapter 3

Score(q, d) =

∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q)t in q

Where:t = term; d = document; q = query; f = field

tf(t in d) = numTermOccurrencesInDocument ½

idf(t) = 1 + log (numDocs / (docFreq + 1))

coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery

queryNorm(q) = 1 / (sumOfSquaredWeights ½ )

sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2

t in q

norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()

Greenville Data Science & Analytics

Page 41: Building Search & Recommendation Engines

• Term Frequency: “How well a term describes a document?”

– Measure: how often a term occurs per document

• Inverse Document Frequency: “How important is a term overall?”

– Measure: how rare the term is across all documents

TF * IDF

*Source: Solr in Action, chapter 3

Greenville Data Science & Analytics

Page 42: Building Search & Recommendation Engines

BM25 (Okapi “Best Match” 25th Iteration)

Score(q, d) =

∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl )t in q

Where:t = term; d = document; q = query; i = index

tf(t in d) = numTermOccurrencesInDocument ½

idf(t) = 1 + log (numDocs / (docFreq + 1))

|d| = ∑ 1t in d

avgdl = = ( ∑ |d| ) / ( ∑ 1 ) )d in i d in i

k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point.

b = Free parameter. Usually ~0.75. Increases impact of document normalization.

Greenville Data Science & Analytics

Page 43: Building Search & Recommendation Engines

News Search : popularity and freshness drive relevance

Restaurant Search: geographical proximity and price range are critical

Ecommerce: likelihood of a purchase is key

Movie search: More popular titles are generally more relevant

Job search: category of job, salary range, and geographical proximity matter

TF * IDF of keywords can’t hold it’s own against good

domain-specific relevance factors!

That’s great, but what about domain-specific knowledge?

Greenville Data Science & Analytics

Page 44: Building Search & Recommendation Engines

*Example from chapter 16 of Solr in Action

Domain-specific relevancy calculation (News Website Example)

News website:

/select?

fq=$myQuery&

q=_query_:"{!func}scale(query($myQuery),0,100)"

AND _query_:"{!func}div(100,map(geodist(),0,1,1))"

AND _query_:"{!func}recip(rord(publicationDate),0,100,100)"

AND _query_:"{!func}scale(popularity,0,100)"&

myQuery="street festival"&

sfield=location&

pt=33.748,-84.391

25%

25%

25%

25%

Greenville Data Science & Analytics

Page 45: Building Search & Recommendation Engines

Fancy boosting functions (Restaurant Search Example)

Distance (50%) + keywords (30%) + category (20%)

q=_val_:"scale(mul(query($keywords),1),0,30)" AND

_val_:"scale(sum($radiusInKm,mul(query($distance),-1)),0,50)” AND

_val_:"scale(mul(query($category),1),0,20)"

&keywords=filet mignon

&radiusInKm=48.28

&distance=_val_:"geodist(latitudelongitude.latlon_is,33.77402,-84.29659)”

&category=”fine dining"

&fq={!cache=false v=$keywords}

Greenville Data Science & Analytics

Page 46: Building Search & Recommendation Engines

This is powerful, but feels like

a lot of work to get right…

Page 47: Building Search & Recommendation Engines

what is “reflected intelligence”?

Page 48: Building Search & Recommendation Engines

The Three C’s

Content:Keywords and other features in your documents

Collaboration:How other’s have chosen to interact with your system

Context:Available information about your users and their intent

Reflected Intelligence“Leveraging previous data and interactions to improve how

new data and interactions should be interpreted”

Greenville Data Science & Analytics

Page 49: Building Search & Recommendation Engines

● Recommendation Algorithms

● Building user profiles from past searches, clicks, and other actions

● Identifying correlations between keywords/phrases

● Building out automatically-generated ontologies from content and queries

● Determining relevancy judgements (precision, recall, nDCG, etc.) from click

logs

● Learning to Rank - using relevancy judgements and machine learning to train

a relevance model

● Discovering misspellings, synonyms, acronyms, and related keywords

● Disambiguation of keyword phrases with multiple meanings

● Learning what’s important in your content

Examples of Reflected Intelligence

Greenville Data Science & Analytics

Page 50: Building Search & Recommendation Engines

John lives in Boston but wants to move to New York or possibly another big city. He is

currently a sales manager but wants to move towards business development.

Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location

in the food service industry.

Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a

Big Data company. He is happy to move across the U.S. for the right job.

Jane is a nurse educator in Boston seeking between $40K and $60K

*Example from chapter 16 of Solr in Action

Consider what you know about users

Greenville Data Science & Analytics

Page 51: Building Search & Recommendation Engines

http://localhost:8983/solr/jobs/select/?

fl=jobtitle,city,state,salary&

q=(

jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10

)

AND (

(city:"Boston" AND state:"MA")^15

OR state:"MA")

AND _val_:"map(salary, 40000, 60000,10, 0)”

*Example from chapter 16 of Solr in Action

Query for Jane

Jane is a nurse educator in Boston seeking between $40K and $60K

Greenville Data Science & Analytics

Page 52: Building Search & Recommendation Engines

{ ...

"response":{"numFound":22,"start":0,"docs":[

{"jobtitle":" Clinical Educator

(New England/ Boston)",

"city":"Boston",

"state":"MA",

"salary":41503},

…]}}

*Example documents available @ http://github.com/treygrainger/solr-in-action

Search Results for Jane

{"jobtitle":"Nurse Educator",

"city":"Braintree",

"state":"MA",

"salary":56183},

{"jobtitle":"Nurse Educator",

"city":"Brighton",

"state":"MA",

"salary":71359}

Greenville Data Science & Analytics

Page 53: Building Search & Recommendation Engines

You just built a

recommendation engine!

Page 54: Building Search & Recommendation Engines

Collaborative Filtering

Term Documents

user1 doc1, doc5

user2 doc2

user3 doc2

user4 doc1, doc3, doc4, doc5

user5 doc1, doc4

… …

Document “Users who bought this product” field

doc1 user1, user4, user5

doc2 user2, user3

doc3 user4

doc4 user4, user5

doc5 user4, user1

… …

What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):

Greenville Data Science & Analytics

Page 55: Building Search & Recommendation Engines

Step 1: Find similar users who like the same documents

Document “Users who bought this product” field

doc1 user1, user4, user5

doc2 user2, user3

doc3 user4

doc4 user4, user5

doc5 user4, user1

… …

Top-scoring results (most similar users):1) user4 (2 shared likes)2) user5 (2 shared likes)3) user 1 (1 shared like)

doc1user1 user4

user5

user4 user5

doc4

q=documentid: ("doc1" OR "doc4")

*Source: Solr in Action, chapter 16

Greenville Data Science & Analytics

Page 56: Building Search & Recommendation Engines

/solr/select/?q=userlikes:("user4"^2 OR "user5"^2 OR "user1"^1)

Step 2: Search for docs “liked” by those similar users

Term Documents

user1 doc1, doc5

user2 doc2

user3 doc2

user4 doc1, doc3, doc4, doc5

user5 doc1, doc4

… …

Top recommended documents:1) doc1 (matches user4, user5, user1)2) doc4 (matches user4, user5)3) doc5 (matches user4, user1)4) doc3 (matches user4)

// doc2 does not match

Most similar users:1) user4 (2 shared likes)2) user5 (2 shared likes)3) user 1 (1 shared like)

*Source: Solr in Action, chapter 16

Greenville Data Science & Analytics

Page 57: Building Search & Recommendation Engines

Using matrix factorization is typically more efficient (Ships with Fusion 3.1):

Greenville Data Science & Analytics

Page 58: Building Search & Recommendation Engines

Feedback Loops

User

Searches

User

Sees

ResultsUser

takes an

action

Users’ actions

inform system

improvements

Greenville Data Science & Analytics

Page 59: Building Search & Recommendation Engines

Demo:

Signals & Recommendations

Page 60: Building Search & Recommendation Engines
Page 61: Building Search & Recommendation Engines
Page 62: Building Search & Recommendation Engines

• 200%+ increase in

click-through rates

• 91% lower TCO

• 50,000 fewer support

tickets

• Increased customer

satisfaction

Page 63: Building Search & Recommendation Engines

Learning to Rank

Page 64: Building Search & Recommendation Engines

Learning to Rank (LTR)

● It applies machine learning techniques to discover the best combination

of features that provide best ranking.

● It requires labeled set of documents with relevancy scores for given set

of queries

● Features used for ranking are usually more computationally expensive

than the ones used for matching

● It typically re-ranks a subset of the matched documents (e.g. top 1000)

Greenville Data Science & Analytics

Page 65: Building Search & Recommendation Engines

Greenville Data Science & Analytics

Page 66: Building Search & Recommendation Engines

Common LTR Algorithms

• RankNet* (neural networks, boosted trees)

• LambdaMart* (regression trees)

• SVM Rank** (SVM classifier)

** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf

* http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf

Greenville Data Science & Analytics

Page 67: Building Search & Recommendation Engines

LambdaMart Example

Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016

Greenville Data Science & Analytics

Page 68: Building Search & Recommendation Engines

Demo: Learning to Rank

Page 69: Building Search & Recommendation Engines

#1: Pull, Build, Start Solrgit clone https://github.com/apache/lucene-solr.git && cd lucene-solr/solrant server bin/solr -e techproducts -Dsolr.ltr.enabled=true

#2: Run Searcheshttp://localhost:8983/solr/techproducts/browse?q=ipod

#3: Supply User Relevancy Judgementscd contrib/ltr/example/nano user_queries.txt

#4: Install Training Librarycurl -L https://github.com/cjlin1/liblinear/archive/v210.zip > liblinear-2.1.tar.gztar -xf liblinear-2.1.tar.gz && mv liblinear-210 liblinearcd liblinear && make && cd ../

#5: Train and Upload Model./train_and_upload_demo_model.py -c config.json

#6: Re-run Searches using Machine-learned Ranking Modelhttp://localhost:8983/solr/techproducts/browse?q=ipod

&rq={!ltr model=exampleModel reRankDocs=25 efi.user_query=$q}

Page 70: Building Search & Recommendation Engines

# Run Searcheshttp://localhost:8983/solr/techproducts/select?q=ipod

Page 71: Building Search & Recommendation Engines

# Supply User Relevancy Judgementsnano contrib/ltr/example/user_queries.txt

#Format: query | doc id | relevancy judgement | source

# Train and Upload Model./train_and_upload_demo_model.py -c config.json

Page 72: Building Search & Recommendation Engines

# Re-run Searches using Machine-learned Ranking Modelhttp://localhost:8984/solr/techproducts/browse?q=ipod

&rq={!ltr model=exampleModel reRankDocs=100 efi.user_query=$q}

Page 73: Building Search & Recommendation Engines

Traditional

Keyword

SearchRecommendations

Semantic

Search

User Intent

Personalized

Search

Augmented

SearchDomain-aware

Matching

The Relevancy

Spectrum

Greenville Data Science & Analytics

Page 74: Building Search & Recommendation Engines

Streaming Expressions & Graph Traversals

Page 75: Building Search & Recommendation Engines

• Perform relational operations on

streams

• Stream sources: search, jdbc, facets,

features, gatherNodes, shortestPath,

train, features, model, random, stats,

topic

• Stream decorators: classify, commit,

complement, daemon, executor, fetch,

having, leftOuterJoin, hashJoin,

innerJoin, intersect, merge, null,

outerHashJoin, parallel, priority,

reduce, rollup, scoreNodes, select,

sort, top, unique, update

Streaming Expressions

Page 76: Building Search & Recommendation Engines

Streaming Expressions - Examples

Shortest-path Graph

Traversal

Parallel Batch

Procesing

Train a Logistic Regression

Model

Distributed Joins

Rapid Export of all

Search Results

Pull Results from External Database

Sources: https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html

Classifying

Search Results

Page 77: Building Search & Recommendation Engines

Graph Use Cases

• Anomaly detection /

fraud detection

• Recommenders

• Social network analysis

• Graph Search

• Access Control

• Relationship discovery / scoring

Examples

o Find all draft blog posts about “Parallel SQL”

written by a developer

o Find all tweets mentioning “Solr” by me or people

I follow

o Find all draft blog posts about “Parallel SQL”

written by a developer

o Find 3-star hotels in NYC my friends stayed in

last year

Greenville Data Science & Analytics

Page 78: Building Search & Recommendation Engines

Solr Graph Timeline

• Some data is much more naturally represented as a graph

structure

• Solr 6.0: Introduced the Graph Query Parser

• Solr 6.1: Introduced Graph Streaming expressions

• Solr 6.6: Current Version

• TBD: Semantic Knowledge Graph (patch available)

Greenville Data Science & Analytics

Page 79: Building Search & Recommendation Engines

Graph Query Parser

• Query-time, cyclic aware graph traversal is able to rank documents based on relationships

• Provides controls for depth, filtering of results and inclusion

of root and/or leaves

• Limitations: single node/shard only

Examples:

• http://localhost:8983/solr/graph/query?fl=id,score&

q={!graph from=in_edge to=out_edge}id:A

• http://localhost:8983/solr/my_graph/query?fl=id&

q={!graph from=in_edge to=out_edge

traversalFilter='foo:[* TO 15]'}id:A

• http://localhost:8983/solr/my_graph/query?fl=id&

q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10]

Greenville Data Science & Analytics

Page 80: Building Search & Recommendation Engines

Graph Streaming Expressions

• Part of Solr’s broader Streaming Expressions capability

• Implements a powerful, breadth-first traversal

• Works across shards AND collections

• Supports aggregations

• Cycle aware

curl -X POST -H "Content-Type: application/x-www-form-urlencoded"

-d ‘expr=…’"http://localhost:18984/solr/movielens/stream"

Greenville Data Science & Analytics

Page 81: Building Search & Recommendation Engines

All movies that user 389 watched

expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i")

Greenville Data Science & Analytics

Page 82: Building Search & Recommendation Engines

All movies that viewers of a specific movie watched

expr:gatherNodes(movielens,

gatherNodes(movielens,walk="161->movie_id_i",gather="user_id_i"),

walk="node->user_id_i",gather="movie_id_i", trackTraversal="true"

)

Movie 161: “The Air Up There”

Greenville Data Science & Analytics

Page 83: Building Search & Recommendation Engines

Collaborative Filtering

expr=top(n="5", sort="count(*) desc",

gatherNodes(movielens,

top(n="30", sort="count(*) desc",

gatherNodes(movielens,

search(movielens, q="user_id_i:305", fl="movie_id_i",

sort="movie_id_i asc", qt=“/export"),

walk="movie_id_i->movie_id_i", gather="user_id_i",

maxDocFreq="10000", count(*)

)

),

walk="node->user_id_i", gather="movie_id_i", count(*)

)

)

Greenville Data Science & Analytics

Page 84: Building Search & Recommendation Engines

Comparing Graph Choices

Solr Elastic Graph Neo4JSpark

GraphX

Best Use Case

QParser: predef.

relationships as filters

Expressions: fast,

query-based, dist.

graph ops

Limited to sequential,

term relatedness

exploration only

Graph ops and

querying that fit on a

single node

Large-scale, iterative

graph ops

Common Graph

Algorithms (e.g.

Pregel, Traversal)

Partial No Yes Yes

Scaling

QParser: Co-located

Shards only

Expressions: Yes

Yes Master/Replica Yes

Commercial

License RequiredNo Yes GPLv3 No

VisualizationsGraphML support

(e.g. Gephi)Kibana Neo4j browser 3rd party

Greenville Data Science & Analytics

Page 85: Building Search & Recommendation Engines

Basic Keyword Search(inverted index, tf-idf, bm25, query formulation, etc.)

Taxonomies / Entity Extraction(entity recognition, ontologies, synonyms, etc.)

Query Intent(query classification, semantic query parsing, concept expansion, rules, clustering, classification)

Relevancy Tuning(signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks)

Self-learningData-driven App Sophistication

Greenville Data Science & Analytics

Page 86: Building Search & Recommendation Engines

Additional References:

Greenville Data Science & Analytics

Page 87: Building Search & Recommendation Engines

Additional References:

Greenville Data Science & Analytics

Page 88: Building Search & Recommendation Engines

Contact Info

Trey [email protected]@treygrainger

http://solrinaction.comMeetup discount (39% off): 39grainger

Other presentations: http://www.treygrainger.com

Greenville Data Science & Analytics

Page 89: Building Search & Recommendation Engines

Greenville Data Science & Analytics

Audience Questions

#1: How can you figure out the meaning or intent of

keywords, particularly when there are multiple ways to

represent them or multiple meanings?

Page 90: Building Search & Recommendation Engines

How do we handle phrases with ambiguous meanings?

Example Related Keywords (representing multiple meanings)

driver truck driver, linux, windows, courier, embedded, cdl,

delivery

architect autocad drafter, designer, enterprise architect, java

architect, designer, architectural designer, data architect,

oracle, java, architectural drafter, autocad, drafter, cad,

engineer

… …

Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.

Greenville Data Science & Analytics

Page 91: Building Search & Recommendation Engines

A few methodologies:

1) Query Log Mining2) Semantic Knowledge Graph

Knowledge Graph

Greenville Data Science & Analytics

Page 92: Building Search & Recommendation Engines

Query Log Mining: Discovering ambiguous phrases

1) Classify users who ran each

search in the search logs

(i.e. by the job title

classifications of the jobs to

which they applied)

3) Segment the search term => related search terms list by classification,

to return a separate related terms list per classification

2) Create a probabilistic graphical model of those classifications mapped

to each keyword phrase.

Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.

Greenville Data Science & Analytics

Page 93: Building Search & Recommendation Engines

Semantic Knowledge Graph: Discovering ambiguous phrases

1) Exact same concept, but use

a document classification

field (i.e. category) as the first

level of your graph, and the

related terms as the second

level to which you traverse.

2) Has the benefit that you don’t need query logs to mine, but it will be representative

of your data, as opposed to your user’s intent, so the quality depends on how clean and

representative your documents are.

Greenville Data Science & Analytics

Page 94: Building Search & Recommendation Engines

Disambiguated meanings (represented as term vectors)

Example Related Keywords (Disambiguated Meanings)

architect 1: enterprise architect, java architect, data architect, oracle, java, .net

2: architectural designer, architectural drafter, autocad, autocad drafter, designer,

drafter, cad, engineer

driver 1: linux, windows, embedded

2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier

designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic,

photoshop, video

2: graphic, web designer, design, web design, graphic design, graphic designer

3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe,

structural designer, revit

… …

Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.

Greenville Data Science & Analytics

Page 95: Building Search & Recommendation Engines

Using the disambiguated meanings

In a situation where a user searches for an ambiguous phrase, what information can we

use to pick the correct underlying meaning?

1. Any pre-existing knowledge about the user:

• User is a software engineer

• User has previously run searches for “c++” and “linux”

2. Context within the query:

User searched for windows AND driver vs. courier OR driver

3. If all else fails (and there is no context), use the most commonly occurring meaning.

driver 1: linux, windows, embedded

2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier

Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.

Greenville Data Science & Analytics

Page 96: Building Search & Recommendation Engines

Greenville Data Science & Analytics

Audience Questions

#2: Can you tell me more about the semantic knowledge

graph?

See:

http://www.treygrainger.com/posts/presentations/the-

semantic-knowledge-graph/