Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Post on 18-Dec-2014

1.309 views 1 download

description

Apache Solr 4 Part 1 - Why use Solr ? How to build Recency Ranking and Popularity Ranking ?

Transcript of Apache Solr 4 Part 1 - Introduction, Features, Recency Ranking and Popularity Ranking

Apache Solr!

Ramzi Alqrainy!

Search Guy!

Part 1!

What !is Apache Solr ?!

Apache Solr!!

is!

“ a standalone full-text search server with Apache Lucene at the backend. “!

!

!

Cont.!

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. !

!

In brief Apache Solr exposes Lucene's JAVA API as REST like API's which can be called over HTTP from any programming language/platform.!

Why!use Apache Solr ?!

Features!

l  Full Text Search!l  Faceted navigation!l  More items like this(Recommendation)/

Related searches !l  Spell Suggest/Auto-Complete!l  Custom document ranking/ordering!l  Snippet generation/highlighting!And a lot More....!

Why Solr ?!

Also, Solr is only provides :!

1. Result Grouping / Field Collapsing!

2. Query Elevation!

3. Pivot Facet!

4. Pluggable Search/update Workflow!

5. Hash-Based Duplication!

Field Collapsing  

“ Collapses a group of results with the same field value down to a single (or fixed number) of entries.”!

For example, most search engines such as Google collapse on site so only one or two entries are shown, along with a link to click to see more results from that site. Field collapsing can also be used to suppress duplicate documents.!

Result Grouping  

“ groups documents with a common field value into groups, returning the top documents per group, and the top groups based on what documents are in the groups”!

One example is a search at Best Buy for a common term such as DVD, that shows the top 3 results for each category ("TVs & Video","Movies","Computers", etc)!

Query Elevation  

enables you to configure the top results for a given query regardless of the normal lucene scoring. This is sometimes called "sponsored search", "editorial boosting" or "best bets".!

Pivot Facet  

You can think of it as "Decision Tree Faceting" which tells you in advance what the "next" set of facet results would be for a field if you apply a constraint from the current facet results!

Pluggable Search/update Workflow  

You can modify the workflow of existing API endpoints / document instert or updates!

Hash-Based Duplication  

Determining the uniqueness of a document not based on ad ID-Field, but the hash signature of a field.!

!

Useful for web pages for example, where the URL may be different but the content the same.!

Boost documents by age!

•  Just do a descending sort by age = done?!

•  B o o s t m o r e r e c e n t d o c u m e n t s a n d p e n a l i z e o l d e r documents just for being old!

•  U s e f u l f o r n e w s , business docs, and local search !

Solr: Indexing!In schema.xml: <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/> <field name="pubdate" type="tdate" indexed="true" stored="true" required="true" />

Date published = DateUtils.round(item.getPublishedOnDate(),Calendar.HOUR);

FunctionQuery Basics!•  FunctionQuery: Computes a value for each

document!– Ranking!– Sorting!

constant literal fieldvalue ord rord sum sub product

pow abs log sqrt map scale query linear

recip max min ms sqedist - Squared Euclidean Dist hsin, ghhsin - Haversine Formula geohash - Convert to geohash strdist

Solr: Query Time Boost!•  Use the recip function with the ms function:!q={!boost b=$recency v=$qq}& recency=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)& qq=wine

•  Use edismax vs. dismax if possible:!

q=wine& boost=recip(ms(NOW/HOUR,pubdate),3.16e-11,0.08,0.05)

•  Recip is a highly tunable function!–  recip(x,m,a,b) implementing a / (m*x + b) –  m = 3.16E-11 a= 0.08 b=0.05 x = Document Age

17

Tune Solr recip function!

18

Tips and Tricks!•  Boost should be a multiplier on the relevancy score !

•  {!boost b=} syntax confuses the spell checker so you need to use spellcheck.q to be explicit!q={!boost b=$recency v=$qq}&spellcheck.q=wine

•  Bottom out the old age penalty using min:!–  min(recip(…), 0.20)

•  Not a one-size fits all solution – academic research focused on when to apply it !

19

•  Score based on number of unique views!

•  Not known at indexing time!

•  View count should be broken into time slots!

20

Boost by Popularity!

Popularity Illustrated!

21

Solr: ExternalFileField!In schema.xml: <fieldType name="externalPopularityScore" keyField="id" defVal="1" stored="false" indexed="false" class=”solr.ExternalFileField" valType="pfloat"/> <field name="popularity" type="externalPopularityScore" />

22

Popularity Boost: Nuts & Bolts!

23

Logs  Solr  Server  

User activity logged

View  Coun1ng  Job  

solr-home/data/ external_popularity

a=1.114 b=1.05 c=1.111 …

commit

Popularity Tips & Tricks •  For big, high traffic sites, use log analysis!

– Perfect problem for MapReduce!– Take a look at Hive for analyzing large volumes

of log data!

•  Minimum popularity score is 1 (not zero) … up to 2 or more!–  1 + (0.4*recent + 0.3*lastWeek + 0.2*lastMonth

…)!

•  Watch out for spell checker “buildOnCommit”!

24

Filtering By User Preferences •  Easy approach is to build basic preference

fields in to the index:!– Content types of interest – content_type!– High-level categories of interest - category!– Source of interest – source!

!•  We had too many categories and sources that

a user could enable / disable to use basic filtering!– Custom SearchComponent with a connection to a

JDBC DataSource!

25

Preferences Component!

•  Connects to a database!•  Caches DocIdSet in a Solr FastLRUCache!•  Cached values marked as dirty using a

simple timestamp passed in the request!!Declared in solrconfig.xml:! <searchComponent ! class=“demo.solr.PreferencesComponent" ! name=”pref">! <str name="jdbcJndi">jdbc/solr</str> ! </searchComponent>! 26

YOU HAVE QUESTIONS …. WE HAVE ANSWERS!

ramzi.alqrainy@gmail.com!

References!

•  h5p://wiki.apache.org/solr/  •  h5p://www.lucidworks.com/  •  Apache  Solr  4  Cookbook