Searching for The Matrix in haystack (with Elasticsearch)

1. Searching for The Matrix in haystack(with Elasticsearch) Synopsi.TV case study Tom Sirn @junckritter Pyvo/Rubyslava November 2012

2. The Environment Recommendation service for movies, TV shows People mark titles they watched(check-in), ratethem Get recommendations Make Watch Later or other-purpose lists Search (to check-in, add to list, share, etc.)

3. The Problem Input box for search on top of web page Many movies, TV shows in database Lot of them have similar titles, use similarwords Some are more probable to be searched for Few input information 3, 4 letters Autocomplete, not only exact match

4. The Red Pill

5. The Blue Pill

6. The Tool Elasticsearch designed for searching indocuments Based on Lucene de facto standard Young yet feature-rich Quick development (despite 1 core developer) Business company recently founded 10M funding in A-round

7. The (Wannabe) Solution Differentiate titles Have cover, plot, cast, directors Year Popularity (whatever it means) Prefer ones with more data, more popular

8. The Text First Attempt Text Query (now Match Query) phrase_prefix type all words in input withmatching of prefixes (m, ma, mat, ), sameorder of words operator and not_analyzed name field (not broke down towords)

9. The Text First Attempt slop parameter - allows change of order, skipwords matrix revolutions revolutions matrixmatrix first revolutions

10. The Sorting First Attempt Default scoring considers only occurence text indocuments We also want other properties of document tocount Custom Score Query Define script for scoringscript: _score * doc[rating].value

11. The Rating Allows to prefer more popular titles External top lists, links, etc. Internal usage data from system Problem for newly added titles lack of data ofboth types

12. The Tuning of Rating Get rid off external data Only score completeness of each document Release year script: 3 * log(_score) + 1 * log(doc["year"].date.year 1880) +0.75 * log(doc["watched_count"].value +1)

13. The Tuning of QueryName field analyzed, edgeNGram filterindex:analysis: filter:my_ngram:type: edgeNGrammin_gram : 1max_gram : 11side : front analyzer:my_analyzer:type: customtokenizer: standardfilter: [lowercase, asciifolding, my_ngram]

14. The AKAs Also know as names of title in differentcountries Lot of additional data, sometimes only noise original is still most important

15. The AKAs Array of AKAs problems with scoring of shortnames Nested AKA documents - query does not returnnested document which matched AKA document is child of title have owninformation (original, country, slug) Top Children Query which AKA matched Another query with Ids Filter get titles

16. The Sorting Second Attempt Custom Filter Score Query apply set of filters,each filter boosts documents which pass itscondition boost parameter of filter differentiateimportance of that filter score_mode sum, product of boost values

17. The Sorting Used Score Filters Release date (in case of TV show last episode)in last 6 months Release date in next 3 months original AKA Have all important categories filled Not Short genre Not TV movie

18. The Sorting Short Input Special case 1 3 letters Very rare to exact match Should work after typing of first letter Only titles from this year 3 letters also titles in near future and previousyear

19. The Year in Input Matrix 1999 Matrix Reloaded (2003) Matrix 2000- released to 2000 Matrix 2000+ released since 2000

20. One More Thing Advanced Search Titles have also data about their usage Watched by Friends FilterShows titles with IDs of your friends in properfield (TermsFilter([IDS])) Not Watched filterShow titles in which is your ID absent(NotFilter(TermFilter(ID)) combination titles to watch to catch up withfriends

21. The EndThanksTom Sirn@junckritter

Searching for The Matrix in haystack (with Elasticsearch)

Technology

Transcript of Searching for The Matrix in haystack (with Elasticsearch)