Solr for Data Science

22
Solr for Data Science Scalable search and analytics in one Grant Ingersoll, CTO: @gsingers

Transcript of Solr for Data Science

Page 1: Solr for Data Science

Solr for Data ScienceScalable search and analytics in one

Grant Ingersoll, CTO: @gsingers

Page 2: Solr for Data Science
Page 3: Solr for Data Science

http://github.com/lucidworks/solr-for-datascience

Page 4: Solr for Data Science

Solr in a nutshell

8M+ total downloads

Solr is both established & growing

250,000+monthly downloads

Largest community of developers.

2500+open Solr jobs.

Solr most widely used search solution on the planet.

LucidworksUnmatched Solr expertise.

1/3of the active committers

70%of the open source code is committed

Lucene/Solr Revolutionworld’s largest open source user

conference dedicated to Lucene/Solr.

Solr has tens of thousands of applications in production.

You use Solr everyday.

Page 5: Solr for Data Science

Solr’s Key Features

• Full text search (Info Retr.)

• Facets/Guided Nav galore!

• Lots of data types

• Spelling, auto-complete, highlighting

• Cursors

• More Like This

• De-duplication

• Apache Lucene

• Grouping and Joins

• Stats, expressions, transformations and more

• Lang. Detection

• Extensible

• Massive Scale/Fault tolerance

Page 6: Solr for Data Science
Page 7: Solr for Data Science

It is increasingly important to know what is important!

Corollary: The faster you know what is important, the better

Page 8: Solr for Data Science

Data Exploration

Page 9: Solr for Data Science

• Solr - Logstash - Kibana

!

• http://lucidworks.com/product/integrations/silk/

• Open source at:

• https://github.com/LucidWorks/banana

• https://github.com/LucidWorks/solrlogmanager

SiLK

Page 10: Solr for Data Science
Page 11: Solr for Data Science

• Feature Selection

• Analyzers for all types

• Easily get weights for terms

• Term Vectors

• Data Reduction

• Filters

• Analyzers

• Data quality tools

Feature Selection and Data Reduction

Page 12: Solr for Data Science

• Quick and dirty:

• kNN, others

• Carrot^2 integration for search result clustering

• Integration with Mahout

• Lucene provides Bayesian classifiers built on index

• Easily build training and test sets via filter queries

Classification and Clustering

Page 13: Solr for Data Science

• Built in expressions, stats, function queries make custom ranking a snap!

• Search is essentially vector * matrix

• Lucene index is a ranking optimized matrix

• More coming!

Math

Page 14: Solr for Data Science

Clicks, tweets, ratings, locations and much more can all be leveraged to provide high quality recommendations

to users and deeper insight for data scientists

!

Signals power relevance

Query ModificationIncrease the findability of

documents and records with automatic creation of tags, fields

and meta-data

Curate the user experience in your application with artificial

result ranking, document injections and obfuscation

Result ManipulationIndex Time EnrichmentPerform real time decision

making and routing in order to map a users intention or

enterprise policy

Page 15: Solr for Data Science

• http://www.lucidworks.com/products/fusion

• Ships w/ built-in Solr-based Recommender OOTB, but easy to extend

• Demo: eCommerce data set

• ~1.2M products

• ~4M clicks

Lucidworks Fusion

Page 16: Solr for Data Science

• Data ingest:

• JSON, CSV, XML, Rich types (PDF, etc.), custom

• Clients for Python, R, Java, .NET and more

• http://cran.r-project.org/web/packages/solr/index.html, amongst others

• Output formats: JSON, CSV, XML, custom

Solr and Your Tools

Page 17: Solr for Data Science

• Vector Space or Probabilistic, it’s your choice!

• Killer FST

• Wicked fast

• Pluggable compression, queries, indexing and more

• Advanced Similarity Models

• Lang. Modeling, Divergence from Random, more

• Easy to plug-in ranking

for Data Science

Page 18: Solr for Data Science

But what about?

Page 19: Solr for Data Science

• More Facets/Stats

• Combine pivots, ranges and stats

• Percentiles via t-digest

• hyper-log-log

• Deeper Spark integration for Solr

• Custom distributed computation and aggregations/maths

• Advanced schema on read options

• Time series? Trends? Anomaly Detection?

• Learn to rank?

What’s coming?

Page 20: Solr for Data Science

Lucidworks Open Source• Logstash for Solr:

• https://github.com/LucidWorks/solrlogmanager

• Banana (Kibana for Solr):

• https://github.com/LucidWorks/banana

• Effortless AWS deployment and monitoring:

• http://www.github.com/lucidworks/solr-scale-tk

• Data Quality Toolkit:

• https://github.com/LucidWorks/data-quality

• Spark Integration

• https://github.com/LucidWorks/spark-solr

Page 21: Solr for Data Science

• This code: http://github.com/lucidworks/solr-for-datascience

• Company: http://www.lucidworks.com

• Our blog: http://www.lucidworks.com/blog

• Book: http://www.manning.com/ingersoll

• Solr: http://lucene.apache.org/solr

• Fusion: http://www.lucidworks.com/products/fusion

• Twitter: @gsingers

Resources

Page 22: Solr for Data Science