Search-based business intelligence and reverse data engineering with Apache Solr
Solr for Data Science
-
Upload
grant-ingersoll -
Category
Technology
-
view
1.169 -
download
7
Transcript of Solr for Data Science
Solr for Data ScienceScalable search and analytics in one
Grant Ingersoll, CTO: @gsingers
http://github.com/lucidworks/solr-for-datascience
Solr in a nutshell
8M+ total downloads
Solr is both established & growing
250,000+monthly downloads
Largest community of developers.
2500+open Solr jobs.
Solr most widely used search solution on the planet.
LucidworksUnmatched Solr expertise.
1/3of the active committers
70%of the open source code is committed
Lucene/Solr Revolutionworld’s largest open source user
conference dedicated to Lucene/Solr.
Solr has tens of thousands of applications in production.
You use Solr everyday.
Solr’s Key Features
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete, highlighting
• Cursors
• More Like This
• De-duplication
• Apache Lucene
• Grouping and Joins
• Stats, expressions, transformations and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
It is increasingly important to know what is important!
Corollary: The faster you know what is important, the better
Data Exploration
• Solr - Logstash - Kibana
!
• http://lucidworks.com/product/integrations/silk/
• Open source at:
• https://github.com/LucidWorks/banana
• https://github.com/LucidWorks/solrlogmanager
SiLK
• Feature Selection
• Analyzers for all types
• Easily get weights for terms
• Term Vectors
• Data Reduction
• Filters
• Analyzers
• Data quality tools
Feature Selection and Data Reduction
• Quick and dirty:
• kNN, others
• Carrot^2 integration for search result clustering
• Integration with Mahout
• Lucene provides Bayesian classifiers built on index
• Easily build training and test sets via filter queries
Classification and Clustering
• Built in expressions, stats, function queries make custom ranking a snap!
• Search is essentially vector * matrix
• Lucene index is a ranking optimized matrix
• More coming!
Math
Clicks, tweets, ratings, locations and much more can all be leveraged to provide high quality recommendations
to users and deeper insight for data scientists
!
Signals power relevance
Query ModificationIncrease the findability of
documents and records with automatic creation of tags, fields
and meta-data
Curate the user experience in your application with artificial
result ranking, document injections and obfuscation
Result ManipulationIndex Time EnrichmentPerform real time decision
making and routing in order to map a users intention or
enterprise policy
• http://www.lucidworks.com/products/fusion
• Ships w/ built-in Solr-based Recommender OOTB, but easy to extend
• Demo: eCommerce data set
• ~1.2M products
• ~4M clicks
Lucidworks Fusion
• Data ingest:
• JSON, CSV, XML, Rich types (PDF, etc.), custom
• Clients for Python, R, Java, .NET and more
• http://cran.r-project.org/web/packages/solr/index.html, amongst others
• Output formats: JSON, CSV, XML, custom
Solr and Your Tools
• Vector Space or Probabilistic, it’s your choice!
• Killer FST
• Wicked fast
• Pluggable compression, queries, indexing and more
• Advanced Similarity Models
• Lang. Modeling, Divergence from Random, more
• Easy to plug-in ranking
for Data Science
But what about?
• More Facets/Stats
• Combine pivots, ranges and stats
• Percentiles via t-digest
• hyper-log-log
• Deeper Spark integration for Solr
• Custom distributed computation and aggregations/maths
• Advanced schema on read options
• Time series? Trends? Anomaly Detection?
• Learn to rank?
What’s coming?
Lucidworks Open Source• Logstash for Solr:
• https://github.com/LucidWorks/solrlogmanager
• Banana (Kibana for Solr):
• https://github.com/LucidWorks/banana
• Effortless AWS deployment and monitoring:
• http://www.github.com/lucidworks/solr-scale-tk
• Data Quality Toolkit:
• https://github.com/LucidWorks/data-quality
• Spark Integration
• https://github.com/LucidWorks/spark-solr
• This code: http://github.com/lucidworks/solr-for-datascience
• Company: http://www.lucidworks.com
• Our blog: http://www.lucidworks.com/blog
• Book: http://www.manning.com/ingersoll
• Solr: http://lucene.apache.org/solr
• Fusion: http://www.lucidworks.com/products/fusion
• Twitter: @gsingers
Resources