A Practical Data Science Workbench:spark-solr
Jake Mannix
@pbrane
Lead Data Engineer, Lucidworks
$ whoamiNow: Lucidworks, Office of the CTO: applied ML / data engineering R&D
Previously: • Allen Institute for AI: Semantic Search on academic research
publications• Twitter: account search, user interest modeling, content
recommendations• LinkedIn: profile search, generic entity-to-entity recommender
systems
Prehistory:• other software companies, algebraic topology, particle cosmology
Cold Start
Imagine you jumped into a new Data Lake…
• What is the “Minimum Viable Big Data Science Toolkit”?• DB? Distributed FS? NoSQL store?• ML libraries / frameworks (scripting? notebook? REPL?)• text analysis or graph libraries?• dataviz package?• hosting layer (for models and/or POC apps)?
Cold Start
• Spark and Solr for Data Engineering• Why Solr?• Why Spark?• Example rapid turnaround workflow: Searchhub
• data exploration• clustering: unsupervised ML• classification: supervised ML• recommenders: collaborative filtering + content-
based + “mixed-mode”
Overview
Practical Data Science with Spark and Solr
Why does Solr need Spark?
Why does Spark need Solr?
Why does Spark need Solr?
Typical Hadoop / Spark data-engineering task, start with some data on HDFS:
$ hdfs dfs -ls /user/jake/mail/lucene-solr-user/2015…-rw-r--r-- 1 jake staff 63043884 Feb 4 18:22 part-00001.lzo-rw-r--r-- 1 jake staff 79770856 Feb 4 18:22 part-00002.lzo-rw-r--r-- 1 jake staff 72108179 Feb 4 18:22 part-00003.lzo-rw-r--r-- 1 jake staff 12150481 Feb 4 18:22 part-00004.lzo
Now what? What’s in these files?
Solr gives you:
• random access data store
• full-text search
• fast aggregate statistics
• just starting out: no HDFS / S3 necessary!
• world-class multilingual text analytics:
• no more: tokens = str.toLowerCase().split(“\\s+“)
• relevancy / ranking
• realtime REST service layer / web console
• Apache Lucene
• Grouping and Joins
• Streaming parallel SQL
• Stats, expressions, transformations and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr Key Features
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete, highlighting
• Cursors
• More Like This
• De-duplication
Why Spark for Solr?
• spark-shell: a Big Data REPL with all your fave JVM libs!
• Build the index in parallel very, very quickly
• Aggregations
• Boosts, stats, iterative global computations
• Offline compute to update index with additional info (e.g. PageRank, popularity)
• Whole corpus analytics and ML: clustering, classification, CF, rankers
• General-purpose distributed computation
• Joins with other storage (Cassandra, HDFS, DB, HBase)
Why do data engineering with Solr and Spark?
SolrSpark
• Data exploration and visualization
• Easy ingestion and feature selection
• Powerful ranking features• Quick and dirty classification
and clustering• Simple operation and scaling• Stats and math built in
• General purpose batch/streaming compute engine
Whole collection analysis!• Fast, large scale iterative
algorithms• Advanced machine learning:
MLLib, Mahout, Deep Learning4j
• Lots of integrations with other big data systems
and together: http://github.com/lucidworks/spark-solr
• Free Data ! ASF mailing-list archives + github + JIRA
• https://github.com/lucidworks/searchhub
• Index it into Solr
• Explore a bit deeper: unsupervised Spark ML
• Exploit labels: predictive analytics
• Build a recommender, mix & match with search
Example workflow: Searchhub
TM
• Initial exploration of ASF mailing-list archives
• index into Solr: just need to turn your records into json
• facet:
• fields with low cardinality or with sensible ranges
• document size histogram
• projects, authors, dates
• find: broken fields, automated content, expected data missing, errors
• now: load into a spark RDD via SolrRDD:
Searchhub: Initial Exploration
• try other text analyzers: (no more str.split(“\\w+”)! )
Smarter Text Analysis in Spark
ref: Lucidworks blog on LuceneTextAnalyzer by Steve Rowe
• Unsupervised machine learning with MLLib or Mahout:
• clustering documents with KMeans
• extract topics with Latent Dirichlet Allocation
• learn word vectors with Word2Vec
• Write the results back to solr:
Searchhub: Exploratory Data Science
• can also do something more like real Data Science:
Searchhub Classification: “Many Newsgroups”
Recommender Systems with Spark and Solr
• Recommender Systems• content-based:
• mail-thread as “item”, head msgs grouped by replier as “user” profile
• search query of users against items to recommend• collaborative-filtering:
• users replying to a head msg “rate” them +-tively• train a Spark-ML ALS RecSys model
• both can generate item-item similarity models
Spark+Solr RecSys
• With top-K closest items by both CF and Content:• store them back into a Solr collection!• fetch your (or generic user’s) recent items• query them:
• “q=(cf:123^1.1 cf:39^2.3 cf:93^0.7)^alpha (ct:912^2.9 ct:123^1.8 ct:99^2.2)^(1-alpha)”
Experimenting with mixed-mode Recommenders
Resources
• spark-solr: http://github.com/lucidworks/spark-solr
• searchhub: http://github.com/lucidworks/searchhub
• Company: http://www.lucidworks.com
• Our blog: http://www.lucidworks.com/blog
• Fusion: http://www.lucidworks.com/products/fusion
• Twitter: @pbrane
Top Related