Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damiano Braga & Praneet...
-
Upload
lucidworks -
Category
Software
-
view
584 -
download
0
Transcript of Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damiano Braga & Praneet...
Thoth
Real-time Solr Monitor Search Analysis Engine
Damiano Braga Sr. Software Engineer
Praneet Mhatre Data Mining Engineer
Overview
- What is Thoth ? - Data Collection and Thoth Core Indexing - Thoth API & Thoth Dashboard - Thoth Monitor - Thoth ML : Prediction and Topic Modeling - Special Thanks & Q/A
Demo
What is Thoth? - Innovation project at Trulia - Understand our search infrastructure without touching logs - Troubleshoot search performance issues
- Designed as a modular system - Set of tools that can help gather info, monitor, understand a search infrastructure - Open source project :
thoth thoth-ml thoth-api thoth-dashboard thoth-monitor thoth-demo
Problem: Know Your Search Infrastructure - Solr logs are a good source. Sometimes partial information - Decentralized data (at least 1 log per search server) - Log rotation - Not searchable
If we could index all the information .. Let’s use Solr ! - We can search on it - We have some handy features for free: facets, stats etc - It’s scalable
Thoth Document
1 Solr Request = 1 Thoth (Solr) Document Server Info hostname, port number, core name, pool name Query Info timestamp, actual query, qtime, hits, exception?
Data Collection (1/2) - Should be smooth. No traffic slowing down. - We care about near real-time data - We care about historical data - Dataset is growing fast
- Interceptor on each search server - We use a SolrComponent attached to a Request Handler - Queue System (E.g: ActiveMQ) to facilitate and temporary store messages - Each search server has a manifest in the solrconfig.xml
Data Collection (2/2) <requestHandler name="select" class="com.solr2activemq.SolrToActiveMQHandler”> <arr name="last-components”> <str>solr2activemq</str> </arr> </requestHandler> <searchComponent name="solr2activemq” class="com.solr2activemq.SolrToActiveMQComponent" > <str name="activemq-broker-uri">localhost</str> <int name="activemq-broker-port">61616</int> <str name="activemq-broker-destination-type">queue</str> <str name="activemq-broker-destination-name">test-queue</str> <str name="solr-hostname">localhost</str> <int name="solr-port">8983</int> <str name="solr-poolname">default</str> <str name="solr-corename">collection</str> <int name="solr2activemq-buffer-size">1000</int> <int name="solr2activemq-dequeuing-buffer-polling">500</int> <int name="solr2activemq-check-activemq-polling">5000</int> </searchComponent>
Sizing of Data - Need for granular information for near real-time data - Less granularity for historical data
Too much data = slow search, space problem - Shrinking feature:
-‐ Create Shrank Document -‐ Real-‐3me Core cleanup
- Shrinking time is configurable
Thoth Index - Solr 4.7 - Soft commit for near real-time search - Soft commit maxTime set to 1s - Auto commit set to 15s - Update chain set to enforce UUID as PkID - Use of Solrj to index data and query
Thoth API - Abstraction for Thoth index and Thoth data - Read only REST-like API - JSON response - Written in Node.js to accommodate socket.io Example: {"numFound":95,"values":[{"timestamp":"2014-09-16T18:00:02Z","value":45337},{"timestamp":"2014-09-16T18:15:02Z","value":77325},{"timestamp":"2014-09-16T18:30:02Z","value":109523},{"timestamp":"2014-09-16T18:45:02Z","value":112279},{"timestamp":"2014-09-16T19:00:02Z","value":115334}
thoth:3001/api/server/foo/core/bar/port/portbar/start/NOW-‐1DAY/end/NOW/count/nqueries
Thoth Dashboard (1/5) - Visual insight on Thoth data - Useful graphs divided by server or pool - Handy list of slow queries and exceptions - Real-time view for server - Selecting data based on time - Sharable URLs (to OPS team, QA team, Release Eng. )
Thoth Monitor - Continuously monitoring for metrics - Stateless - Alerting through email or Nagios - Examples: QTime, Number of Zero hits,
Predictor Model Health - Possibility to implement custom monitors - Reuse StatsComponent
[http://wiki.apache.org/solr/StatsComponent] if possible
Thoth ML What can we do with all this data? • Rich source of information • Can we turn it into knowledge? • How about machine learning?
1. Query 3me predic3on 2. Query paJern recogni3on 3. Server sizing and resource alloca3on
1. Query Time Prediction (1/4) • Goal : appropriately route queries to slow/ fast pool • Look at query attributes
• Query text • Start parameter • Facets, range queries, geo spa3al searches etc
• Train a supervised learning model • Use learned model to predict if a query will be slow v/s fast • H2O Machine Learning Library
1. Query Time Prediction (2/4) Challenges • Imbalanced dataset • Frequency of model training • Type of model • Minimal delay requirement
1. Query Time Prediction (3/4) Challenges Addressed • Imbalanced dataset
• Stra3fied sampling • Frequency of model training
• Auto iden3fy relearning frequency • Type of model
• Boolean, categorical features -‐> Tree based • High accuracy • Gradient Boosted Machine
• Minimal delay requirement • User pool queries: 45-‐50 ms • Predic3on: 1-‐3 ms
1. Query Time Prediction (4/4) • 1000 Gradient Boosted Trees • Slow queries = (>100ms. Configurable) • Experimental Results
• Training on ~3.1 million • Test on ~1.4 million • AUC: 0.94542 • Accuracy: 0.9202223
2. Query Pattern Recognition • Exceptions, zero hit queries • Analyze and find out why • Probabilistic Topic Modeling • Using MALLET open source toolkit
Future Direction - Thoth ML improvements:
• Predic3ng query 3me buckets • Regression v/s classifica3on • Excep3ons and zero hit query analysis • Sizing and resource alloca3on
- Solr Cloud integration - Dashboard integration with Solr cloud - More standard metrics on Thoth Monitor - More data collection (load, GC)
Contributors and Special Thanks Damiano : [email protected] Praneet: [email protected]
Fork us on Github! github.com/trulia/thoth
JD Cantrell ( API, Dashboard)
Giulio Grillanda (API, Dashboard) Rajendra Shioramwar (Core)
Ying Wang (Design) Girish Gudla (Monitor)
Alexander Kanarsky Alex Burmester