Otis gospodnetic Search Analytics Lucene Eurocon 2011

43
Search Analytics Business Value & NoSQL Backend Otis Gospodnetić Sematext International @otisg @sematext sematext.com sematext.com/search-analytics

description

See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011

Transcript of Otis gospodnetic Search Analytics Lucene Eurocon 2011

Page 1: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Search Analytics

Business Value&

NoSQL Backend

Otis Gospodnetić – Sematext International@otisg ◦ @sematext ◦ sematext.com

sematext.com/search-analytics

Page 2: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.2

About Otis Gospodnetić

• ASF Member: Lucene, Solr, Nutch, Mahout

• Author: Lucene in Action 1 & 2

• Entrepreneur: Sematext, Simpy

Page 3: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.3

Sematext Metrics

100% organic: no GMO, no VC 4 years old < 10 people 7 countries 3 timezones 2 continents > 100 customers

Page 4: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.4

About Sematext

Products & Services

Consulting, Development, Tech Support:

Search (Lucene, Solr, ElasticSearch...) Big Data (Hadoop, HBase,

Voldemort...) Web Crawling (Nutch, Droids) Machine Learning (Mahout)

Page 5: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.5

Agenda

What is Search Analytics and why it matters

Example reports and their value What we built, why, and how

Page 6: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.6

Communication

twitter.com/sematext twitter.com/otisg hash tags: #stsa or #stanalytics http://sematext.com/search-analytics/index.html Raise your hand! [email protected]

Page 7: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.7

The Compass

Search logs are your Map

Search Analytics is your Compass

Page 8: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.8

High Level Why

searchusers

searchproviders

searchexperience

Page 9: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.9

High Level Why

searchproviders

searchexperience

This search sucks!It takes 17 tries to find anything here!

F!?@#$%^&?!?

searchusers

Cool, the latest search tweaks made our site really sticky!

Awesome!

Page 10: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.10

Don't Be Like This Dude

Page 11: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.11

Got Clue?

Search Analytics

Performance Monitoring

Quality Assurance

Tuning UI

Page 12: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.12

More Concrete Why

Measure and monitor everything. Introspection. Supports (re)design, navigation choices Helps with content acquisition & enhancement Improve search experience Mula

Page 13: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.13

The Moment of Truth

Question for the audience #1

What do you use for Search Analytics?

a) Home grown stuffb) Google Analyticsc) Omnitured) Webtrendse) Otherf ) Nothing

Page 14: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.14

Search Analytics Outline

Collect: queries & clicks & interactions & ... Analyze: actions / xactions / conversions Output: reports – over time Output++: feedback loop

The means, not the goal Ongoing, not one-off

remember this

Page 15: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.15

Search vs. Web Analytics

User intent and information needs vs. inferring Hand in hand Ideally you can relate data from both or even

unify it

Page 16: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.16

Example Core Reports

Rate & Volume, Latency (mean, avg, 90%) Click Through Rate, Mean Reciprocal Rank Top Queries by count, clicks, 0 hits... Query Trending Top Seen Docs, Top Clicked Docs (msft) Page & Click Depth Facet & Sort Usage ...

Page 17: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.17

More Reports in More Detail

See Search Analytics What? Why? How?

http://blog.sematext.com/tag/analytics/

Page 18: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.18

Part Dos

Switching gears... Juno digs NoSQL

Page 19: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.19

What We've Built

Search Analytics SaaS Numerous reports (e.g. query volume,

rate, latency, term frequencies / comparisons, hit buckets, search origins, etc.)

Trending over time Comparisons of time periods Top N reports Filter, slice and dice

Page 20: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.20

Who Needs a Compass?

We need it search-hadoop.com & search-lucene.com

Our customers need it!

You?

Page 21: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.21

Sematext Search Analytics

Page 22: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.22

Big Dreams

SaaS Multitenant Large Scale – Massive Data Cloud

Page 23: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.23

Storage Choices

RDBMS: MySQL, PostgreSQL HDFS Hive HBase Cassandra

Page 24: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.24

SaaS vs. In-House

Question for the audience #2

SaaS vs in-house Search Analytics?

a) SaaSb) in-house

Page 25: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.25

Sematext Search Analytics

Page 26: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.26

Sematext Search Analytics

Page 27: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.27

Sematext Search Analytics

Page 28: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.28

Sematext Search Analytics

Page 29: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.29

Data Flow See Search Analytics with Flume and HBase

http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/

Page 30: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.30

Data Collection See Search Analytics with Flume and HBase

http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/

Page 31: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.31

Core Tech

JavaScript Beacons Metric Capture Web App aka Receiver Flume Agents, Collectors, Sinks HBase MapReduce Aggregations Search Analytics Reporting Web App

Page 32: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.32

What is Flume

Distributed data/log collection service Scalable, configurable, extensible Centrally manageable, open source

Agents get data from app, Collectors save it Abstractions: Source → Decorator(s) → Sink

Page 33: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.33

What is HBase

Scalable, reliable, distributed, column-oriented DB On top of HDFS MapReducable

Page 34: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.34

Data Flow, Detailed

Page 35: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.35

Why Flume

Reliable delivery e.g. queue msgs locally if destination unreachable

Easy, centralized management via Web UI or console

Good community, good progress, now @ASF But: more complex, more moving parts On Flume: slideshare.net/cloudera/inside-flume Alternatives: Kafka, Scribe...

Page 36: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.36

Why HBase

Scalable raw & aggregate data storage MapReduce data input Fast scans for time ranges, fast key lookups Easy storage and compute power expansion Good looking roadmap, community,

progress

Page 37: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.37

Open Sourcing

2 open-source projects:

github.com/sematext/HBaseWD

github.com/sematext/HBaseHUT See sematext.com/open-source/index.html

Patches for Flume and HBaseblog.sematext.com/tag/flume/

Page 38: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.38

Challenges

Data size. Solutions: Compression (4-5x smaller with lzo) Data pruning (variable levels)

Query string distribution: very long-tail Lots of data to process, update, aggregate

Young tools: Flume, HBase Poor IO on EC2 Hadoop distributions

Page 39: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.39

Output++

AutoComplete - $MM improvement Better DYM Spellchecker Related Searches Recommendations Relevance Feedback ...

Page 40: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.40

Closing the Loop

searchusers

searchproviders

searchexperience

Page 41: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.41

Resource

http://rosenfeldmedia.com/books/searchanalytics/

Search Analytics for Your SiteLouis Rosenfeld

Page 42: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.42

We're Hiring

Dig Search?

Dig Analytics?

Dig Big Data?

Dig Performance?

Dig working with and in open-source?

We're hiring world-wide!

http://sematext.com/about/jobs.html

Page 43: Otis gospodnetic Search Analytics Lucene Eurocon 2011

Copyright 2011 Sematext Int'l. All rights reserved.43

sematext.com blog.sematext.com @sematext @otisg [email protected]

Want SA? Grab me or go to: sematext.com/search-analytics

Hash tags: #stsa or #stanalytics

Contact