Clustering Search Log Data

38
Copyright © President & Fellows of Harvard College. Sophy Bishop & Ravi Mynampaty Clustering Search Query Log Data to Improve Search

description

Presented at the Harvard ABCD-WWW/CMS session, Nov. 15, 2012 A previous version of this talk was presented at Enterprise Search Europe, May 2012

Transcript of Clustering Search Log Data

Page 1: Clustering Search Log Data

Copyright © President & Fellows of Harvard College.

Sophy Bishop & Ravi Mynampaty

Clustering Search Query Log Data to Improve Search

Page 2: Clustering Search Log Data

Agenda

Background

Five W’s of Clustering

• What, why, who, how, when

Is it really repeatable?

Questions

Page 3: Clustering Search Log Data

About Information Management Services (IMS)

- Standards - Best Practices - User Needs - Service Models

Analytics

Metadata Mgmt.

Taxonomy Dev.

Search

Lifecycle Mgmt.

Page 4: Clustering Search Log Data

Inspired by…

Chapters 8 & 9

Page 5: Clustering Search Log Data

About this talk…

Case study on how we are improving search and

browse by performing clustering exercises on your

search query data

Not rocket science

High-level overview

You can follow this method, with your own insights and

tweaks

You can kick this off next week at your work

Page 6: Clustering Search Log Data

What is clustering?

A process for organizing and analyzing search log

data that:

Is repeatable, low-cost, scalable, simple

Yields actionable results

Supports constant incremental improvement

to search

Page 7: Clustering Search Log Data

What’s clustering good for?

Ensure results for high frequency queries

Improve Metadata and Taxonomy

Inform and validate decision making in site IA

Informs editorial/curatorial activities

Provides Feedback for Search Suggestions

o Autosuggest, synonym lists, no-hits page

suggestions

But more on this later...

Page 8: Clustering Search Log Data

So how do I cluster search queries?

A simple set of steps

Create query report

Cluster queries

Determine # queries to analyze

Analyze clusters

Draw conclusions

and ACT

Page 9: Clustering Search Log Data

Step 1: Create a query report

We started with the site with the most traffic

• Upper-bound limit

• One year’s data by quarter

• Cut off tail at frequency < 10

Page 10: Clustering Search Log Data

Step 1: Create a query report

We started with the site with the most traffic

• Upper-bound limit

• One year’s data by quarter

• Cut off tail at frequency < 10

HBS Working Knowledge FY12 Use Snapshot

Overall Traffic

Page Views: 6,439,485

Visits: 3,635,746

Unique visitors: 2,734,620

On-site searches: 174,425

Views per Visit: 1.77

Local Search visit rate: 5%

Organic Search visit rate: 46%

Page 11: Clustering Search Log Data

Step 2: Cluster the queries

Page 12: Clustering Search Log Data

Step 2 (cont’d): Three levels of clustering

Level Method Example

Narrow Simple

normalization

Eliminate

grammatical,

spelling, typos, and

punctuation

differences

Mid-level Group by subject management,

finance, decision

making

Broad Group by facet topic, name, date,

content type

Page 13: Clustering Search Log Data

Step 2 (cont’d): Levels Tasks Enabled

Level Improve your

base for

query

analysis

Ensure

representation

of major

clusters on your

site

Improve

Metadata/Index

/Taxonomy

Improve

Search

Suggestions

Narrow

(simple)

X X X

Mid-level

(group by

subject)

X X X

Broad

(group by

facet)

X X

Page 14: Clustering Search Log Data

Step 2 (cont’d): Narrow Clustering Example

Page 15: Clustering Search Log Data

Step 2 (cont’d): Mid-level Example

Cluster brand

branding 245

brand 160

brand management 73

consumer branding 57

global brand 32

service brands 24

brand image retail bank 17

employer branding 16

brand management professional

services 16

global branding 13

b2b branding 13

importance of branding 12

brand 2002 12

brand equity 11

brand image 11

Page 16: Clustering Search Log Data

Step 2 (cont’d): Mid-level Example Cluster brand

branding 245

brand 160

brand management 73

consumer branding 57

global brand 32

service brands 24

brand image retail bank 17

employer branding 16

brand management professional

services 16

global branding 13

b2b branding 13

importance of branding 12

brand 2002 12

brand equity 11

brand image 11

Page 17: Clustering Search Log Data

Step 2 (cont’d): Mid-level Example Cluster brand

branding 245

brand 160

brand management 73

consumer branding 57

global brand 32

service brands 24

brand image retail bank 17

employer branding 16

brand management professional

services 16

global branding 13

b2b branding 13

importance of branding 12

brand 2002 12

brand equity 11

brand image 11

333

179

145

111 101

88

40

26 26 25 20 19 15 14 12 12 11 11 10 10 10

0

50

100

150

200

250

300

350

customer

Page 18: Clustering Search Log Data

Step 2 (cont’d): Broad Clustering Example

Page 19: Clustering Search Log Data

Step 2 (cont’d): List of facets we used

Facet Example

content type case studies, cases, working papers, articles, newspaper

date 2011, world in 2030 demographic characteristics women, Gen Y, gender, baby boomers event economic crisis format podcast, video geographic area india, japan, mount everest industry global wine industry

job type/role independent director, entrepreneur, ceo, phd economist

organization name ikea, zara, toyota person name michael porter, kanter, sebenius product name / brand name ipad product/commodity coffee, wine, cement topic this covers the majority of keywords

work faculty work, ex: publication name, title of a case

Page 20: Clustering Search Log Data

Step 3: Choose #clusters to analyze

Number of

Clusters

Analyzed

Analyze Top Hits Improve Metadata/

Taxonomy

/Index

Supply Search

Suggestions

50 X

150 X X

300+ X X X

Page 21: Clustering Search Log Data

Small # Clusters can cover a lot of your data

Number of top clusters % Total Queries

Top 20 clusters 14

Top 30 clusters 18

Top 50 clusters 26

Top 100 clusters 37

Page 22: Clustering Search Log Data

Now you have your clusters…

What do you do with them?

TAKE ACTION!

Page 23: Clustering Search Log Data

Analyze Top (“Short Head”) Clusters

Clustering has created a condensed and reliable

list of your top search queries

Are they what you thought they would be?

Does the information on your site accurately

represent the top searches?

Are you fulfilling user needs?

Page 24: Clustering Search Log Data

Use your clusters: Improve Site Navigation

Examine the short-head of clusters, basically:

For each cluster, add up the frequencies

of queries

Reorder clusters by cumulative frequency

descending

Ensure top clusters are accounted for in your

navigation

Use cluster topics as browse/navigation

headers/footers for your website

Page 25: Clustering Search Log Data

WK Top Clusters

Cluster Frequency

innovation 867

balanced scorecard 794

leadership 570

cases 545

social media 508

negotiation 470

knowledge management 457

ethics 448

apple 430

corporate social responsibility 398

Page 26: Clustering Search Log Data

Use your clusters: Improve Taxonomy

• Missing categories in browse taxonomy

• "Balanced Scorecard"

• “Ethics”

• “Social media”

• Second-level topics in the WK context

Page 27: Clustering Search Log Data

Use your clusters: Improve Taxonomy

• Missing categories in browse taxonomy

• "Balanced Scorecard"

• “Ethics”

• “Social media”

• Second-level topics in the WK context

Page 28: Clustering Search Log Data

Use your clusters: Improve Taxonomy

• Missing categories in browse taxonomy

• "Balanced Scorecard"

• “Ethics”

• “Social media”

• Second-level topics in the WK context

Page 29: Clustering Search Log Data

Mid-level clustering:

Informs editorial /curatorial activities

“Featured Topics”

o What topics to highlight this week/month/year

o News items to focus on

o What research guides to create

o How to formulate queries for the topics

Page 30: Clustering Search Log Data

Use your clusters: Improve Synonym Handling

Clustered list provides synonyms for taxonomy

Requires human judgment and

standards/guidelines for synonyms – in our

case, synonyms are exact

Map to one "like term" in the search engine

Example:

Balanced Scorecard, BSC, Balanced score card

kaplan and norton -> Balanced Scorecard

Page 31: Clustering Search Log Data

Use your clusters: Improve no-hits page

Page 32: Clustering Search Log Data

Time Commitment

• 2 hours to 2 weeks

• Variables include:

• What kind of information you want to gather

• How broad or narrow you want your clusters

• How many queries you analyze

• In our case ~2 person-weeks

• We had Sophy Bishop

• Intern, MSLIS student

Page 33: Clustering Search Log Data

Results vs. Time Invested

Analyze top

clusters

Update

Taxonomy

Create New

Metadata

Determine

New Search

Suggestions

2 Hours X X

6 Hours X X X

One Week X X X X

Page 34: Clustering Search Log Data

Next Steps: Autosuggest

Your top clusters probably make up a large

percentage of what people are looking for

o Use them to establish/supplement

auto-suggest!

Example: suggestions for “innovation”

o innovation and leadership

o disruptive innovation

o innovation management

o open innovation

Page 35: Clustering Search Log Data

Next Steps: New Access Structures

Needed an obvious way to search podcasts

o Put in best bets for now

A lot of people searching for article titles o Considering simple interface/approach for select

field-specific search, e.g. “title”

Consider adding other facets to browse

taxonomy where we have entities tagged o “company name”, “job type/class”, etc.

Page 36: Clustering Search Log Data

Next Steps

SEO Optimization Input

o Advise authors to use top cluster terms in Titles,

Abstracts, Keywords

o Report on clusters in our monthly analytics reports

to faculty (“Top search topics/subjects in May 2012

were…” ; “Searchers found your works with

following queries”)

Repeat process on other sites/content

Page 37: Clustering Search Log Data

Summary

Established plan/process, but be willing to tweak

as you go

Keep it very simple.

Play with your data – the more we played, the better

we understood what benefits could be realized by

levels of clustering and effort

Tuning process/results

o Build staging/working prototypes

o Repeat process on other sites

TAKE ACTION!

Page 38: Clustering Search Log Data

Thank you!

Questions?

[email protected] @sophreads

[email protected] @ravimynampaty