Search Query Log Analysis Kristina Lerman. What can we learn from web search queries?...

42
Search Query Log Analysis Kristina Lerman

Transcript of Search Query Log Analysis Kristina Lerman. What can we learn from web search queries?...

Page 1: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Search Query Log Analysis

Kristina Lerman

Page 2: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

What can we learn from web search queries?

• Characteristics– Length has steadily grown over the years

• 1990’s: < 2 terms• 2001: 2.4 terms• 2014: long search queries, e.g., “where is the nearest coffee shop”

– Heavy-tailed distribution of term frequency– Billions of queries

• User intentions– Aggregate query words with results of search to learn user’s needs, wants, goals– Create a database of commonsense knowledge

• Cf. Cyc

• Does data exist?– AOL search query log– Google trends

Page 3: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

2006 AOL search query log dataset

• ~20M web queries• ~650K users• 3 month period: March 1 – May 31, 2006• Data format– AnonID – an anonymous user ID number– Query – the query issued by the user– QueryTime – time query was submitted– ItemRank – rank of item clicked in results– ClickURL – the domain of the clicked item

Page 4: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Timeline

• 8/4/06: Announcement to SIG-IRList from AOL• 8/6/06: TechCrunch slams AOL over privacy• 8/7/06: Dataset removed• 8/9/06: NYTimes identifies user 4417749– Thelma Arnold, 62, from Lilbum, Georgia

• 8/21/06: AOL CTO Maureen Govern resigns– AOL researcher and supervisor are fired

Page 5: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:
Page 6: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Weakly-supervised discovery of named entities using web search queries

Marius Pasca (Google)CIKM-07: Conference on Information and Knowledge Management, Lisbon, Portugal

Page 7: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Weakly Supervised Discovery of Named Entities using Web Search (2007)

• Goal: Automatically extract knowledge (entities) from texts created by many people– Discover new instances of classes

• Red Alert is videogame• Lilbum is a town• Lorazepam is a drug

• For what purpose?– Cataloging human knowledge– Understanding searching users

• #399392 in Lilbum takes Lorazepam, plays Red Alert

Page 8: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Intuition

• Templates in queries“side effects of xanax pills”“side effects of birth control pills”“side effects of lipitor pills”…

– Prefix: “side effects of”– Postfix: “pills”

• But, templates are difficult to specify– Cf. extraction patterns in web information retrieval

Page 9: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

“Weakly”-supervised approach

• Guided by a small set of known seed instances– Input is a target class and some examples• Drug: {phentermine, viagra, vicodin, vioxx, xanax}• City: {london, paris, san francisco, tokyo, toronto}• Food: {chicken, fish, milk, tomatoes, wheat}

– Identify the patterns seed instances occur in• Learn many more new instances automatically – Use patterns to find more instances

Page 10: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Step 1: Identify query templates

• Identify all queries that contain each known class instance

• vioxx

• Extract left and right context– “long term vioxx use”• Prefix: “long term”• Postfix: “use”• Infix: “vioxx”

Page 11: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Step 2: Generate candidate instances

• Go over the query log again• Identify all queries that match template• Collect query infixes as candidate instances

{low blood pressure, xanax, lamictal, generic birth control, lipitor, vicodin, beta blockers, …}

Page 12: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Step 3. Compile search signatures

• Each candidate is represented as a vector– Each template is a

dimension– Weighted by

frequency in queries

Page 13: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Step 4. Reference signatures

• Vectors for example class instances are combined

• Prototype of search signature for the class

Page 14: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Example

Page 15: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Step 5. Compute signature similarity

• Vector similarity between reference signature and candidate signature– Jensen-Shannon similarity function

• Output is rank-ordered list

Drug: {viagra, phentermine, ambien, adderall, vicodin, hydrocodone, xanax, vioxx, oxycontin, cialis, valium, lexapro, ritalin, zoloft, percocet, …}

Page 16: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Evaluation

Page 17: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Repeatability

• Need enormous database of search query logs– Probably best done at Google or Microsoft

• What can be done with small query databases?

• What types of social media text could this method be applied to?

Page 18: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Classifying the user intent ofweb queries using k-means

clustering

Ashish Kathuria, Bernard J. Jansen and Carolyn Hafernik, Amanda Spink

Page 19: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Problem IntroductionWWW playes a vital tool in many people’s daily lives

Nearly 70 percent of searchers use a search engine

Search engines receive hundreds of millions of queries per day Billions of results per week in response to these queries.

Smart users: Novel and increasingly assorted ways of searching!!

Page 20: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Understanding intent behind searching

Can help to improve search engine performance viapage ranking, result clustering, advertising, and presentation of results

Page 21: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Approach

• Automatically classify a large set of queries from a web search engine log as informational, navigational and transactional.

• Encode the characteristics of informational, navigational and transactional queries identified from prior work to develop an automatic classifier using k-means clustering.

• Use data-mining techniques to more accurately automatically classify queries by user Intent

• Overcome limitations of previous research:– Small datasets– Limited methodology

Page 22: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Classification of Queries

Images from http://moz.com/blog/segmenting-search-intent

Page 23: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Research methodology

• Dataset: Transaction log from Dogpile. Each record has fields like: User identification, cookie, Time of day, Query terms, source

Step 1: Creating sessions and removing duplicates The fields of Time of day, User identification, Cookie,

and Query were used to locate the initial query of a session and then recreate the series of actions in the session.

Collapsed the search using user identification, cookie, and query to eliminate duplicates of result and null queries

Page 24: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Research methodology

Step 2: Generating additional attributes Calculated three additional attributes for each record: Query length, query

reformulation and result page

Step 3: Assignment of terms1. Navigational:

Contain company/business/organization/people names Queries containing portions of URLs or even complete URLs

2. Transactional: Analysis, specifically via the identification of key terms related to transactional

domains such as entertainment and ecommerce

3. Informational: Queries that use natural language terms Longer sessions than for informational searching

Page 25: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Research methodology

Step 5: Converting string to vector

Step 4 : Textual data to numerical data

Page 26: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

K-means Clustering

NavigationalInformational

Transactional

The resulting data set had four attributes that could be used for classification: query length, source, query reformulation rate, userintent weight of the query

Page 27: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Results• Performed on various datasets and achieved 94%

accuracy• Overall, about 76 percent of the queries were

classified as informational, while about 12 percent were classified as transactional, and 12 percent were classified as navigational

Page 28: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Results

• Navigational queries: Low rates of reformulation, typically sessions of just one query.

• Informational queries: Low occurrences of query reformulation, indicating probably relatively easy informational needs, such as fact finding

• Transactional queries: Shorter queries

Page 29: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Discussion of approach

• Limitations:– The Dogpile user population representative of web search engine users in general?– What if a prototype has multiple user intents associated with it ?– Is relying solely on transactional logs sufficient ?

• Future Scope:– Investigate in subcategories– A laboratory study on how searchers express their underlying intent– Devlope algorithmic approaches for more in-depth analysis of individual queries

The approach has a high success rate, it uses a large data set of queries and does not depend on external content, thereby making it implementable in real time.

Page 30: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Summary• Identifying the user intent of web queries is very useful for web search

engines because it would allow them to provide more relevant results to searchers and more precisely targeted sponsored links.

• Classifying queries helps in focused search:– Information queries: Provide relevant information and ads– Navigational queries: Provide links straight to a requested web

page– Transactional queries: Focus on all commercial links for future

purchase as well

• The use of k-means as an automatic clustering and classification technique yielded positive results and opened effective ways to improve performance of web search engines.

Page 31: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

-Neha Mundada

Page 32: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Acquiring Explicit Goals from Search Query Logs

• Understanding human goals is necessary for– Recognize goals of actions– Create a plan• E.g., ‘plan a trip to Vienna’ has subgoals

– ‘contact travel agent’– ‘book hotel’– ‘buy concert tickets’, etc.

• Automatically acquire human goals from search query logs– Acquire and organize commonsense knowledge

Page 33: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Research overview

• Research Question:– If and How search query logs can be utilized to overcome the

problem of acquiring knowledge about human goals?• Following an exploratory research style, we intend to show:

– contain a small but interesting number of user goals– Separation by automatic methods

• Results:– Knowledge about the automatic acquisition of goals out of

search query logs– Knowledge about the nature of goals extracted from search

query logs

Page 34: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Results of Human Subject Study

• 4 independent raters• labeled 3000 queries

• Examples– bug killing devices– mothers working from

home– how to lose weight

• Classes appear to be separable

Page 35: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Experimental Setup

• AOL search query log – ~ 20 million search queries– recorded between March 1 and May 31 (2006 )– ethical issues

• pre-processing steps to reduce noise– 5 million queries

• labeled queries from the human subject study were utilized as training examples (controversial queries were omitted)

Page 36: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Classification approach

• Part of speech tagging– Maximum entropy tagger converts a sequence of

words into a sequence of POS tags– Example• Query “buy a car” buy/VB a/DT car/NN• Set of words {buy, car}• Part of speech trigrams$ VB DT NN $ {$ VB DT, VB DT NN, DT NN $}

Page 37: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Classification approach (2)

• Linear Support Vector Machine [Dumais98]– Robust and effective in the area of text classification– Weka Machine Learning Toolkit

http://www.cs.waikato.ac.nz/ml/weka/• Performance:– 10 trials – 3-fold Cross Validation– Precision, Recall and F1-Measure for the class: “queries

containing goals”• Precision = 0.77• Recall = 0.63• F1-Measure = 0.69

Page 38: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

N-fold cross validation

• Problem: limited amount of labeled data• Solution: N-fold cross validation

• Divide data into N equal segments (folds)• Training data: N-1 folds • Testing data: remaining fold• Repeat for remaining test folds and average

results

Page 39: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Goals are diverse

• Rank-Frequency plot of goals is heavy tailed– Few goals share by many users– Majority of goals are shared by only few users

Page 40: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Most frequent goals

Page 41: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Most frequent goals with “get”, “make”, “change” and “be”

Page 42: Search Query Log Analysis Kristina Lerman. What can we learn from web search queries? Characteristics – Length has steadily grown over the years 1990’s:

Summary

• Web search queries are an abundant, but very sparse and very noisy, source of data about needs, desires, intentions of people

• Clever methods can learn from these diverse data – Named entities– Goals

• Can these methods be used in social media?