SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

38
SearchInFocus Exploratory Study on Query Logs and Actionable Intelligence Marina Santini Exploratory Query-log Analysis Workshop Organized by Findwise , AB - www.findwise.com / Thursday, October 25, 2012 from 10:00 AM to 12:00 PM (CEST) Lund, Sweden SLTC 2012: Fourth Swedish Language Technology Conference, October 24-26, 2012, Lund. LAST UPDATED: 26 OCTOBER 2012

description

Query logs are an important source of information to surmize users intents'. Although Karlgren (2010) points out that “There are several reasons to be cautious in drawing too far-reaching conclusions: we cannot say for sure what the users were after; [...]“, some linguistic problems could be sorted out by applying more advanced text/content analytics, such as register/sublanguage identification and terminology classification (see Friberg Heppin, 2011) . In this presentation, I will argue that query logs can be considered a digital textual genre alike emails, blogs, chats, tweets and so forth. All these genres contain unstructured information that, still today, is difficult to leverage upon satisfactorily. The hypothesis that I would like to put forward in this workshop is that query logs might be easier to exploit to extract useful information and actionable intelligence than other digital genres.

Transcript of SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Page 1: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

SearchInFocusExploratory Study on Query Logs and Actionable

Intelligence

Marina Santini

Exploratory Query-log Analysis WorkshopOrganized by Findwise, AB - www.findwise.com/

Thursday, October 25, 2012 from 10:00 AM to 12:00 PM (CEST)Lund, Sweden

SLTC 2012: Fourth Swedish Language Technology Conference, October 24-26, 2012, Lund.

LAST UPDATED: 26 OCTOBER 2012

Page 2: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Query Logs and Actionable Intelligence:Questions to LinkedIn-ers

• “Can anyone suggest references about mining query logs for BI and CEM?” (3rd May 2012) [BI=Business Intelligence; CEM=Customer Experience Management]

• Applying Findability to Mine Query Logs for BI: Preliminaries “How can I profitably use query logs for making better business decisions and predict future trends?” (14th May 2012)

• Mining Query Logs: Query Disambiguation & Understanding through a KB “some linguistic problems can be sorted out -- for example those related to sublanguage, terminology, multi-word expressions, etc. -- through a dictionary-shaped knowledge base where the different uses of language are stored and continually updated. I will call this knowledge base DaisyKB” (21st May 2012)

Page 3: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

My preliminary reflections based on this info…

• “The average length of a search query was 2.4 terms"

• "A 2005 study of Yahoo's query logs revealed 33% of the queries from the same user were repeat queries and that 87% of the time the user would click on the same result. This suggests that many users use repeat queries to revisit or re-find information. This analysis is confirmed by a Bing search engine blog post telling about 30% queries are navigational queries."

• “… much research has shown that query term frequency distributions conform to the power law, or long tail distribution curves. That is, a small portion of the terms observed in a large query log (e.g. > 100 million queries) are used most often, while the remaining terms are used less often individually."

• “… in a recent study in 2011 it was found that the average length of queries has grown steadily over time and average length of non-English languages queries had increased more than English queries."

Page 4: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Then came the corpus…

• Enterprise query logs: VGR (27 August 2012) – easier to handle and interpret than general-

purpose search engines’ query logs!

Page 5: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

So… that’s the Outline

1. The query log genre2. Actionable Intelligence3. A possible use case4. Preliminary conclusions

Page 6: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

What is a (textual) genre?

• Simply simply simply put:– A genre is a class of text

Page 7: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

What characterize a genre?

1. Must have a name2. Must be recognized within a community3. Must be produced during a task4. Must have conventions5. Must raise expectations6. Can change over time. It is an cultural

artifact (culture here includes society, media, techonology, etc.)

Page 8: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Genre Characterization1. Name formation: a genre must indicate a class, a family (for genre name

formation, see Görlach, 2004). Recent webgenres: blogs, tweets, chatlogs, etc.2. Community: a genre is not something individual. A genre is a textual form that

is used and recognized by a community (vs. style can be individualized). Ex: Blogs bloggers and blog readers; academic home pages academics; etc.)

3. Task: a genre meets a RECURRENT communication need. Ex: personal home page genre tells us something about a person; a technical blog is informative about a specific technology; etc.)

4. Conventions: ex : a personal blog is made of posts organized in chronological order where a blogger communicates personal and subjective views on some facts.

5. Expectations: when reading a personal blog, readers expect to read something personal (personal facts or personal opinions) and expect the possibility to leave a comment if they wish to do so.

6. A genre is a cultural artifact: it might evolve over time (see the History of Blog by Rebecca Blood, 2000) might disappear if the society changes (ex : Chansons des gestes). New genres emerge with new media, new technologies, new information needs.

Page 9: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

The query log genre is…a novel and fully-emerged webgenre1. Name: in line with other digital genres (ex: web log

blog)2. Community: internet users, IR practitioners3. Task: information needs specified in a search

engine4. Conventions: short texts written in”keywordese”5. Expectations: to find relevant information6. Cultural artifact: a product of our media-based,

internet-based society OR a subproduct of search engines

Page 10: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

The query log genre: Languistic and Textual Conventions

• Length: short text (a query log can be seen as a corpus of very short texts, shorter than tweets, mobile text messages, chat logs, etc.)

• Sublanguage/Jargon: ”keywordese”• Register: neutral• Morphology: LITTLE• Syntax : OCCASIONALLY (usually no articles, no

prepositions, no subclauses, etc.)

Page 11: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Query Log Genre: The Benefits

• Expressed in a ”lean” sublanguage, the keywordese: – reduced morphology– reduced syntax– short texts– Mostly Nouns and Verbs

• Reduced size: compare a 2-years collection of emails vs a 2-year collection of query logs

• = REDUCED SIZE, REDUCED PRE-PROCESSING; NO DATA CLEANING!

Page 12: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Expectations: a text written by a user for a search engine to find relevant information

• The texts (queries) must express information needs aka users’ intents

• It is good practice to be cautious with the interpretation of users’ intents. However…

• If we mine query logs with a simple quantitative approach, it is possible to extract recurrent information needs and build upon them…

Page 13: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Actionable Intelligence

• It must be accurate, and verifiably• It must be timely• It must be comprehensive• It must be comprehensible• ability to act on that information straightaway

Page 14: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

I would argue:a Query Log is an ”Actionable” Corpus

• Let’s see…

Page 15: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Mining query logs for actionable intelligence: Description and Basic Statistics

• Corpus Time frame: 2010-2011 (2 years)

• “These logs come from the search at hittavard.vgregion.se. The biggest bulk should come from 1177.se. The rest should be from vgregion.se. The target audience are both VGR (Västra Götalands Region) users/employees as well as the general public, as it is a public site. The internal files aresearches made from within the VGR…”

• Corpus size:– size = 3,167 KB (only queries) (BIG DATA is usually > 1TB)– number of queries = 249,243– number of words = 306,453

• Average query length: 1.23 words

Page 16: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Case study enterprise search – VGR

FINDWISE SLIDESHARE: http://www.slideshare.net/findwise/case-study-enterprise-search-vgr

http://www.vgregion.se/en/Vastra-Gotalandsregionen/Home/

Page 17: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Business Decision: Improve Search Quality and Usability to

increase Users’ Satisfaction & Competitiveness

Page 18: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

How?

• The simplest approach…

ANALIZE THE HEAD

Page 19: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

(1) Take the Top-Ranked Queries

Page 20: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

(2) Use them as TAGS (metadata creation)

1. egenremiss2. mina vårdkontakter3. webbisar4. sjukresor5. vårdgaranti6. sjukresa7. mammografi8. vårdval9. influensa10. urinvägsinfektion11. halsfluss12. förnya recept13. magkatarr14. vattkoppor15. byta vårdcentral16. blanketter17. svinkoppor18. reseersättning19. klamydia20. feber21. högkostnadsskydd22. vinterkräksjukan23. patientombudsman24. öroninflammation25. logga in26. frikort27. hosta28. magsjuka29. njursten30. als

Tags are keywords describing the content

Page 21: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

3) Use TAG metadata to automatically annotate only documents selected by users

Page 22: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Watch out!

Page 23: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Mismatch or Ambiguity?

Page 24: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

4) Use most frequent queries to create a query suggester

Page 25: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

5) If you want, you can sort queries automatically into query types and build…• a taxonomy

• The categories of the taxonomy can be also used to annotate existing documents automatically (another layer of METADATA)– TAGS describe the content– CATEGORIES IN A TAXONOMY organize the content– Categories can be hierarchical whereas tags cannot

Page 26: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

If you want, you can give the taxonomy to document creators, so they can annotate the text with metadata

• … in short you will have a multilabelled corpus that can be used with machine learning.

Page 27: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

The importance of metadata to structure unstructured data & to extract actionable

intelligence• From Unstructured Data to Actionable Intelligence by

Ramana Rao, 2003

• ” We access information for various purposes and in various ways according to our purpose. Sometimes we’re surveying an area of knowledge, trying to get a general understanding of what it’s about or what’s available. At other times we’re searching for specific answers. […] It is this range of purpose and context that we can better address by providing a richer set of information access tools based on exploiting metadata.”

Page 28: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Linguistic Remarks

• At the top of the frequency list:– Nouns– Compounds– A+N– V+N

• More complex constructions at the bottom

Page 29: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Syntactic Patterns

Page 30: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

In this case, automatic annotation can help a lot

Page 31: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Benefit for the Search Provider

• Mining query logs to extract user-created knowlege, ie queries that can be used as tags (metadata)

• Quickly create domain-specific taxonomies you can capitalize upon, especially for new client companies working in related fields

• Enhancements of current search products• Inexpensive creation of annotated corpora: document

annotation through query logs is a simple technique that in the a short time will build massive annotated corpora to use for machine learning, which will allow more sophisticated search refinements.

Page 32: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Benefits for Clients & End Users

• Somebody said: SEARCH MUST BE MIND READER!• BUT ALSO faster, more friendly, more exhaustive

and more accurate.• If this happens, clients will spend less for customer

care. If you find what you need online, there is no need to call an helpdesk or customer care service.

Page 33: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Query Pre-processing ?

Absolutely YES• Normalization

– egen remiss & egenremiss– Spelling correction

• Terminology expansion (domain-dependent)– anemi & blodbrist (ex: taken

from Freberg Heppin, 2010; ex: painkiller & analgesic)

– Stemming/Lemmatization (blanketter blankett; sjukresor sjukresa)

If you want… Nj• Compound decomposition &

Tokanization. Text chunks (such as queries) are more informative and less ambiguous than single words. No need to tokenize or decompose, if RECALL is ok.

• Ontology? Uhm.. not sure we need a semantic structure here….

Page 34: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Tokinization ? Domain-dependent?

Top query frequencies• 21388 egenremiss• 17360 mina vårdkontakter• 10553 webbisar• 8787 sjukresor• 7345 vårdgaranti• 3938 sjukresa• 3734 mammografi• 3723 vårdval• 3653 influensa• 2908 urinvägsinfektion• 2803 halsfluss• 2542 förnya recept• 2460 magkatarr• 2394 vattkoppor• 2274 byta vårdcentral• 2256 blanketter• 1878 svinkoppor• 1840 reseersättning• 1653 klamydia• 1559 feber• 1525 högkostnadsskydd• 1420 vinterkräksjukan• 1405 patientombudsman• 1326 öroninflammation• 1252 logga in• 1251 frikort• 1199 hosta• 1193 magsjuka• 1184 njursten• 1167 als

Top word frequencies• 21565 egenremiss• 17717 vårdkontakter• 17407 mina• 10567 webbisar• 8880 sjukresor• 7357 vårdgaranti• 4044 sjukresa• 3763 vårdcentral• 3754 mammografi• 3732 influensa• 3730 vårdval• 2932 urinvägsinfektion• 2819 halsfluss• 2805 recept• 2543 förnya• 2463 magkatarr• 2413 vattkoppor• 2349 i• 2296 byta• 2269 blanketter• 1881 svinkoppor• 1840 reseersättning• 1802 feber• 1666 klamydia• 1571 högkostnadsskydd• 1422 vinterkräksjukan• 1405 patientombudsman• 1383 hepatit• 1338 öroninflammation• 1331 frikort

Page 35: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Different search users’ behaviour:Enterprise vs. Web?

VGR: Swedish – Enterprise SearchFörsta hjälpen till psykisk hälsa (MHFA-Sverige) Swedish – Web Search

Page 36: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Preliminary reflections revisited…• “The average length of a search query was 2.4 terms“… uhm.. It depends: enterprise vs. web

• "A 2005 study of Yahoo's query logs revealed 33% of the queries from the same user were repeat queries and that 87% of the time the user would click on the same result. This suggests that many users use repeat queries to revisit or re-find information. This analysis is confirmed by a Bing search engine blog post telling about 30% queries are navigational queries.“ not investigated

• "much research has shown that query term frequency distributions conform to the power law, or long tail distribution curves. That is, a small portion of the terms observed in a large query log (e.g. > 100 million queries) are used most often, while the remaining terms are used less often individually.“ … definitely yes

• "in a recent study in 2011 it was found that the average length of queries has grown steadily over time and average length of non-English languages queries had increased more than English queries.“uhm.. It depends : enterprise vs. web + language

Page 37: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

Conclusions from this Exploration• Query logs are a genre that is easier to exploit for extracting

actionable intelligence.

• Query logs are a good, handy and economic source of information for actionable business decisions, such as:– keeping a cutting-edge profile on the market, – enhancing enterprise search usability (query suggester/autofill), – disambiguation, – annotation and taxonomy creation– preventing huge cost for customer helpdesk and similar services throught

a cutting-edge search functionality!

• Future: More and diversified use cases…

Page 38: SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence

QUESTIONS?

THANK YOU FOR YOUR ATTENTION