Search, Signals & Sense: An Analytics Fueled Vision

Post on 26-Jan-2015

103 views 0 download

Tags:

description

Keynote presented by Seth Grimes at the Open Source Search Conference, October 2, 2012

Transcript of Search, Signals & Sense: An Analytics Fueled Vision

Search, Signals & Sense:An Analytics Fueled Vision

Seth Grimes@sethgrimes

A Sense Making Story

New York Times,September 30, 2012

New York Times,September 8, 1957

Valium: Starting a Chain of Connections

H.P. Luhn

By H.P. Luhn, inIBM Journal,April, 1958

http://altaplana.com/ibm-luhn58-LiteratureAbstracts.pdf

Modelling Text

“Statistical information derived from word frequency and distribution is used by the machine to compute a relative measure of significance, first for individual words and then for sentences. Sentences scoring highest in significance are extracted and printed out to become the auto-abstract.”

-- H.P. Luhn, The Automatic Creation of Literature Abstracts, IBM Journal, 1958.

Luhn’s analysis of Messengers of the Nervous System, a Scientific American article http://wordle.net,

applied to the NY Times article

New York Times,September 8, 1957

Luhn’s Example

Close Reading

Can Software Make the Connection?

Mark Lombardi, George W. Bush, Harken Energy and Jackson Stephens, c. 1979-90, Detail

There and Back Again: Modelling Text, 2

The text content of a document can be considered an unordered “bag of words.”

Particular documents are points in a high-dimensional vector space.

Salton, Wong & Yang, “A Vector Space Model for Automatic Indexing,” November 1975.

Modelling Text, 3

We might construct a document-term matrix...• D1 = “I like databases”• D2 = “I hate hate databases”

and use a weighting such as TF-IDF (term frequency–inverse document frequency)…

in computing the cosine of the angle between weighted doc-vectors to determine similarity.

I like hate databases

D1 1 1 0 1

D2 1 0 2 1http://en.wikipedia.org/wiki/Term-document_matrix

Modelling Text, 4

In the form of query-document similarity, this is Information Retrieval 101.• See, for instance, Salton & Buckley, “Term-Weighting

Approaches in Automatic Text Retrieval,” 1988.• A useful basic tech paper: Russ Albright, SAS, “Taming Text

with the SVD,” 2004.

Given the complexity of human language, statistical models may fall short.

“Reading from text in general is a hard problem, because it involves all of common sense knowledge.”

-- Expert systems pioneer Edward A. Feigenbaum

From Text to Data: Features

Analytical methods make text tractable.Latent semantic indexing utilizing singular value

decomposition for term reduction / feature selection.

Classification technologies / methods:• Naive Bayes.• Support Vector Machine.• K-nearest neighbor.

Thus the Orb he roam'dWith narrow search; and with inspection

deep Consider'd every Creature, which of all Most opportune might serve his Wiles.

-- John Milton, Paradise Lost

“Reading from Text is a Hard Problem”

Eugène Delacroix, St. Michael Defeats the Devil

Thus the Orb he roam'dWith narrow search; and with inspection

deep Consider'd every Creature, which of all Most opportune might serve his Wiles.

-- John Milton, Paradise Lost

Eugène Delacroix, St. Michael Defeats the Devil

Data, Search, Analysis, and Discovery

Data Space

For features Analysi

s

Intent, Goals

The User Interface

“Search is the UI for data today.”-- Grant Ingersoll, Chief Scientist, LucidWorks

Quoted by Gil Press in Forbes,

“LucidWorks: Bringing Search to Big Data”http://www.forbes.com/sites/gilpress/2012/09/24/lucidworks-bringing-search-to-big-data/

What’s beyond?

Search and Sensemaking

“It is convenient to divide the entire information access process into two main components: information retrieval through searching and browsing, and analysis and synthesis of results. This broader process is often referred to in the literature as sensemaking. Sensemaking refers to an iterative process of formulating a conceptual representation from of a large volume of information. Search plays only one part in this process.”

-- Marti Hearst, 2009http://searchuserinterfaces.com/

Senseless Search

New but old: Dumb and siloed

Better?

Searcher Supplied Sense

Siloed signals.

More better?

Semantic Search Engines

Meh.

Clustered Clarity

Carrot2.(open source)

Semanticized (Web) Search

Google Knowledge Graph

Search Fronted Analysis & Discovery

Fusions, Signals

Old Search Sensemaking

Search on: keywords + identity, history & context

Sources: content/type silos

Unified

Indexed: terms + metadata (properties)

Returned: hit lists Categories / clusters / answers first

Relevance: PageRank (Inferred) intent

Prevalence: plenty of new platforms with old(ish) search

Plenty of established search with new(ish) capabilities, also wanna-bes.

Toward Semantic Search Sensemaking

Platforms and ecosystems.

APIs and services.

Text and content analytics --Discerns and extracts features including

relationships from source materials.

Features = entities, key-value pairs, concepts, topics, events, sentiment, etc.

Provide (for) BI on content-sourced data.

Data integration, record linkage, data fusion.

The Back End

Text/content analytics generates semantics to bridge search, BI, and applications, enabling next-generation information systems.

Search BI

Applica-tions

Search based applications (search + text + apps)

Information access (search + text + BI)

Integrated analytics (text + BI)

Text analytics (inner circle)

Semantic search (search + text)

NextGen CRM, EFM, MR, marketing, …

Text+ Technology Mashups

Analytical Assets (Open Source)

>>> import nltk>>> sentence = """At eight o'clock on Thursday morning... Arthur didn't feel very good.""">>> tokens = nltk.word_tokenize(sentence)>>> tokens['At', 'eight', "o'clock", 'on', 'Thursday', 'morning','Arthur', 'did', "n't", 'feel', 'very', 'good', '.']>>> tagged = nltk.pos_tag(tokens)>>> tagged[0:6][('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),('Thursday', 'NNP'), ('morning', 'NN')]

http://nltk.org/tm: Text Mining PackageA framework for text mining applications within R.

A Big Data Analytics Architecture

http://www.geeklawblog.com/2011/12/lexis-advance-platform-launch-two.html

http://hpccsystems.com/ (GNU Affero GPL)

Commercial (Non-OS) Solutions Plug In

Drivers and Trends

Social media!… and personal-social-enterprise integration.

Via-API cloud services.

Big Data (even if you don’t like the term).Volume and velocity mean new analytical approaches.Variety: new types and a new fusion imperative.

Sentiment: Mood, opinions, emotions, intent.

Question answering.

Text Tech Initiatives

Now and near future.• Broader & deeper international language support.• Sentiment analysis, beyond polarity.

Emotions, intent signals. etc.• Identity resolution & profile extraction.

Online-social-enterprise data integration.• Semantic data integration, Complex Data. • Speech analytics.• Discourse analysis.

Because isolated messages are not conversations.

• Rich-media content analytics.• Augmented reality; new human-computer interfaces.

http://timoelliott.com/blog/2010/10/sap-businessobjects-augmented-explorer-now-available-resources-to-test-it.html

Personal. Mobile. Intelligent?

A Focus on Information & Applications

Now and near future.• Signal detection.

Sentiment, emotion, identity, intent.• Semanticized applications.

Linkable, mashable, enrichable.• Rich information.

Context sensitive, situational.

Σ = Sensemaking.

Onward… to Q&A

Search, Signals & Sense:An Analytics Fueled Vision

Seth Grimes@sethgrimes