SLA Summer 2008

Post on 07-Dec-2014

696 views 1 download

Tags:

description

My presentation to SLA, summer 2008

Transcript of SLA Summer 2008

Mining SolutionsA New Approach to Making the Most of Your Research Time

SLA,Strategic Technology Alliance, Seattle, 2008Joe Buzzanga, Product Manager, Elsevier Science and TechnologyJune 17, 2008

Agenda

•Challenges and Framework for Information Retrieval (IR)

•Using Natural Language Processing (NLP) in IR (illumin8)

•Product Demo

Digital Universe: 10x bigger in 5 years

“Searching for meaning in the content of unstructured data like images, video clips, documents, and the numbers and characters in databases is the rocket science of the digital universe.” IDC

Source: IDC Whitepaper, The Diverse and Exploding Digital Universe, March 2008

Today’s Researcher?

Search for Meaning?

Impact on Information Retrieval

•Separate the Signal from Noise

•Signal processing

Our Goal

•Make you successful through superior information retrieval tools

Framework for Information Retrieval

HumanIndex SearchSimple

Model Content

•Traditional: card catalog, periodical index…

HumanIndex SearchPrint

Collections Surrogate

RecordContent

•Simple Model: single book

Meta Data

Framework for Information Retrieval

HumanIndex SearchDigital

BibliographicA&I

Surrogate Record

DigitalIndex

Content

Hybrid Index

Meta Data

•Digital bibliographic A&I•Semi-structured records•Content under editorial control•Application of controlled terms•Application of digital indexing•Results need to be organized and ranked

•additional access points (e.g., facets, tags..)

Results

Framework for Information Retrieval

•No Human Intervention•Content unstructured, uncontrolled and unmeasurable•Crawling is inherently imperfect•Typically Keyword indexing•Ranking of results becomes critical

Web SearchCrawl Digital

IndexContent

Results

Content:How Big is the Web?

Today

170 million websites across all domains

Source: Netcraft

2 years ago

80 million websites across all domains

Content: Plumbing the Depths

Source: Mills Davis, Project 10X

Content: How Big is the Web?

~10 Billion pages (2003 estimate)

http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/

Crawling in the Dark

The Key in Keyword?

• Keyword is a misnomer in context of an index• Keyword is in the mind of the searcher• Every word is indexed, since the computer is not smart enough to know significant words (i.e., the “key” in “keyword”)

• Brute force approach, feasible with compute power

Results: Mystery Equation

mystery clip

Results: Facets

Research and its Discontents

18185.5 hours / week *Searching and gathering information

* Source: 2007 survey of 6,300 knowledge workers, Outsell, Inc.

4.7 hours / week *Organizing and analyzing and applying information

Introducing illumin8

•Cut through the noise•Rapid summary/overview•Cross domain view•Integrated content•Web-based•Sharing results

Applies Natural Language Processing at Internet Scale!

Typical Search

Current general searchGet millions of documents

to sift through

Page 1 Page 2 Page 180,000

compostable film

There is just no way any researcher can read through all this information.It just takes too long!

Illumin8 Uses Natural Language Processing to “read” text

Enter search termsGenerate

Organized Result Set

Products Companies/Organizations Technical Approaches

•Results grouped into meaningful classes

•System generates list of solutions, not records

•Quickly see interesting and useful areas for investigation

Our Approach• Premium Scientific• Patent• Web

Search-Crawl-Load

SemanticIndex

Content

Results

NLP Applied

Problems, Solutions, Benefits

NLP Applied

Fuse, Classify, Summarize

NLP Applied

NLP applied throughout the system: index, query, result set

Full Text

Abstracts

illumin8 searches on solutions. The solutions are extracted from full text sources, abstracts, web, and patents

Internet

Patents

illumin8 Solution Database1.1 billion

5 Billion web pages, blogs and forums

3 Million full-text scientific and technical articles from 1,800 Elsevier journals

33 Million scientific records from 15,000 peer reviewed journals & more than 4,000 publishers

21 Million patents from 5 world-wide patent offices

Extract and Summarize Solutions

Search

How does illumin8 work?

WEB JOURNAL PATENT

• Summarizing information about Companies, Products, etc., for technologies that researchers

care about

• Organizing results from the worlds most trusted scientific content and billions of web pages

A Uniform Lens (index) Across Content Sources

Keyword Indexing

• Meaning is lost

Taking Search Beyond Keyword Indexing

Sentence processing

• Meaning is maintained

• Identify & classify problems, solutions and benefits

Neural Network used in handwriting recognitionSolution Problem

Natural Language Parsing

Help_patternsSucceed2Correct_problemtreatPerson_SAVSpositively_influencehave_positive_influenceprotect_sb_against_sthProduct_would_do_goodprovide_sb_with_sthProduct_is_shown_totalented_atuse_sth_to_do_sthapprove_sthrely_on_product_toapplication_isProduct_allows_sb_toVG2ensure_protagonistA_makes_B_goodbenefit_of

...

Thousands of rulesPlus statistical models

illumin8 Rules Grammatical Role Role Test Role Assignment

provides

Capacitive deionization

an economical and efficient method for removing salt and impurities from water

Solution

Benefit

Continue …Modal?

Check that Verb polarity is positive; this rule would not match if the Verb were modal (i.e. only in certain cases), for example if it said “should provide … but”

Check that Subject is not negated; this rule would not match if Subject were not positive, for example if it said “no process provides an economical an efficient …”

Check that Object is not antagonistic; this rule would not match if Object were, for example “provides a costly and complicated method”

no

yes

Negated? no

yes

Antagonistic? noye

s

Capacitive deionization with carbon aerogel electrodes provides an economical and efficient method for removing salt and impurities from water.

Verb

Subject

Object

Analyzing A Sentence

Carrier’s Infinity™ Air Purifier uses ultraviolet light to eliminate germs such as viruses, molds, bacteria, mildew and mold spores from the indoor air of homes and offices, ensuring a higher indoor air quality.

Germ[Problem]

Indoor air quality[Benefit]

Carrier[Organization]

Infinity Air Purifier

[Product]

Ultraviolet light

[Technology]

Virus

Mold

Bacteria

MildewMakes Uses

Solves

Provides

Kind of

Mold spore

Concepts, ideas and entities extracted from a single sentence.

DEMO