Database-Inspired Search David Konopnicki and Oded Shmueli IBM Haifa Technion IBM Haifa Technion.

53
Database-Inspired Database-Inspired Search Search David Konopnicki and Oded David Konopnicki and Oded Shmueli Shmueli IBM Haifa IBM Haifa Technion Technion

Transcript of Database-Inspired Search David Konopnicki and Oded Shmueli IBM Haifa Technion IBM Haifa Technion.

Database-Inspired Database-Inspired SearchSearch

David Konopnicki and Oded David Konopnicki and Oded ShmueliShmueli

IBM Haifa TechnionIBM Haifa Technion

Back in 1994-95…Back in 1994-95…

Went live in Dec. 1995 with 18 million documents

•Started as “Jerry and David's Guide to the World Wide Web”•Funded in April 1995 with a initial investment of $2 million

•Went live in 1994 with 54,000 documents•Had indexed 1.5 million in the beginning of 1995

W3QL – W3QS: A database W3QL – W3QS: A database approach to Web dataapproach to Web data

A way to “improve” search resultsA way to “improve” search results A database language for searching the webA database language for searching the web Using full-text indexes as starting pointsUsing full-text indexes as starting points Had conditions on “semi-structured” formats:Had conditions on “semi-structured” formats:

n1.format eq “Latex File” && n1.section[3].content n1.format eq “Latex File” && n1.section[3].content =~ /zoo/=~ /zoo/

Would record form fillings and re-execute Would record form fillings and re-execute them automaticallythem automatically

Basically, a way to define personal crawlersBasically, a way to define personal crawlers

Contemporary SystemsContemporary Systems

First generation languages: WebSQL First generation languages: WebSQL (Mihaila, (Mihaila, Mendelzon and Milo) Mendelzon and Milo)

Second generation languages: Weblog Second generation languages: Weblog (Lakshmanan, Sadri, and Subramania)(Lakshmanan, Sadri, and Subramania) , Florid , Florid (Ludascher, Himmeroder, Lausen, May and (Ludascher, Himmeroder, Lausen, May and Schlepphorst)Schlepphorst)

Web restructuring languages: WebOQL Web restructuring languages: WebOQL (Arocena and Mendelzon)(Arocena and Mendelzon) , StruQL , StruQL (Fernandez, (Fernandez, Florescu, Kang, Levy and Suciu)Florescu, Kang, Levy and Suciu), Araneus , Araneus (Mecca, (Mecca, Atzeni, Masci, Merialdo and Sindoni)Atzeni, Masci, Merialdo and Sindoni)

Lorel Lorel (Abiteboul, Quass, McHugh, Widom and (Abiteboul, Quass, McHugh, Widom and Wiener)Wiener)

Present TrendsPresent Trends Certainly, nowadays search engines are bigger Certainly, nowadays search engines are bigger

and faster and more accurateand faster and more accurate A few new features:A few new features:

Is searching the web easier?Is searching the web easier?Clusty Teoma

Limitations Remain the Limitations Remain the SameSame

Visually parsing resultsVisually parsing results Search in contextSearch in context Searching beyond the first page of Searching beyond the first page of

resultsresults Integrated search from my desktop, Integrated search from my desktop,

my enterprise and on to the worldmy enterprise and on to the world

Visually Parsing ResultsVisually Parsing Results

What is best?

Lots of times we search for real-world objects not documents

Merging Documents and Merging Documents and Object RetrievalObject Retrieval

Document

Email

Person

Need to understand objects,attributes etc…

Search in ContextSearch in Context

Hard to do using keywords only…

Search Only the First Page Search Only the First Page of Resultsof Results

From a recent study on 12,500 queries:From a recent study on 12,500 queries: 73.9% of Ask Jeeves first page results were unique 73.9% of Ask Jeeves first page results were unique

to Ask Jeeves to Ask Jeeves 71.2% of Yahoo first page results were unique to 71.2% of Yahoo first page results were unique to

Yahoo Yahoo 70.8% of MSN search first page results were 70.8% of MSN search first page results were

unique to MSN search unique to MSN search 66.4% of Google first page results were unique to 66.4% of Google first page results were unique to

Google Google Need an automated way to search beyond the Need an automated way to search beyond the

first page on several search engines first page on several search engines simultaneouslysimultaneously

Full-text indexes are just starting points

Desktop SearchDesktop Search

Quite different than Quite different than web web searchsearch

No links - cannot use No links - cannot use link analysislink analysis

Information discoveryInformation discoveryversus locating versus locating informationinformation

Enterprise SearchEnterprise Search

Quite different too:Quite different too: Data integration from Data integration from

lots of systemslots of systems Critical intranet Critical intranet

serviceservice IBM Intranet SearchIBM Intranet Search

10,000 websites10,000 websites 6 million indexed 6 million indexed

documents documents A new product called A new product called

OmniFindOmniFind

Search Architectures in Search Architectures in the Enterprisethe Enterprise

Applications Search Services Content Sources

Enterprise SearchEnterprise SearchE-mail Systems

Content Servers

PortalServers

CRMSystems

Intranet SearchIntranet Search

Employee PortalsEmployee Portals

Employee DirectoriesEmployee Directories

Corporate Info &Commerce SearchCorporate Info &

Commerce Search

Customer ServicesCustomer Services

Sales Force InfoCenter

Sales Force InfoCenter

Collections

E-mail Systems

Web Servers News Servers

Content Servers

FileServers

PortalServers

CRMSystems

Directory Servers

Information integration without a schema!Information integration without a schema!

Really ?!Really ?!What about schema mappings, joins…What about schema mappings, joins…

An Example: DB2 Crawling An Example: DB2 Crawling in OmniFindin OmniFind

For every table, For every table, select fields: For each select fields: For each field, define whether field, define whether it should be full-text it should be full-text searchable, searchable, if it should support if it should support range conditions etc…range conditions etc…

Full Boolean Full Boolean operations are operations are supportedsupported

The next frontier: The next frontier: Fast index building!Fast index building!

UIMA: UIMA: UUnstructured nstructured IInformation nformation MManagement anagement

AArchitecturerchitecture An open architectureAn open architecture A software framework for processing A software framework for processing

unstructured informationunstructured information Plug-n-Play with back-end Search Plug-n-Play with back-end Search

Technologies Technologies Freely Available on IBM AlphaWorksFreely Available on IBM AlphaWorks

UIMA’s Basic Building Blocks UIMA’s Basic Building Blocks

are are AnnotatorsAnnotators

FredFred isis thetheCenterCenter CEOCEO ofof

OrganizationOrganizationPersonPerson

CeoOfCeoOf

Arg2:OrgArg2:OrgArg1:PersonArg1:Person

PPPPVPVPNPNPParserParser

Named EntityNamed Entity

RelationshipRelationship

CenterCenter MicrosMicros

CAS

Collection Processing Engine (CPE)Collection Processing Engine (CPE)

CAS ConsumerCAS Consumer

Aggregate Analysis EngineAggregate Analysis Engine

UIMA Component UIMA Component Architecture from “Source Architecture from “Source

to Sink”to Sink”

CAS ConsumerCAS Consumer

CAS ConsumerCAS Consumer

OntologiesOntologies

IndicesIndices

DBsDBs

KnowledgeBases

KnowledgeBases

Collection

Reader

Collection

ReaderText, Chat,

Email, Audio, Video

Text, Chat, Email, Audio,

Video

Analysis EngineAnalysis Engine

AnnotatorAnnotator

Analysis EngineAnalysis Engine

AnnotatorAnnotator

CASCAS

CASCAS

CASCAS

Future Search Future Search Integration ServiceIntegration Service

RequirementsRequirements Index IntegrationIndex Integration Object AwareObject Aware

(“schema”)(“schema”) Correlation Correlation

AwareAware(“flexible” joins)(“flexible” joins)

Context AwareContext Aware(“language”)(“language”) Enterprise

Index 3

EnterpriseIndex 2

DesktopIndex

EnterpriseIndex 1

WebIndex 4

WebIndex 3

WebIndex 2

WebIndex 1

Search Integration Service

Search Integration Services Search Integration Services CapabilitiesCapabilities

Need APIs for querying and controlNeed APIs for querying and control Control capabilitiesControl capabilities

Specifying the number of results, result chunksSpecifying the number of results, result chunks Total size of results Total size of results Degree of validity, recency, trust, security-Degree of validity, recency, trust, security-

level…level… Time constraints, cost constraints, privacy Time constraints, cost constraints, privacy

constraints, security constraintsconstraints, security constraints May specify tradeoffsMay specify tradeoffs

Semantic capabilities: APIsSemantic capabilities: APIs Relevant ontologiesRelevant ontologies Description of resourcesDescription of resources

A Changing LandscapeA Changing Landscape

Search Integration ServicesSearch Integration Services Semantic web capabilitiesSemantic web capabilities Technologies for Supporting Technologies for Supporting

Comprehensive Search: Comprehensive Search: XML searchXML search NLNL annotation servers annotation servers collaborative bookmarks collaborative bookmarks domain-specific servicesdomain-specific services

What kind of Applications What kind of Applications are we considering?are we considering?

Generally involves a Generally involves a comprehensivecomprehensive answer to answer to a a questionquestion

Not the kind you can perform by viewing a Not the kind you can perform by viewing a single result page – although these are very single result page – although these are very importantimportant

Very time consuming with current toolsVery time consuming with current tools May involve public and proprietary informationMay involve public and proprietary information May involve information from various sourcesMay involve information from various sources May involve personal informationMay involve personal information May involve payment for certain resourcesMay involve payment for certain resources May be time constrainedMay be time constrained May be of May be of adjustableadjustable levels of dependability, levels of dependability,

clarity, recencyclarity, recency

Kinds of QuestionsKinds of Questions InformationalInformational: U.S. educational spending in : U.S. educational spending in

cities with population of at least one millioncities with population of at least one million RecommendationRecommendation: What treatment is : What treatment is

recommended for recommended for XX TechnicalTechnical: detailed techniques for water : detailed techniques for water

purificationpurification WorkflowWorkflow: How do I organize a trip to : How do I organize a trip to YY: visa, : visa,

flights, vaccinations, money exchange, cellular flights, vaccinations, money exchange, cellular service, consulate, emergenciesservice, consulate, emergencies

CompositionalCompositional: How do I perform a task : How do I perform a task electronically by composing various serviceselectronically by composing various services

These are difficult to answer with current toolsThese are difficult to answer with current tools

Towards a Towards a Comprehensive PlatformComprehensive Platform

A A languagelanguage and a and a systemsystem supporting it supporting it Why an additional language?Why an additional language?

To take advantage of a To take advantage of a collectioncollection of of sophisticated services – search engines, sophisticated services – search engines, semantics, collaborative tools, advanced semantics, collaborative tools, advanced techniques …techniques …

To provide a To provide a contextcontext to search services to search services To enable better result To enable better result presentationpresentation services services To enable To enable personalizationpersonalization of the task at hand of the task at hand When required, look at ‘When required, look at ‘rawraw’ data rather than ’ data rather than

only derived productsonly derived products To enable To enable optimizationoptimization

Search Integration System

Full Text Search

Search & Control

XML & DB Search

Semantic Sources

Desktop, Enterprise, Web Search

P2P, RSS, BLOG, Wikis search

Files,Databases

Semantic KB,Semantic search

engines

NeighborhoodQuerying,Ranking,

Preferences…

Annotations, NLA of documents

Natural Language Analysis of Queries

Semantic Web: Search and Semantic Web: Search and IntegrationIntegration

Look at Look at mixed resourcesmixed resources – involving – involving traditional as well as semantic layers traditional as well as semantic layers (annotation).(annotation).

Search the semantic web (as in Search the semantic web (as in SwoogleSwoogle)) Use ontologies to resolve ambiguitiesUse ontologies to resolve ambiguities Include reasoning capabilitiesInclude reasoning capabilities Use various measures for semantic proximityUse various measures for semantic proximity Combine information from multiple sources and Combine information from multiple sources and

resolve conflicts (resolve conflicts (trust, easier for intranetstrust, easier for intranets)) Use ontologies to organize results in Use ontologies to organize results in human human

readablereadable form form Supply Supply explanationsexplanations – how is information deduced – how is information deduced

Semantic Web: Search and Semantic Web: Search and IntegrationIntegration

Search Search semantic data (KB)semantic data (KB) to obtain to obtain access to described traditional access to described traditional resources (as in resources (as in TAPTAP)) Resolve ambiguities at the data levelResolve ambiguities at the data level Deduce keywords for traditional search Deduce keywords for traditional search

engines to obtain additional informationengines to obtain additional information Examine likely sources (e.g., IMDB)Examine likely sources (e.g., IMDB) ContinueContinue further exploration of further exploration of

described resourcesdescribed resources

Swoogle (extracted from Swoogle (extracted from the site)the site)

Swoogle is a crawler-based indexing and Swoogle is a crawler-based indexing and retrieval system for the Semantic Web -- retrieval system for the Semantic Web -- RDF RDF and OWLand OWL documents encoded in XML or N3 documents encoded in XML or N3

Swoogle extracts metadata for each discovered Swoogle extracts metadata for each discovered document, and computes relations among themdocument, and computes relations among them

Swoogle is intended as a resource to support Swoogle is intended as a resource to support services needed by services needed by software agents and software agents and programsprograms via web service interfaces and also via web service interfaces and also for semantic web researchers to use directly for semantic web researchers to use directly via the web interface via the web interface

It is It is notnot designed to support casual users designed to support casual users seeking to answer queries on the web (e.g., seeking to answer queries on the web (e.g., "what is the population of the capital of "what is the population of the capital of India?") India?")

Tap (extracted from the Tap (extracted from the site)site)

The TAP KB is a shallow but broad knowledge The TAP KB is a shallow but broad knowledge base containing basic lexical and taxonomic base containing basic lexical and taxonomic information about a wide range of popular information about a wide range of popular objects objects

Our goal is to bootstrap the Semantic Web by Our goal is to bootstrap the Semantic Web by providing a comprehensive providing a comprehensive source of basic source of basic informationinformation about popular objects about popular objects

The KB currently includes knowledge about, The KB currently includes knowledge about, Music: Popular music, musicians & groups, Music: Popular music, musicians & groups,

instruments, styles, composers instruments, styles, composers Movies: Top Movies, actors, television shows Movies: Top Movies, actors, television shows Authors: Top book authors, classic books Authors: Top book authors, classic books Sports: Athletes, sports, sports teams, equipment Sports: Athletes, sports, sports teams, equipment ……..

The KBThe KB

<tap:UnitedStatesSenator rdf:ID="http://tap.stanford.edu/data/PoliticianDodd,_Christopher"> <rdfs:label xml:lang="en">Christopher Dodd</rdfs:label> <tap:representsPlace rdf:resource="http://tap.stanford.edu/data/ConnecticutState"/> <tap:memberOf rdf:resource="http://tap.stanford.edu/data/USDemocraticParty"/> </tap:UnitedStatesSenator>

</rdfs:Class> <rdfs:Class rdf:ID="http://tap.stanford.edu/data/UnitedStatesSenator"> <rdfs:label xml:lang="en">Sen.</rdfs:label> <rdfs:label xml:lang="en">Senator</rdfs:label> <rdfs:subClassOf rdf:resource="http://tap.stanford.edu/data/Politician"/> <tap:plural>senator</tap:plural> </rdfs:Class>

Semantic Web: Task FormationSemantic Web: Task Formation

Use ontologies to Use ontologies to deducededuce a a workflowworkflow for performing a taskfor performing a task Applicable to composing web servicesApplicable to composing web services The task itself may involve a number of The task itself may involve a number of

sitessites Parts may be executable:Parts may be executable:

on the webon the web via other meansvia other means via web services via web services

The output may be a complete or partial The output may be a complete or partial task fulfillmenttask fulfillment

Business Trip Planner Business Trip Planner Agent Example-1Agent Example-1

Present coherent information for trip Present coherent information for trip planningplanning Dates, constraints, preferences, Dates, constraints, preferences,

organizational policyorganizational policy Company resources and clients in the areaCompany resources and clients in the area

History of contacts, clients, deals, prospectsHistory of contacts, clients, deals, prospects Destination conditions based on historical Destination conditions based on historical

datadata weather, tourist information, official holidaysweather, tourist information, official holidays

Latest news at destination and vicinityLatest news at destination and vicinity commercial, political, religious, security, crime, medicalcommercial, political, religious, security, crime, medical

Business Trip Planner Business Trip Planner Agent Example-2Agent Example-2

Additional information for trip planningAdditional information for trip planning Airline, hotel, car rental dataAirline, hotel, car rental data Suggest itinerary based on constraintsSuggest itinerary based on constraints Prepare to make reservations on-line Prepare to make reservations on-line Personal friends, family in the areaPersonal friends, family in the area Must visit tourist attractionsMust visit tourist attractions

dates, rates, photos, video, historical background, linksdates, rates, photos, video, historical background, links Major seasonal attractionsMajor seasonal attractions

festivals, concerts, theatrefestivals, concerts, theatre Once information is machine “understandable” Once information is machine “understandable”

one should be able to construct a trip one should be able to construct a trip planner planner agentagent

Technologies for Technologies for Supporting Supporting

Comprehensive SearchComprehensive Search1.1. Querying Modes and ControlQuerying Modes and Control

The exact structure may not always be known The exact structure may not always be known and relationships need be specified in a and relationships need be specified in a flexibleflexible way; various semantics are possible way; various semantics are possible

Declaratively stating Declaratively stating prioritiespriorities2.2. RankingRanking

Ranking is a critical component, both in Ranking is a critical component, both in weighting different scores as well as weighting different scores as well as controlling the ordering of result presentationcontrolling the ordering of result presentation

3.3. Neighborhood QueryingNeighborhood Querying Imprecise querying mode in which similar or Imprecise querying mode in which similar or

near entities/objects are retrievednear entities/objects are retrieved

1. Querying Modes and 1. Querying Modes and ControlControl NL understandingNL understanding

Web pages contain Web pages contain phrasesphrases whose similarity is not whose similarity is not just based on syntactical matching; the meaning just based on syntactical matching; the meaning may depend on may depend on contextcontext, language usage and more, language usage and more

Flexible QueryingFlexible Querying The exact structure may not always be known and The exact structure may not always be known and

relationships need be specified in a flexible way; relationships need be specified in a flexible way; various semantics are possiblevarious semantics are possible

Query control: PreferencesQuery control: Preferences A search may involve resources and tradeoffs may A search may involve resources and tradeoffs may

need to be specified; preferences may also address need to be specified; preferences may also address quality, recency, amount, language and other quality, recency, amount, language and other factorsfactors

Querying Modes and Querying Modes and Controls ExampleControls Example

Trying to locate information about a movie Trying to locate information about a movie based on fairly vague recollectionsbased on fairly vague recollections

It is based on a bookIt is based on a book It deals with military political issues, maybe It deals with military political issues, maybe

a coup or a coup attempt, or a kidnappinga coup or a coup attempt, or a kidnapping From the fifties or sixtiesFrom the fifties or sixties The lead role is a famous movie star of that The lead role is a famous movie star of that

timetime It’s not the one with Peter Sellers and it’s It’s not the one with Peter Sellers and it’s

not Failsafe and not the one with submarinesnot Failsafe and not the one with submarines The plot involves Generals, Colonels and the The plot involves Generals, Colonels and the

President, maybe not all of them and there President, maybe not all of them and there might also be a Senator or twomight also be a Senator or two

Querying Modes and Querying Modes and Controls ExampleControls Example

Solving the above may utilizeSolving the above may utilize a movie database with an associated a movie database with an associated

ontologyontology a a flexible querying languageflexible querying language that that

attempts at attempts at maximal subset satisfactionmaximal subset satisfaction a web search engine with some a web search engine with some NLNL

understanding (of the plot)understanding (of the plot)

Querying Modes and Querying Modes and Controls Example Con’t.Controls Example Con’t.

While I’m really interested, pleaseWhile I’m really interested, please Work on it for Work on it for no more than an hourno more than an hour Don’t spend more that Don’t spend more that a dollara dollar finding the finding the

answeranswer Use only highly Use only highly trustedtrusted sources sources Obtain Obtain photosphotos and video clips if possible, and video clips if possible,

especially those involving the lead star, especially those involving the lead star, Washington sites, trucks and airplanesWashington sites, trucks and airplanes

The The most importantmost important items are how much the items are how much the movie grossed and whether the lead star was movie grossed and whether the lead star was nominated for an Oscar for this movienominated for an Oscar for this movie

CompositionComposition Various “judges” may score differently; allow Various “judges” may score differently; allow

scoring of search terms, services, relevancyscoring of search terms, services, relevancy Top-k QueriesTop-k Queries

Multidimensional objects; monotone aggregation Multidimensional objects; monotone aggregation function on attributes; on each attribute, a list in function on attributes; on each attribute, a list in rank order; find k top ranked objects rank order; find k top ranked objects

Many variations; e.g., applications for finding Many variations; e.g., applications for finding “best” pages based on ranking by various services“best” pages based on ranking by various services

Ranked Query ResultsRanked Query Results Ranking query results in desired order also Ranking query results in desired order also

applies to the semantic web, important for applies to the semantic web, important for retaining user attention as well as in specifying retaining user attention as well as in specifying sub queries during compilation/executionsub queries during compilation/execution

2. Ranking2. Ranking

Ranking ExampleRanking Example

Continuing the previous example, textual Continuing the previous example, textual information may be provided by various information may be provided by various search engines – search engines – rankrank the information based the information based on the weights awarded to these engineson the weights awarded to these engines

Various photos may Various photos may scorescore differently on the differently on the star, Washington sites, airplanes and trucks, star, Washington sites, airplanes and trucks, find bestfind best

RankRank results, for example those that answer results, for example those that answer the the most conditionsmost conditions that are judged to be that are judged to be the the most importantmost important

k Nearest Neighborsk Nearest Neighbors Locate Locate near-by objectsnear-by objects in a in a

multidimensional space, objects may be multidimensional space, objects may be pages, or traditional objects, where each pages, or traditional objects, where each dimension corresponds to a property dimension corresponds to a property (attribute) (attribute)

Complex Similarity QueriesComplex Similarity Queries Identify Identify similar objectssimilar objects, to a given object , to a given object

setset Detecting “identical objects”Detecting “identical objects”

3. Neighborhood 3. Neighborhood Querying - flexibilityQuerying - flexibility

Neighborhood Querying - Neighborhood Querying - ExampleExample

Continuing the example, if a Continuing the example, if a coupcoup or or kidnappingkidnapping plot is not found, a close one plot is not found, a close one may be a may be a plotplot of some other type, for of some other type, for example an example an overthrowoverthrow, and instead of the , and instead of the military it may involve the secret servicemilitary it may involve the secret service

Maybe it was some otherMaybe it was some other vehicle vehicle rather rather than than truckstrucks or or planesplanes

Perhaps the movie was an Oscar candidate Perhaps the movie was an Oscar candidate in some other category or its director/star in some other category or its director/star were Oscar winners for other movieswere Oscar winners for other movies

Moving on…Moving on… The landscape is complexThe landscape is complex

Sophisticated tagging and information Sophisticated tagging and information aggregationaggregation

Merging object and document retrievalMerging object and document retrieval Focused searchFocused search New “sources” including RSS, Blogs, Wikis …New “sources” including RSS, Blogs, Wikis … Useful result presentationUseful result presentation Cooperative bookmarks management Cooperative bookmarks management

We explored some ways to take advantage We explored some ways to take advantage of this emerging landscape for of this emerging landscape for sophisticated search and integration taskssophisticated search and integration tasks

Thank You!Thank You!