1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum...

24
1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum [email protected] http://www-dbs.cs.uni- sb.de

Transcript of 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum...

Page 1: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

1

The Web in the Year 2010:Challenges and Opportunities

for Database Research

Gerhard Weikum

[email protected]

http://www-dbs.cs.uni-sb.de

Page 2: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

2

Importance of Database Technology

Page 3: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

3

What Have We Done to the Web?

Information at your fingertips

Electronic commerce

Interactive TV

Digital libraries

Terabyte servers

Brave New World

Flooded by junk & ads

Poor responsiveness andvulnerable to load surges

Needles in haystacks

Unreliable services

Success stories require special care & investment

Back to Reality

Page 4: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

4

The Grand Challenge:Service Quality Guarantees

”Our ability to analyze and predict the performance of the enormously complex software systems ...are painfully inadequate"

(PITAC Report)

Continuous ServiceAvailability Money-back Performance Guarantees

Guaranteed Search Result Quality

Importance of quality guarantees not limited to WebObservation:

DFG graduate program at U Saarland

Prediction for 2010:Either we will have succeeded in achieving these qualities,or we will face world-wide information chaos !

Page 5: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

5

Outline

Why I’m Speaking Here

• Money-back Performance Guarantees

• Observations and Predictions

• Continuous Service Availability

• Guaranteed Search Result Quality

• Summary of My Message

Page 6: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

6

Internal Server Error.Our system administrator has been notified.Please try later again.

From Best Effort to Performance GuaranteesObservations:

• Web service performance is best-effort only• Response time is unacceptable during peak load because of queueing delays• Performance is mostly unpredictable !

Example: Check Availability(Look-Up Will Take 8-25 Seconds)

Users (and providers) need performance guarantees !Unacceptably slow servers are like unavailable servers.With huge number of clients, guarantees may be stochastic.

Page 7: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

7

Example: Video (& Audio) ServerPartitioning of continuous data objects with variable bit rateinto fragments of constant time length TPeriodic scheduling in rounds of duration T

0 T 3T2T 4T

Clients

Server fragment streams with deadlines for QoS

Admission controlto ensure QoS:yes, go aheadno way

T T Tserv seek rot ii

N

i

N

,

11f f f fserv seek rot

Ntrans

N* * * *Ttrans,i

0|)(* inf ][ serv

tserv fetTP

Chernoffbound

Stochastic model:

...

Auto-configure server: admission control, #disks, etc.

Page 8: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

8

Observations and Predictions

resource dedication can simplify the problem

stochastic modeling is a crucial asset,but realistic modeling is difficult and sometimes impossible

Observations:

„low-hanging fruit“ engineering: 90% solution with 10% intellectual effort

Predictions for 2010:special-purpose,self-tuningservers withpredictableperformance

„Web engineering“ for end-to-end QoSwill rediscover stochastic modeling or will fail

95.0]2[ stimeresponsePstochastic guarantees for all data and services,e.g., of the form

money-back guarantees after trial phase

asap alerting about necessary resource upgrading

Page 9: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

9

Outline

Why I’m Speaking Here

Money-back Performance Guarantees

• Observations and Predictions

Continuous Service Availability

• Guaranteed Search Result Quality

• Summary of My Message

Page 10: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

10

Ranking bydescendingrelevance

Vector Space Model for Content Relevance

Search engine

Query (set of weightedfeatures)

||]1,0[ Fid Documents are feature vectors

||]1,0[ Fq

||

1

2||

1

2

||

1:),(F

jj

F

jij

F

jjij

iqd

qd

qdsim

Similarity metric:

Page 11: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

11

Vector Space Model for Content Relevance

Search engine

Query (Set of weightedfeatures)

||]1,0[ Fid Documents are feature vectors

||]1,0[ Fq

||

1

2||

1

2

||

1:),(F

jj

F

jij

F

jjij

iqd

qd

qdsim

Similarity metric:Ranking bydescendingrelevance

e.g., using: k ikijij wwd 2/:

iikk

ijij fwithdocs

docsdffreq

dffreqw

##

log),(max

),(:

tf*idfformula

generalizes tomultimedia search

Page 12: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

12

+ Consider in-degree and out-degree of Web nodes: Autority Rank (di) :=

Stationary visit probability [di]

in random walk on the Web

Link Analysis for Content Authority

Search engine

Query (Set of weighted features)

||]1,0[ Fq

Ranking by descendingrelevance & authority

Reconciliation of relevance and authoritybased on ad hoc weighting

Page 13: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

13

Web Search Engines: State of the Artq = „Chernoff theorem“

AltaVista:

Google:

Yahoo:

Fermat's last theorem. Previous topic. Next topic. ...URL: www-groups.dcs.st-and.ac.uk/~history/His...st_theorem.html

...strong convergence \cite{Chernoff}. \begin{theorem}\label{T1} Let...http://mpej.unige.ch/mp_arc/p/00-277

Moment-generating Functions; Chernoff's Theorem; The Kullback-... http://www.siam.org/catalog/mcc10/bahadur.htm

Lycos: SIAM Journal on Computing Volume 26, Number 2 Contents Fail-Stop Signatures ...http://epubs.siam.org/sam-bin/dbq/toc/SICOMP/26/2

Mathsearch: No matches found.

Northernlight: J. D. Biggins- Publications. Articles on the Branching Random Walkhttp:/ / www.shef.ac.uk/ ~st1jdb/ bibliog.html

Excite: The Official Web Site of Playboy Lingerie Model Mikki Chernoff http://www.mikkichernoff.com/

Page 14: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

14

But There Is Hope

Starting from (intellectually maintained) directory or promising but still unsatisfactory query results and exploring the neighborhood of these URLswould eventually lead to useful documents

Observation:

But intellectual time is expensive !

Research Avenues:Leverage advanced IR: automatic classification

Organize information and leverage IT megatrends: XML

Page 15: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

15

Ontologies and Statistical Learning forContent Classification

...

Science

Mathematics

Probability and Statistics

Algebra

LargeDeviation

HypothesesTesting

...

...

Categories ||]1,0[ F

kc

Training sample

Feature space:term frequencies fj ...

New docs

Naive Bayes classifier:

]|[ fcdP k

][

][]|[

fP

cdPcdfP kk

][/......

)(1

11

fPqppff

dlengthk

fkF

fk

FF

with multinomial prior and estimated p1k, ... p|F|k, qk

or o

ther

cla

ssif

iers

(e.

g., S

VM

)

Good for query expansion and user relevance feedback

assign to highest-likelihood category

Page 16: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

16

www.links2go.com: Chernoff theorem

Page 17: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

For Better Search Quality: XML to the Rescue<travelguide> <place> Zion National Park <location> Utah </> <activities> hiking, canyoneering </activities> <lodging> <lodge price = $ 80-140> Zion Lodge ... </> <motel price = $55> ... </> </lodging> <hikes> <hike type=backcountry, level=difficult> Kolob Creek ... class 5.2 ... </hike> ... </place> <place> Yosemite NP <location> California </> <activities> hiking, climbing ...

travelguide

place: Zion NP place: ...

location:Utah

activities:hiking,canyoneering

...

...

lodging

... motelprice=$55

hikes

...

hiketype=backcountry

level=difficult:Kolob Creek ... class 5.2 ...

trip report

DozentURL=...

Inhalt...

Semistructured data:elements, attributes, linksorganized as labeled graph

Page 18: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

Querying XML

Regular expressionsover path labelsLogical conditionsover element contents

+

Example query:difficult hikes in affordablenational parks

Select PFrom //travelguide.com Whereplace Like „%Park%“ As P AndP.#.lodging.#.(price|rate) < $70And P.#.hike+.level? Like „%difficult%“

travelguide

place: Zion National Park place: ...

location:Utah

activities:hiking,canyoneering

...

...

And ... #.tripreport Like ...... many technical obstacles ...... 15 feet dropoff ...... need 100 feet rope ...

lodging

... motelprice=$55

hikes

...

hiketype=backcountry

level=difficult:Kolob Creek ... class 5.2 ...

trip report

place

lodging

hikeSelect PFrom //travelguide.com Whereplace Like „%Park%“ As P AndP.#.lodging.#.(price|rate) < $70And P.#.hike+.level? Like „%difficult%“

travelguide

place

Page 19: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

<outdoors> <region> Zion NP <state> Utah </> <things-to-do> hiking, canyoneering </things-to-do> <campground fee = $15> ... </> <backcountry trips> <trip> Kolob Creek ... challenging hike ... </trip> ... </region> ...

XXL: Reconciling XML and IR Technologies

DozentURL=...

Inhalt...

Result ranking of XML databased on semantic similarity

<travelguide> <place> Zion National Park <location> Utah </> <activities> hiking, canyoneering </activities> <lodging> <lodge price = $ 80-140> Zion Lodge ... </> price = $55> ... </> </lodging> <hikes> <hike type=backcountry, level=difficult> Kolob Creek ... class 5.2 ... </hike> ... </place> <place> Yosemite NP <location> California </> <activities> hiking, climbing ...

<outdoors> <region> Zion NP <state> Utah </> <things-to-do> hiking, canyoneering </things-to-do> <campground fee = $15> ... </> <backcountry trips> <trip> Kolob Creek ... challenging hike ... </trip> ... </region> ...

Example query:difficult hikes in affordable national parks

Select PFrom //all-rootsWhere~place ~ „Park“ As P AndP.#.~lodge.#.~price < $70And P.#.~hike ~ „difficult“

And P.#.~activities ~ „climbing“

...

climbing

canyoneering

Page 20: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

20

Ontologies, Statistical Learning, and XML:The Big Synergy

Research Avenue:build domain-specific and personalized ontologies leverage XML momentum !automatically classify XML documents into ontologyexpand query by mapping query into ontologyby adding category-specific path conditions ex.: #.math?.#.(~large deviation)?.#.theorem.~Chernoff

exploit user feedback

Research Issues:Which kind of ontology (tree, lattice, HOL, ...) ?Which feature selection for classification? Which classifier?Information-theoretic foundation?Theory of „optimal“ search results?

Page 21: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

21

The Mega Challenge: Scalability

Observations:search engines cover only „surface web“: 1 Bio. docs, 20 TBytesmost data is in „deep web“ behind gateways: 500 Bio. docs, 8 PBytes

future search engines need new paradigmin new world with > 90% information in XMLand „deep web“ with > 90% information behind gateways

Research Avenue:

Page 22: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

22

Predictions for 2010

XXX (Cross-Web XML Explorer), aka. Deep Search

future search engines will combine pre-computation (and enhance with richer form of indexing) &additional path traversal starting from index seeds(topic-specific crawling with „semantic“ pattern matching) &dynamic creation of subqueries at gateways

should carry out large-scale experiments

will be able to find results for every search in one day with < 1 min intellectual effortthat the best human experts can find with infinite time

should have a theory of search result „optimality“

Page 23: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

23

Outline

Why I’m Speaking Here

Money-back Performance Guarantees

Observations and Predictions

Continuous Service Availability

Guaranteed Search Result Quality

• Summary of My Message

Page 24: 1 The Web in the Year 2010: Challenges and Opportunities for Database Research Gerhard Weikum weikum@cs.uni-sb.de .

Strategic Research Avenuesinspired by Jim Gray‘s Turing Award lecture: trouble-free systems, always up, world memex

Conceivable killer arguments:Infinite RAM & network bandwidth and zero latency for freeSmarter people don‘t need a better Web

Challenges for 2010: Self-tuning servers with response time guarantees by reviving stochastic modeling and combining it with „low-hanging fruit“ engineering

Continuously available servers by new theory of recovery contracts for multi-tier federations in combination with better engineering

Breakthrough on search quality (incl. optimality theory ?) from synergy of ontologies, statistical learning, and XML