Cloud based Web Intelligence

35
"How I Learned to Stop Worrying and Love the Bomb" WIVE 2010

description

presentation givent at the 2nd International Workshop on Web Intelligence & Virtual Enterprises (WIVE'10) held at the 11th IFIP Working Conference on Virtual Enterprises (PRO-VE'10) http://www.emse.fr/wive/

Transcript of Cloud based Web Intelligence

Page 1: Cloud based Web Intelligence

"How I Learned to Stop Worrying and Love the Bomb"

WIVE 2010

Page 2: Cloud based Web Intelligence

An unhinged computer scientist, known as TimBL, has invented the WWW, plunging the world into an information vortex…

Now everybody is fighting to prevent the knowledge apocalypse...

Recently, Dr. StrangeCloud, a former mainframe virtualization specialist, has been called to the rescue…

(story inspired by Kubrick's Dr. Strangelove)

Page 3: Cloud based Web Intelligence

Is Dr. Strangecloud going to save the

planet from the ever increasing danger of

death by information overload ?

Page 4: Cloud based Web Intelligence
Page 5: Cloud based Web Intelligence

Virtual Organiza

Web Intelligence

within/for VE

resource sharing

Cloud based WI for VE

Page 6: Cloud based Web Intelligence

Looking for YACCP ?

You'd better check the specialists instead…

Yet Another

Cloud Computing

Presentation ?

Page 7: Cloud based Web Intelligence

NOT (only)

"Intelligence on Web data"

Page 8: Cloud based Web Intelligence

(COMPSAC 2000, Taiwan)

as a research &

application field

Page 9: Cloud based Web Intelligence

Information Technology

Artificial Intelligence

cloud

computing

web

mining

semantic

web

web

information retrieval

WEB

INTELLIGENCE

our focus

today

Crossing hot topics

Page 10: Cloud based Web Intelligence

Hot, mild or cold ?(based on Wikipedia article popularity)

Web Intelligence

1 632

Cloud Computing

1 911 127

Lady Gaga

11 344 529

cumulated Wikipedia page views, Jan => June 2010(source : access statistics from wikipedia's squid cluster as compiled by http://stats.grok.se/)

6 month trends for Wikipedia pages Web

Intelligence, Cloud Computing and Lady Gaga

100

1 000

10 000

100 000

1 000 000

10 000 000

jan feb mar apr may jun

Page 11: Cloud based Web Intelligence

www.web-intelligence-rhone-alpes.org

But hot local

Web Intelligence

recipe

Page 12: Cloud based Web Intelligence
Page 13: Cloud based Web Intelligence

cloud based domain knowledge

repository enrichment(use case in FP7 project proposal)

web

crawler

semantic

extractor

triple

store

90M

triples

initially

manual input

2.5M triples/year

millionscrawls/month

publication in LOD cloud

Page 14: Cloud based Web Intelligence

(demo on www.webarchive.org.uk/analytics/analytics.htm)

(details on news.cnet.com/8301-13846_3-10459507-62.html)

UKWAby The British Library

powered by

IBM BigSheets

on British Library private cloud

crawl, annotate , preserve

visual analysis & navigation

Page 15: Cloud based Web Intelligence

Public Terabyte Datasetby Bixo Labs

not yet available (09/2010)

big corpus ready for AWS based analysis

(WI research, evaluation, ...)

powered by

Bixo

on AWS cloud

Elactic MapReduce

Hadoop

Tika

SimpleDBS3

Avro

50-200M pages from the 1M top US domains

Cascading

Page 16: Cloud based Web Intelligence
Page 17: Cloud based Web Intelligence

the Web Intelligence paradox

pick up

the data

process it with all those

marvelous ML algorithms...

All the Web data is at hand, ready for WI research and applications

2 simple steps :

Wait a minute, it's not that simple ! What about :

scale ? heterogeneity?

(aka "crappiness")

copyright ?

politeness ?

Page 18: Cloud based Web Intelligence

Looking for semantic annotations in 82k web pages(Squido production systems, 01/2010)

Use the Semantic Web?

less than 3%

Page 19: Cloud based Web Intelligence

kindof real world WI process

dedicated bandwith

lot's of memory

lot's of i/o's

lot's of threads

lot's of CPU

crawl

clean

(ML,...)

process

(ML, ...)

millions pages

Page 20: Cloud based Web Intelligence

typical load pattern

for distributed computing

Page 21: Cloud based Web Intelligence

consider

Cloud Computing

when

testing/calibrating

in production

if no crawl limits

Load may scale up or

down considerably with crawl size

Page 22: Cloud based Web Intelligence

1 .45 automatic.

2 boxes of ammunition.

4 days' concentrated emergency rations.

1 drug issue containing antibiotics, morphine, vitamin pills, pep pills, sleeping pills, tranquilizer

pills.

1 miniature combination Russian phrase book and Bible.

100 dollars in rubles.

100 dollars in gold.

9 packs of chewing gum.

1 issue of prophylactics.

3 lipsticks.

3 pairs of nylon stockings.

Page 23: Cloud based Web Intelligence

Build from other'sTop 10 Lessons Learned from Deploying Hadoop in a Private Cloud

(Rod Cope, OpenLogic's CTO, CloudSlam'10)

Page 24: Cloud based Web Intelligence
Page 25: Cloud based Web Intelligence

"Cloud computing is a trap"warns GNU founder Richard Stallman

"It's stupidity.

It's worse than

stupidity: it's a

marketing hype

campaign."

(www.guardian.co.uk/technology/2008/sep/29/cloud.computing.richard.stallman)

=> we can still consider private cloud+OSS

Page 26: Cloud based Web Intelligence

OSS stack for DC/DML

under active

development

(www.blackducksoftware.com/oss/projects/#cloud)

Cloud OSS on the rise

web-scale

distributed crawl OSS

not mature

(Heritrix Cluster Controller build server exception)

Page 27: Cloud based Web Intelligence

Compare prices

(Rod Cope, OpenLogic's CTO, CloudSlam'10)

$330k/year

$33k/year

Page 28: Cloud based Web Intelligence

Crawling

is the launch pad

in Web Intelligence

Don't take it easy !

Get yourself

a decent crawler

Page 29: Cloud based Web Intelligence

customizable revisit

policy ?

Crawling by millionsis not trivial...

many large objects

in memory :

transient ?

persistent ?

www crappiness

means

endless ugly special

cases

politeness is

challenging

Page 30: Cloud based Web Intelligence

DDOS is at the cornerwith (poor) cloud based crawling

Page 31: Cloud based Web Intelligence

(ken-blog.krugler.org)

1,264,539 URLs from

41,978 unique domains

10 slaves cluster

4000 active fetch threads max

Infrastructure is not always key to perfs

fetch rate

drops

over time

brute force

Organic effect

of politeness

on performance

opportunity

to scale down !

Page 32: Cloud based Web Intelligence

a. Cloud Computing is worth considering for WI

b. Have a cloud survival kit

c. Consider private cloud & OSS

d. Compare prices

e. Get yourself a decent crawler

f. Don't turn into DDOS

g. Infrastructure is not always key to perfs

Page 33: Cloud based Web Intelligence

"SaaS intelligence on web data, for professionnals"

collectcollect

filterfiltershareshare

analyseanalysemonitormonitor www.squido.fr

Page 34: Cloud based Web Intelligence

www.ixxo.frwww.slideshare.net/fpouillouxwww.linkedin.com/in/fpouilloux

Page 35: Cloud based Web Intelligence

Photos:1. National Nuclear Security Administration/Nevada Site Office

2. Dr. Strangelove/Original film poster by Tomi Ungerer

3. Dr. Strangelove/movie still

4. Dr. Strangelove/movie still

6. cloudslam10.com/Gartner keynote slide,

National Institute of Standards and Technology web site screenshot

7. cia.gov/OHB lobby seal picture

8. amazon.com/Computational Web Intelligence book cover

10. Wikimedia Commons/Lady Gaga by petercruise

12. Wikimedia Commons/Operation Crossroads Baker in color.jpg

13. Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

14. flickr/British Library III/jovike, ibm.com/The_British_Library_and_IBM_Bi.jpg

16. Dr. Strangelove/movie still

21. Wikimedia Commons/Castle Bravo Blast.jpg

22. Dr. Strangelove/movie still

23. cloudslam10.com/OpenLogic slide

24. Dr. Strangelove/movie still

25. Wikimedia Commons/RMS iGNUcius techfest iitb.JPG

27. cloudslam10.com/OpenLogic slide

28. Wikimedia Commons/Peacekeeper_missile_after_silo_launch.jpg

31. kkrugler.files.wordpress.com/2009/05/fetch-performance2.png

32. Dr. Strangelove/movie still

Websites: wikipedia.org

www.emse.fr/wive/

csrc.nist.gov

cloudslam10.com

www.web-intelligence-rhone-alpes.org

stats.grok.se

www.ibm.com/software/ebusiness/jstart/bigsheets

bixolabs.com/datasets/public-terabyte-dataset-project

www.openlogic.com

www.blackducksoftware.com

crawler.archive.org

www.apache.org

twitter.com

ken-blog.krugler.org