Cloud based Web Intelligence
-
Upload
francois-pouilloux -
Category
Technology
-
view
2.817 -
download
0
description
Transcript of Cloud based Web Intelligence
"How I Learned to Stop Worrying and Love the Bomb"
WIVE 2010
An unhinged computer scientist, known as TimBL, has invented the WWW, plunging the world into an information vortex…
Now everybody is fighting to prevent the knowledge apocalypse...
Recently, Dr. StrangeCloud, a former mainframe virtualization specialist, has been called to the rescue…
(story inspired by Kubrick's Dr. Strangelove)
Is Dr. Strangecloud going to save the
planet from the ever increasing danger of
death by information overload ?
Virtual Organiza
Web Intelligence
within/for VE
resource sharing
Cloud based WI for VE
Looking for YACCP ?
You'd better check the specialists instead…
Yet Another
Cloud Computing
Presentation ?
NOT (only)
"Intelligence on Web data"
(COMPSAC 2000, Taiwan)
as a research &
application field
Information Technology
Artificial Intelligence
cloud
computing
web
mining
semantic
web
web
information retrieval
WEB
INTELLIGENCE
our focus
today
Crossing hot topics
Hot, mild or cold ?(based on Wikipedia article popularity)
Web Intelligence
1 632
Cloud Computing
1 911 127
Lady Gaga
11 344 529
cumulated Wikipedia page views, Jan => June 2010(source : access statistics from wikipedia's squid cluster as compiled by http://stats.grok.se/)
6 month trends for Wikipedia pages Web
Intelligence, Cloud Computing and Lady Gaga
100
1 000
10 000
100 000
1 000 000
10 000 000
jan feb mar apr may jun
www.web-intelligence-rhone-alpes.org
But hot local
Web Intelligence
recipe
cloud based domain knowledge
repository enrichment(use case in FP7 project proposal)
web
crawler
semantic
extractor
triple
store
90M
triples
initially
manual input
2.5M triples/year
millionscrawls/month
publication in LOD cloud
(demo on www.webarchive.org.uk/analytics/analytics.htm)
(details on news.cnet.com/8301-13846_3-10459507-62.html)
UKWAby The British Library
powered by
IBM BigSheets
on British Library private cloud
crawl, annotate , preserve
visual analysis & navigation
Public Terabyte Datasetby Bixo Labs
not yet available (09/2010)
big corpus ready for AWS based analysis
(WI research, evaluation, ...)
powered by
Bixo
on AWS cloud
Elactic MapReduce
Hadoop
Tika
SimpleDBS3
Avro
50-200M pages from the 1M top US domains
Cascading
the Web Intelligence paradox
pick up
the data
process it with all those
marvelous ML algorithms...
All the Web data is at hand, ready for WI research and applications
2 simple steps :
Wait a minute, it's not that simple ! What about :
scale ? heterogeneity?
(aka "crappiness")
copyright ?
politeness ?
Looking for semantic annotations in 82k web pages(Squido production systems, 01/2010)
Use the Semantic Web?
less than 3%
kindof real world WI process
dedicated bandwith
lot's of memory
lot's of i/o's
lot's of threads
lot's of CPU
crawl
clean
(ML,...)
process
(ML, ...)
millions pages
typical load pattern
for distributed computing
consider
Cloud Computing
when
testing/calibrating
in production
if no crawl limits
Load may scale up or
down considerably with crawl size
1 .45 automatic.
2 boxes of ammunition.
4 days' concentrated emergency rations.
1 drug issue containing antibiotics, morphine, vitamin pills, pep pills, sleeping pills, tranquilizer
pills.
1 miniature combination Russian phrase book and Bible.
100 dollars in rubles.
100 dollars in gold.
9 packs of chewing gum.
1 issue of prophylactics.
3 lipsticks.
3 pairs of nylon stockings.
Build from other'sTop 10 Lessons Learned from Deploying Hadoop in a Private Cloud
(Rod Cope, OpenLogic's CTO, CloudSlam'10)
"Cloud computing is a trap"warns GNU founder Richard Stallman
"It's stupidity.
It's worse than
stupidity: it's a
marketing hype
campaign."
(www.guardian.co.uk/technology/2008/sep/29/cloud.computing.richard.stallman)
=> we can still consider private cloud+OSS
OSS stack for DC/DML
under active
development
(www.blackducksoftware.com/oss/projects/#cloud)
Cloud OSS on the rise
web-scale
distributed crawl OSS
not mature
(Heritrix Cluster Controller build server exception)
Compare prices
(Rod Cope, OpenLogic's CTO, CloudSlam'10)
$330k/year
$33k/year
Crawling
is the launch pad
in Web Intelligence
Don't take it easy !
Get yourself
a decent crawler
customizable revisit
policy ?
Crawling by millionsis not trivial...
many large objects
in memory :
transient ?
persistent ?
www crappiness
means
endless ugly special
cases
politeness is
challenging
DDOS is at the cornerwith (poor) cloud based crawling
(ken-blog.krugler.org)
1,264,539 URLs from
41,978 unique domains
10 slaves cluster
4000 active fetch threads max
Infrastructure is not always key to perfs
fetch rate
drops
over time
brute force
Organic effect
of politeness
on performance
opportunity
to scale down !
a. Cloud Computing is worth considering for WI
b. Have a cloud survival kit
c. Consider private cloud & OSS
d. Compare prices
e. Get yourself a decent crawler
f. Don't turn into DDOS
g. Infrastructure is not always key to perfs
"SaaS intelligence on web data, for professionnals"
collectcollect
filterfiltershareshare
analyseanalysemonitormonitor www.squido.fr
www.ixxo.frwww.slideshare.net/fpouillouxwww.linkedin.com/in/fpouilloux
Photos:1. National Nuclear Security Administration/Nevada Site Office
2. Dr. Strangelove/Original film poster by Tomi Ungerer
3. Dr. Strangelove/movie still
4. Dr. Strangelove/movie still
6. cloudslam10.com/Gartner keynote slide,
National Institute of Standards and Technology web site screenshot
7. cia.gov/OHB lobby seal picture
8. amazon.com/Computational Web Intelligence book cover
10. Wikimedia Commons/Lady Gaga by petercruise
12. Wikimedia Commons/Operation Crossroads Baker in color.jpg
13. Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
14. flickr/British Library III/jovike, ibm.com/The_British_Library_and_IBM_Bi.jpg
16. Dr. Strangelove/movie still
21. Wikimedia Commons/Castle Bravo Blast.jpg
22. Dr. Strangelove/movie still
23. cloudslam10.com/OpenLogic slide
24. Dr. Strangelove/movie still
25. Wikimedia Commons/RMS iGNUcius techfest iitb.JPG
27. cloudslam10.com/OpenLogic slide
28. Wikimedia Commons/Peacekeeper_missile_after_silo_launch.jpg
31. kkrugler.files.wordpress.com/2009/05/fetch-performance2.png
32. Dr. Strangelove/movie still
Websites: wikipedia.org
www.emse.fr/wive/
csrc.nist.gov
cloudslam10.com
www.web-intelligence-rhone-alpes.org
stats.grok.se
www.ibm.com/software/ebusiness/jstart/bigsheets
bixolabs.com/datasets/public-terabyte-dataset-project
www.openlogic.com
www.blackducksoftware.com
crawler.archive.org
www.apache.org
twitter.com
ken-blog.krugler.org