Elastic Web Mining
-
Upload
ken-krugler -
Category
Technology
-
view
104 -
download
1
description
Transcript of Elastic Web Mining
Web Mining in the Cloud
Ken Krugler, Bixo Labs, Inc.
ACM Silicon Valley Data Mining Camp
01 November 2009
Hadoop/Cascading/Bixo in EC2
About me
Background in vertical web crawl– Krugle search engine for open source code– Bixo open source web mining toolkit
Consultant for companies using EC2– Web mining– Data processing
Founder of Bixo Labs– Elastic web mining platform– http://bixolabs.com
Typical Data Mining
Data Mining Victory!
Meanwhile, Over at McAfee…
Web Mining 101
Extracting & Analyzing Web Data
More Than Just Search
Business intelligence, competitive
intelligence, events, people, companies,
popularity, pricing, social graphs, Twitter
feeds, Facebook friends, support forums,
shopping carts…
4 Steps in Web Mining
Collect - fetch content from web
Parse - extract data from formats
Analyze - tokenize, rate, classify, cluster
Produce - “useful data”
Web Mining versus Data Mining
Scale - 10 million isn’t a big number
Access - public but restricted– Special implicit rules apply
Structure - not much
How to Mine Large Scale Web Data?
Start with scalable map-reduce platform
Add a workflow API layer
Mix in a web crawling toolkit
Write your custom data processing code
Run in an elastic cloud environment
One Solution - the HECB Stack
Bixo
Cascading
Hadoop
EC2
EC2 - Amazon Elastic Compute Cloud
True cost of non-cloud environment– Cost of servers & networking (2 year life)– Cost of colo (6 servers/rack)– Cost of OPS salary (15% of FTE/cluster)– Managing servers is no fun
Web mining is perfect for the cloud– “bursty” => savings are even greater– Data is distilled, so no transfer $$$ pain
Why Hadoop?
Perfect for processing lots of data– Map-reduce– Distributed file system
Open source, large community, etc.
Runs well in EC2 clusters
Elastic Map Reduce as option
Why Cascading?
API on top of Hadoop
Supports efficient, reliable workflows
Reduces painful low-level MR details
Build workflow using “pipe” model
Why Bixo?
Plugs into Cascading-based workflow– Scales with Hadoop cluster– Rules well in EC2
Handles grungy web crawling details– Polite yet efficient fetching– Errors, web servers that lie– Parsing lots of formats, broken HTML
Open source toolkit for web mining apps
SEO Keyword Data Mining
Example of typical web mining task
Find common keywords (1,2,3 word
terms)– Do domain-centric web crawl– Parse pages to extract title, meta, h1, links– Output keywords sorted by frequency
Compare to competitor site(s)
Workflow
Custom Code for Example
Filtering URLs inside domain– Non-English content– User-generated content (forums, etc)
Generating keywords from text– Special tokenization– One, two, three word phrases
But 95% of code was generic
End Result in Data Mining Tool
What Next?
Another example - mining mailing lists
Go straight to Summary/Q&A
Talk about Public Terabyte Dataset
Write tweets, posts & emails
Find people to meet in the lobby
Another Example - HUGMEE
HadoopUsers whoGenerate theMostEffectiveEmails
Helpful Hadoopers
Use mailing list archives for data (collect)
Parse mbox files and emails (parse)
Score based on key phrases (analyze)
End result is score/name pair (produce)
Scoring Algorithm
Very sophisticated point system
“thanks” == 5
“owe you a beer” == 50
“worship the ground you walk on” == 100
High Level Steps
Collect emails– Fetch mod_mbox generated page– Parse it to extract links to mbox files– Fetch mbox files– Split into separate emails
Parse emails– Extract key headers (messageId, email, etc)– Parse body to identify quoted text
High Level Steps
Analyze emails– Find key phrases in replies (ignore signoff)– Score emails by phrases– Group & sum by message ID– Group & sum by email address
Produce ranked list– Toss email addresses with no love– Sort by summed score
Workflow
Building the Flow
mod_mbox Page
Custom Operation
Validate
This Hug’s for Ted!
Produce
Public Terabyte Dataset
Sponsored by Concurrent/Bixolabs
High quality crawl of top domains– HECB Stack using Elastic Map Reduce
Hosted by Amazon in S3, free to EC2 users
Crawl & processing code available
Questions, input? http://bixolabs.com/PTD/
Back
Summary
HECB stack works well for web mining– Cheaper than typical colo option– Scales to hundreds of millions of pages– Reliable and efficient workflow
Web mining has high & increasing value– Search engine optimization, advertising– Social networks, reputation– Competitive pricing– Etc, etc, etc.
Any Questions?
My email:
Bixo mailing list:
http://tech.groups.yahoo.com/group/bixo-dev/