TRACKING LIVE WIKIPEDIA [email protected]
Insight Data Engineering Week 4 - January 2015
DRAFT WIKIWATCH.ANDREWMO.COM
MOTIVATION
• Raw dumps of Wikipedia data are available for analysis on a monthly basis, but…What about changes between these intervals?
• Data Collection:Live edits for Wikimedia projects are broadcast to nearly 882 IRC channels
• Goals: (Collect, filter, format, transform and produce information about live edit data)
STREAM PROCESSING
Multiple Topologies(10 sec, 10 min, 1 hr)
Multiple Metrics(events, size, new pages, topics, users)
Python + Storm (Pyleus)MySQL
API ACCESSTime Series Summary Metrics
for Multiple Windows
New Pages
Detailed User Activity
Detailed Topic Activity
Top Topics, Top Users, Top Bots, etc
Thanks
Apache Software FoundationWikimedia FoundationInsight Data Science
LinkedIn (Kafka)Twitter (Storm)
Yelp (Pyleus)
ABOUT MOA Project Manager that Writes Code !
Worked at RAND Corporation Booz Allen Hamilton
Studied at Pardee RAND Graduate School UC San Diego - Electrical Engineering
Alphabet SoupPMP, PMI-ACP, CISSP, ISSEP, CSEP, CSEP-ACQ [email protected] GitHub: https://github.com/moandcompanyLinkedIn: http://linkedin.com/in/andrewmo
VELOCITY AND OUR NEXT SPRINTSprint 1 (MVP Development)
18 Jan - 31 Jan 2015
Address the need + Simplify
API-query elicitation and discovery
Novel feature focus - Realtime
Maximize common-code (Python)
Sprint 2 (MVP Validation)
Engage users + Complete Features
API enhancement
Batch Integration
NoSQL Optimization
Preempt Technical Debt - Refactoring
Velocity Chart
Top Related