What's up?
-
Upload
lars-marius-garshol -
Category
Technology
-
view
565 -
download
1
description
Transcript of What's up?
1
What’s up?
Bouvet BigOne, 2011-10-27Lars Marius Garshol, <[email protected]>http://twitter.com/larsga
2
The problem with RSS readers
• Many feeds (like newspapers’) are too busy, and so unread stories pile up
• In most feeds you are only interested in a small subset of the posts
• Staying on top of the flow of news and digging up the interesting stuff is hard work
3
What’s up?
• A newsreader that tries solve this for you
• It uses statistics to figure out which news are the most interesting for you– the statistics are based on feedback from you
• Everything is collected into a single list, sorted by relevance and freshness
• Stories sink slowly as they age, so if you don’t read them they gradually fade away
4
Like
Dislike
Not used
Mark as read
5
Very interesting post about beer,but the word “beer” doesn’t actuallyappear anywhere
Probabilities combined with Bayes’stheorem.
6
Utterly irrelevant post about sports
7
Adding feeds
Probably the usability Achillesheel of the system right now
8
Three implementations
• In-memory single-user version– worked well for me for several years– wanted to try it out with more users
• Google AppEngine version– easy to build and deploy– used way too much CPU
• “Traditional” version– PostgreSQL backend, ordinary web hosting– seems to scale much better
9
The goal
• Make the site pay for its own hosting– currently solved by running it on my personal
web server– expect system to outgrow that server soon
• Move to cloud hosting– candidates: Amazon EC2, heroku, Google
AppEngine w/ MySQL
• Income from Google Ads– income per user likely to be very low– scaling challenge: support enough users to
pay for computing resources
10
Data structure
• Good– fully normalized, no redundancy– simple and natural
• Bad– showing main page requires many joins– limited possibilities for caching
User
Feed
Subscription
Post
Rated post
11
Queueing
• The original version would respond to clicks in real-time– meant recomputing all stories on each up/down vote,
before showing the page again– not really very pleasant user experience
• Changed over to a queue approach– user clicks added to a queue– queue retrieves tasks, processes them, may add more– scheduled tasks injected into queue– admin command-line tool1) to inject tasks when
needed– works beautifully
1) http://code.google.com/p/whazzup/source/browse/send.py
12
Google AppEngine experience
• Easy to build, painless to deploy– web.py and Python well supported– good queue and scheduled tasks APIs
• Datastore and GQL too primitive– high latency registers as high CPU usage (costly)– very, very limited support for letting the database do
the work, leads to poor performance
• AppEngine apps require heavy caching to work– not really possible with this application
• Would have hit limit of free usage at 4 users– not a realistic proposition
13
Example problem
• How to implement aging of posts?– that is, reducing score as the posts get older
• Could compute score when loading story list– not possible in GQL (no expression language)
• Could run scheduled tasks once an hour– in GQL this requires loading all RatedPost
objects into main process– way too resource-intensive
• Just didn’t scale at all
(probability * 1000.0) / math.log(ageinsecs)
14
Current architecture
Apache w/ mod_pythonApache w/ mod_python
Apache w/ mod_pythonApache w/ mod_python
IPC message queue 1)
Queue worker
Download thread
Download thread
Download thread
PostgreSQL
DBM filesDBM filesDBM files
DBM files
100% PythonBased on web.pySingle server so far
cron
1) http://semanchuk.com/philip/sysv_ipc/
15
Aging posts with Postgres
• First attempt– load posts, compute in Python, save to DB– took 1.1 seconds per subscription– with ~50 subscriptions per user, that’s much
too slow
• Second attempt– do calculation in the SQL update statement1)
– takes 0.5 seconds per user– more than 100 times faster– may still be too slow
• with 7200 users it would take an hour
1) http://code.google.com/p/whazzup/source/browse/dbqueue.py?r=4e86bc419d8f397b65cbfb2b39e08db3bf398a32#111
16
More performance tricks
• Loading story pages is a bit expensive– because of SQL joins required– now handling votes with AJAX, so page
doesn’t have to be reloaded for every vote– next step: caching feed titles and story titles?
• Separate worker threads for feed downloading– because feeds may be slow to respond– threads save feed XML to disk, then queue
task to process feed– ParseFeed task doesn’t have any network
latency
17
Statistics No perceptible server loadBottlenecks:• loading story list pages• parsing feeds, calculating points
18
Future architecture
Web frontend Web frontend Web frontend
Message queue(Gearman?)
cron
Queue worker Queue worker Queue worker
DB cluster(PostgreSQL?)
DBM files?
memcached?
19
More information
• Blog post– http://www.garshol.priv.no/blog/216.html
• Source code– http://code.google.com/p/whazzup/
• Pre-alpha trial– open to anyone; sign up if you’re interested– no guarantees about anything– http://whazzup.garshol.priv.no/– currently limited to 100 users (87 accounts
available)