What's up?

19
1 What’s up? Bouvet BigOne, 2011-10-27 Lars Marius Garshol, <[email protected]> http://twitter.com/larsga

description

About an RSS reader that uses statistics to show the news most interesting to you. Based on Python and web.py.

Transcript of What's up?

Page 1: What's up?

1

What’s up?

Bouvet BigOne, 2011-10-27Lars Marius Garshol, <[email protected]>http://twitter.com/larsga

Page 2: What's up?

2

The problem with RSS readers

• Many feeds (like newspapers’) are too busy, and so unread stories pile up

• In most feeds you are only interested in a small subset of the posts

• Staying on top of the flow of news and digging up the interesting stuff is hard work

Page 3: What's up?

3

What’s up?

• A newsreader that tries solve this for you

• It uses statistics to figure out which news are the most interesting for you– the statistics are based on feedback from you

• Everything is collected into a single list, sorted by relevance and freshness

• Stories sink slowly as they age, so if you don’t read them they gradually fade away

Page 4: What's up?

4

Like

Dislike

Not used

Mark as read

Page 5: What's up?

5

Very interesting post about beer,but the word “beer” doesn’t actuallyappear anywhere

Probabilities combined with Bayes’stheorem.

Page 6: What's up?

6

Utterly irrelevant post about sports

Page 7: What's up?

7

Adding feeds

Probably the usability Achillesheel of the system right now

Page 8: What's up?

8

Three implementations

• In-memory single-user version– worked well for me for several years– wanted to try it out with more users

• Google AppEngine version– easy to build and deploy– used way too much CPU

• “Traditional” version– PostgreSQL backend, ordinary web hosting– seems to scale much better

Page 9: What's up?

9

The goal

• Make the site pay for its own hosting– currently solved by running it on my personal

web server– expect system to outgrow that server soon

• Move to cloud hosting– candidates: Amazon EC2, heroku, Google

AppEngine w/ MySQL

• Income from Google Ads– income per user likely to be very low– scaling challenge: support enough users to

pay for computing resources

Page 10: What's up?

10

Data structure

• Good– fully normalized, no redundancy– simple and natural

• Bad– showing main page requires many joins– limited possibilities for caching

User

Feed

Subscription

Post

Rated post

Page 11: What's up?

11

Queueing

• The original version would respond to clicks in real-time– meant recomputing all stories on each up/down vote,

before showing the page again– not really very pleasant user experience

• Changed over to a queue approach– user clicks added to a queue– queue retrieves tasks, processes them, may add more– scheduled tasks injected into queue– admin command-line tool1) to inject tasks when

needed– works beautifully

1) http://code.google.com/p/whazzup/source/browse/send.py

Page 12: What's up?

12

Google AppEngine experience

• Easy to build, painless to deploy– web.py and Python well supported– good queue and scheduled tasks APIs

• Datastore and GQL too primitive– high latency registers as high CPU usage (costly)– very, very limited support for letting the database do

the work, leads to poor performance

• AppEngine apps require heavy caching to work– not really possible with this application

• Would have hit limit of free usage at 4 users– not a realistic proposition

Page 13: What's up?

13

Example problem

• How to implement aging of posts?– that is, reducing score as the posts get older

• Could compute score when loading story list– not possible in GQL (no expression language)

• Could run scheduled tasks once an hour– in GQL this requires loading all RatedPost

objects into main process– way too resource-intensive

• Just didn’t scale at all

(probability * 1000.0) / math.log(ageinsecs)

Page 14: What's up?

14

Current architecture

Apache w/ mod_pythonApache w/ mod_python

Apache w/ mod_pythonApache w/ mod_python

IPC message queue 1)

Queue worker

Download thread

Download thread

Download thread

PostgreSQL

DBM filesDBM filesDBM files

DBM files

100% PythonBased on web.pySingle server so far

cron

1) http://semanchuk.com/philip/sysv_ipc/

Page 15: What's up?

15

Aging posts with Postgres

• First attempt– load posts, compute in Python, save to DB– took 1.1 seconds per subscription– with ~50 subscriptions per user, that’s much

too slow

• Second attempt– do calculation in the SQL update statement1)

– takes 0.5 seconds per user– more than 100 times faster– may still be too slow

• with 7200 users it would take an hour

1) http://code.google.com/p/whazzup/source/browse/dbqueue.py?r=4e86bc419d8f397b65cbfb2b39e08db3bf398a32#111

Page 16: What's up?

16

More performance tricks

• Loading story pages is a bit expensive– because of SQL joins required– now handling votes with AJAX, so page

doesn’t have to be reloaded for every vote– next step: caching feed titles and story titles?

• Separate worker threads for feed downloading– because feeds may be slow to respond– threads save feed XML to disk, then queue

task to process feed– ParseFeed task doesn’t have any network

latency

Page 17: What's up?

17

Statistics No perceptible server loadBottlenecks:• loading story list pages• parsing feeds, calculating points

Page 18: What's up?

18

Future architecture

Web frontend Web frontend Web frontend

Message queue(Gearman?)

cron

Queue worker Queue worker Queue worker

DB cluster(PostgreSQL?)

DBM files?

memcached?

Page 19: What's up?

19

More information

• Blog post– http://www.garshol.priv.no/blog/216.html

• Source code– http://code.google.com/p/whazzup/

• Pre-alpha trial– open to anyone; sign up if you’re interested– no guarantees about anything– http://whazzup.garshol.priv.no/– currently limited to 100 users (87 accounts

available)