Speeding up Django and other Python apps with - 2ndQuadrant

34
Remoting Django (and other python) methods to be run inside the PostgreSQL database PgWest 2010 Hannu Krosing

Transcript of Speeding up Django and other Python apps with - 2ndQuadrant

Page 1: Speeding up Django and other Python apps with - 2ndQuadrant

Remoting Django (and other python) methods to be run inside the

PostgreSQL database

PgWest 2010

Hannu Krosing

Page 2: Speeding up Django and other Python apps with - 2ndQuadrant

About me

Hannu Krosing

[email protected]

skype: hkrosing

* Using python since v. 1.1

* Using linux since v. 0.9

* PostgreSQL since it was Postgres 4.2 (using postQuel as its query language instead of SQL)

Senior PostgreSQL Consultantwww.2ndQuadrant.com

Technical AdvisorAmbient Sound Investments

www.asi.ee

Page 3: Speeding up Django and other Python apps with - 2ndQuadrant

Python & PostgreSQL Background

Large portion of python code I have written has had to do with databases, be itbinary protocol or file format hacking, or using python to automate data loading,backups or just plain data manipulation.

I was the first (and for some years the only) DBA for Skype, where we usedpython for most data backend manipulation tasks and where I invented a Proxying and database partitioning language pl/proxy which lets you build unlimited scalability into your database backends with relative ease , both from developer and administrator viewpoint.

For last few years I have been helping companies to manage and scale theirPostgreSQL databases as a Database Consultant at 2ndQuadrant(www.2ndQuadrant.com) , as well as evaluating technical merits of startups atASI , the investment company of Skype founding engineers (www.asi.ee).

Page 4: Speeding up Django and other Python apps with - 2ndQuadrant

The 1st problem to solve

● Many people use postgreSQL from python, but don't want to dive deeply into SQL

● Even when you are good at using SQL from another language, it is still cumbersome, due to different ways of using and different expectations of users (an “impedance mismatch”)

Page 5: Speeding up Django and other Python apps with - 2ndQuadrant

Solution to the 1st problem

● Use an Object Relational Mapper (ORM)● ORMs replace SQL with something

syntactically more close to hosting language

● Django has a good and flexible ORM

Page 6: Speeding up Django and other Python apps with - 2ndQuadrant

ORM vs DB API

● Sample DB API update to table rowcur.execute('update users set password=%s where name=%s', ('frog', 'bob')))

● Same using Django ORMusr=User.get(pk='bob')

usr.password = 'frog'

usr.save()

Page 7: Speeding up Django and other Python apps with - 2ndQuadrant

Other available ORMs● SQLAlchemy

keeps Object definition and Mapper separate, joins when running

● STORM (GPL)● SQLObject (mostly superseeded by more

flexible SQLAlchemy)

Page 8: Speeding up Django and other Python apps with - 2ndQuadrant

Main problem with ORMs● Most ORMs tend to generate lots of distinct database

requests and do not make use of most database features, at least when used naively.

● Not using good database features tends to prefer databases with less features.

● Most ORMs have ways to circumvent this and even write your own SQL, but then why use the ORM in the first place.

Page 9: Speeding up Django and other Python apps with - 2ndQuadrant

More problem with ORMs

Next ones are due to not using functions to access database:

● There is no easy way for a DBA to go and fix inefficient functionality without changing the source code of client application.

● You can't use pl/proxy so it is hard to use partitioning for scaling.

Page 10: Speeding up Django and other Python apps with - 2ndQuadrant

Is this really a problem ?

● How bad is the “many db requests” problem

● As each db request has about 0.4ms overhead even when doing nothing you can easily get a large total runtime for simple things.

● This does usually not show up at development time, when you are the only user.

Page 11: Speeding up Django and other Python apps with - 2ndQuadrant

Sample schema-- 100000 items create table items( id serial primary key, item_name text not null, price nume(10,4), sale_price numeric(10,4) );

-- 100000 items_availablecreate table items_available(items_availableid serial primary key references items(id),quantity int);

-- 100000 items create table items( id serial primary key, item_name text not null, price nume(10,4), sale_price numeric(10,4) );

-- 1000000 order_linescreate table order_lines(order_linesid serial primary key,order_id int not null references orders(id),item_id int not null references items(id),quantity_ordered int,quantity_shipped int);

Page 12: Speeding up Django and other Python apps with - 2ndQuadrant

Shipping an order using Django ORMorder=Order.get(id=1000)for line in order__order_lines:    item_av = ItemAvailable.get(pk=line.id)    if line.quantity_ordered > item_av.quantity:        line.quantity_shipped = item_av.quantity        item_av.quantity = 0    else:        line.quantity_shipped = line.quantity_ordered        item_av.quantity ­= line.quantity_shipped    line.save()    item_av.save()

● Does 1 query to get order and another one to get ~100 order lines● Then does 100 queries to get corresponding availability data● And then 200 to save order lines and availability back● That is 300+ queries in total, each doing RPC call to database● Likely to take at least 500 ms or more● some of the queries can be avoided by using get_related but most can't

Page 13: Speeding up Django and other Python apps with - 2ndQuadrant

Shipping an order in pl/pythoncreate or replace function process_order(i_order_id int, out json_result text) as$$from cjson import encode, decodeif 'order_plan' not in SD:    q = 'select * from orders where id = $1'    SD['order_plan'] = plpy.prepare(q, [ 'int' ])<..cache other plans here..>order = plpy.execute(SD['order_plan'], [i_order_id])order_lines = plpy.execute(SD['order_line_plan'], [i_order_id])availability_query = SD['availability_plan']update_quatities_query = SD['update_quatities']update_order_shipping_query = SD['update_order_shipping']for order_line in order_lines:    line_id = order_line['id']    item_id = order_line['item_id']    availability = plpy.execute(availability_query, [item_id])[0]    if availability['quantity'] < order_line['quantity_ordered']:        plpy.execute(update_quatities_query, [item_id, 0])        plpy.execute(update_order_shipping_query, [line_id,availability['quantity']])    else:        plpy.execute(update_quatities_query, [item_id, availability['quantity'] ­                                                       order_line['quantity_ordered']])        plpy.execute(update_order_shipping_query, [line_id, order_line['quantity_ordered']])$$ language plpythonu security definer;

● ● One RPC call to databaseOne RPC call to database● ● Total 30-120 ms to update 2x100 rowsTotal 30-120 ms to update 2x100 rows

Page 14: Speeding up Django and other Python apps with - 2ndQuadrant

Problems with pl/python

● Maintaining code in 2 places

● Hard to write

● Writing this as 1-2 SQL statements would have performed even faster, so why bother

Page 15: Speeding up Django and other Python apps with - 2ndQuadrant

Automatic remoting

● You can automate moving code into postgreSQL to be executed by pl/python

How ?

● Write a decorator, which wraps up code and arguments, and passes them for execution to a special pl/python function

Page 16: Speeding up Django and other Python apps with - 2ndQuadrant

Writing a remoting decoratorAll the following samples need these modules and connections

import psycopg2import json, marshal, base64

# user who can create functionsadmcon = psycopg2.connect('dbname=pyrpcdb user=admin')admcur = admcon.cursor()

# connection for using the functionusrcon = psycopg2.connect('dbname=pyrpcdb user=bob')usrcur = admcon.cursor()

Page 17: Speeding up Django and other Python apps with - 2ndQuadrant

Writing a remoting decoratorDecorator as class

class run_in_database(object):    def __init__(self, f):        self.f = f        self.code = f.func_code        self.mcode = marshal.dumps(self.code)        self.code64 = base64.encodestring(self.mcode)        self.db_name = "%s_%d" % (f.__name__, hash(self.mcode))    def __call__(self, *args, **kargs):        json_args = json.dumps((args, kargs))        usrcur.execute('select * from run_code(%s,%s)',                       (self.code64, json_args))        result = cjson.loads(usrcur.fetchone()[0])        return result

● serialises actual code of function● serializes arguments, passed argunments and code to pl/python● deserialises return value

Page 18: Speeding up Django and other Python apps with - 2ndQuadrant

Remote executor in postgreSQLAll the following samples need these modules and connections

create or replace function run_code(in code64 text, in json_args text,                                      out json_result text) as$$import json, marshal, base64args, kargs = json.loads(json_args)

code = marshal.loads(base64.decodestring(code64))def f():passf.func_code = code

res = f(*args, **kargs)

return json.dumps(res)$$ language plpythonu security definer;

● Deserializes arguments and code● Creates a dummy function, then attaches deserialised code to it● Calls the function, serialises result and returns it

Page 19: Speeding up Django and other Python apps with - 2ndQuadrant

Defining and calling the remoted function

All the following samples need these modules and connections

@run_in_databasedef get_order(i_order_id):    if 'order_plan' not in SD:        q = 'select * from orders where id = $1'        SD['order_plan'] = plpy.prepare(q, [ 'int' ])    if 'order_line_plan' not in SD:        q = 'select * from order_lines where order_id = $1'        SD['order_line_plan'] = plpy.prepare(q, [ 'int' ])    order = plpy.execute(SD['order_plan'], [i_order_id])[0]    order_lines = plpy.execute(SD['order_line_plan'], [i_order_id])    return (order, [order_line for order_line in order_lines])

#call the decorated function, it is executed in databaseres = get_order(1000)

● Main overhead returning data, not shipping the code● Code shipping can be changed to happen only when code is changed

Page 20: Speeding up Django and other Python apps with - 2ndQuadrant

Some speed numbers

● Simple ”select * from t where id = n” runs in 0.1-0.7 ms

● The get_order(n) takes 4.5ms in database

● Calling it with automatic remoting from python, 6.5ms

This means that the overhead of shipping the function code is negligible.

Page 21: Speeding up Django and other Python apps with - 2ndQuadrant

Problems with simple remoting

● While running everything via code shipping works, it is not the best option

● No caching of functions● No caching of query plans● No further optimisation or scaling possible

● Better to create a corresponding db function for each remoted function

Page 22: Speeding up Django and other Python apps with - 2ndQuadrant

Advanced remotingA function that runs (and creates if necessary) a corresponding function in database

class run_as_db_function(object):    def __init__(self, f):        self.f = f        self.py_fname = f.__name__        self.code = f.func_code        self.mcode = marshal.dumps(self.code)        self.code64 = base64.encodestring(self.mcode)        self.db_fname = "%s_%d" % (f.__name__, abs(hash(self.mcode)) )    def __call__(self, *args, **kargs):        json_args = json.dumps((args, kargs))        print  self.db_fname, (json_args,)        try:            # try calling the function in db            usrcur.execute("savepoint sp1; select * from %s(%%s)" % self.db_fname,                                                                               (json_args,))        except:            #  define func using admin connection            usrcur.execute("rollback to savepoint sp1")            admcur.execute(remote_function_template % (self.db_fname, self.code64) )            admcon.commit()            # run the function again            usrcur.execute("select * from %s(%%s)" % self.db_fname, (json_args,))        result = json.loads(usrcur.fetchone()[0])        return result

Page 23: Speeding up Django and other Python apps with - 2ndQuadrant

Advanced remoting (cont)A database function template to be used when creating the function

remote_function_template = """\CREATE OR REPLACE FUNCTION %s (in json_args text, out json_result text) AS$$import json, marshal, base64

code = marshal.loads(base64.decodestring(\"\"\"%s\"\"\"))args, kargs = json.loads(json_args)

def f():passf.func_code = code

res = f(*args, **kargs)return json.dumps(res)$$  language plpythonu security definer;"""

Create and call a remoted database function

@run_as_db_functiondef double(strarg):    return strarg *2

double('cat')

Page 24: Speeding up Django and other Python apps with - 2ndQuadrant

Advanced remoting (cont)The function that actually was created in database

CREATE FUNCTION double_1980513421(json_args text, OUT json_result text) RETURNS text    LANGUAGE plpythonu SECURITY DEFINER    AS $$import json, marshal, base64

code = marshal.loads(base64.decodestring("""YwEAAAABAAAAAgAAAEMAAABzCAAAAHwAAGQBABRTKAIAAABOaQIAAAAoAAAAACgBAAAAdAYAAABzdHJhcmcoAAAAACgAAAAAcwcAAAA8c3RkaW4+dAYAAABkb3VibGUBAAAAcwIAAAAAAg=="""))args, kargs = json.loads(json_args)

def f():passf.func_code = code

res = f(*args, **kargs)

return json.dumps(res)$$;

Page 25: Speeding up Django and other Python apps with - 2ndQuadrant

Making it scalable● The easiest way to achieve scalability with a function-based interface to postgreSQL is pl/proxy

● But pl/proxy needs a partitioning key in function signature while current approach hides all arguments inside a JSON container

● Time to add possibility to pass some arguments separately

● While we are at it, lets also cache the internal function

Page 26: Speeding up Django and other Python apps with - 2ndQuadrant

Scalable remoting for pl/proxyA decorator exposing an explicit partitioning key def run_as_partitioned_db_function(part_key, part_key_type="text"):  class _run_as_partitioned_db_function(object):    def __init__(self, f):        self.f = f        self.py_fname = f.__name__        self.code = f.func_code        self.mcode = marshal.dumps(self.code)        self.code64 = base64.encodestring(self.mcode)        self.db_fname = "%s_%d" % (f.__name__, abs(hash(self.mcode)) )        self.part_key = part_key        self.part_key_type = part_key_type        self.part_key_pos = self.code.co_varnames.index(self.part_key)    def __call__(self, *args, **kargs):        json_args = json.dumps((args, kargs))        print 'args:',args        print 'kargs:',  kargs        print  self.db_fname, (args[self.part_key_pos], json_args)        try:            # try calling the function in db            usrcur.execute("savepoint sp1; select * from %s(%%s, %%s)" % self.db_fname,                                                                       (args[self.part_key_pos], json_args))        except:            #  define func using admin connection            usrcur.execute("rollback to savepoint sp1")            admcur.execute(partitioned_function_template % (self.db_fname, self.part_key_type, self.code64) )            admcon.commit()            # run the function again            usrcur.execute("select * from %s(%%s, %%s)" % self.db_fname, (args[self.part_key_pos], json_args))        result = json.loads(usrcur.fetchone()[0])        return result  return _run_as_partitioned_db_function

Page 27: Speeding up Django and other Python apps with - 2ndQuadrant

Scalable remoting for pl/proxy (cont.)A function template for scalable remoting exposing key field partitioned_function_template = """\CREATE OR REPLACE FUNCTION %s (in part_key %s, in json_args text, out json_result text) AS$$import json

if not SD.has_key('_func_'):    import marshal, base64    code = marshal.loads(base64.decodestring(\"\"\"%s\"\"\"))

    def f():pass    f.func_code = code    SD['_func_'] = f

args, kargs = json.loads(json_args)res = SD['_func_'](*args, **kargs)

return json.dumps(res)$$  language plpythonu security definer;"""

This also caches the function in pl/python's local dictionary SD so it is not re-created at each call

Page 28: Speeding up Django and other Python apps with - 2ndQuadrant

Scalable remoting function in databaseDefining and running this in python

@run_as_partitioned_db_function('username')def hello(username, age):    return 'Hello %s, you are nor %d years old' % (username, age)

hello('bob',32)

Yields the following function in the database

CREATE OR REPLACE FUNCTION hello_678701658(IN  part_key text, json_args text,                                            OUT json_result text)    LANGUAGE plpythonu SECURITY DEFINER AS $$import jsonif not SD.has_key('_func_'):    import marshal, base64    code = marshal.loads(base64.decodestring("""YwIAAAACAAAAAwAAAEMAAABzDgAAAGQBAHwAAHwBAGYCABZTKAIAAABOcyIAAABIZWxsbyAlcywgeW91IGFyZSBub3IgJWQgeWVhcnMgb2xkKAAAAAAoAgAAAHQIAAAAdXNlcm5hbWV0AwAAAGFnZSgAAAAAKAAAAABzBwAAADxzdGRpbj50BQAAAGhlbGxvAQAAAHMCAAAAAAI="""))    def f():pass    f.func_code = code    SD['_func_'] = fargs, kargs = json.loads(json_args)res = SD['_func_'](*args, **kargs)return json.dumps(res)$$;

Page 29: Speeding up Django and other Python apps with - 2ndQuadrant

So how would one scale thisOnce you have a function with exposed partition key, it is easy to use pl/proxy to do the actual partitioning. Just define a proxy function

CREATE OR REPLACE  FUNCTION hello_678701658(part_key text, json_args text,                                             OUT json_result text)LANGUAGE plproxy AS $$  CLUSTER 'usercluster';  RUN ON hashtext(i_username) ;$$;

And this function decides the partition to run the function in based on part_key and then calls the real function in that partition with exact same arguments as it was called and returns the result to original caller

Page 30: Speeding up Django and other Python apps with - 2ndQuadrant

How far is Django support for this ?While the aim of this project is support for remoting ORM-based methods and Django as the first one, we are not there yet

I have an ugly hacked version of postgresql database backend which can be run inside the database to execute Django QuerySet objects, but it is far from complete nor is it easy to use or install.

You can write straight pl/python code inside your Django app and have it remoted using the code from this tutorial

Page 31: Speeding up Django and other Python apps with - 2ndQuadrant

ToDo for Django support

● Complete django ORM support on pl/pythons SPI interface

– Above needs a SPI-specific database adapter

– And a DB-API compliant SPI interface, a long-time ToDo item in pl7python

● Automatic generation of Django Models from database Schema

Page 32: Speeding up Django and other Python apps with - 2ndQuadrant

Other nice-to-have features

● Automatic generation of pl/proxy function stubs

● Bypassing SQL when generating the query plans

● Support for other python ORMs● Support for ORMs from other languages

(Ruby on Rails anybody ?)

Page 33: Speeding up Django and other Python apps with - 2ndQuadrant

Conclusion

● It is possible to automatically move functions into the database for execution as demonstrated in this talk.

● It is also possible to move Django methods into database and make them scalable using pl/proxy

● Making the Django method shipping easy still needs work (donations welcome ;)

Page 34: Speeding up Django and other Python apps with - 2ndQuadrant

Where to get more info● http://www.2ndQuadrant.com/ - if you want professional

paid help

● http://www.postSQL.org/ - where this code will be developed and maintained

● http://plproxy.projects.postgresql.org/doc/tutorial.html Tutorial for using pl/proxy

● How to reach me:

– email [email protected]

– Skype hkrosing