Download - Building data flows with Celery and SQLAlchemy

Building data flows with Celery and SQLAlchemy

PyCon Australia 2013

Roger Barnes@[email protected]://slideshare.net/mindsocket

Coming up

● Data warehousing– AKA data integration

● Processing data flows– SQLAlchemy– Celery

● Tying it all together

About me

● 15 years doing all things software● 11 years at a Business Intelligence vendor● Currently contracting

– This talk based on a real reporting system

Why Data Warehousing?

Why Data Warehousing?

But we need reports that are● Timely● Unambiguous● Accurate● Complete● … and don't impact production systems

What is a Data Warehouse

"… central repository of data which is created by integrating

data from one or more disparate sources" - Wikipedia

Extract, Transform, Load

Source: www.imc.com

Python can help!

● Rapid prototyping● Code re-use● Existing libraries● Decouple

– data flow management– data processing– business logic

Existing solutions

● Not a lot available in the Python space● People roll their own● Bubbles (Brewery 2)

– Framework for Python 3– "Focus on the process, not the data technology"

Ways to move data around

● Flat files● NOSQL data stores● RDBMS

SQLAlchemy is...

Python SQL toolkit

&

Object Relational Mapper

About SQLAlchemy

● Full featured● Mature, robust, documented, maintained● Flexible

Enterprise!

DB support

● SQLite● Postgresql● MySQL● Oracle

● MS-SQL● Firebird● Sybase● ...

Python support

cPython 2.5+cPython 3+Jython 2.5+Pypy 1.5+

Structure

SQLAlchemy Core

● Abstraction over Python's DBAPI● SQL language via generative Python

expressions

SQLAlchemy Core

● Good for DB performance– bulk operations– complex queries– fine-tuning– connection/tx management

Create a table

from sqlalchemy import *engine = create_engine('sqlite:///:memory:')metadata = MetaData()

vehicles_table = Table('vehicles', metadata, Column('model', String), Column('registration', String), Column('odometer', Integer), Column('last_service', Date),)

vehicles_table.create(bind=engine)

Insert data

values = [ {'model': 'Ford Festiva', 'registration': 'HAX00R', 'odometer': 3141 }, {'model': 'Lotus Elise', 'registration': 'DELEG8', 'odometer': 31415 },]rows = engine.execute( vehicles_table.insert(), list(values)).rowcount

Query data

query = select( [vehicles_table] ).where( vehicles_table.c.odometer < 100 )

results = engine.execute(query)

for row in results: print row

Encapsulating a unit of work

Example Processor Types

● Extract– Extract from CSV– Extract from DB table– Scrape web page

● Transform– Copy table from extract layer– Derive column– Join tables

Abstract Processor

class BaseProcessor(object): def dispatch(self): return self._run()

def _run(self): return self.run()

def run(self): raise NotImplementedError

Abstract Database Processorclass DatabaseProcessor(BaseProcessor): db_class = None engine = None metadata = None

@contextlib.contextmanager def _with_session(self): with self.db_class().get_engine() as engine: self.engine = engine self.metadata = MetaData(bind=engine) yield

def _run(self): with self._with_session(): return self.run()

CSV Extract Mixin

class CSVExtractMixin(object): input_file = None def _run(self): with self._with_engine(): self.reader = csv.DictReader( self.input_file ) return self.run()

A Concrete Extractclass SalesHistoryExtract(CSVExtractMixin, DatabaseProcessor): target_table_name = 'SalesHistoryExtract' input_file = SALES_FILENAME def run(self): target_table = Table(self.target_table_name, self.metadata) columns = self.reader.next() [target_table.append_column(Column(column, ...)) for column in columns if column] target_table.create() insert = target_table.insert() new_record_count = self.engine.execute(insert, list(self.reader)).rowcount return new_record_count

An Abstract Derive Transformclass AbstractDeriveTransform(DatabaseProcessor): table_name = None key_columns = None select_columns = None target_columns = None

def process_row(self, row): raise NotImplementedError

... # Profit!

A Concrete Transformfrom business_logic import derive_foo

class DeriveFooTransform(AbstractDeriveTransform): table_name = 'SalesTransform' key_columns = ['txn_id'] select_columns = ['location', 'username'] target_columns = [Column('foo', FOO_TYPE)]

def process_row(self, row): foo = derive_foo(row.location, row.username) return {'foo': foo}

Introducing Celery

Distributed Task Queue

A Processor Task

class AbstractProcessorTask(celery.Task): abstract = True processor_class = None

def run(self, *args, **kwargs): processor = self.processor_class( *args, **kwargs) return processor.dispatch()

class DeriveFooTask(AbstractProcessorTask): processor_class = DeriveFooTransform

DeriveFooTask().apply_async() # Run it!

Canvas: Designing Workflows

● Combines a series of tasks● Groups run in parallel● Chains run in series● Can be combined in different ways

>>> new_user_workflow = (create_user.s() | group(... import_contacts.s(),... send_welcome_email.s()))... new_user_workflow.delay(username='artv',... first='Art',... last='Vandelay',... email='[email protected]')

Sample Data Processing Flow

Extract sales

Extract customers

Extract product

s

Copy sales to transform

Copy customers to transform

Copy products to transform

Join table

s

Aggregate sales by customer

Normalise

currencyAggregate

sales by region

Customer data exception report

Sample Data Processing Flow

extract_flow = group((

ExtractSalesTask().si(),

ExtractCustTask().si(),

ExtractProductTask().si()))

transform_flow = group((

CopySalesTask().si() | NormaliseCurrencyTask().si(),

CopyCustTask().si(),

CopyProductTask().si())) | JoinTask().si()

load_flow = group((

QualityTask().si(),

AggregateTask().si('cust_id'),

AggregateTask().si('region_id')))

all_flow = extract_flow | transform_flow | load_flow

Monitoring – celery events

Monitoring – celery flower

Turning it up to 11

● A requires/depends structure● Incremental data loads● Parameterised flows● Tracking flow history● Hooking into other libraries

– NLTK– SciPy/NumPy– ...

Summary

● Intro to data warehousing● Process data with SQLAlchemy● Task dependencies with Celery

canvas

Resources

● SQLAlchemy core: http://bit.ly/10FdYZo● Celery Canvas: http://bit.ly/MOjazT● http://databrewery.org

– Bubbles: http://bit.ly/14hNsV0– Pipeline: http://bit.ly/15RXvWa

● http://schoolofdata.org

Thank You!

Questions?

http://slideshare.net/mindsocket