Building data flows with Celery and SQLAlchemy

42
Building data flows with Celery and SQLAlchemy PyCon Australia 2013 Roger Barnes @mindsocket [email protected] http://slideshare.net/mindsocket

description

Reporting and analysis systems rely on coherent and reliable data, often from disparate sources. To that end, a series of well established data warehousing practices have emerged to extract data and produce a consistent data store. This talk will look at some options for composing workflows using Python. In particular, we'll explore beyond Celery's asynchronous task processing functionality into its workflow (aka Canvas) system and how it can be used in conjunction with SQLAlchemy's architecture to provide the building blocks for data stream processing.

Transcript of Building data flows with Celery and SQLAlchemy

Page 1: Building data flows with Celery and SQLAlchemy

Building data flows with Celery and SQLAlchemy

PyCon Australia 2013

Roger Barnes@[email protected]://slideshare.net/mindsocket

Page 2: Building data flows with Celery and SQLAlchemy

Coming up

● Data warehousing– AKA data integration

● Processing data flows– SQLAlchemy– Celery

● Tying it all together

Page 3: Building data flows with Celery and SQLAlchemy

About me

● 15 years doing all things software● 11 years at a Business Intelligence vendor● Currently contracting

– This talk based on a real reporting system

Page 4: Building data flows with Celery and SQLAlchemy

Why Data Warehousing?

Page 5: Building data flows with Celery and SQLAlchemy

Why Data Warehousing?

But we need reports that are● Timely● Unambiguous● Accurate● Complete● … and don't impact production systems

Page 6: Building data flows with Celery and SQLAlchemy

What is a Data Warehouse

"… central repository of data which is created by integrating

data from one or more disparate sources" - Wikipedia

Page 7: Building data flows with Celery and SQLAlchemy
Page 8: Building data flows with Celery and SQLAlchemy
Page 9: Building data flows with Celery and SQLAlchemy

Extract, Transform, Load

Source: www.imc.com

Page 10: Building data flows with Celery and SQLAlchemy

Python can help!

● Rapid prototyping● Code re-use● Existing libraries● Decouple

– data flow management– data processing– business logic

Page 11: Building data flows with Celery and SQLAlchemy

Existing solutions

● Not a lot available in the Python space● People roll their own● Bubbles (Brewery 2)

– Framework for Python 3– "Focus on the process, not the data technology"

Page 12: Building data flows with Celery and SQLAlchemy

Ways to move data around

● Flat files● NOSQL data stores● RDBMS

Page 13: Building data flows with Celery and SQLAlchemy

SQLAlchemy is...

Python SQL toolkit

&

Object Relational Mapper

Page 14: Building data flows with Celery and SQLAlchemy

About SQLAlchemy

● Full featured● Mature, robust, documented, maintained● Flexible

Page 15: Building data flows with Celery and SQLAlchemy

Enterprise!

Page 16: Building data flows with Celery and SQLAlchemy

DB support

● SQLite● Postgresql● MySQL● Oracle

● MS-SQL● Firebird● Sybase● ...

Page 17: Building data flows with Celery and SQLAlchemy

Python support

cPython 2.5+cPython 3+Jython 2.5+Pypy 1.5+

Page 18: Building data flows with Celery and SQLAlchemy

Structure

Page 19: Building data flows with Celery and SQLAlchemy

SQLAlchemy Core

● Abstraction over Python's DBAPI● SQL language via generative Python

expressions

Page 20: Building data flows with Celery and SQLAlchemy

SQLAlchemy Core

● Good for DB performance– bulk operations– complex queries– fine-tuning– connection/tx management

Page 21: Building data flows with Celery and SQLAlchemy

Create a table

from sqlalchemy import *engine = create_engine('sqlite:///:memory:')metadata = MetaData()

vehicles_table = Table('vehicles', metadata, Column('model', String), Column('registration', String), Column('odometer', Integer), Column('last_service', Date),)

vehicles_table.create(bind=engine)

Page 22: Building data flows with Celery and SQLAlchemy

Insert data

values = [ {'model': 'Ford Festiva', 'registration': 'HAX00R', 'odometer': 3141 }, {'model': 'Lotus Elise', 'registration': 'DELEG8', 'odometer': 31415 },]rows = engine.execute( vehicles_table.insert(), list(values)).rowcount

Page 23: Building data flows with Celery and SQLAlchemy

Query data

query = select( [vehicles_table] ).where( vehicles_table.c.odometer < 100 )

results = engine.execute(query)

for row in results: print row

Page 24: Building data flows with Celery and SQLAlchemy

Encapsulating a unit of work

Page 25: Building data flows with Celery and SQLAlchemy

Example Processor Types

● Extract– Extract from CSV– Extract from DB table– Scrape web page

● Transform– Copy table from extract layer– Derive column– Join tables

Page 26: Building data flows with Celery and SQLAlchemy

Abstract Processor

class BaseProcessor(object): def dispatch(self): return self._run()

def _run(self): return self.run()

def run(self): raise NotImplementedError

Page 27: Building data flows with Celery and SQLAlchemy

Abstract Database Processorclass DatabaseProcessor(BaseProcessor): db_class = None engine = None metadata = None

@contextlib.contextmanager def _with_session(self): with self.db_class().get_engine() as engine: self.engine = engine self.metadata = MetaData(bind=engine) yield

def _run(self): with self._with_session(): return self.run()

Page 28: Building data flows with Celery and SQLAlchemy

CSV Extract Mixin

class CSVExtractMixin(object): input_file = None def _run(self): with self._with_engine(): self.reader = csv.DictReader( self.input_file ) return self.run()

Page 29: Building data flows with Celery and SQLAlchemy

A Concrete Extractclass SalesHistoryExtract(CSVExtractMixin, DatabaseProcessor): target_table_name = 'SalesHistoryExtract' input_file = SALES_FILENAME def run(self): target_table = Table(self.target_table_name, self.metadata) columns = self.reader.next() [target_table.append_column(Column(column, ...)) for column in columns if column] target_table.create() insert = target_table.insert() new_record_count = self.engine.execute(insert, list(self.reader)).rowcount return new_record_count

Page 30: Building data flows with Celery and SQLAlchemy

An Abstract Derive Transformclass AbstractDeriveTransform(DatabaseProcessor): table_name = None key_columns = None select_columns = None target_columns = None

def process_row(self, row): raise NotImplementedError

... # Profit!

Page 31: Building data flows with Celery and SQLAlchemy

A Concrete Transformfrom business_logic import derive_foo

class DeriveFooTransform(AbstractDeriveTransform): table_name = 'SalesTransform' key_columns = ['txn_id'] select_columns = ['location', 'username'] target_columns = [Column('foo', FOO_TYPE)]

def process_row(self, row): foo = derive_foo(row.location, row.username) return {'foo': foo}

Page 32: Building data flows with Celery and SQLAlchemy

Introducing Celery

Distributed Task Queue

Page 33: Building data flows with Celery and SQLAlchemy

A Processor Task

class AbstractProcessorTask(celery.Task): abstract = True processor_class = None

def run(self, *args, **kwargs): processor = self.processor_class( *args, **kwargs) return processor.dispatch()

class DeriveFooTask(AbstractProcessorTask): processor_class = DeriveFooTransform

DeriveFooTask().apply_async() # Run it!

Page 34: Building data flows with Celery and SQLAlchemy

Canvas: Designing Workflows

● Combines a series of tasks● Groups run in parallel● Chains run in series● Can be combined in different ways

>>> new_user_workflow = (create_user.s() | group(... import_contacts.s(),... send_welcome_email.s()))... new_user_workflow.delay(username='artv',... first='Art',... last='Vandelay',... email='[email protected]')

Page 35: Building data flows with Celery and SQLAlchemy

Sample Data Processing Flow

Extract sales

Extract customers

Extract product

s

Copy sales to transform

Copy customers to transform

Copy products to transform

Join table

s

Aggregate sales by customer

Normalise

currencyAggregate

sales by region

Customer data exception report

Page 36: Building data flows with Celery and SQLAlchemy

Sample Data Processing Flow

extract_flow = group((

ExtractSalesTask().si(),

ExtractCustTask().si(),

ExtractProductTask().si()))

transform_flow = group((

CopySalesTask().si() | NormaliseCurrencyTask().si(),

CopyCustTask().si(),

CopyProductTask().si())) | JoinTask().si()

load_flow = group((

QualityTask().si(),

AggregateTask().si('cust_id'),

AggregateTask().si('region_id')))

all_flow = extract_flow | transform_flow | load_flow

Page 37: Building data flows with Celery and SQLAlchemy

Monitoring – celery events

Page 38: Building data flows with Celery and SQLAlchemy

Monitoring – celery flower

Page 39: Building data flows with Celery and SQLAlchemy

Turning it up to 11

● A requires/depends structure● Incremental data loads● Parameterised flows● Tracking flow history● Hooking into other libraries

– NLTK– SciPy/NumPy– ...

Page 40: Building data flows with Celery and SQLAlchemy

Summary

● Intro to data warehousing● Process data with SQLAlchemy● Task dependencies with Celery

canvas

Page 41: Building data flows with Celery and SQLAlchemy

Resources

● SQLAlchemy core: http://bit.ly/10FdYZo● Celery Canvas: http://bit.ly/MOjazT● http://databrewery.org

– Bubbles: http://bit.ly/14hNsV0– Pipeline: http://bit.ly/15RXvWa

● http://schoolofdata.org

Page 42: Building data flows with Celery and SQLAlchemy

Thank You!

Questions?

http://slideshare.net/mindsocket