Download - Building data flows with Celery and SQLAlchemy

Transcript
Page 1: Building data flows with Celery and SQLAlchemy

Building data flows with Celery and SQLAlchemy

PyCon Australia 2013

Roger Barnes@[email protected]://slideshare.net/mindsocket

Page 2: Building data flows with Celery and SQLAlchemy

Coming up

● Data warehousing– AKA data integration

● Processing data flows– SQLAlchemy– Celery

● Tying it all together

Page 3: Building data flows with Celery and SQLAlchemy

About me

● 15 years doing all things software● 11 years at a Business Intelligence vendor● Currently contracting

– This talk based on a real reporting system

Page 4: Building data flows with Celery and SQLAlchemy

Why Data Warehousing?

Page 5: Building data flows with Celery and SQLAlchemy

Why Data Warehousing?

But we need reports that are● Timely● Unambiguous● Accurate● Complete● … and don't impact production systems

Page 6: Building data flows with Celery and SQLAlchemy

What is a Data Warehouse

"… central repository of data which is created by integrating

data from one or more disparate sources" - Wikipedia

Page 7: Building data flows with Celery and SQLAlchemy
Page 8: Building data flows with Celery and SQLAlchemy
Page 9: Building data flows with Celery and SQLAlchemy

Extract, Transform, Load

Source: www.imc.com

Page 10: Building data flows with Celery and SQLAlchemy

Python can help!

● Rapid prototyping● Code re-use● Existing libraries● Decouple

– data flow management– data processing– business logic

Page 11: Building data flows with Celery and SQLAlchemy

Existing solutions

● Not a lot available in the Python space● People roll their own● Bubbles (Brewery 2)

– Framework for Python 3– "Focus on the process, not the data technology"

Page 12: Building data flows with Celery and SQLAlchemy

Ways to move data around

● Flat files● NOSQL data stores● RDBMS

Page 13: Building data flows with Celery and SQLAlchemy

SQLAlchemy is...

Python SQL toolkit

&

Object Relational Mapper

Page 14: Building data flows with Celery and SQLAlchemy

About SQLAlchemy

● Full featured● Mature, robust, documented, maintained● Flexible

Page 15: Building data flows with Celery and SQLAlchemy

Enterprise!

Page 16: Building data flows with Celery and SQLAlchemy

DB support

● SQLite● Postgresql● MySQL● Oracle

● MS-SQL● Firebird● Sybase● ...

Page 17: Building data flows with Celery and SQLAlchemy

Python support

cPython 2.5+cPython 3+Jython 2.5+Pypy 1.5+

Page 18: Building data flows with Celery and SQLAlchemy

Structure

Page 19: Building data flows with Celery and SQLAlchemy

SQLAlchemy Core

● Abstraction over Python's DBAPI● SQL language via generative Python

expressions

Page 20: Building data flows with Celery and SQLAlchemy

SQLAlchemy Core

● Good for DB performance– bulk operations– complex queries– fine-tuning– connection/tx management

Page 21: Building data flows with Celery and SQLAlchemy

Create a table

from sqlalchemy import *engine = create_engine('sqlite:///:memory:')metadata = MetaData()

vehicles_table = Table('vehicles', metadata, Column('model', String), Column('registration', String), Column('odometer', Integer), Column('last_service', Date),)

vehicles_table.create(bind=engine)

Page 22: Building data flows with Celery and SQLAlchemy

Insert data

values = [ {'model': 'Ford Festiva', 'registration': 'HAX00R', 'odometer': 3141 }, {'model': 'Lotus Elise', 'registration': 'DELEG8', 'odometer': 31415 },]rows = engine.execute( vehicles_table.insert(), list(values)).rowcount

Page 23: Building data flows with Celery and SQLAlchemy

Query data

query = select( [vehicles_table] ).where( vehicles_table.c.odometer < 100 )

results = engine.execute(query)

for row in results: print row

Page 24: Building data flows with Celery and SQLAlchemy

Encapsulating a unit of work

Page 25: Building data flows with Celery and SQLAlchemy

Example Processor Types

● Extract– Extract from CSV– Extract from DB table– Scrape web page

● Transform– Copy table from extract layer– Derive column– Join tables

Page 26: Building data flows with Celery and SQLAlchemy

Abstract Processor

class BaseProcessor(object): def dispatch(self): return self._run()

def _run(self): return self.run()

def run(self): raise NotImplementedError

Page 27: Building data flows with Celery and SQLAlchemy

Abstract Database Processorclass DatabaseProcessor(BaseProcessor): db_class = None engine = None metadata = None

@contextlib.contextmanager def _with_session(self): with self.db_class().get_engine() as engine: self.engine = engine self.metadata = MetaData(bind=engine) yield

def _run(self): with self._with_session(): return self.run()

Page 28: Building data flows with Celery and SQLAlchemy

CSV Extract Mixin

class CSVExtractMixin(object): input_file = None def _run(self): with self._with_engine(): self.reader = csv.DictReader( self.input_file ) return self.run()

Page 29: Building data flows with Celery and SQLAlchemy

A Concrete Extractclass SalesHistoryExtract(CSVExtractMixin, DatabaseProcessor): target_table_name = 'SalesHistoryExtract' input_file = SALES_FILENAME def run(self): target_table = Table(self.target_table_name, self.metadata) columns = self.reader.next() [target_table.append_column(Column(column, ...)) for column in columns if column] target_table.create() insert = target_table.insert() new_record_count = self.engine.execute(insert, list(self.reader)).rowcount return new_record_count

Page 30: Building data flows with Celery and SQLAlchemy

An Abstract Derive Transformclass AbstractDeriveTransform(DatabaseProcessor): table_name = None key_columns = None select_columns = None target_columns = None

def process_row(self, row): raise NotImplementedError

... # Profit!

Page 31: Building data flows with Celery and SQLAlchemy

A Concrete Transformfrom business_logic import derive_foo

class DeriveFooTransform(AbstractDeriveTransform): table_name = 'SalesTransform' key_columns = ['txn_id'] select_columns = ['location', 'username'] target_columns = [Column('foo', FOO_TYPE)]

def process_row(self, row): foo = derive_foo(row.location, row.username) return {'foo': foo}

Page 32: Building data flows with Celery and SQLAlchemy

Introducing Celery

Distributed Task Queue

Page 33: Building data flows with Celery and SQLAlchemy

A Processor Task

class AbstractProcessorTask(celery.Task): abstract = True processor_class = None

def run(self, *args, **kwargs): processor = self.processor_class( *args, **kwargs) return processor.dispatch()

class DeriveFooTask(AbstractProcessorTask): processor_class = DeriveFooTransform

DeriveFooTask().apply_async() # Run it!

Page 34: Building data flows with Celery and SQLAlchemy

Canvas: Designing Workflows

● Combines a series of tasks● Groups run in parallel● Chains run in series● Can be combined in different ways

>>> new_user_workflow = (create_user.s() | group(... import_contacts.s(),... send_welcome_email.s()))... new_user_workflow.delay(username='artv',... first='Art',... last='Vandelay',... email='[email protected]')

Page 35: Building data flows with Celery and SQLAlchemy

Sample Data Processing Flow

Extract sales

Extract customers

Extract product

s

Copy sales to transform

Copy customers to transform

Copy products to transform

Join table

s

Aggregate sales by customer

Normalise

currencyAggregate

sales by region

Customer data exception report

Page 36: Building data flows with Celery and SQLAlchemy

Sample Data Processing Flow

extract_flow = group((

ExtractSalesTask().si(),

ExtractCustTask().si(),

ExtractProductTask().si()))

transform_flow = group((

CopySalesTask().si() | NormaliseCurrencyTask().si(),

CopyCustTask().si(),

CopyProductTask().si())) | JoinTask().si()

load_flow = group((

QualityTask().si(),

AggregateTask().si('cust_id'),

AggregateTask().si('region_id')))

all_flow = extract_flow | transform_flow | load_flow

Page 37: Building data flows with Celery and SQLAlchemy

Monitoring – celery events

Page 38: Building data flows with Celery and SQLAlchemy

Monitoring – celery flower

Page 39: Building data flows with Celery and SQLAlchemy

Turning it up to 11

● A requires/depends structure● Incremental data loads● Parameterised flows● Tracking flow history● Hooking into other libraries

– NLTK– SciPy/NumPy– ...

Page 40: Building data flows with Celery and SQLAlchemy

Summary

● Intro to data warehousing● Process data with SQLAlchemy● Task dependencies with Celery

canvas

Page 41: Building data flows with Celery and SQLAlchemy

Resources

● SQLAlchemy core: http://bit.ly/10FdYZo● Celery Canvas: http://bit.ly/MOjazT● http://databrewery.org

– Bubbles: http://bit.ly/14hNsV0– Pipeline: http://bit.ly/15RXvWa

● http://schoolofdata.org

Page 42: Building data flows with Celery and SQLAlchemy

Thank You!

Questions?

http://slideshare.net/mindsocket