Building data flows with Celery and SQLAlchemy

PyCon Australia 2013

Roger Barnes@mindsocketroger@mindsocket.com.auhttp://slideshare.net/mindsocket

Coming up

● Data warehousing– AKA data integration

● Processing data flows– SQLAlchemy– Celery

● Tying it all together

About me

● 15 years doing all things software● 11 years at a Business Intelligence vendor● Currently contracting

– This talk based on a real reporting system

Why Data Warehousing?

But we need reports that are● Timely● Unambiguous● Accurate● Complete● … and don't impact production systems

What is a Data Warehouse

"… central repository of data which is created by integrating

data from one or more disparate sources" - Wikipedia

Extract, Transform, Load

Source: www.imc.com

Python can help!

● Rapid prototyping● Code re-use● Existing libraries● Decouple

– data flow management– data processing– business logic

Existing solutions

● Not a lot available in the Python space● People roll their own● Bubbles (Brewery 2)

– Framework for Python 3– "Focus on the process, not the data technology"

Ways to move data around

● Flat files● NOSQL data stores● RDBMS

SQLAlchemy is...

Python SQL toolkit

Object Relational Mapper

About SQLAlchemy

● Full featured● Mature, robust, documented, maintained● Flexible

Enterprise!

DB support

● SQLite● Postgresql● MySQL● Oracle

● MS-SQL● Firebird● Sybase● ...

Python support

cPython 2.5+cPython 3+Jython 2.5+Pypy 1.5+

Structure

SQLAlchemy Core

● Abstraction over Python's DBAPI● SQL language via generative Python

expressions

SQLAlchemy Core

● Good for DB performance– bulk operations– complex queries– fine-tuning– connection/tx management

Create a table

from sqlalchemy import *engine = create_engine('sqlite:///:memory:')metadata = MetaData()

vehicles_table = Table('vehicles', metadata, Column('model', String), Column('registration', String), Column('odometer', Integer), Column('last_service', Date),)

vehicles_table.create(bind=engine)

Insert data

values = [ {'model': 'Ford Festiva', 'registration': 'HAX00R', 'odometer': 3141 }, {'model': 'Lotus Elise', 'registration': 'DELEG8', 'odometer': 31415 },]rows = engine.execute( vehicles_table.insert(), list(values)).rowcount

Query data

query = select( [vehicles_table] ).where( vehicles_table.c.odometer < 100 )

results = engine.execute(query)

for row in results: print row

Encapsulating a unit of work

Example Processor Types

● Extract– Extract from CSV– Extract from DB table– Scrape web page

● Transform– Copy table from extract layer– Derive column– Join tables

Abstract Processor

class BaseProcessor(object): def dispatch(self): return self._run()

def _run(self): return self.run()

def run(self): raise NotImplementedError

Abstract Database Processorclass DatabaseProcessor(BaseProcessor): db_class = None engine = None metadata = None

@contextlib.contextmanager def _with_session(self): with self.db_class().get_engine() as engine: self.engine = engine self.metadata = MetaData(bind=engine) yield

def _run(self): with self._with_session(): return self.run()

CSV Extract Mixin

class CSVExtractMixin(object): input_file = None def _run(self): with self._with_engine(): self.reader = csv.DictReader( self.input_file ) return self.run()

A Concrete Extractclass SalesHistoryExtract(CSVExtractMixin, DatabaseProcessor): target_table_name = 'SalesHistoryExtract' input_file = SALES_FILENAME def run(self): target_table = Table(self.target_table_name, self.metadata) columns = self.reader.next() [target_table.append_column(Column(column, ...)) for column in columns if column] target_table.create() insert = target_table.insert() new_record_count = self.engine.execute(insert, list(self.reader)).rowcount return new_record_count

An Abstract Derive Transformclass AbstractDeriveTransform(DatabaseProcessor): table_name = None key_columns = None select_columns = None target_columns = None

def process_row(self, row): raise NotImplementedError

... # Profit!

A Concrete Transformfrom business_logic import derive_foo

class DeriveFooTransform(AbstractDeriveTransform): table_name = 'SalesTransform' key_columns = ['txn_id'] select_columns = ['location', 'username'] target_columns = [Column('foo', FOO_TYPE)]

def process_row(self, row): foo = derive_foo(row.location, row.username) return {'foo': foo}

Introducing Celery

Distributed Task Queue

A Processor Task

class AbstractProcessorTask(celery.Task): abstract = True processor_class = None

def run(self, *args, **kwargs): processor = self.processor_class( *args, **kwargs) return processor.dispatch()

class DeriveFooTask(AbstractProcessorTask): processor_class = DeriveFooTransform

DeriveFooTask().apply_async() # Run it!

Canvas: Designing Workflows

● Combines a series of tasks● Groups run in parallel● Chains run in series● Can be combined in different ways

>>> new_user_workflow = (create_user.s() | group(... import_contacts.s(),... send_welcome_email.s()))... new_user_workflow.delay(username='artv',... first='Art',... last='Vandelay',... email='art@vandelay.com')

Sample Data Processing Flow

Extract sales

Extract customers

Extract product

Copy sales to transform

Copy customers to transform

Copy products to transform

Join table

Aggregate sales by customer

Normalise

currencyAggregate

sales by region

Customer data exception report

Sample Data Processing Flow

extract_flow = group((

ExtractSalesTask().si(),

ExtractCustTask().si(),

ExtractProductTask().si()))

transform_flow = group((

CopySalesTask().si() | NormaliseCurrencyTask().si(),

CopyCustTask().si(),

CopyProductTask().si())) | JoinTask().si()

load_flow = group((

QualityTask().si(),

AggregateTask().si('cust_id'),

AggregateTask().si('region_id')))

all_flow = extract_flow | transform_flow | load_flow

Monitoring – celery events

Monitoring – celery flower

Turning it up to 11

● A requires/depends structure● Incremental data loads● Parameterised flows● Tracking flow history● Hooking into other libraries

– NLTK– SciPy/NumPy– ...

Summary

● Intro to data warehousing● Process data with SQLAlchemy● Task dependencies with Celery

canvas

Resources

● SQLAlchemy core: http://bit.ly/10FdYZo● Celery Canvas: http://bit.ly/MOjazT● http://databrewery.org

– Bubbles: http://bit.ly/14hNsV0– Pipeline: http://bit.ly/15RXvWa

● http://schoolofdata.org

Thank You!

Questions?

http://slideshare.net/mindsocket

Building data flows with Celery and SQLAlchemy

Technology

Transcript of Building data flows with Celery and SQLAlchemy

Socrates Used PostgreSQL and SQLAlchemy

SQLAlchemy Drill Instructions · SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL. Contents

PostgreSQL, SQLAlchemy, and schema-less data. - wmmi.net · PostgreSQL, SQLAlchemy, and schema-less data. MySQL is the root of all evil Most developers met 'databases' through MySQL.

SQLAlchemy quick reference - New Mexico Institute of ...shipman/soft/sqlalchemy/sqlalchemy.pdf · SQLAlchemy quick reference A Python-database interface John W. Shipman 2015-09-03

Object Relational Mapping - @let@token ORM & SQLAlchemy

SQLAlchemy Migrate Documentation...Inspired by Ruby on Rails’ migrations, SQLAlchemy Migrate provides a way to deal with database schema changes inSQLAlchemyprojects. Migrate was

SQLAlchemy-ORM-tree Documentation

SQLAlchemy Core: An Introduction

Celery - Capital Area Food Bank · Celery is a good source of potassium, which is important Wrap whole celery bunches tightly in aluminum Tip: Wilting celery can easily be revived.

SQLAlchemy-Searchable Documentation · 2019. 4. 2. · As of version 1.0 SQLAlchemy-Searchable comes with native PostgreSQL search query parser. The search query parser is capable

SQLAlchemy-ImageAttach Documentation

for Pythonistas SqlAlchemy and pandas tips€¦ · Python 3 (includes sqlite3, Python DBI, etc.) SqlAlchemy IPython-SQL (%sql) pandas and Python. SqlAlchemy The de facto standard

Introduction to SQLAlchemy and Alembic Migrations

Database Access with Jython, Hibernate and SQLAlchemy · Database Access with Jython, Hibernate and SQLAlchemy Frank Wierzbicki, Jython Project Lead ... As if Hibernate needs an introduction

Celery Seed

sqlalchemy aio Documentation

PyCon DE 2011 Leipzig: Lightning Talk sqlalchemy-migrate

SQLAlchemy - discorporate · from sqlalchemy import MetaData, Table, Column, String, Integer metadata = MetaData() users = Table('users', metadata, Column('id', Integer, primary_key=True),

sqlalchemy-redshift Documentation

GeoAlchemy Documentation - GeoAlchemy: Using SQLAlchemy ...