Data herding

75
Dataherding Brian Luft <[email protected] > DjangoCon 2010 Portland, OR

description

The primary focus of this presentation is approaching the migration of a large, legacy data store into a new schema built with Django. Includes discussion of how to structure a migration script so that it will run efficiently and scale. Learn how to recognize and evaluate trouble spots.Also discusses some general tips and tricks for working with data and establishing a productive workflow.

Transcript of Data herding

Page 1: Data herding

DataherdingBrian Luft <[email protected]>

DjangoCon 2010 Portland, OR

www.princexml.com
Prince - Non-commercial License
This document was created with Prince, a great way of getting web content onto paper.
Page 2: Data herding

Enough About MeWorking with Django for a few years.

With Lincoln Loop since 2008.

Clients include National Geographic, PBS, Nasuni, and redbeacon.

Page 3: Data herding

The Big SqueezeCompanies and organizations that have been operating for at least a few years have amassedlarge amounts of content and data.

For most organizations that also means systems and tools that have been around for a while:

• Desktop applications

• Expensive, licensed, proprietary applications

• Increasing frustration mounts as staff can't operate as they'd like to

Page 4: Data herding

AgendaDealing With Data in a Django Environment

• Front matter: A Few Tools and Tips

• Dealing With a Large Legacy Migration

• South in Team Environments

Page 5: Data herding

Django Data-Oriented Commands• dbshell

• dumpdata / loaddata

• inspectdb

• flush / reset

• sql, sqlall, sqlclear, sqlcustom, sqlflush, sqlindexes, sqlreset sqlsequencereset

--database option

Page 6: Data herding

FixturesNatural Keys - new feature in Django 1.2

Helps make fixtures much more portable.

Page 7: Data herding

Customizing Data Installation• post-syncdb

• sqlcustom

◦ appname/sql/modelname.sql

Example:

• More powerful syncdb

◦ Handle migrations, sync, and other fixture installations all in one-shot

Page 8: Data herding

iPython Tips1. shell_plus

2. Input/Output Caches

3. Macros

Page 9: Data herding

shell_plusPart of django-extensions - autoloads your Django models into the interpreter namespace.

1 python bin/manage.py shell_plus2 From 'auth' autoload: Permission, Group, User, Message3 From 'contenttypes' autoload: ContentType4 From 'sessions' autoload: Session5 From 'sites' autoload: Site6 From 'admin' autoload: LogEntry7 From 'redirects' autoload: Redirect8 From 'south' autoload: MigrationHistory9 From 'categories' autoload: Category

10 From 'content' autoload: Page, Content, Guide, WikiName, WikiLink, Template, Attraction11 From 'menus' autoload: Menu, MenuItemGroup, MenuItem

Page 10: Data herding

Input CacheThe input cache lets you access and eval previously input commands in a flexible manner.

1 In [5]: _i3

Page 11: Data herding

Output CacheThe output cache lets you access the results of previous statements.

1 In [23]: range(5)2 Out [23]: [0, 1, 2, 3, 4]3 ...4 In [30]: _235 Out [30]: [0, 1, 2, 3, 4]

Page 12: Data herding

Large Output and the Output CacheYou can suppress output using a ';' at the end of a line.

The output cache prevents Python's garbage collector from removing previous results. This canquickly use up memory. Use the cache_size setting to bump or down the cache (including 0 todisable).

Page 13: Data herding

MacrosYou can easily use the %macro feature to capture previous input lines into a single command.%hist is handy for this.

You can also %store your macros so you have them available in the iPython namespace acrosssessions.

Page 14: Data herding

Macros as Workflow CommandsEven seen this?

1 SQL Error: 1452: Cannot add OR UPDATE a child row:2 a FOREIGN key CONSTRAINT fails (`myapp`.`categories`,3 CONSTRAINT `fk_categories_categories`4 FOREIGN KEY (`parent_id`) REFERENCES `categories` (`id`)5 ON DELETE NO ACTION ON UPDATE NO ACTION)

Page 15: Data herding

Macros as Workflow CommandsMix shell commands and Python:

1 In[1]: !echo "set foreign_key_checks=0; drop table customers" | manage.py dbshell"2 In[2]: Accounts.objects.delete()3 In[3]: import sys4 In[4]: print "I'm getting distracted"5 In[5]: reload(sys.modules['apps.tickets'])6 In[6]: %macro resetaccts7 In[7]: %store resetaccts8 ...9 In[45]: resetaccts

10 In[46]: migrate

Page 16: Data herding

AgendaDealing With Data in a Django Environment

• Front matter: A Few Tools and Tips

• Dealing With a Large Legacy Migration

• South in Team Environments

Page 17: Data herding

Legacy Data MigrationMoving data from a big ol' crufty database to your shiny new Django application can be a wedge inagile development processes.

Page 18: Data herding

Pre-requisites• Production snapshot or reasonable imitation (this is often a PIA)

• People who know the legacy schema and all the convulted business requirements

Page 19: Data herding

Get a Bird's Eye ViewNot all data is created alike.

• Table inventory

• Table data types

• Tables with several relations

Page 20: Data herding

Getting Started• inspectdb

• ORM Multi-DB support

• Create an app dedicated to the legacy schema

• Give your project a migration role (settings file dedicated to the migration)

Page 21: Data herding

Migration Project Role1 #migration_settings.py2 from settings import *34 DATABASES = {}5 DATABASES['default'] = {6 'NAME': 'newscompany',7 'ENGINE': 'django.db.backends.mysql',8 'USER': 'joe',9 'PASSWORD': 'schmoe',

10 }1112 DATABASES['legacy'] = {13 'NAME': 'newscompany_legacy',14 'ENGINE': 'django.db.backends.mysql',15 'USER': 'root',16 'PASSWORD': 'youllneverguess',17 }

Page 22: Data herding

Migration Project Role (continued)1 DATABASE_ROUTERS = ['apps.legacy.db_router.LegacyRouter',]23 INSTALLED_APPS += (4 'apps.legacy',5 )67 DEBUG = False

Page 23: Data herding

Debug False?Wait...

1 DEBUG = False

???

Page 24: Data herding

What about South?

Page 25: Data herding

I wouldn't use South for this type of project -Andrew Godwin, South

Page 26: Data herding

Let's Write a Migration!We'd like to start pulling legacy Articles into the new application. Running inspectdb has given usthis:

1 #apps.legacy.models.py23 class News_Section:4 name = models.CharField(max_length=50)5 ...67 class News_Topic:8 title = models.CharField(max_length=50)9 ...

1011 class News_Article(models.Model):12 section = models.ForeignKey(Section)13 topic = models.ForeignKey(Topic)14 ...

Page 27: Data herding

Our New SchemaWe've decided that the legacy schema works pretty well here and we'll only make minormodifications

1 #apps.content.models.py23 class Section:4 name = models.CharField(max_length=50)5 ...67 class Topic:8 name = models.CharField(max_length=50)9 ...

1011 class Article(models.Model):12 section = models.ForeignKey(Section)13 topic = models.ForeignKey(Topic)14 ...

Page 28: Data herding

Mapping Table-to-TableFirst we'll migrate Sections

1 #apps/legacy/migrations/sections.py2 from apps.legacy import models as legacy_models3 from apps.content.models import Section45 for s in legacy_models.News_Section.objects.all():6 section = Section(name=s.name)7 section.save()

Page 29: Data herding

Another Table-to-TableNext we handle Topics:

1 #apps/legacy/migrations/sections.py2 from apps.legacy import models as legacy_models3 from apps.content.models import Topic45 for t in legacy_models.News_Topic.objects.all():6 topic = Topic()78 #map to new field name9 topic.name = t.title

1011 topic.save()

We're on a roll now!

Page 30: Data herding

Generalizing Table-to-TableFor these types of one-to-one mappings it is easy to generalize the operation and reduce a bunchof repetitive code. (Declarative code FTW!)

1 def map_table(field_map, src, dst):2 for src_col, dst_col in field_map:3 setattr(dst, dst_col, getattr(src, src_col))4 return dst56 MAP = ( ('name', 'name',) ,7 ('title', 'headline', ),8 ('create_date', 'date_created', ) )9

10 for some_old_object in legacy_models.SomeOldModel.objects.all():11 map_table(MAP, some_old_object, ShinyNewObject()).save()

Page 31: Data herding

Migrating Tables with RelationsNow, time for the Article.

In this case we have foreign keys to fill so our table-mapping pattern won't get us all the waythere. We'll need to account for the relations manually.

1 for a in legacy_models.News_Articles.all():2 article = Article()3 article.headline = a.headline45 #get the Section6 Section.objects.get(...) #OOPS! How do we know which one?

A snag. We need to lookup the Section in the new database that corresponds to the old Article's oldNews_Section. How can we reliably tell which one?

Page 32: Data herding

Preserve a Few Legacy BitsWe need to be able to tell which row in the legacy DB an object in the new system came from. Inaddition to keeping the original ID, you may want to preserve other fields even if you don't have adefinite plan for it. Reasons in support of:

• Reports and historical documents might reference legacy IDs

• Affiliate APIs, 3rd party, businesses partners might have legacy IDs in their systems

• New business rules deprecate some data, but the data might still be useful in the future

I'm in favor of sticking these directly on the models, unless it is more than a few extra fields.

Page 33: Data herding

Introducing Models with BaggageOur new Section, Topics and Article Models

1 #apps.legacy.content.py23 class Section:4 legacy_id = models.IntegerField()5 ...67 class Topic:8 legacy_id = models.IntegerField()9 ...

1011 class Article(models.Model):12 section = models.ForeignKey(Section)13 topic = models.ForeignKey(Topic)14 legacy_id = models.IntegerField()15 ...

Page 34: Data herding

Articles Migration: Take 21 for a in legacy_models.News_Articles.all():2 article = Article()3 article.headline = a.headline45 #get the Section6 section = Section.objects.get(legacy_id=a.section.id)7 article.section = section89 #get the Topic

10 topic = Topic.objects.get(legacy_id=a.topic.id)11 article.save()

OK, so we're getting close. Now we just rinse and repeat with the rest of the tables in the DB.

Page 35: Data herding

Reality CheckTurns out the legacy system contains 75,000 Articles.

Page 36: Data herding

Reality CheckTurns out the legacy system contains 75,000 Articles.

By the way, we haven't put much attention into the 3,000,000 user comments, the 700,000 useraccounts, the user activity stream, the media assets, and a few other things.

Page 37: Data herding

Reality CheckTurns out the legacy system contains 75,000 Articles.

By the way, we haven't put much attention into the 3,000,000 user comments, the 700,000 useraccounts, the user activity stream, the media assets, and a few other things.

Also, the Articles mapping is going to need work because there are different article "types" thatwere shoehorned into the system.

Page 38: Data herding

What Are We Up Against?Let's look at our Article migration again:

1 for a in legacy_models.News_Articles.all():2 ...

How big of a QuerySet can we actually handle? Beyond a few thousand objects things might getdicey. (Don't try this at home)

Page 39: Data herding

What Are We Up Against?We can switch to using the ModelManager.iterator:

1 for a in legacy_models.News_Articles.iterator():2 ...

Memory crisis averted! (Not really...)

Page 40: Data herding

What Are We Up Against?But wait, now we're making at one query per News_Article:

1 for a in legacy_models.News_Articles.iterator():2 ...

75,000 article queries

Page 41: Data herding

What Are We Up Against?1 for a in legacy_models.News_Articles.iterator():2 article = Article()3 article.headline = a.headline45 #get the Section6 section = Section.objects.get(legacy_id=a.section.id)7 article.section = section

75,000 section queries

Page 42: Data herding

What Are We Up Against?1 for a in legacy_models.News_Articles.iterator():2 article = Article()3 article.headline = a.headline45 #get the Section6 section = Section.objects.get(legacy_id=a.section.id)7 article.section = section89 #get the Topic

10 topic = Topic.objects.get(legacy_id=a.topic.id)11 article.save()

75,000 topic queries

Page 43: Data herding

Survey Says...• 1 Huge News_Article query

• 75,000 Section queries

• 75,000 Topic queries

• 150,000 total queries

5ms connection latency = 750 sec = 12.5 minutes just in network latency

Page 44: Data herding

Now What?1. ModelManager.all() is problematic once row counts get into the thousands.

Page 45: Data herding

Now What?1. ModelManager.all() is problematic once row counts get into the thousands.

2. ModelManager.iterator() might not be enough.

Page 46: Data herding

Now What?1. ModelManager.all() is problematic once row counts get into the thousands.

2. ModelManager.iterator() might not be enough.

Neither option is very attractive as we deal with large tables.

Page 47: Data herding

Let's Write a Smarter MigrationWhat should guide our decision making?

• Step out of web request mental mode. We're optimizing for different needs here.

• Want to maximize Write throughput to the application DB

• Want to maximize Read throughput from the legacy DB

• Cram the RAM

• Maximize Objects per query

• Fill the connection packets

• Migration jobs as atomic units of work

Page 48: Data herding

KISS• Database performance features (delayed inserts, etc)

• Database import/export data formats (Postgres COPY, MySQL LOAD INFILE)

Don't worry about getting too exotic until you've maxed out other options. A well designed jobsystem will give you a ton of mileage.

Page 49: Data herding

Difference MakersDon't Work Blind. Make sure you know how to:

• View the query log

• Profile queries, measure throughput

Also helps:

• Disable indexes on the new database until after the migration is done

• Turn on connection compression if your client/server support it and you're going over the wire.

Page 50: Data herding

Bring FriendsGrab some handy tools:

• SQL Editor, GUI Console (if you're not a CLI ninja)

• Maatkit

• SqlAlchemy

Page 51: Data herding

A Few Key Features1. Pause / Graceful Stop

2. Resume

3. Timing

4. Logging

5. Partial Jobs

6. Strict Mode vs. Continue on Fail

Page 52: Data herding

Graceful StopThe ability to cancel the process and leave data in a consistent state.

Page 53: Data herding

ResumeThe ability to restart the process from a specific point.

Page 54: Data herding

TimingThe ability to record how long a job takes.

Page 55: Data herding

LoggingThe ability to record what was done, and what went wrong

Page 56: Data herding

Partial JobsThe ability to run a job against a single row, a ranges of rows, or a single table

Page 57: Data herding

Strict ModeThe ability to have the migration ignore errors (log them of course) or stop on any exception

Page 58: Data herding

Articles Migration: Take 3One more look at our naive first stab at it:

1 for a in legacy_models.News_Articles.all():2 article = Article()3 article.headline = a.headline45 #get the Section6 section = Section.objects.get(legacy_id=a.section.id)7 article.section = section89 #get the Topic

10 topic = Topic.objects.get(legacy_id=a.topic.id)11 article.save()

Problems:

• The legacy articles query is pulling one at a time

• Even when we fix that, we can still only deal with one row at a time because we need to querythe Section and Topic per old Article.

• Not easy to make this work in parallel.

Page 59: Data herding

Work in BatchesWe can mediate between the "all-or-little" extremes of all() and iterator() using set batch sizes.

Take a guess at a reasonable batch size. 1000 rows should be a reasonable starting point for mostsituations.

Page 60: Data herding

A More Declarative StyleWe can move the work of mapping rows out to a runner script.

1 from apps.content.models import Topic2 from legacy.migration.runner import MigrationBase3 from apps.legacy.models import News_Topic45 class Migration(MigrationBase):6 model = Section7 legacy_model = News_Section89 # legacy application

10 MAP = (('title', 'name',))

This lets us vary the batch size and resource usage independently of the individual jobs.

Page 61: Data herding

Non-trivial Row Transformations1 from apps.content.models import Topic2 from legacy.migration.runner import MigrationBase3 from apps.legacy.models import News_Topic45 class Migration(MigrationBase):6 model = Section7 legacy_model = News_Section89 # legacy application

10 MAP = (('title', 'name',))1112 def process_row(self, row):13 return (row.title.upper())

Page 62: Data herding

Job ObjectsGeneralize construction of multi-value INSERT statements

1 class MigrationBase(object):23 @property4 def column_list(self):5 return ','.join(self.MAP)67 @property8 def values_placeholder(self):9 return 'DEFAULT,' + ','.join(['%s']*len(self.MAP))

1011 @property12 def insert_stmt(self):13 return "INSERT INTO %s VALUES (%s)" % (self.model._meta.db_table,14 self.values_placeholder, )

Page 63: Data herding

Handling Related TablesA little extra work since we need to collect the legacy IDs.

Remember how we needed to look up the related model in the new database using the legacy ID?

1 for a in legacy_models.News_Articles.all():2 article = Article()3 article.headline = a.headline45 #get the Section6 Section.objects.get(...) #OOPS! How do we know which one?

Page 64: Data herding

Related Tables Pattern1 class Migration(MigrationBase):2 model = Article3 legacy_model = News_Article4 related = [{'model': Section,5 'map_index': 1}]67 MAP = ('title',8 'section_id',9 'legacy_id', )

1011 def process_row(self, row):12 return {'values': [row.title,13 None,14 row.id],15 'Section': row.section_id}

Now we can grab the related objects in a batch (one query), apply the correct new IDs in bulk, andpreserve our batch INSERT for the new objects.

Page 65: Data herding

Progress ReportIn a real-world example, a job that was taking a few minutes was reduced to less than a second.

Page 66: Data herding

Run in ParallelThe advantage of the atomic-style jobs is that they can run independently.

This means we can use Queue from multiprocessing and run jobs in parallel.

• A little added complexity since we need to make sure related tables are finished for manytables.

Page 67: Data herding

No Mercy Write ThroughputIf you need big league performace, replace the local Queue with Celery.

Now we can also run jobs on multiple network nodes and even use multiple copies of the legacyDB for improved read throughput.

We can also write to multiple application DBs for increased write throughput. Merge them at theend.

Use cloud servers (EC2 / Rackspace)

Page 68: Data herding

In Case You're Still BoredOther lovely things you'll run into:

• Weird primary keys

• Data inconsistencies

• Creative user solutions to limitations in the old schema

• All kinds of special conditions and edge cases

• Mismatched data types, abused data types

Page 69: Data herding

Outside the Confines of Tidy TalkExamplesDon't get stuck in the Django ORM tunnel. This is a very appropriate domain for using alternativeapproaches.

• ModelManager.raw()

• Using the cursor will let you write more elegant JOIN queries

• SqlAlchemy / Unit of Work

Page 70: Data herding

AgendaDealing With Data in a Django Environment

• Front matter: A Few Tools and Tips

• Dealing With a Large Legacy Migration

• South in Team Environments

Page 71: Data herding

Managing Migrations• South has good momentum and good intentions

• Does a job well and gets out of your way

• Best if everyone has a decent understanding of how it works

Page 72: Data herding

Common Complaints1. Merging branches brings migration conflicts

2. Two team members create identically numbered migrations

Turns out the solution is to Talk To Your Teammates!

Page 73: Data herding

Other SolutionsChronicDB is a new product with an innovative approach to schema migrations.

Built in Python and C. A free version is available for small databases.

Page 74: Data herding

Questions?Brian Luft - @unbracketed @zen_of_pythonThank you for your attention.

Page 75: Data herding