Marty: Application Development and Testing with Production ... · the PostgreSQL database...

67

Transcript of Marty: Application Development and Testing with Production ... · the PostgreSQL database...

Page 1: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

Faculty of Industrial Engineering,

Mechanical Engineering and

Computer Science

University of Iceland

2014

Faculty of Industrial Engineering,

Mechanical Engineering and

Computer Science

University of Iceland

2014

Marty: Application Developmentand Testing with Production

Data in PostgreSQL

Baldur Þór Emilsson

Page 2: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers
Page 3: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

MARTY: APPLICATION DEVELOPMENT AND

TESTING WITH PRODUCTION DATA IN

POSTGRESQL

Baldur Þór Emilsson

60 ECTS thesis submitted in partial ful�llment of a

Magister Scientiarum degree in Computer Science

Advisors

Hjálmtýr Hafsteinsson

Ebba Þóra Hvannberg

Faculty Representative

Snorri Agnarsson

Faculty of Industrial Engineering,

Mechanical Engineering and

Computer Science

School of Engineering and Natural Sciences

University of Iceland

Reykjavik, May 2014

Page 4: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

Marty: Application Development and Testing with Production Data in PostgreSQL

Development and Testing with Production Data

60 ECTS thesis submitted in partial ful�llment of a M.Sc. degree in Computer Science

Copyright c© 2014 Baldur Þór Emilsson

All rights reserved

Faculty of Industrial Engineering,

Mechanical Engineering and

Computer Science

School of Engineering and Natural Sciences

University of Iceland

Hjarðarhaga 2-6

107, Reykjavik, Reykjavik

Iceland

Telephone: 525 4700

Bibliographic information:

Baldur Þór Emilsson, 2014, Marty: Application Development and Testing with Produc-

tion Data in PostgreSQL, M.Sc. thesis, Faculty of Industrial Engineering,

Mechanical Engineering and

Computer Science, University of Iceland.

Printing: Háskólaprent, Fálkagata 2, 107 Reykjavík

Reykjavik, Iceland, May 2014

Page 5: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

Abstract

Marty is a proof-of-concept prototype for a framework that o�ers convenient appli-cation development and testing against data used in production that is stored inthe PostgreSQL database management system. It is designed for minimal overheadand con�guration on production servers while o�ering quick and simple databaseinitialization on development and testing servers. This opens the possibility for ap-plication testing on production data with minimal e�ort, which complements con-ventional testing datasets and helps preventing bugs from entering production codewhich were not caught with the conventional datasets.

Útdráttur

Marty er hugbúnaðarlausn sem býður upp á þægilegt þróunar- og prófunarumhver�fyrir forrit sem nota PostgreSQL gagnagrunnsker�ð. Hún er hönnuð til að nota gögnúr gagnagrunnum sem keyra í raunumhver� án þess að hafa neikvæð áhrif á afköstnetþjónanna sem grunnarnir keyra á og án mikilla breytinga á uppsetningu þeirraen bjóða á sama tíma upp á �jótlega og einfalda uppsetningu þróunar- og prófunar-gagnagrunna. Það opnar fyrir möguleikann á hugbúnaðaprófunum með raungögnumán mikillar fyrirhafnar sem geta keyrt samhliða prófunum með hefðbundin prófunar-gagnasett og hjálpað við að uppræta villur sem koma ekki í ljós með hefðbundnumprófunum.

iii

Page 6: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers
Page 7: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

Contents

List of Figures vii

List of Tables ix

List of Listings xi

Acknowledgments xiii

1. Introduction 11.1. Goals and purpose of Marty . . . . . . . . . . . . . . . . . . . . . . . 21.2. Similar solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3. Postgres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1. Technical information . . . . . . . . . . . . . . . . . . . . . . . 41.4. Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2. Architecture 72.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2. The clone databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3. The history database . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1. Populating and updating the history database . . . . . . . . . 102.4. Advantages and Drawbacks . . . . . . . . . . . . . . . . . . . . . . . 11

3. Implementation 133.1. The slave instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1. Reading the WAL . . . . . . . . . . . . . . . . . . . . . . . . . 143.2. The history database . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1. Schema information . . . . . . . . . . . . . . . . . . . . . . . . 163.2.2. Data tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.3. CTID columns . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.4. History versions . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3. The clone databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.4. Source code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4. Current status and future work 314.1. Limitations of the current version . . . . . . . . . . . . . . . . . . . . 32

v

Page 8: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

Contents

4.2. Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2.1. Logical replication . . . . . . . . . . . . . . . . . . . . . . . . 334.2.2. Data obfuscation . . . . . . . . . . . . . . . . . . . . . . . . . 334.2.3. VCS integration . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2.4. Time travel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

A. Source code 37

B. The Origin of the Name 51

vi

Page 9: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

List of Figures

2.1. An overview of the architecture of Marty . . . . . . . . . . . . . . . . 7

2.2. The clone part of the architecture . . . . . . . . . . . . . . . . . . . . 8

2.3. Table layout in a clone database . . . . . . . . . . . . . . . . . . . . . 9

2.4. The history part of the architecture . . . . . . . . . . . . . . . . . . . 9

2.5. An example of the schema of a data table in the history database . . 10

2.6. The slave part of the architecture . . . . . . . . . . . . . . . . . . . . 10

3.1. The slave part of the architecture . . . . . . . . . . . . . . . . . . . . 13

3.2. The history part of the architecture . . . . . . . . . . . . . . . . . . . 16

3.3. Example of the contents of the schema information tables . . . . . . . 19

3.4. An example of the schema of a data table in the history database . . 21

3.5. An example of the data in a data table in the history database . . . . 21

3.6. The clone part of the architecture . . . . . . . . . . . . . . . . . . . . 25

3.7. Table layout in a clone database . . . . . . . . . . . . . . . . . . . . . 25

vii

Page 10: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers
Page 11: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

List of Tables

3.1. Con�guration parameters in postgres.conf for the master database . . 14

3.2. Con�guration parameters in postgres.conf for the slave database . . . 14

3.3. The columns of the marty_schemas table . . . . . . . . . . . . . . . . 17

3.4. The columns of the marty_tables table . . . . . . . . . . . . . . . . . 17

3.5. The columns of the marty_tables table . . . . . . . . . . . . . . . . . 18

3.6. The columns of the marty_updates table . . . . . . . . . . . . . . . . 24

3.7. The columns of the bookkeeping table . . . . . . . . . . . . . . . . . 27

ix

Page 12: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers
Page 13: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

List of Listings

3.1. WAL replay log example . . . . . . . . . . . . . . . . . . . . . . . . . 153.2. WAL update and delete example . . . . . . . . . . . . . . . . . . . . 223.3. The view_select function . . . . . . . . . . . . . . . . . . . . . . . . . 263.4. An example of a SELECT query with a table alias . . . . . . . . . . . 27

A.1. clone.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37A.2. history.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38A.3. utils/__init__.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41A.4. utils/dbobjects.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41A.5. utils/inspector.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42A.6. utils/populator.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45A.7. postgres-9.3.3.patch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

xi

Page 14: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers
Page 15: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

Acknowledgments

This thesis would be incomplete if not for the guidance of my advisor, HjálmtýrHafsteinsson. His comments and suggestions helped with the creation of this projectand I would like to thank him for all his help.

I would also like to thank professor Kristján Jónasson who provided me with hisown personal computer on which Marty was designed and developed.

Lastly I would like to thank my girlfriend, Valborg, for her support and encourage-ment. You enrich my days with joy and wonder.

xiii

Page 16: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers
Page 17: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

1. Introduction

Database management systems (DBMS) are used as data stores in many di�erentsystems in various �elds. They are rarely used as standalone products and aretypically used to store data from other applications. These applications are oftenin constant development with short development cycles, which include both manualand automated testing. Those tests are often run against datasets that are createdto test for speci�c conditions and ideally they help with catching all bugs in theapplications before they enter production. However, many projects can bene�tfrom tests that are run against data from the production environment, either tocomplement the testing datasets or to provide data to test against in situationswhere no testing datasets exist. The main disadvantage of using production data intesting is that cloning a large database can take a long time which slows down testingand development and it adds an overhead to the database in production whichcan have negative e�ects on the performance of the application in the productionenvironment.

The goal of Marty is to o�er a convenient and relatively e�cient way to run testsfor applications that use the PostgreSQL (Postgres) DBMS against live data on theproduction servers without adding overhead to them. This is achieved by creating atesting database with empty tables that are populated when they are �rst queried.This saves time as only the tables which are used in the tests are populated andno time is spent copying the data for the other tables, which remain empty. Thedata is not copied directly from the production server but from another instance ofPostgres that stores a copy of the production data. This ensures the consistency ofthe data in the cloned databases and it also minimizes the load on the productiondatabase.

The architecture of Marty enables users to inspect the state of the productiondatabase as it was at certain points in time in the past. This is similar to thetime travel feature which was once a part of Postgres but was removed due to per-formance and storage space issues. This can be bene�cial in situations where thestate of the database caused anomalies or bugs in the application, bugs which havesince stopped because the state of the database has changed. The user could thenrun the application with data from di�erent points in time to debug it.

1

Page 18: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

1. Introduction

1.1. Goals and purpose of Marty

The goal of the development of Marty is to create an application that enables itsusers to clone a running database quickly. The original idea was that softwaredevelopers and testers would be able to clone a database that is used in productionand stores large amounts of data that would normally take a considerable time tocopy to another server. Marty will speed up development and testing by reducing thetime it takes to clone the production database and also uses techniques that reduceor prevent any negative impact that the cloning would have on the performance ofthe production database.

Although the initial idea was for production databases to be cloned there is nothingthat prevents Marty from being used for other kinds of databases, such as databasesthat are dedicated to storing test datasets that never enter production, as long asthose databases ful�ll the requirements for Marty.

The emphasis in the design and development of Marty is to minimize the time fromwhen the cloning of a database is initialized until the newly created clone can beused for testing. The performance of the newly created database has not been ahigh priority as it is not intended to be used in a performance critical environment.Thus Marty is not a solution that should be used to create clones of a database thatare to be used for load balancing or failover or serve any other role in the productionenvironment of an application.

Marty is supposed to be used in an environment where a single or a few databasesneed to be cloned regularly. The architecture that was chosen for Marty requires asystem administrator to set up and con�gure Marty for the environment where it isused. This involves running a dedicated Postgres instance that is used as a referencewhen the clones are created and also con�guring the production server to work withthis dedicated instance in a certain way, which might require the production serverto be restarted. It should therefore be clear that Marty is not suited for cloning adatabase that only needs to be cloned for a limited number of times. It should bemost useful when the database to be cloned is large enough that the time saved byusing Marty justi�es the initial setup.

1.2. Similar solutions

The development of Marty was started in part to solve a problem that did not haveany solutions available. When a user wanted to replicate a Postgres database sheneeded to copy the whole database. Tools like pg_dump exists to aid with database

2

Page 19: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

1.2. Similar solutions

replication but any optimizations had to be created manually for each setup. If theuser wanted some tables to be empty or not included in the replication she needed tocreate her own solution, such as a script, that implemented that behavior. Postgresdoes not support lazy-loading of data natively like the clone databases in Martyrequire.

Heroku, a web hosting company, o�ers Postgres database hosting1. It has imple-mented a feature called forking which is based on the same idea as Marty butimplemented di�erently. The documentation for forking includes:

Preparing a fork can take anywhere from several minutes to several hours,depending on the size of your dataset.2

The fork is a complete replica that contains all the tables and database objects of theoriginal database as well as all of its data. This is unlike the way that Marty createsits clone databases; their tables start empty and are not populated with data untilthey are queried. In that sense the forking of Postgres databases in Heroku serves adi�erent purpose than Marty, which puts emphasis on the short initialization timeof the clone databases.

There exist numerous clustering solutions for Postgres, such as Slony-I 3 and pgpool-II 4. Many of these solutions can possibly be tailored to suit the needs of developersand testers who need replicas of the production database. They are, however, notdeveloped with this use case in mind so their usage for this situation can be prob-lematic and can include much con�guration and setup, if they can be used at all.Marty is developed for a speci�c use case and is tailored to satisfy the requirementsof that use case. This makes it a better choice in the situation where developers andtesters need to be able to quickly create replicas of the production database.

Software development includes testing in various stages of the development. Variousmethods are used to test the software, such as automated unit tests and integrationtests and manual testing that is performed by the developer or a dedicated tester.It should be possible to use Marty for all stages of software testing. The creationand initialization of the clone databases should be fast enough to make it feasible tocreate a new clone for each feature that is being developed or tested. This includescreating a new clone database for each unit test and integration test that is executed,even if there are hundreds of them and they are executed multiple times per day.

1https://www.heroku.com/postgres2https://devcenter.heroku.com/articles/heroku-postgres-fork3http://slony.info4http://www.pgpool.net/mediawiki/index.php/Main_Page

3

Page 20: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

1. Introduction

1.3. Postgres

Postgres5 is an SQL based DBMS that originated at the University of California,Berkeley in the 1980's. It was based on another DBMS, Ingres, and was released as afree and open source software in 1995. It is developed by a global community underthe name PostgreSQL Global Development Group, with a core team consisting of ahalf a dozen members and a large number of other contributors. It is written in Cand runs on multiple platforms.

Postgres is very mature and has a large number of features, including conformancewith a large part of the SQL standard and a support for extension modules. Manymodules have been created to add new data types, o�er new scripting languages forstored procedures and add functionality for speci�c types of data, such as geograph-ical information. It has a very extensive documentation and an active communitythat o�ers support for users through mailing lists and IRC channels.

Many companies o�er commercial support and products based on Postgres withmany more using it as a part of their internal systems. It is used by governmentorganizations and universities and many free and open source projects.

1.3.1. Technical information

There are a few implementation details that are used in this thesis for the discussionof the architecture and implementation of Marty. Postgres also sometimes uses ter-minology to describe objects or ideas that is di�erent from the one that is commonlyis use.

A database cluster in Postgres is a directory in the �le system of the server that runsthe Postgres instance6. This directory contains �les and subfolders that store allcontents of every database that runs on that instance of Postgres. The �les containbinary data that is generally not readable by programs other than the PostgresDBMS.

A relation is a database object that stores some data. One example of a relation isan ordinary table that is created in a Postgres database. Its contents are stored inthe database cluster directory in a relation �le. This �le consists of blocks of data,each of which contains one or many tuples that store the values of the relation.Other types of relations are e.g. views, sequences and foreign tables.

5http://www.postgresql.org6http://www.postgresql.org/docs/9.3/static/creating-cluster.html

4

Page 21: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

1.4. Thesis Overview

Postgres keeps a log of all changes that are made to the relation �les, and other �lesand directories in the database cluster, in a so-called write-ahead log or WAL7. TheWAL contains the binary records that Postgres inserts into the �les in the cluster. Itis used for recovery after a server crash and can also be used to replicate the databasein another instance of Postgres. Marty uses the WAL to inspect the changes thathave been made to the master database which the clone databases replicate.

Postgres is a transactional database. This means that every operation that a userexecutes in the database is wrapped in a transaction. Inside a transaction the stateof the database is una�ected by the changes that are made in other transactions,even if they run concurrently. This is important to make it possible for users to querythe database even though another user is updating the same tables at the same time.Postgres implements this by using multiversion concurrency control (MVCC)8. Itenables a database to contain multiple versions of the same data at the same timeto ensure that the data in the tables in each transaction is correct and consistent.

The details of how Postgres implements the MVCC are complex, but the mostrelevant part for the description of Marty are the xmin and xmax columns. Theseare hidden system columns which Postgres adds to every table. They are used tode�ne which transactions can see which rows in the table. Every transaction thatruns on the server has a transaction ID. The ID of the transaction that inserts a rowinto a table is recorded in the xmin column. When a row is deleted the ID of thetransaction that deleted it is recorded in the xmax column, until then it containsNULL. A transaction that updates the values of a row actually inserts a new rowwith the updated values and marks the old row as deleted. A row is part of atransaction if the equation xmin ≤ transaction ID < xmax holds.

Marty uses a similar technique when it stores multiple versions of the data from themaster database.

1.4. Thesis Overview

The rest of the thesis is organized as follows: Chapter 2 contains a descriptionof the architecture of Marty and explains the purpose of each part of the system.Chapter 3 contains a detailed description of the implementation of each part, withreferences to and explanations of the relevant parts of Postgres. Chapter 4 containsthe conclusion of the thesis along with a description of the limitations of the currentdesign and ideas for future work. The source code for the current version of Martyis included in Appendix A. Appendix B explains the origins of the name Marty.

7http://www.postgresql.org/docs/9.3/static/wal-intro.html8http://www.postgresql.org/docs/9.3/static/mvcc.html

5

Page 22: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers
Page 23: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

2. Architecture

This chapter contains a detailed description of the architecture and design of Marty.

2.1. Overview

Marty consists of a few parts that serve di�erent purposes, see Figure 2.1. Developersand testers that use Marty in their work create database clones. These databases areclones of the master database, which can be a database that is used in a productionenvironment. When the clones are created they are initialized with a copy of all thetables in the master, but the tables remain empty until they are queried by a useror an application. The design of the clone databases is discussed in Chapter 2.2.

Figure 2.1: An overview of the architecture of Marty

Marty does not inspect the schema of the master database directly when it createsthe tables in the clone databases. Instead it queries another database that is calledhistory. The history database contains information about the schema of the masterdatabase as well as a copy of its data. When the clones need to populate theirtables they also use this history database as a reference. The reason for usinganother database to store a copy of the schema and data of the master databaseis discussed in Chapter 2.3, along with a description of the design of the historydatabase.

As the name suggests the history database contains not only a copy of the current

7

Page 24: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

2. Architecture

version of the master but also of previous versions. To update the history databasewith new versions of the master and to keep it in sync with the changes that aremade on the master Marty uses a log from the master that is called the write-aheadlog orWAL. It does not read this log directly but uses a specially patched instance ofPostgres to read the contents of the log. This instance is called slave and it outputsinformation that Marty can use to update the history database with all the changesthat have been made in the master database. The reason for keeping old versionsof the master database and the relationship between the master, slave and historydatabases is described in detail in Chapter 2.3.

2.2. The clone databases

Figure 2.2: The clone part of the architecture

A clone database is a standard Post-gres database. It uses two Post-gres extensions; the PL/pgSQL ex-tension that enables users to createstored procedures, and the dblinkextension which enables users toquery another database directlyfrom the clone database without us-ing any external scripts or programs. The clone can run on a local instance ofPostgres on the developers' or testers' computer as long as the history database isaccessible from that computer. More than one clone database can run in parallel onthe same instance of Postgres so each user can use many clones at the same time.

To create a clone the user creates a new, empty database. She then initializes it withMarty. After the clone has been initialized it contains all the schemas that are foundin the master database and a copy of all the tables from each schema. The tablesremain empty until they are �rst queried which saves time in the initialization as theuser does not have to wait for Marty to �nish copying all the data in the tables beforeshe can start querying the clone. This behavior is implemented by creating viewsinstead of tables in the clone. The views look like the tables that the user expectsto �nd and when they are queried they call a PL/pgSQL function, view_select, thatreturns the appropriate data. This function looks for the data in the actual datatables, which Marty creates in the clone, and if these tables are empty the functionpopulates them with data from the history database before returning their contents.

The data tables are created in a special schema called marty which is created inthe clone database. It contains the data tables as well as another table calledbookkeeping. The view_select function uses this table to keep track of which datatables have been populated and which ones are still unpopulated. The table also

8

Page 25: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

2.3. The history database

contains the query strings that view_select uses when it fetches the data from thehistory database. See Figure 2.3 for an example of a table layout in a clone database.

The views, bookkeeping table, data tables and the view_select function are de-scribed in detail in Chapter 3.3.

Figure 2.3: Table layout in a clone database

The table persons is actually a view that returns results from the table

data_myschema_persons_1 which is in the marty schema.

2.3. The history database

Figure 2.4: The history part of the architecture

The history database is a stan-dard Postgres database. It is cre-ated by a system administrator andcontains data that the developersand testers use when they createclone databases. When the historydatabase has been initialized it con-tains information about the schemaof the master database and a copy of its data. After the initialization Marty updatesthe history database with all the changes that are made in the master, both to itsdata and schema. Its contents are versioned and, as the name suggests, the user canlook up previous states of the master in the history database.

The reason for keeping old versions of the master is the delayed population of thedata tables in the clones. From the moment that a clone database is initializedand until its tables are populated the master database might change. Tables mightbe dropped or renamed and rows might be updated or deleted, which could lead toinconsistency in the clone as foreign key relations might break. The history databaseo�ers access to a particular version of the contents of the master database and thusprevents errors of this kind in the clones.

9

Page 26: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

2. Architecture

Marty initializes the history database with four tables; marty_schemas,marty_tables,marty_columns and marty_updates. The �rst three store information about theschemas, tables and table columns in the master database. Their contents are mostlycopied from the tables pg_namespace, pg_class and pg_attributes, respectively. Thefourth table, marty_updates, keeps a log of version timestamps. Each version of thecontents of the history database has two timestamps associated with it; the localtime of the history server when that version was created in the history database andthe time of the transaction on the master database that created that version.

Marty copies the data from the tables in the master and stores it in special datatables in the history database. Each table in the master has a corresponding datatable in the history database. Its schema is similar to the original table but a fewcolumns are added. They are used for versioning and as a reference when rowsare deleted or updated. There are also no constraints on the data tables or theircolumns. They are unnecessary as the data table is only used to store the data thathas already been validated on the master. Constraints might also get in the way,e.g. when the tables need to store di�erent versions of the same row that has aunique constraint on some of its columns. Another example might be a table thatis altered and a not-null constraint is added to one of its columns where null valueshave been stored in the past. Therefore there are no constraints on the data tablesin the history database. For an example of the schema of a data table see Figure2.5, and for further details see Chapter 3.2.2.

Figure 2.5: An example of the schema of a data table in the history database

2.3.1. Populating and updating the history database

Figure 2.6: The slave part of the architecture

When the history database is ini-tialized its contents are not readdirectly from the master database.Instead there is a dedicated Post-gres instance that replicates themaster database that is used as areference for the history. This in-stance is called slave. The reason

10

Page 27: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

2.4. Advantages and Drawbacks

for using another instance to read the schema and data from the master is theformat of the write-ahead log, or WAL.

The contents of the WAL are used to update the history database with the changesthat are made to the master after the history has been initialized. This approach waschosen because it makes it possible to read the changes that are made to the masterdatabase without inspecting it directy, e.g. with triggers. That was consideredimportant in the design process for Marty because any change on the master mightintroduce bugs and reduce its performance. However, the WAL contains binaryinformation and to be able to read it and use its contents it is necessary to havethe database cluster �les from the master as a reference. Instead of implementing acomplex algorithm to read the contents of the WAL it was decided to leverage onthe recovery feature of Postgres that reads the WAL and applies it to the databasecluster. The slave is therefore started with a copy of the cluster �les from the masterdatabase and it then replays the WAL into the cluster as it arrives from the master.As the WAL is replayed the slave logs information about the operations that itreplays and Marty uses this replay log to read the new version of the data from theslave.

The replay log is not enabled in the default build of Postgres so the slave must becompiled with a special �ag to enable it. The replay must also be paused after eachtransaction that has been replayed to give Marty time to read the data from thattransaction before the next one is applied. Marty includes a patch for the Postgressource code to enable this pause. The slave therefore runs on a specially patchedversion of Postgres. More information about the slave can be found in Chapter 3.1.

2.4. Advantages and Drawbacks

The current architecture of Marty that is described in this chapter was chosenbecause of its simplicity and because it could be implemented in high level code(PL/pgSQL instead of C) which sped up prototyping and simpli�ed the develop-ment of Marty. However, it has a few drawbacks which make it unsuitable for aproduction ready version of Marty. The main drawback is the lack of optimizationfor queries from the clones to the history database; when a user queries a tablein a clone database it fetches the complete contents of that table from the historydatabase even if the query should only return a small part of it to the user. Anotherissue is the creation of indexes for the tables in the clones; the user can not createthem like she would on the master database. This is because the tables that theuser expects to �nd in the clone database are actually views and it is not possibleto create indexes for views in Postgres.

11

Page 28: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

2. Architecture

See Chapter 4 for a discussion of the current status of Marty, the limitations of thecurrent version and ideas for future works and improvements.

12

Page 29: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3. Implementation

Marty is written in Python and PL/pgSQL with a small patch to the Postgressource code written in C. Python and PL/pgSQL are both high-level programminglanguages and ideal for rapid prototyping. The source code for Marty containstwo scripts, clone.py and history.py, which are used to create and populate theclone databases and the history database, respectively. The patch to Postgres isnecessary for Marty to be able to read the changes from the write-ahead log (WAL)with the slave instance. It is for version 9.3.3 of Postgres and might not work withother versions.

This chapter describes the implementation of Marty. It explains which parts ofPostgres Marty uses to create the history database and keep track of the changesthat are made to the master database. It starts by explaining how the slave instanceis used and why it is patched. Next it describes the history database and its designand then continues with a description of the clone databases and how they use thehistory database. The last part of this chapter describes brie�y how the source codefor Marty is organized.

3.1. The slave instance

Figure 3.1: The slave part of the architecture

Marty uses the slave instance to ini-tialize the history database and toinspect the contents of the write-ahead log from the master. Theslave is con�gured to act as a hotstandby for the master; it startswith a copy of the master databaseand updates it with the WAL.When the slave database is �rst started Marty copies its schema and data to thehistory database. It then inspects the changes from the WAL as they are appliedand updates the history database accordingly.

Before the slave instance is started a database administrator must con�gure the

13

Page 30: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3. Implementation

master. This includes con�guring a few parameters in the postgres.conf �le, seeTable 3.1. Note that the archive command in the table is an example. When themaster is con�gured the archive command must copy the WAL �les to a locationwhere they are accessible by the slave. Also note that max_wal_senders must atleast be 1, but can be higher.

Parameter Valuewal_level hot_standbyarchive_mode onarchive_command ’cp %p /path/to/archive/%f’max_wal_senders 1

Table 3.1: Con�guration parameters in postgres.conf for the master database

Next the administrator must create a base backup of the master database. A basebackup is a copy of the cluster �les that store all the data in the database. Itcan be created with the program pg_basebackup and might require changes to thepg_hba.conf �le, see the Postgres documentation for further reference1.

As previously noted the slave runs on a patched version of Postgres that must becompiled with a special �ag that enables Postgres to log the WAL replay actions.When the patch has been applied to the Postgres source code it must be compiledwith the WAL_DEBUG CPP �ag.

Instead of creating a new database cluster for the slave with the initdb commandthe administrator uses the base backup from the master. When it has been copiedto the correct place the postgres.conf �le must be updated, see Table 3.2. It is thennecessary to add a recovery.conf �le with a command to fetch the WAL �les fromthe master database, see the Postgres documentation for further reference2.

Parameter Valuehot_standby onwal_debug on

Table 3.2: Con�guration parameters in postgres.conf for the slave database

3.1.1. Reading the WAL

The write-ahead log contains all the changes that are applied to the master database.They are stored in binary records that can be applied directly to the �les in the slave

1http://www.postgresql.org/docs/9.3/static/app-pgbasebackup.html2http://www.postgresql.org/docs/9.3/static/recovery-con�g.html

14

Page 31: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3.2. The history database

database cluster to repeat the changes on the slave. There are a few di�erent kindsof WAL records that store information about di�erent operations on the master.The ones that Marty looks for are heap records and commit records.

Heap records contain changes to the tables in the database; inserts, updates anddeletes. The commit records signal the slave to commit and close the current trans-action that is being replayed from the WAL. Other kinds of WAL records includeB-tree records for B-tree indexes and XLOG records that contain information abouttransaction logs.

Marty can read information about the heap records in the WAL replay log. The logcontains information such as the type of the heap record; insert, update or delete,and identi�ers of the database and table that the record alters. The log also containsreferences to the row which was inserted, updated or deleted.

1 LOG: REDO @ 0/800F1A0 ; LSN 0/800F248 : prev 0/800F160 ; xid 741 ; l en139 : Heap − i n s e r t : r e l 1663/16384/11829; t i d 39/63

2 LOG: REDO @ 0/900390C; LSN 0/9003944: prev 0/9002158; xid 742 ; l en26 : Heap − de l e t e : r e l 1663/16384/11829; t i d 39/38 KEYS_UPDATED

3 LOG: REDO @ 0/8011D80 ; LSN 0/8011F0C : prev 0/8011D54 ; xid 741 ; l en368 : Transact ion − commit : 2014−03−06 23:36 :33 .937958+00

Listing 3.1: WAL replay log example

Listing 3.1 has an example of the WAL replay log. The insert and delete heap recordsin lines 1 and 2 in this example alter the same table. This table can be looked upwith the rel values in the log; 1663 is the database ID, 16384 is the namespace IDand 11829 is the table ID. Marty uses the database and table IDs to look up thetable in the slave database and queries it with the tid values, which in this case are39/63 and 39/38. The tid values reference the row which was inserted into or deletedfrom the table. The last line of the example, line 3, logs a commit record. It closesthe transaction and applies the changes from lines 1 and 2 to the slave database.

The next section describes the schema of the history database and how the slavedatabase is queried for the data that is insert into the history.

3.2. The history database

The history database contains information about the schema of the master databaseand a copy of its data. It contains multiple versions of the schema and data from themaster, each transaction that alters the master database creates a history version.It is used as a reference when the user creates a new clone database; the schema

15

Page 32: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3. Implementation

of the clone is created according to the information in the history database and itsdata is copied from the history.

Figure 3.2: The history part of the architecture

The history database is createdwhen the slave instance has beencon�gured. It is a standard Post-gres database that must have thesame major version as the mas-ter database. That means that ifthe master runs on Postgres version9.3.3 the history version must startwith 9.3, see the Postgres documentation for further reference3.

3.2.1. Schema information

To initialize the history database the administrator runs the script history.py. Itstarts by creating the schema information tables which Marty uses to store infor-mation about the schema of the master database:

marty_schemas Contains the name and ID of all schemas in the slave database.

marty_tables Contains the name and ID of the tables in the slave database and areference to the schema they are in. It also stores the internal name which isused for the data table in the history and clone databases.

marty_columns Contains the name, number, type and length of each column inthe tables. It also stores an internal name for the column that is used in thedata tables in the history and clone databases.

Tables 3.3 to 3.5 show information about the columns of these tables. Note thateach table contains a column called _ctid, see Chapter 3.2.3 for more informationabout the value in this column. The tables also have two columns called start andstop. These columns de�ne which history version each row is part of, see chapter3.2.4 for further information. Figure 3.3 shows an example of the contents of theschema information tables.

Postgres uses system catalogs to store information about the schema of its databases.They are used internally, e.g. when Postgres reads from or writes to tables. Infor-mation such as which �le in the database cluster represents which table and thecolumn names and types of each table are stored in the system catalogs. They are

3http://www.postgresql.org/docs/9.3/static/upgrading.html

16

Page 33: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3.2. The history database

marty_schemasColumn Type Description

_ctid tid A reference to the pg_namespace table in the slave databaseoid oid The ID of the schema in the slave databasename name The name of the schemastart integer First version where this schema is present in the databasestop integer First version where this schema stops being present in the

database

Table 3.3: The columns of the marty_schemas table

marty_tablesColumn Type Description

_ctid tid A reference to the pg_class table in the slave databaseoid oid The ID of the table in the slave databasename name The name of the tableschema_oid oid A reference to the schema that this table belongs tointernal_name name The name of the data table, see Chapter 3.2.2start integer First version where this table is present in the databasestop integer First version where this table stops being present in the

database

Table 3.4: The columns of the marty_tables table

ordinary tables which are stored in a schema called pg_catalogs and can by queriedby a user just as any other table.

Marty reads information from four system catalogs in the slave database:

pg_namespace Contains information about the schemas in the slave database. In-formation from this catalog is saved inmarty_schemas in the history database.

pg_class Contains information about the tables in the slave database. Informationfrom this catalog is saved in marty_tables in the history database.

pg_attribute Contains information about the columns of the tables in the slavedatabase. Information from this catalog is saved in marty_columns in thehistory database.

pg_type Contains additional information about the columns. Marty reads thename of the column type from this table (integer, text etc.) and stores it inmarty_columns along with the other column related information.

17

Page 34: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3. Implementation

marty_columnsColumn Type Description

_ctid tid A reference to the pg_attribute table in the slavedatabase

table_oid oid A reference to the table this column is inname name The name of the columnnumber int2 The index of this column in the table (is it the �rst,

second, third etc.)type name The type of the column (int, text, boolean etc.)length int4 The length of the column. Some data types require

the user to specify a length, such as character columns,e.g. char(26). Columns without length usually havethe value -1

internal_name name The name of this column in the data table, see Chapter3.2.2

start integer First version where this column is part of the tablestop integer First version where this column stops being part of the

table

Table 3.5: The columns of the marty_tables table

The history.py script creates the schema information tables and then it populatesthem with information about the schema of the slave database. This is of coursealso the schema of the master database so the history database really containsinformation about the master.

Marty inspects the schema of the slave database before any WAL records have beenapplied to it. It writes the schema information to the history database, which createsthe �rst history version, see Chapter 3.2.4 for more about history versions. Martythen starts the WAL replay in the slave database and reads the replay log.

When a new schema is created the WAL contains an insert heap record for thepg_namespace table. Marty sees this in the replay log and queries the slave aboutthis new schema. It then creates a new history version that includes it. The samehappens when a new table is created or a new column is added to a table; a newhistory version is created that includes the new table or column.

If a schema, table, or a column is altered, e.g. when they are renamed, the WALhas an update heap record. Similarly when a schema, table or column is droppedthe WAL has a delete heap record. Marty sees this in the replay log and createsnew history versions with the altered database schema.

See Figure 3.3 for an example of the contents of the schema information tables.

18

Page 35: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3.2. The history database

Figure 3.3: Example of the contents of the schema information tables

In the example the master database has a schema called myschema. Its objectidenti�er (oid) is 2200. The pg_namespace system catalog in the slave databasestores information about this schema in a row with the ctid value of (0,5). This isre�ected in the marty_schemas table in the history database.

The master contains one table, persons, which belongs to myschema. Informationabout this table is stored in marty_tables. Its oid is 16385 and the slave storesinformation about this table in a row with ctid value of (0,46) in the pg_classsystem catalog. Its internal name is data_myschema_persons_1, see Chapter 3.2.2for information about internal names of tables and columns.

The persons table has two columns, age of type integer or int4, and name of typetext. The slave stores information about these columns in the pg_attribute systemcatalog in rows with ctid values of (39,32) and (39,33). The age column is the �rstcolumn in the table, the name column is the second one. This is re�ected in thenumber value in marty_columns. Columns of type integer and text do not use thelength attribute, so Postgres puts -1 there. The internal names of the columns aredata_age_1 and data_name_1.

19

Page 36: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3. Implementation

The schema, table and columns are all part of the �rst history version. This isre�ected in the columns start in the schema information tables. They have not yetbeen altered or dropped and are thus part of every history version after the �rstone, which is re�ected with the NULL value in the stop column. See Chapter 3.2.4for information about history versions.

3.2.2. Data tables

The history database stores a copy of all the data from the slave database tables.Marty creates a table in the history database for each table in the slave. Thishappens immediately after Marty has saved the schema information about the orig-inal table to the history database. These tables are the data tables of the historydatabase.

The schema of a data table is not identical to the schema of the original table. Martyadds three columns for metadata; data_ctid, start and stop. These columns are alsopresent in the schema information tables (where data_ctid is called _ctid) wherethey serve the same purpose, see Chapters 3.2.3 and 3.2.4 for more information.

When a column is added to the original table in the slave database Marty also addsit to the corresponding data table. If a column is dropped from the original tableMarty still keeps the column in the data table. This is necessary because Martykeeps old versions of the data in the table and must therefore keep all columns thathave been added to the original table, regardless of whether they have since beendropped or not.

When a row is deleted from the original table Marty marks it as deleted in the datatable. The row is still kept in the data table as part of an old history version becausethe data must still be accessible. When a row is updated in the original table Martyinserts a new row into the data table with the updated values and marks the oldrow as deleted. This behavior is similar to the multiversion concurrency control(MVCC) that Postgres uses to allow more than one transaction to use the sametable at the same time.

Marty does not copy any constraints from the original tables to the data tables.The only data that the data tables contain has been copied from the original tablesin the slave database. This data conforms to the constraints of the original tablesand it is therefore not necessary to replicate the constraints in the data tables. Theconstraints would also cause trouble, e.g. when Marty stores updated rows from atable with a unique constraint. If the values of the unique columns are not updatedbut only values in other columns then the data table could not store both the oldand the new versions of the rows, as the values in the unique columns would be the

20

Page 37: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3.2. The history database

same in both rows. For these reasons Marty does not copy any constraints from theoriginal tables.

Figure 3.4: An example of the schema of a data table in the history database

Marty creates all data tables in the master schema of the history database. To avoidnaming con�icts new names are creates for them: data_[schema]_[table]_[version].They all have the pre�x data_ followed by the name of the schema they are partof in the slave database. Next comes the name of the original table and the nameends with the ID of the history version where this table was created. This makes itpossible for Marty to keep all the data tables in one schema in the history database,even if more than one table from di�erent schemas share the same name in the slavedatabase. It also handles name reuse when a table is dropped and another tableis created with the same name. These two tables are not part of the same historyversion and thus the names of the data tables are di�erent. The columns of thedata tables use similar names. They have the data_ pre�x and are su�xed with thehistory version where they were added to the table.

Figure 3.5: An example of the data in a data table in the history database

See Figure 3.4 for an example of the schema of a data table in the history database.The internal name of the data table in the history database is data_myschema_persons_1 ;it belongs to the schema myschema and was created in the �rst history version. Ithas the three metadata columns; data_ctid, start and stop. The age and namecolumns of the original table in the master database are called data_age_1 anddata_name_1 in the data table.

Figure 3.5 shows an example of the contents of a data table. The original tablein the master database contains one row where age is 45 and name is John Doe.

21

Page 38: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3. Implementation

This row has the ctid value of (0,1), which Marty stores in the data_ctid column inthe data table. It was inserted into the table in history version 2 and deleted againin history version 4. This is re�ected in the columns start and stop.

3.2.3. CTID columns

The schema information tables have a column called _ctid and the data tables havean identical column called data_ctid. The values in these columns reference thevalues in the ctid columns in the corresponding tables in the slave database. Forthe schema information table these are the system catalogs and for the data tablesthe corresponding tables are the original tables in the slave database.

The ctid column is a hidden system column that Postgres adds to all tables whenthey are created. It is of type tid, or tuple identi�er, which is a pair of numbersthat identify the physical location of the row inside the relation �le in the databasecluster. The relation �les are made up of blocks which store the contents of therelations. Each block contains one or more tuples which store the actual data ofthe table rows. The tuple identi�er consists of a block index and a tuple index. Theblock index is zero-based while the tuple index is one-based so the �rst tuple ID is(0,1).

1 LOG: REDO @ 0/60080CC; LSN 0/6008198: prev 0/600806C; xid 735 ; l en37 ; bkpb0 : Heap − update : r e l 1663/16384/16385; t i d 0/1 xmax

735 ; new t i d 0/2 xmax 02 LOG: REDO @ 0/600822C; LSN 0/6008264: prev 0/6008204; xid 737 ; l en

26 : Heap − de l e t e : r e l 1663/16384/16385; t i d 0/3 KEYS_UPDATED

Listing 3.2: WAL update and delete example

The values in the ctid column can be used to identify each row in the database.Marty saves this value in the schema information tables and data tables to be ableto identify deleted and updated rows. The WAL replay log from the slave instancecontains the tuple identi�er of the rows that are updated or deleted in the WALrecords. Marty uses the tid value to locate the correct row in the schema informationtables or data tables in the history database and marks these rows as deleted. SeeListing 3.2 for an example. In the example the update record in the �rst line insertsa new row with tid value of (0,2) to replace the old values in row with tid value of(0,1). The delete record in the second line deletes a row with tid value of (0,3).

Note that the value of the ctid column can change, e.g. when Postgres vacuumsthe table or when a row is updated. Therefore it is not a viable long-term key toidentify the row. This does not a�ect the way Marty uses it because it only needs

22

Page 39: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3.2. The history database

to know the tid value of the row until it changes. When it is changed in the slavedatabase Marty re�ects this change in the history database and saves the new tidvalue.

Figure 3.3 shows an example of the values in the _ctid column in the schema in-formation tables. The table marty_schemas contains information that Marty readsfrom the pg_namespace system catalog. In the example this table contains one rowwith values from the row with ctid (0,5) in pg_namespace. The table marty_tablescontains one row with information from the pg_class system catalog. It containsvalues from the row with ctid (0,46) in pg_class. The table marty_columns containstwo rows with information from the pg_attribute catalog. They contain values fromthe rows with ctids (39,32) and (39,33) in pg_attribute.

Figure 3.5 shows an example of a value in the data_ctid column in a data table.The data table is called data_myschema_persons_1 and the row it contains hasvalues from a row in the original table in the master database that has the ctid value(0,1).

3.2.4. History versions

The history database stores multiple versions of the data from the slave database.This is necessary because of the delayed population of the tables in the clonedatabases. From the time when a clone is initialized and until its tables are popu-lated with data the slave database can change, causing inconsistency in the clone.Therefore the clone must have access to the data as it was when the user initializedthe clone database.

When the history database is initialized Marty creates the table marty_updates,see Table 3.6 for a list of its columns. It stores the ID of each version along withtwo timestamps, the local time and the master time. The local timestamp (in thecolumn time) is the time when this version was created in the history database. Themaster time is the time when the transaction that created this version was executedon the master database. The ID is a simple counter that starts in 1 and incrementsby one for each new version.

The schema information tables and data tables in the history database have twocolumns that tell Marty which versions a particular row is part of. These columnsare start and stop and they are foreign keys that reference the version ID the themarty_updates table. See Tables 3.3 to 3.5 for a list of columns in the schemainformation tables and Figure 3.4 for an example of the columns in a data table.

When a row in inserted into one of the tables in the history database it is part of the

23

Page 40: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3. Implementation

marty_updatesColumn Type Description

id serial Primary key - a unique ID for each history versiontime timestamp The date and time when this version was created in the

history databasemastertime timestamp The date and time of the transaction that created this

version on the master database

Table 3.6: The columns of the marty_updates table

current history version. Marty inserts the ID of the current version into the startcolumn to re�ect this. When a row is deleted or updated in the slave database thecorresponding row in the history database is marked as deleted. Marty does thisby updating the value in the stop column of the deleted row to the current historyversion ID. An example of this is a row where the start ID is 3 and the stop ID is 6.This row is part of versions 3, 4 and 5. A row that has not been deleted or updatedhas null in the stop column. If the current version ID is 10 and a row has start ID7 and stop ID null then this row is part of versions 7, 8, 9 and 10 and all futureversions until it is marked as deleted.

This method is similar to the way Postgres implements multiversion concurrencycontrol (MVCC). The MVCC is used to allow more than one transaction to usethe same table at the same time. Postgres creates hidden system columns on alltables called xmin and xmax. They store the transaction ID of the transactions thatcreated the row and deleted or updated it, respectively. This corresponds to the wayMarty uses the start and stop columns. Marty does not, however, use the values inthe xmin and xmax columns to determine the version numbers for the rows in thehistory database. Instead the WAL replay log is used to inspect when each row iscreated and when it is deleted.

When the history database is initialized Marty inspects the slave database andcreates the �rst history version. It contains all schemas and tables in the slavedatabase before any WAL records have been applied to it. When Marty is �nishedinspecting the slave it starts the WAL replay. The slave pauses the replay when the�rst transaction has been applied and Marty creates the second version with all thechanges from that transaction. Marty then restarts the WAL replay and repeats theprocess for each new transaction.

24

Page 41: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3.3. The clone databases

3.3. The clone databases

Figure 3.6: The clone part of the architecture

A clone database is a replica ofthe master database. The user canquery the clone just like she wouldquery the master. It is a stan-dard Postgres database that usesthe PL/pgSQL and dblink exten-sions. The user creates a new,empty database and runs the scriptclone.py to initialize it as a clone. The script reads the schema information fromthe history database and initializes the clone accordingly.

Like the history database the clone database must have the same major version asthe master. If the master has the version 9.3.3 that means that the clone databasesmust have a version number that starts with 9.3.

The initialization script creates schemas in the clone database to match those in themaster database. The tables in the clone are lazy-loading. This means that a tableis empty until a user queries it for the �rst time. The clone then fetches its contentfrom the history database. This is implemented with a view which the user queriesand an accompanying data table which stores the data once it has been fetched fromthe history.

The data tables live in a schema called marty. They have the same name as the datatables in the history database (see Chapter 3.2.2). The columns are identical to thecolumns in the original table in the master database, there are no extra columnsand their names are identical to the column names in the original table. See Figure3.7 for an example of a data table.

Figure 3.7: Table layout in a clone database

The table persons is actually a view that returns results from the table

data_myschema_persons_1 which is in the marty schema.

When the user queries a table in the clone database like she would query it in

25

Page 42: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3. Implementation

the master she is actually querying the view that Marty creates. It executes aPL/pgSQL function, view_select, which returns the contents of the original tablein the master. The function is created by Marty when the clone is initialized. Itreceives one argument, the name of the view that is being queried, and fetches thedata for it from the corresponding data table. When the view is queried for the�rst time the view_select function fetches the data from the history database andinserts it into the data table before returning the results. See Listing 3.3 for thesource code of the view_select function.

1 CREATE FUNCTION v iew_se lect (my_view_name text ) RETURNS SETOF RECORDAS $$

2 DECLARE

3 view_info RECORD;4 BEGIN

5 SELECT ∗ FROM marty . bookkeeping WHERE view_name = my_view_nameINTO view_info ;

6 IF NOT view_info . cached THEN7 RAISE NOTICE ' f e t c h i n g %' , view_info . view_name ;8 EXECUTE ' INSERT INTO ' | | view_info . l o ca l_tab l e | |9 ' SELECT ' | | view_info . c o l d e f | |10 ' FROM dbl ink ( ' ' ' | | c on in fo ( ) | | ' ' ' , ' ' ' | | view_info .

remote_select_stmt | | ' ' ' ) '11 ' AS ' | | view_info . temp_table_def ;12 UPDATE marty . bookkeeping SET cached = true WHERE view_name

= my_view_name ;13 END IF ;14 RETURNQUERYEXECUTE 'SELECT ' | | view_info . c o l d e f | | ' FROM ' | |

view_info . l o ca l_tab l e ;15 END;16 $$ LANGUAGE p lpg sq l ;

Listing 3.3: The view_select function

Marty keeps track of which data tables have been populated in the table bookkeeping.It is created alongside the data tables in the marty schema when the clone databaseis initialized and is used to store information about the data tables and views. Itstores which data table keeps the data for which view and whether it has beeninitialized, as well as information about how to query the history database to fetchthe data for each view. See Table 3.7 for a list of columns in the bookkeeping table.

The cached column is a boolean which tells the view_select function whether thedata table has been populated with data from the history database (line 6 in Listing3.3). It defaults to false as all data tables start empty. The function uses the coldefvalue when it queries the data table. It is a comma separated list of the columnsin the data table and is used to construct the select query (line 14). When theview_select function queries the history database it uses the remote_select_stmtas the query (lines 7 to 11). It is a select query that returns all rows from the data

26

Page 43: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3.3. The clone databases

bookkeepingColumn Type Description

view_name name The name of the viewlocal_table name The name of the data table that contains the

data for the viewcached boolean True if the data table has been populated with

data from the history databasecoldef text A list of columns in the data table, used by the

view_select function when querying the data ta-ble

remote_select_stmt text The select statement for the data table in thehistory database

temp_table_def text Table de�nition for a temporary table, used byview_select

Table 3.7: The columns of the bookkeeping table

table in the history. The dblink function from the dblink extension is used to querythe history database (line 10). It receives two parameters; a connection string andthe query to run.

The connection string is generated when the clone database is initialized. Martycreates a function, coninfo, that returns this string. This makes it simple to up-date the connection string later in a running clone database if the setup of thehistory database changes, as only this one-line function needs to be rede�ned in-stead of the whole view_select function. The query that returns the data fromthe history database is created when Marty initializes the clone and is saved in theremote_select_stmt column in the bookkeeping table.

SELECT ∗ FROM dbl ink ( 'dbname=mydb ' , 'SELECT age , name FROM persons ' ) ASp( age int , name text )

Listing 3.4: An example of a SELECT query with a table alias

The dblink query must have an alias part where the names and types of the columnsin the result rows are speci�ed. This is because the dblink function is declared toreturn a set of records. Postgres does not know the format of the records so the usermust provide this information with the alias4. See Listing 3.4 for an example. Thepart after AS tells Postgres what to expect in the query results. This part is uniqueto each data table in the clone database. It is saved in the temp_table_def columnin the bookkeeping table and is appended to the dblink query in the view_selectfunction (line 11).

4http://www.postgresql.org/docs/9.3/static/contrib-dblink-function.html

27

Page 44: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3. Implementation

3.4. Source code

The source code for the current version of Marty can be found in Appendix A. Itcontains two Python scripts that can be run from the command line. Those arehistory.py, which a database administrator uses to initialize the history database,and clone.py which a user runs to initialize a clone database. The rest of the codeconsists of three Python �les which contain classes that the scripts use.

inspector.py Contains the classes SlaveInspector andHistoryInspector. The SlaveIn-spector is used by history.py to gather information about the schemas andtables in the slave database. The HistoryInspector is used by clone.py to readthe schema and table information from the history database.

populator.py Contains the classes HistoryPopulator and ClonePopulator. The His-toryPopulator is used by history.py to insert the information from the SlaveIn-spector into the history database. The ClonePopulator is used by clone.py tocreate schemas and tables in the clone database according to the informationfrom the HistoryInspector.

dbobjects.py This �le contains a few classes that the inspectors use to supplythe populators with data. They include Schema, Table and Column whichcontain the name and other data about each one of those objects. The Tableand Column classes also have methods to create the names of the data tablesand its columns.

The history.py script receives ten arguments:

--slave-host The hostname or IP address of the slave database--slave-port The port number of the slave database--slave-user The username for the slave database--slave-password The password for the slave database--slave-database The name of the slave database--history-host The hostname or IP address of the history database--history-port The port number of the history database--history-user The username for the history database--history-password The password for the history database--history-database The name of the history database

The clone.py script receives the same arguments for the history database and similararguments for the clone database that should be initialized.

The clone.py script only uses the data in the history database to populate theclone. The history.py script receives the WAL replay log from the slave database as

28

Page 45: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

3.4. Source code

standard input. It uses a couple of classes to consume and parse the log, Workerand RegExer. The Worker class reads the log and keeps a list of all heap recordsit encounters. When it reads a commit record it creates a new history version inthe history database and runs through the heap records, saving the changes to thenew history version. To parse the log records it uses the RegExer class, which is awrapper around regular expressions which are used to match and extract data fromthe replay log. See Chapter 3.1.1 for an explanation of the WAL replay log, heaprecords and commit records.

All SQL and PL/pgSQL code in Marty is part of the Python scripts. Marty isnot a large project and the current version contains a little over 1100 lines of code.Included in that number is the �le that contains the patch to the Postgres sourcecode.

29

Page 46: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers
Page 47: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

4. Current status and future work

The current version of Marty is a proof of concept prototype. It is written in Pythonand PL/pgSQL for version 9.3.3 of Postgres. It supports replicating a database withordinary tables without any default values or constraints.

Marty replicates a master database and the replicas are called clone databases.They are standard Postgres databases that the user initializes with a Python scriptprovided by Marty. The tables in the clone databases are empty until they arequeried. This reduces the time it takes to create the clone database but with thedrawback of a longer query time when the tables in the clones are �rst queried.The lazy-loading of the tables is implemented with views. When the user queriesa table in a clone database like she would query it in the master database sheis actually querying a view. When a view is queried for the �rst time it calls aPL/pgSQL function that fetches the data and caches it on the clone database beforeit is returned. Subsequent queries to the view use the cached data and thus theytake less time.

The clone databases do not fetch the data directly from the master database. In thetime between initializing the clone with the empty tables and fetching their datawhen the user queries the tables in the database the state of the master databasemight have changed. This could cause inconsistency in the clone database. Toprevent that it fetches the data from another database, the history database. It is astandard Postgres database that contains information about the schema and a copyof the data of the master database. As the name suggests it contains a history of thechanges that have been made on the master database, both to the schema and thedata. The clone can therefore query the history database for the data as it was ata certain point in time. This guarantees the consistency of the data that the clonedatabase fetches and returns to the user.

Marty uses the write-ahead log (WAL) from the master database to monitor how itchanges. This log contains the changes of the master database in binary records thatcan be directly applied to the cluster �les that store the contents of the database.This makes it di�cult or impossible to read the WAL without having a copy of thedatabase cluster �les as a reference. Marty uses a dedicated instance of Postgresto read the contents of the WAL. This is the slave instance which is con�gured asa hot-standby for the master database. While it reads the WAL and replays it to

31

Page 48: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

4. Current status and future work

the cluster �les Marty inspects the changes that the WAL makes and records thosechanges in the history database.

4.1. Limitations of the current version

Marty can be used to quickly create databases for the development and testingof applications that use Postgres. However, the current version is limited to afew core elements in a Postgres database. Those are schemas, ordinary tables andthe standard data types of the columns. Marty does not replicate default valueson columns or any column or table constraints, such as foreign keys or UNIQUEconstraints.

Before Marty is ready for use in a production environment these limitations mustbe addressed. Support must be added for default values and constraints, along withsupport for more types of database objects. Marty should be able to replicate viewsas well as ordinary tables. Primary keys in tables are often created with the serialtype which uses a sequence to supply the table with values for the primary key.Marty must be able to replicate sequences.

Ordinary tables, sequences and views are all examples of relations. Other relationtypes are TOAST tables, materialized views, composite types and foreign tables1. Itwould be nice to add support for these types of relations to make Marty usable inmore environments. Other types of database objects that could be supported infuture versions are functions and operators, data types and domains, triggers andrewrite rules.

Another limitation of the current version is the design of the clone databases. Theyuse views to imitate tables that lazy-load their contents from another database. Thiscomplicates some operations that a user might want to perform in the database, suchas creating new indexes or altering the tables. The user should be able to executeas many operations as possible in the clone databases in exactly the same way asthey are executed in the master database. This would likely require the user to runa patched version of Postgres for the clone databases.

The clone databases do not optimize their queries to the history database in any way.When the user queries a table in the clone database it fetches the complete contentsof that table from the history, even though the user limits the query results to only afew rows. A patched version of Postgres could mitigate this and inspect the queriesfrom the user before it fetches the data form the history database. This could speed

1http://www.postgresql.org/docs/9.3/static/catalog-pg-class.html

32

Page 49: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

4.2. Future work

up the execution of queries, especially for tables that have many thousands, or evenmillions of rows.

4.2. Future work

There are many features that can be added to Marty to make it more attractiveto developers. These include feature that minimize the overhead of using Marty,make it more secure or simplify its setup. Below are a few ideas for improvementsor additions that could be added to Marty in future versions.

4.2.1. Logical replication

Logical log streaming replication is a feature that is being implemented for Postgresand will be part of future versions. It provides the possibility of shipping the changelog of one database instead of shipping the changes of the whole database cluster,like the write ahead log currently does. The receiving database uses plugins to readsthe contents of the logical log. One plugin is used to apply the changes from the logbut another one is available that creates SQL statements from the changes, withoutapplying them. Marty could provide one such plugin for the history database thatwould read the log and create new history versions. This would make the slaveinstance in Marty unnecessary and would simplify its setup and maintenance.

4.2.2. Data obfuscation

Many databases contain sensitive information, or information that should not beused in a development or testing environment. Examples of such data are creditcard numbers and e-mail addresses. Marty could include the possibility to changethe values of certain columns in certain tables when it inserts data into the historydatabase. This could be in the form of a Python script which would allow for avery �exible way of obfuscating or changing the values before they are inserted intothe history. This would prevent the sensitive data from entering the developmentor testing environment, thus preventing any accidental use.

33

Page 50: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

4. Current status and future work

4.2.3. VCS integration

Version control systems (VCS) such as Git2 and Mercurial3 allow developers to workon many di�erent features or �xes for a program simultaneously while keeping thechanges for each feature isolated. The developers create branches of the sourcecode where each branch contains the changes for only one feature or �x. They canthen switch between branches without polluting one branch with the changes fromanother one.

It is possible to create hooks for these systems that run speci�c commands whenthe user executes certain actions. Marty could include hooks that would automati-cally create and initialize a new clone database whenever a developer created a newbranch. They could also be con�gured to update the settings of the project that thedeveloper is working on. The developer would then always use the correct databasefor the active branch without needing to change manually from one database toanother. This could ease development as the developer would always get a newdevelopment database for each new branch.

4.2.4. Time travel

Time travel is a feature that was once a part of Postgres4. It allowed users to querythe contents of the database as it was at a certain point in time in the past. Itwas removed from Postgres in version 6.2 due to performance impact and how muchextra storage was needed to support this feature.

It can, however, be useful in some situations to be able to query the database forhistorical versions of its data. This is possible in the current version of Postgresby using the write-ahead log, but it is not a very practical solution. The WALmust be applied to an old base-backup of the master database and Postgres must becon�gured to stop the WAL replay at the right moment. This can be time consumingas the WAL replay can take considerable time. It can also be hard to leap betweendi�erent points in time when replaying the WAL as there is no way to rewind it.

The information in the history database can instead be used for this task. It wouldbe a relatively quick operation to jump backwards or forwards in time and inspectthe di�erent states of the database. It is already possible to use Marty in such a waywith minimal changes. The user selects which version to inspect and creates a newclone database that is initialized for that version. This includes some manual labor

2http://git-scm.com3http://mercurial.selenic.com4http://www.postgresql.org/docs/6.3/static/c0503.htm

34

Page 51: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

4.2. Future work

as the user must �rst �nd the correct history version and then con�gure Marty toinitialize a clone database with that speci�c version ID. It is also cumbersome toinspect di�erent versions as the user would need to create a new clone database foreach version and then query all of them and inspect the results manually.

Future versions of Marty could include a tool that would enable a user to quicklyscan the history database and locate the correct history version. The user couldinspect the database as it was at that time and even compare two or more versions.This would make it possible to debug anomalies that were caused by erroneousdata in the master database even after the database state has been altered and theanomalies have stopped.

35

Page 52: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers
Page 53: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

A. Source code

This is the source code for Marty. It is accessible online on Github1.

Listing A.1: clone.py1 #!/usr/bin/env python2 # -*- coding: utf-8 -*-3

4 import sys5 import argparse6 import psycopg27 from utils import HistoryInspector, ClonePopulator, get_logger8

9

10 def connect(history, clone):11 histcon = psycopg2.connect(**history)12 clonecon = psycopg2.connect(**clone)13 histcon.autocommit = True14 clonecon.autocommit = True15 return histcon, clonecon16

17

18 def main():19 parser = argparse.ArgumentParser()20 parser.add_argument(’--history-host’, help=’Hostname or IP of the history database’)21 parser.add_argument(’--history-port’, help=’Port number for the history database’)22 parser.add_argument(’--history-user’, help=’Username for the history database’)23 parser.add_argument(’--history-password’, help=’Password for the history database’)24 parser.add_argument(’--history-database’, help=’Name of the history database’)25

26 parser.add_argument(’--clone-host’, help=’Hostname or IP of the clone database’)27 parser.add_argument(’--clone-port’, help=’Port number for the clone database’)28 parser.add_argument(’--clone-user’, help=’Username for the clone database’)29 parser.add_argument(’--clone-password’, help=’Password for the clone database’)30 parser.add_argument(’--clone-database’, help=’Name of the clone database’)31

32 args = parser.parse_args()33

34 history = {35 ’host’: args.history_host,36 ’port’: args.history_port,37 ’user’: args.history_user,38 ’password’: args.history_password,39 ’database’: args.history_database40 }41

42 clone = {43 ’host’: args.clone_host,44 ’port’: args.clone_port,45 ’user’: args.clone_user,46 ’password’: args.clone_password,47 ’database’: args.clone_database48 }49

50 histcon, clonecon = connect(history, clone)51

52 inspector_logger = get_logger(’inspector’)53 populator_logger = get_logger(’populator’)54

55 inspector = HistoryInspector(histcon, logger=inspector_logger)56 populator = ClonePopulator(clonecon, inspector.update, history, logger=populator_logger)57 populator.initialize()58 for schema in inspector.schemas():59 populator.create_schema(schema)60 for table in inspector.tables(schema):

1https://github.com/baldurthoremilsson/marty/commit/f9a88d7

37

Page 54: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

A. Source code

61 inspector.columns(table)62 populator.create_table(table)63 clonecon.commit()64

65

66 if __name__ == ’__main__’:67 main()

Listing A.2: history.py1 #!/usr/bin/env python2 # -*- coding: utf-8 -*-3

4 import psycopg25 import sys6 import argparse7 import re8

9 from utils import SlaveInspector, HistoryPopulator, get_logger10 from utils.dbobjects import Schema11

12

13 class RegExer(object):14 def __init__(self):15 self.m = None16

17 rel_tid = ’rel (?P<spc_node>\d+)/(?P<db_node>\d+)/(?P<rel_node>\d+); tid (?P<block>\d+)/(?P<offset>\d+)’18

19 self.regexes = {20 ’insert’: re.compile(r’Heap - insert(?:\(init\))?: {}’.format(rel_tid)),21 ’update’: re.compile(r’Heap - (?:hot_)?update: {} xmax \d+ (?:[A-Z_]+ )?; new tid (?P<new_block>\d+)/(?

P<new_offset>\d+) xmax \d+’.format(rel_tid)),22 ’delete’: re.compile(r’Heap - delete: {}’.format(rel_tid)),23 ’lastup’: re.compile(r’LOG: database system was interrupted; last known up at (?P<timestamp>\d{4}-\d

{2}-\d{2} \d{2}:\d{2}:\d{2})’),24 ’connect’: re.compile(r’LOG: database system is ready to accept read only connections’),25 ’paused’: re.compile(r’LOG: recovery has paused’),26 ’redo’: re.compile(r’LOG: REDO @ [0-9A-F]+/[0-9A-F]+; LSN [0-9A-F]+/[0-9A-F]+: prev [0-9A-F]+/[0-9A-F

]+; xid [0-9]+; len [0-9]+(?:; bkpb[0-9]+)?: (.*)’),27 ’commit’: re.compile(r’Transaction - commit: (?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+)’),28 }29

30 def match(self, regex, pattern):31 self.m = self.regexes[regex].match(pattern)32 return self.m33

34 @property35 def groupdict(self):36 return self.m.groupdict()37

38 def __getitem__(self, key):39 return self.groupdict[key]40

41 def get(self, key, val):42 try:43 return self[key]44 except KeyError:45 return val46

47

48

49 class Worker(object):50 def __init__(self, infile, regexer, connect_callback):51 self.infile = infile52 self.regexer = regexer53 self.connect_callback = connect_callback54 self.inspector = None55 self.populator = None56 self.slavecon = None57 self._work = []58 self._commited = False59 self._timestamp = None60

61 def consume(self):62 self.infile.flush()63 line = self.infile.readline()64 if self.regexer.match(’lastup’, line):65 self._timestamp = self.regexer.m.groupdict()[’timestamp’]66 elif self.regexer.match(’connect’, line):67 self.slavecon, self.inspector, self.populator = self.connect_callback(self._timestamp)68 self.inspector.resume()69 self._timestamp = None70 elif self.regexer.match(’paused’, line):71 if self.inspector:72 self.inspector.resume()

38

Page 55: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

73 elif self.regexer.match(’redo’, line):74 work = self.regexer.m.groups()[0]75 if self.regexer.match(’commit’, work):76 if self.inspector:77 self._commited = True78 self._timestamp = self.regexer.m.groupdict()[’timestamp’]79 elif self._commited:80 self.populator.update(self._timestamp)81 for w in self._work:82 self.work(w)83 self._work = []84 self._commited = False85 self._timestamp = None86 self._work.append(work)87

88 def work(self, work):89 for action in ’insert’, ’update’, ’delete’:90 if self.regexer.match(action, work):91 break92 else:93 # If the work is not an insert, update or delete action we leave94 # (we only run the else part if the for loop does not break)95 return96

97 db_node = int(self.regexer.get(’db_node’, 0))98 rel_node = int(self.regexer.get(’rel_node’, 0))99 block = int(self.regexer.get(’block’, 0))

100 offset = int(self.regexer.get(’offset’, 0))101 new_block = int(self.regexer.get(’new_block’, 0))102 new_offset = int(self.regexer.get(’new_offset’, 0))103

104 if db_node != self.inspector.db_oid:105 return106

107 if rel_node in self.inspector.system_tables:108 table = self.inspector.system_tables[rel_node]109 if table.name == ’pg_namespace’:110 self.schema_change(action, block, offset, new_block, new_offset)111 elif table.name == ’pg_class’:112 self.table_change(action, block, offset, new_block, new_offset)113 elif table.name == ’pg_attribute’:114 self.column_change(action, block, offset, new_block, new_offset)115 return116

117 table = self.inspector.tabledict.get(rel_node, None)118 if not table:119 return120

121 if action == ’insert’:122 self.insert(table, block, offset)123 elif action == ’update’:124 self.update(table, block, offset, new_block, new_offset)125 elif action == ’delete’:126 self.delete(table, block, offset)127

128 def ctid(self, block, offset):129 return ’({},{})’.format(block, offset)130

131 def schema_change(self, action, block, offset, new_block, new_offset):132 if action == ’insert’:133 schema = self.inspector.get_schema(self.ctid(block, offset))134 self.populator.add_schema(schema)135 elif action == ’update’:136 schema = self.inspector.get_schema(self.ctid(new_block, new_offset))137 self.populator.add_schema(schema)138 self.populator.remove_schema(self.ctid(block, offset))139 elif action == ’delete’:140 self.populator.remove_schema(self.ctid(block, offset))141

142 def table_change(self, action, block, offset, new_block, new_offset):143 if action == ’insert’:144 table = self.inspector.get_table(self.ctid(block, offset))145 if table:146 self.populator.add_table(table)147 self.populator.create_table(table)148 elif action == ’update’:149 table = self.inspector.get_table(self.ctid(new_block, new_offset))150 if table:151 self.populator.add_table(table)152 self.populator.remove_table(self.ctid(block, offset))153 elif action == ’delete’:154 table = self.populator.get_table(self.ctid(block, offset))155 if table:156 self.populator.delete_all(table)157 self.populator.remove_table(self.ctid(block, offset))158

159 def column_change(self, action, block, offset, new_block, new_offset):160 update = self.populator.update_id161 if action == ’insert’:162 column = self.inspector.get_column(self.ctid(block, offset), update=update)

39

Page 56: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

A. Source code

163 if column:164 self.populator.add_column(column)165 self.populator.add_data_column(column)166 if action == ’update’:167 old_column = self.populator.get_column(self.ctid(block, offset))168 column = self.inspector.get_column(self.ctid(new_block, new_offset),169 update=update, internal_name=old_column.internal_name)170 if column:171 self.populator.add_column(column)172 self.populator.remove_column(self.ctid(block, offset))173 if action == ’delete’:174 self.populator.remove_column(self.ctid(block, offset))175

176 def insert(self, table, block, offset):177 row = self.inspector.get(table, block, offset)178 self.populator.insert(table, block, offset, row)179

180 def update(self, table, block, offset, new_block, new_offset):181 self.delete(table, block, offset)182 self.insert(table, new_block, new_offset)183

184 def delete(self, table, block, offset):185 self.populator.delete(table, block, offset)186

187

188 def connect(slave, history):189 slavecon = psycopg2.connect(**slave)190 histcon = psycopg2.connect(**history)191 slavecon.autocommit = True192 histcon.autocommit = True193 return slavecon, histcon194

195 def connect_callback(timestamp):196 parser = argparse.ArgumentParser()197

198 parser.add_argument(’--slave-host’, help=’Hostname or IP of the slave database’)199 parser.add_argument(’--slave-port’, help=’Port number for the slave database’)200 parser.add_argument(’--slave-user’, help=’Username for the slave database’)201 parser.add_argument(’--slave-password’, help=’Password for the slave database’)202 parser.add_argument(’--slave-database’, help=’Name of the slave database’)203

204 parser.add_argument(’--history-host’, help=’Hostname or IP of the history database’)205 parser.add_argument(’--history-port’, help=’Port number for the history database’)206 parser.add_argument(’--history-user’, help=’Username for the history database’)207 parser.add_argument(’--history-password’, help=’Password for the history database’)208 parser.add_argument(’--history-database’, help=’Name of the history database’)209

210 args = parser.parse_args()211

212 slave = {213 ’host’: args.slave_host,214 ’port’: args.slave_port,215 ’user’: args.slave_user,216 ’password’: args.slave_password,217 ’database’: args.slave_database218 }219

220 history = {221 ’host’: args.history_host,222 ’port’: args.history_port,223 ’user’: args.history_user,224 ’password’: args.history_password,225 ’database’: args.history_database226 }227

228

229 slavecon, histcon = connect(slave, history)230

231 inspector_logger = get_logger(’inspector’)232 populator_logger = get_logger(’populator’)233

234 inspector = SlaveInspector(slavecon, logger=inspector_logger)235 populator = HistoryPopulator(histcon, logger=populator_logger)236

237 populator.create_tables()238 populator.update(timestamp)239 for schema in inspector.schemas():240 populator.add_schema(schema)241 for table in inspector.tables(schema):242 inspector.columns(table)243 populator.add_table(table)244 populator.create_table(table)245 populator.fill_table(table)246

247 return slavecon, inspector, populator248

249

250 def main():251 worker = Worker(sys.stdin, RegExer(), connect_callback)252 while True:

40

Page 57: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

253 worker.consume()254

255

256 if __name__ == "__main__":257 main()

Listing A.3: utils/__init__.py1 # -*- coding: utf-8 -*-2

3 import sys4 import logging5

6 from inspector import SlaveInspector, HistoryInspector7 from populator import HistoryPopulator, ClonePopulator8

9

10 __all__ = (’SlaveInspector’, ’HistoryInspector’, ’HistoryPopulator’,11 ’ClonePopulator’, ’get_logger’)12

13

14 def get_logger(name):15 formatter = logging.Formatter(logging.BASIC_FORMAT)16 handler = logging.StreamHandler(sys.stdout)17 handler.setFormatter(formatter)18 logger = logging.getLogger(name)19 logger.addHandler(handler)20 logger.setLevel(logging.DEBUG)21 return logger

Listing A.4: utils/dbobjects.py1 # -*- coding: utf-8 -*-2

3 class Schema(object):4

5 def __init__(self, ctid, oid, name):6 self.ctid = ctid7 self.oid = oid8 self.name = name9

10 def __repr__(self):11 return u’<Schema {} ({})>’.format(self.name, self.oid)12

13

14 class Table(object):15

16 def __init__(self, schema, ctid, oid, name, con=None, internal_name=None):17 self.schema = schema18 self.ctid = ctid19 self.oid = oid20 self.name = name21 self.columns = []22 self.con = con23 self._internal_name = internal_name24 self.update = None25

26 def __repr__(self):27 return u’<Table {} ({})>’.format(self.name, self.oid)28

29 @property30 def long_name(self):31 return ’{}.{}’.format(self.schema.name, self.name)32

33 @property34 def internal_name(self):35 if not self._internal_name:36 self._internal_name = ’data_{}_{}_{}’.format(self.schema.name, self.name, self.update)37 return self._internal_name38

39 @property40 def internal_columns(self):41 yield CTIDColumn()42 for column in self.columns:43 yield column44 yield StartColumn()45 yield StopColumn()46

47 def add_column(self, ctid, name, number, type, length, internal_name=None):48 self.columns.append(Column(self, ctid, name, number, type, length, internal_name=internal_name))49

41

Page 58: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

A. Source code

50 def data(self):51 with self.con.cursor() as curs:52 curs.execute(’SELECT ctid, * FROM {}’.format(self.long_name))53 for row in curs:54 yield row55

56

57 class Column(object):58

59 def __init__(self, table, ctid, name, number, type, length, internal_name=None):60 self.ctid = ctid61 self.table = table62 self.name = name63 self.number = number64 self.type = type65 self.length = length66 self._internal_name = internal_name67

68 def __repr__(self):69 return u’<Column {} {}({})>’.format(self.name, self.type, self.length)70

71 @property72 def internal_name(self):73 if not self._internal_name:74 self._internal_name = ’data_{}_{}’.format(self.name, self.table.update)75 return self._internal_name76

77

78 class CTIDColumn(object):79 internal_name = ’data_ctid’80 type = ’tid’81 length = -182

83

84 class StartColumn(object):85 internal_name = ’start’86 type = ’integer REFERENCES marty_updates(id) NOT NULL’87 length = -188

89

90 class StopColumn(object):91 internal_name = ’stop’92 type = ’integer REFERENCES marty_updates(id)’93 length = -1

Listing A.5: utils/inspector.py1 # -*- coding: utf-8 -*-2

3 from dbobjects import Schema, Table, Column, StartColumn, StopColumn4

5 class SlaveInspector(object):6

7 def __init__(self, con, logger=None):8 self.con = con9 self.db_oid = self._get_db_oid()

10 self.tabledict = {}11 self._system_tables = None12 self.pg_namespace = None13 if logger:14 self.logger = logger15 else:16 self.logger = logging.getLogger()17 self.logger.addHandler(logging.NullHandler())18

19 def _get_db_oid(self):20 with self.con.cursor() as curs:21 curs.execute(’SELECT oid FROM pg_database WHERE datname = current_database()’)22 row = curs.fetchone()23 return row[0]24

25 def schemas(self):26 with self.con.cursor() as curs:27 curs.execute("""28 SELECT ctid, oid, nspname29 FROM pg_namespace30 WHERE nspname NOT LIKE ’information_schema’ AND nspname NOT LIKE ’pg_%’31 """)32 for ctid, oid, name in curs:33 self.logger.info(’schema {}, {}, {}’.format(ctid, oid, name))34 yield Schema(ctid, oid, name)35

36 def tables(self, schema):37 """38 Missing:

42

Page 59: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

39 indexes (relkind = i)40 sequences (relkind = S)41 views (relkind = v)42 materialized views (relkind = m)43 composite type (relkind = c)44 TOAST tables (relkind = t)45 foreign tables (relkind = f)46 """47 with self.con.cursor() as curs:48 curs.execute("""49 SELECT ctid, oid, relname, pg_catalog.pg_relation_filenode(oid) AS filenode50 FROM pg_class51 WHERE relnamespace = %s AND relkind = ’r’52 """, (schema.oid,))53 for ctid, oid, name, filenode in curs:54 self.logger.info(’table {}, {} ({})’.format(oid, name, filenode))55 table = Table(schema, ctid, oid, name, con=self.con)56 self.tabledict[filenode] = table57 yield table58

59 def columns(self, table):60 """61 Missing:62 arrays (attndims)63 data in TOAST tables (attstorage)64 not null (attnotnull)65 default value (atthasdef)66 attislocal?67 attinhcount?68 collation (attcollation)69 attoptions?70 attfdwoptions?71 """72 with self.con.cursor() as curs:73 curs.execute("""74 SELECT pg_attribute.ctid, attname, attnum, typname, atttypmod75 FROM pg_attribute76 LEFT JOIN pg_type ON pg_attribute.atttypid = pg_type.oid77 WHERE attrelid = %s AND attisdropped = false AND attnum > 078 ORDER BY attnum ASC79 """, (table.oid,))80 for ctid, name, number, type, length in curs:81 self.logger.info(’column {} {}({})’.format(name, type, length))82 table.add_column(ctid, name, number, type, length)83

84 @property85 def system_tables(self):86 """87 This looks up tables88 pg_namespace89 pg_class90 """91 if self._system_tables == None:92 self._system_tables = {}93 schema = Schema(None, None, ’pg_catalog’)94 with self.con.cursor() as curs:95 curs.execute("""96 SELECT ctid, oid, relname, pg_catalog.pg_relation_filenode(oid) as filenode97 FROM pg_class98 WHERE relname IN (’pg_namespace’, ’pg_class’, ’pg_attribute’)99 """)

100 for ctid, oid, name, filenode in curs:101 self.logger.info(’system table {}, {} ({})’.format(oid, name, filenode))102 table = Table(schema, ctid, oid, name, con=self.con)103 self._system_tables[filenode] = table104 return self._system_tables105

106 def get_schema(self, ctid=None, oid=None):107 query = ’SELECT ctid, oid, nspname FROM pg_namespace ’108 if oid:109 query += ’WHERE oid = %s’110 values = (oid,)111 else:112 query += ’WHERE ctid = %s’113 values = (ctid,)114 with self.con.cursor() as curs:115 curs.execute(query, values)116 ctid, oid, nspname = curs.fetchone()117 return Schema(ctid, oid, nspname)118

119 def get_table(self, ctid=None, oid=None):120 query = ’SELECT ctid, oid, relname, relnamespace, relkind FROM pg_class WHERE relkind = %s AND ’121 values = [’r’]122 if oid:123 query += ’oid = %s’124 values.append(oid)125 else:126 query += ’ctid = %s’127 values.append(ctid)128 with self.con.cursor() as curs:

43

Page 60: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

A. Source code

129 curs.execute(query, values)130 ctid, oid, relname, relnamespace, relkind = curs.fetchone()131 schema = self.get_schema(oid=relnamespace)132 return Table(schema, ctid, oid, relname)133

134 def get_column(self, ctid=None, oid=None, update=None, internal_name=None):135 query = """136 SELECT pg_attribute.ctid, attrelid, attname, attnum, typname, atttypmod137 FROM pg_attribute138 LEFT JOIN pg_type ON pg_attribute.atttypid = pg_type.oid139 WHERE %s AND attisdropped = false AND attnum > 0140 ORDER BY attnum ASC141 """142 if oid:143 query %= ’attrelid = %s’144 values = (oid,)145 else:146 query %= ’pg_attribute.ctid = %s’147 values = (ctid,)148 with self.con.cursor() as curs:149 curs.execute(query, values)150 row = curs.fetchone()151 if not row:152 return None153 ctid, attrelid, attname, attnum, typname, atttypmod = row154 table = self.get_table(oid=attrelid)155 table.update = update156 return Column(table, ctid, attname, attnum, typname, atttypmod, internal_name=internal_name)157

158 def resume(self):159 self.logger.info(’resuming’)160 with self.con.cursor() as curs:161 curs.execute(’SELECT pg_xlog_replay_resume()’)162

163 def get(self, table, block, offset, cols=None):164 with self.con.cursor() as curs:165 if cols:166 cols = ’, ’.join(cols)167 else:168 cols = ’*’169 query = "SELECT {} FROM {} WHERE ctid = ’({},{})’"170 query = query.format(cols, table.long_name, block, offset)171 curs.execute(query)172 row = curs.fetchone()173 return row174

175

176 class HistoryInspector(object):177

178 def __init__(self, con, logger=None):179 self.con = con180 if logger:181 self.logger = logger182 else:183 self.logger = logging.getLogger()184 self.logger.addHandler(logging.NullHandler())185 self.update = self._update()186

187 def _update(self):188 with self.con.cursor() as curs:189 curs.execute("""190 SELECT id, time191 FROM marty_updates192 ORDER BY time DESC LIMIT 1193 """)194 update_id, time = curs.fetchone()195 self.logger.debug(’got update id {} from {}’.format(update_id, time))196 return update_id197

198 def schemas(self):199 with self.con.cursor() as curs:200 curs.execute("""201 SELECT _ctid, oid, name202 FROM marty_schemas203 WHERE start <= %(update_id)s AND (stop IS NULL OR stop > %(update_id)s)204 """, {’update_id’: self.update})205 for ctid, oid, name, in curs:206 yield Schema(ctid, oid, name)207

208 def tables(self, schema):209 with self.con.cursor() as curs:210 curs.execute("""211 SELECT oid, ctid, name, internal_name212 FROM marty_tables213 WHERE schema = %(schema_id)s214 AND start <= %(update_id)s AND (stop IS NULL OR stop > %(update_id)s)215 """, {’schema_id’: schema.oid, ’update_id’: self.update})216 for oid, ctid, name, internal_name in curs:217 yield Table(schema, ctid, oid, name, internal_name=internal_name)218

44

Page 61: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

219 def columns(self, table):220 with self.con.cursor() as curs:221 curs.execute("""222 SELECT ctid, name, number, type, length, internal_name223 FROM marty_columns224 WHERE table_oid = %(table_oid)s225 AND start <= %(update_id)s AND (stop IS NULL OR stop > %(update_id)s)226 ORDER BY number ASC227 """, {’table_oid’: table.oid, ’update_id’: self.update})228 for ctid, name, number, type, length, internal_name in curs:229 table.add_column(ctid, name, number, type, length, internal_name=internal_name)

Listing A.6: utils/populator.py1 # -*- coding: utf-8 -*-2

3 import logging4

5 from dbobjects import Table, Column6

7

8 class HistoryPopulator(object):9

10 def __init__(self, con, logger=None):11 self.con = con12 self.update_id = None13 if logger:14 self.logger = logger15 else:16 self.logger = logging.getLogger()17 self.logger.addHandler(logging.NullHandler())18

19 def create_tables(self):20 self.logger.info(’creating tables’)21

22 with self.con.cursor() as curs:23 # marty_updates24 curs.execute("""25 CREATE TABLE IF NOT EXISTS marty_updates(26 id SERIAL PRIMARY KEY,27 time TIMESTAMP DEFAULT current_timestamp NOT NULL,28 mastertime TIMESTAMP NOT NULL29 )30 """)31

32 # marty_schemas33 curs.execute("""34 CREATE TABLE IF NOT EXISTS marty_schemas(35 _ctid tid NOT NULL,36 oid oid NOT NULL,37 name name NOT NULL,38 start integer REFERENCES marty_updates(id) NOT NULL,39 stop integer REFERENCES marty_updates(id)40 )41 """)42

43 # marty_tables44 curs.execute("""45 CREATE TABLE IF NOT EXISTS marty_tables(46 _ctid tid NOT NULL,47 oid oid NOT NULL,48 name name NOT NULL,49 schema oid NOT NULL,50 internal_name name NOT NULL,51 start integer REFERENCES marty_updates(id) NOT NULL,52 stop integer REFERENCES marty_updates(id)53 )54 """)55

56 # marty_columns57 curs.execute("""58 CREATE TABLE IF NOT EXISTS marty_columns(59 _ctid tid NOT NULL,60 table_oid oid NOT NULL,61 name name NOT NULL,62 number int2 NOT NULL,63 type name NOT NULL,64 length int4 NOT NULL,65 internal_name name NOT NULL,66 start integer REFERENCES marty_updates(id) NOT NULL,67 stop integer REFERENCES marty_updates(id)68 )69 """)70

71 def update(self, mastertime):

45

Page 62: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

A. Source code

72 with self.con.cursor() as curs:73 curs.execute("""74 INSERT INTO marty_updates(mastertime) VALUES(%s) RETURNING id75 """, (mastertime,))76 self.update_id = curs.fetchone()[0]77 self.logger.debug(’new update id {}’.format(self.update_id))78

79 def add_schema(self, schema):80 self.logger.info(’adding schema {}’.format(schema.name))81

82 with self.con.cursor() as curs:83 curs.execute("""84 INSERT INTO marty_schemas(_ctid, oid, name, start) VALUES(%s, %s, %s, %s)85 """, (schema.ctid, schema.oid, schema.name, self.update_id))86

87 def remove_schema(self, ctid):88 self.logger.info(’removing schema {}’.format(ctid))89

90 with self.con.cursor() as curs:91 curs.execute("""92 UPDATE marty_schemas SET stop = %s WHERE _ctid = %s93 """, (self.update_id, ctid))94

95 def add_table(self, table):96 self.logger.info(’adding table {}’.format(table.long_name))97

98 update = self.update_id99 table.update = update

100 with self.con.cursor() as curs:101 curs.execute("""102 INSERT INTO marty_tables(_ctid, oid, name, schema, internal_name, start)103 VALUES(%s, %s, %s, %s, %s, %s)104 """, (table.ctid, table.oid, table.name, table.schema.oid, table.internal_name, self.update_id))105

106 self.logger.debug(curs.query)107

108 for column in table.columns:109 self.add_column(column)110

111 def remove_table(self, ctid):112 self.logger.info(’removing table {}’.format(ctid))113

114 with self.con.cursor() as curs:115 curs.execute("""116 UPDATE marty_tables SET stop = %s WHERE _ctid = %s117 """, (self.update_id, ctid))118

119 def add_column(self, column):120 self.logger.info(’adding column {} to {}’.format(column.name, column.table.long_name))121

122 with self.con.cursor() as curs:123 curs.execute("""124 INSERT INTO marty_columns(_ctid, table_oid, name, number, type, length, internal_name, start)125 VALUES(%s, %s, %s, %s, %s, %s, %s, %s)126 """, (column.ctid, column.table.oid, column.name, column.number, column.type,127 column.length, column.internal_name, self.update_id))128

129 self.logger.debug(curs.query)130

131 def remove_column(self, ctid):132 self.logger.info(’removing column {}’.format(ctid))133

134 with self.con.cursor() as curs:135 curs.execute("""136 UPDATE marty_columns SET stop = %s WHERE _ctid = %s137 """, (self.update_id, ctid))138

139 def create_table(self, table):140 self.logger.info(’creating table {}’.format(table.internal_name))141

142 with self.con.cursor() as curs:143 cols = ’,’.join(’\n {} {}’.format(column.internal_name, column.type) for column in table.

internal_columns)144 curs.execute(’CREATE TABLE {}({})’.format(table.internal_name, cols))145

146 self.logger.debug(curs.query)147

148 curs.execute(’SELECT oid FROM pg_class WHERE relname = %s’, (table.internal_name,))149 table_oid, = curs.fetchone()150

151 for column in table.columns:152 curs.execute("""153 UPDATE pg_attribute154 SET atttypmod = %s155 WHERE attrelid = %s AND attname = %s156 """, (column.length, table_oid, column.internal_name))157

158 def add_data_column(self, column):159 with self.con.cursor() as curs:160 curs.execute("""

46

Page 63: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

161 ALTER TABLE {} ADD COLUMN {} {}162 """.format(column.table.internal_name, column.internal_name, column.type))163

164 curs.execute(’SELECT oid FROM pg_class WHERE relname = %s’, (column.table.internal_name,))165 table_oid, = curs.fetchone()166

167 curs.execute("""168 UPDATE pg_attribute169 SET atttypmod = %s170 WHERE attrelid = %s AND attname = %s171 """, (column.length, table_oid, column.internal_name))172

173 def fill_table(self, table):174 self.logger.info(’filling table {}’.format(table.internal_name))175

176 table_name = table.internal_name177 column_names = ’, ’.join(column.internal_name for column in table.internal_columns)178 value_list = ’, ’.join(’%s’ for column in table.internal_columns)179 query = ’INSERT INTO {}({}) VALUES({})’.format(table_name, column_names, value_list)180

181 with self.con.cursor() as curs:182 for line in table.data():183 values = list(line)184 values.extend([self.update_id, None])185 curs.execute(query, values)186

187 self.logger.debug(curs.query)188

189 def insert(self, table, block, offset, row):190 self.logger.info(’inserting to table {}’.format(table.internal_name))191 table_name = table.internal_name192 column_names = ’, ’.join(column.internal_name for column in table.internal_columns)193 value_list = ’, ’.join(’%s’ for column in table.internal_columns)194 query = ’INSERT INTO {}({}) VALUES({})’.format(table_name, column_names, value_list)195

196 values = [’({},{})’.format(block, offset)] + list(row) + [self.update_id, None]197 with self.con.cursor() as curs:198 curs.execute(query, values)199 self.logger.debug(curs.query)200

201 def delete(self, table, block, offset):202 self.logger.info(’deleting from table {}’.format(table.internal_name))203 query = ’UPDATE {} SET stop = %s WHERE data_ctid = %s’.format(table.internal_name)204 values = [self.update_id, ’({},{})’.format(block, offset)]205 with self.con.cursor() as curs:206 curs.execute(query, values)207 self.logger.debug(curs.query)208

209 def delete_all(self, table):210 query = ’UPDATE {} SET stop = %s WHERE stop IS NULL’.format(table.internal_name)211 values = (self.update_id,)212 with self.con.cursor() as curs:213 curs.execute(query, values)214

215 def get_table(self, ctid):216 with self.con.cursor() as curs:217 curs.execute("""218 SELECT _ctid, oid, name, internal_name219 FROM marty_tables220 WHERE _ctid = %s""", (ctid,))221 row = curs.fetchone()222 if not row:223 return224 ctid, oid, name, internal_name = row225 return Table(None, ctid, oid, name, internal_name=internal_name)226

227 def get_column(self, ctid):228 with self.con.cursor() as curs:229 curs.execute("""230 SELECT _ctid, table_oid, name, number, type, length, internal_name231 FROM marty_columns232 WHERE _ctid = %s""", (ctid,))233 row = curs.fetchone()234 if not row:235 return236 ctid, table_oid, name, number, type, length, internal_name = row237 return Column(None, ctid, name, number, type, length, internal_name=internal_name)238

239

240 class ClonePopulator(object):241

242 def __init__(self, con, update, history_coninfo, logger=None):243 self.con = con244 self.update = update245 self.history_coninfo = history_coninfo246 if logger:247 self.logger = logger248 else:249 self.logger = logging.getLogger()250 self.logger.addHandler(logging.NullHandler())

47

Page 64: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

A. Source code

251

252 def initialize(self):253 with self.con.cursor() as curs:254 curs.execute(’CREATE SCHEMA IF NOT EXISTS marty’)255 curs.execute(’CREATE EXTENSION IF NOT EXISTS dblink’)256 curs.execute("""257 CREATE TABLE marty.bookkeeping(258 view_name name UNIQUE,259 local_table name,260 cached boolean DEFAULT false,261 coldef text,262 remote_select_stmt text,263 temp_table_def text264 )265 """)266

267 curs.execute("""268 CREATE FUNCTION coninfo() RETURNS text AS $$269 BEGIN270 RETURN ’{coninfo}’;271 END;272 $$ LANGUAGE plpgsql;273 """.format(coninfo=self._dblink_connstr()))274

275 curs.execute("""276 CREATE FUNCTION view_select(my_view_name text) RETURNS SETOF RECORD AS $$277 DECLARE278 view_info RECORD;279 BEGIN280 SELECT * FROM marty.bookkeeping WHERE view_name = my_view_name INTO view_info;281 IF NOT view_info.cached THEN282 RAISE NOTICE ’fetching %’, view_info.view_name;283 EXECUTE ’ INSERT INTO ’ || view_info.local_table ||284 ’ SELECT ’ || view_info.coldef ||285 ’ FROM dblink(’’’ || coninfo() || ’’’, ’’’ || view_info.remote_select_stmt || ’’’)’286 ’ AS ’ || view_info.temp_table_def;287 UPDATE marty.bookkeeping SET cached = true WHERE view_name = my_view_name;288 END IF;289 RETURN QUERY EXECUTE ’SELECT ’ || view_info.coldef || ’ FROM ’ || view_info.local_table;290 END;291 $$ LANGUAGE plpgsql;292 """)293

294 def _dblink_connstr(self):295 parts = {296 ’host’: ’host={}’,297 ’port’: ’port={}’,298 ’user’: ’user={}’,299 ’password’: ’password={}’,300 ’database’: ’dbname={}’,301 }302 return ’ ’.join(parts[key].format(value) for key, value in self.history_coninfo.iteritems())303

304 def create_schema(self, schema):305 self.logger.info(’Creating schema {}’.format(schema.name))306 with self.con.cursor() as curs:307 curs.execute(’CREATE SCHEMA IF NOT EXISTS {}’.format(schema.name))308

309 def create_table(self, table):310 self.logger.info(’Creating table {}’.format(table.long_name))311

312 # Create table for local data313 table.update = self.update314 query = ’CREATE TABLE marty.{table}({cols})’315 cols = ’,’.join(’\n "{name}" {type}’.format(name=column.name, type=column.type) for column in table.

columns)316 with self.con.cursor() as curs:317 curs.execute(query.format(table=table.internal_name, cols=cols))318 for column in table.columns:319 curs.execute("""320 UPDATE pg_attribute321 SET atttypmod = %(column_length)s322 WHERE attrelid = %(table_name)s::regclass::oid AND attname = %(column_name)s323 """, {’column_length’: column.length, ’table_name’: ’marty.{}’.format(table.internal_name), ’

column_name’: column.name})324

325

326 # Create view that combines local and remote data327 my_cols = ’, ’.join([’"{}"’.format(col.name) for col in table.columns])328 temp_columns = [’"{name}" {type}’.format(name=col.name, type=col.type) for col in table.columns]329 temp_table_def = ’t1({columns})’.format(columns=’, ’.join(temp_columns))330

331 view_query = """332 CREATE VIEW {view_name}333 AS SELECT {cols} FROM view_select(’{view_name}’)334 AS {tabledef};335 """336 curs.execute(view_query.format(view_name=table.long_name, cols=my_cols, tabledef=temp_table_def))337

338 bookkeeping_query = """

48

Page 65: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

339 INSERT INTO marty.bookkeeping(view_name, local_table, coldef, remote_select_stmt, temp_table_def)340 VALUES(%(view_name)s, %(local_table)s, %(coldef)s, %(remote_select_stmt)s, %(temp_table_def)s);341 """342 local_cols = ’, ’.join([’"{}"’.format(col.name) for col in table.columns])343 internal_cols = ’, ’.join([col.internal_name for col in table.columns])344 remote_select_stmt = ’SELECT {cols} FROM {table} WHERE start <= {update} and (stop IS NULL or stop > {

update})’345 bookkeeping_values = {346 ’view_name’: table.long_name,347 ’local_table’: ’marty.’ + table.internal_name,348 ’coldef’: local_cols,349 ’remote_select_stmt’: remote_select_stmt.format(cols=internal_cols, table=table.internal_name,

update=self.update),350 ’temp_table_def’: temp_table_def,351 }352 curs.execute(bookkeeping_query, bookkeeping_values)353

354 trigger_queries_values = {355 ’trigger_name’: table.long_name.replace(’.’, ’_’),356 ’local_table’: ’marty.’ + table.internal_name,357 ’local_columns’: my_cols,358 ’new_values_insert’: ’, ’.join([’NEW.’ + col.name for col in table.columns]),359 ’new_values_update’: ’, ’.join([’"{name}" = NEW.{name}’.format(name=col.name) for col in table.

columns]),360 ’old_values’: ’ AND ’.join([’"{name}" = OLD.{name}’.format(name=col.name) for col in table.columns

]),361 ’view_name’: table.long_name,362 }363

364 # Create insert trigger for view365 insert_query = """366 CREATE FUNCTION {trigger_name}_insert() RETURNS trigger AS $$367 BEGIN368 INSERT INTO {local_table}({local_columns}) VALUES({new_values_insert});369 RETURN NEW;370 END;371 $$ LANGUAGE plpgsql;372

373 CREATE TRIGGER {trigger_name}_insert_trigger374 INSTEAD OF INSERT ON {view_name}375 FOR EACH ROW EXECUTE PROCEDURE {trigger_name}_insert();376 """377 curs.execute(insert_query.format(**trigger_queries_values))378

379 # Create update trigger for view380 update_query = """381 CREATE FUNCTION {trigger_name}_update() RETURNS trigger AS $$382 BEGIN383 UPDATE {local_table} SET {new_values_update} WHERE {old_values};384 RETURN NEW;385 END;386 $$ LANGUAGE plpgsql;387

388 CREATE TRIGGER {trigger_name}_update_trigger389 INSTEAD OF UPDATE ON {view_name}390 FOR EACH ROW EXECUTE PROCEDURE {trigger_name}_update();391 """392 curs.execute(update_query.format(**trigger_queries_values))393

394 # Create delete trigger for view395 delete_query = """396 CREATE FUNCTION {trigger_name}_delete() RETURNS trigger AS $$397 BEGIN398 DELETE FROM {local_table} WHERE {old_values};399 RETURN OLD;400 END;401 $$ LANGUAGE plpgsql;402

403 CREATE TRIGGER {trigger_name}_delete_trigger404 INSTEAD OF DELETE ON {view_name}405 FOR EACH ROW EXECUTE PROCEDURE {trigger_name}_delete();406 """407 curs.execute(delete_query.format(**trigger_queries_values))

Listing A.7: postgres-9.3.3.patch1 diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c2 index 93ee070..20d6806 1006443 --- a/src/backend/access/transam/xlog.c4 +++ b/src/backend/access/transam/xlog.c5 @@ -4666,7 +4666,7 @@ recoveryPausesHere(void)6

7 while (RecoveryIsPaused())8 {9 - pg_usleep(1000000L); /* 1000 ms */

49

Page 66: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

A. Source code

10 + pg_usleep(10000L); /* 10 ms */11 HandleStartupProcInterrupts();12 }13 }14 @@ -4867,6 +4867,7 @@ StartupXLOG(void)15 XLogReaderState *xlogreader;16 XLogPageReadPrivate private;17 bool fast_promoted = false;18 + bool first_run = true;19

20 /*21 * Read control file and check XLOG status looks valid.22 @@ -5696,6 +5697,17 @@ StartupXLOG(void)23 if (!recoveryContinue)24 break;25

26 + /*27 + * Pause the recovery after a transaction commit and also28 + * at the start of the recovery29 + */30 + if (first_run || record->xl_rmid == RM_XACT_ID) {31 + //&& (record->xl_info & ~XLR_INFO_MASK == XLOG_XACT_COMMIT_COMPACT

||32 + // record->xl_info & ~XLR_INFO_MASK == XLOG_XACT_COMMIT)) {33 + SetRecoveryPause(true);34 + first_run = false;35 + }36 +37 /* Else, try to fetch the next WAL record */38 record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false);39 } while (record != NULL);

50

Page 67: Marty: Application Development and Testing with Production ... · the PostgreSQL database management system. It is designed for minimal overhead and con guration on production servers

B. The Origin of the Name

When one builds a time machine and travels back in time he or she can changethe course of history. Even a small change in the historic events that lead to todaysreality can have a big impact on the progression of time and can lead to a completelydi�erent reality from the one we know. This is due to a phenomenon called thebutter�y e�ect.

This phenomenon plays a large role in the plot of the 1989 movie Back to the FuturePart II. In that movie Marty McFly, the protagonist, travels forward in time withhis friend Doc Brown where they discover a very di�erent reality from the one theyare used to. This can be attributed to their actions in the previous �lm where theytraveled back in time. Their previous time travel caused reality to diverge from thepath it originally followed and sends it instead down an alternative timeline.

This is similar to what happens when a user creates a clone database. The clonediverges from the original master database as the user executes di�erent queriesthan are executed on the master. The similarities between this behavior and theplot in Back to the Future Part II was the inspiration for the name Marty, which isthe same as the name of the protagonist in the �lm.

51