Postgres and the Genome

Postgres and the Genome

Jeff PenningtonDirector, Translational InformaticsCenter for Biomedical Informatics

AndDepartment of Pathology

The Children’s Hospital Of Philadelphia

Outline

• Background• Genome analysis in the clinic• Application• Database• DB Tuning

DNA as Data

• 4 letter ‘alphabet’ of bases – A T C G3,000,000,000 base pairs

• Sequence codes for biological function

Mutations

Clinical Mutation = ‘Variant’

Sequencing = 100K – 4M Variants

VARIFY

VARIFY Architecture

• Varify Architecture– Three-tier web application– Harvest (http://harvest.research.chop.edu)• Javascript client• Python server using Django ORM• Postgres 9.2

http://harvest.research.chop.edu/

Database

• Physical – 9.2, RHEL VM, VMWare w/ storage on host

• Round 1 – 4G RAM, 80G disk• Round 2 – 32 G RAM, 250G disk

Tuning

• max_connections – too big, • shared_buffers – amount of memory allocated

to PG• work_mem – amount of memory available to

sort• default_statistics_target – gives the query

planner something to work with

Resources

• Book: PostgreSQL 9.0 High Performance– Ch 5 and 6– Page 145

• Tools: pg_buffercache• Benchmarking: – \timing– EXPLAIN– log_min_duration_statement = 5000

Tuning Round 1 (4G RAM)

• max_connections = 100• shared_buffers = 1024MB (default 32MB)• work_mem = 200MB (default 1M)– Tried 1G, bad trade-off on count (slow) vs. list (not

much faster)

Tuning Round 2 (32G RAM)

• max_connections = 100• shared_buffers = 24576MB (Increased from

1024MB)• work_mem = 150MB (Decreased from 200MB)

Tuning Round 3

• Everything in Round 2• default_statistics_target = 1000 (default 100)

Postgres and the Genome

Documents

Transcript of Postgres and the Genome