Bioinformatics of TB: A case study in big data

Bioinformatics of TBA case study in big data

Peter van [email protected]

and Alan Christoffels

South African National Bioinformatics InstituteUniversity of the Western Cape

Bellville, South Africa

January 2015

The plummeting cost of sequencing

http://www.sanbi.ac.za


M. tuberculosis

I Widespread pathogen, responsible for 1.3 million deaths annually

I Genome size ~4 megabases

I Illumina NGS sequencing run ~2 gigabytes (uncompressed)

I Typical student project (2014)1. Gather data (on hard disk / over network)2. Run annotation pipeline (compute time < 1 week, disk used 20 to

40 GB)3. Examine significance of variation compared to “reference

sequence”



M. tuberculosis

I Widespread pathogen, responsible for 1.3 million deaths annually

I Genome size ~4 megabases

I Illumina NGS sequencing run ~2 gigabytes (uncompressed)I Typical student project (2014)

1. Gather data (on hard disk / over network)2. Run annotation pipeline (compute time < 1 week, disk used 20 to

40 GB)3. Examine significance of variation compared to “reference

sequence”



What’s coming down the pipe

I In South Africa alone we have access to samples from severalthousand strains of TB

I Low cost of sequencing means1. More depth: capture population of pathogens in single patient2. More length: study progression of infection in a patient3. More breadth: build in depth regional or global picture of pathogen

sequence



Mapping a virulent TB strain

I “Evolutionary history and global spread of the Mycobacteriumtuberculosis Beijing lineage” Merker et al (2015)

I Beijing lineage strains associated with Multi-Drug Resistant(MDR) TB spread worldwide

I Studied 4987 isolates, fully sequenced 110 representatives

I Mapped 6 clonal complexes and ancestral base sublineageI Paper presents wealth of different data types:

1. DNA reads2. Genotyping3. Phylogeny4. Geospatial5. Time series data6. Metadata on samples and experiments



More data: not more of the same

I Existing publishing puts focus on results not dataI Research data is very seldom FAIR:

1. Findable2. Accessible3. Interpretable4. Reusable

(j.mp/fairdata1)



j.mp/fairdata1

Change data handling, change research results

In the 21st century, much of the vast volume of scientificdata captured by new instruments on a 24/7 basis, alongwith information generated in the artificial worlds ofcomputer models, is likely to reside forever in a live,substantially publicly accessible, curated state for thepurposes of continued analysis. This analysis will result inthe development of many new theories! (Jim Gray)

I “Big” in “Big Data” is not (only) about data volume

I Cheap pathogen sequencing is driving complexity of questionsthat can be asked of data

I ...but only if data is FAIR



Why we’re not all riding to work on unicorns

[W]e now have terrible data management tools for most ofthe science disciplines. . . . When you go and look at whatscientists are doing, day in and day out, in terms of dataanalysis, it is truly dreadful. (Jim Gray)

I Who curates your data?

I How is it managed?

I Where is it analysed?

I And who gets access?



Future directions for SANBI (data management) research

I Research programme is necessarily modest:1. Cross-institution authentication, authorisation and movement of

data2. New storage technologies3. Data repositories in addition to filesystems4. Storing and querying data on sequence collections, not individual

samples

I Individual institutes can only prototype solutions: scale of thechallenge will require much broader collaborative development



Bioinformatics of TB: A case study in big data

Health & Medicine

Transcript of Bioinformatics of TB: A case study in big data