Bioinformatics of TB: A case study in big data
-
Upload
peter-van-heusden -
Category
Health & Medicine
-
view
277 -
download
2
Transcript of Bioinformatics of TB: A case study in big data
Bioinformatics of TBA case study in big data
Peter van [email protected]
and Alan Christoffels
South African National Bioinformatics InstituteUniversity of the Western Cape
Bellville, South Africa
January 2015
M. tuberculosis
I Widespread pathogen, responsible for 1.3 million deaths annually
I Genome size ~4 megabases
I Illumina NGS sequencing run ~2 gigabytes (uncompressed)
I Typical student project (2014)1. Gather data (on hard disk / over network)2. Run annotation pipeline (compute time < 1 week, disk used 20 to
40 GB)3. Examine significance of variation compared to “reference
sequence”
M. tuberculosis
I Widespread pathogen, responsible for 1.3 million deaths annually
I Genome size ~4 megabases
I Illumina NGS sequencing run ~2 gigabytes (uncompressed)I Typical student project (2014)
1. Gather data (on hard disk / over network)2. Run annotation pipeline (compute time < 1 week, disk used 20 to
40 GB)3. Examine significance of variation compared to “reference
sequence”
What’s coming down the pipe
I In South Africa alone we have access to samples from severalthousand strains of TB
I Low cost of sequencing means1. More depth: capture population of pathogens in single patient2. More length: study progression of infection in a patient3. More breadth: build in depth regional or global picture of pathogen
sequence
Mapping a virulent TB strain
I “Evolutionary history and global spread of the Mycobacteriumtuberculosis Beijing lineage” Merker et al (2015)
I Beijing lineage strains associated with Multi-Drug Resistant(MDR) TB spread worldwide
I Studied 4987 isolates, fully sequenced 110 representatives
I Mapped 6 clonal complexes and ancestral base sublineageI Paper presents wealth of different data types:
1. DNA reads2. Genotyping3. Phylogeny4. Geospatial5. Time series data6. Metadata on samples and experiments
More data: not more of the same
I Existing publishing puts focus on results not dataI Research data is very seldom FAIR:
1. Findable2. Accessible3. Interpretable4. Reusable
(j.mp/fairdata1)
Change data handling, change research results
In the 21st century, much of the vast volume of scientificdata captured by new instruments on a 24/7 basis, alongwith information generated in the artificial worlds ofcomputer models, is likely to reside forever in a live,substantially publicly accessible, curated state for thepurposes of continued analysis. This analysis will result inthe development of many new theories! (Jim Gray)
I “Big” in “Big Data” is not (only) about data volume
I Cheap pathogen sequencing is driving complexity of questionsthat can be asked of data
I ...but only if data is FAIR
Why we’re not all riding to work on unicorns
[W]e now have terrible data management tools for most ofthe science disciplines. . . . When you go and look at whatscientists are doing, day in and day out, in terms of dataanalysis, it is truly dreadful. (Jim Gray)
I Who curates your data?
I How is it managed?
I Where is it analysed?
I And who gets access?
Future directions for SANBI (data management) research
I Research programme is necessarily modest:1. Cross-institution authentication, authorisation and movement of
data2. New storage technologies3. Data repositories in addition to filesystems4. Storing and querying data on sequence collections, not individual
samples
I Individual institutes can only prototype solutions: scale of thechallenge will require much broader collaborative development
Future directions for SANBI (data management) research
I Research programme is necessarily modest:1. Cross-institution authentication, authorisation and movement of
data2. New storage technologies3. Data repositories in addition to filesystems4. Storing and querying data on sequence collections, not individual
samples
I Individual institutes can only prototype solutions: scale of thechallenge will require much broader collaborative development