Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks?...

11
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig 10 September 2012 UNCLASSIFIED//APPROVED FOR PUBLIC RELEASE

Transcript of Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks?...

Page 1: Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

Analyzing Time Course Data:How can we pick the disappearing needle

across multiple haystacks?

IEEE-HPEC Bioinformatics Challenge Day

Dr. C. Nicole Rosenzweig

10 September 2012

UNCLASSIFIED//APPROVED FOR PUBLIC RELEASE

Page 2: Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

Problem Statement

• Environmental sampling is computationally complex for several reasons:• Reference database contains limited coverage of

plant, fungi, and other material• Organism representation changes over time and

across distance in ways that we have not adequately modeled or captured

• To respond to these limitations, time course studies can be used to identify the genomic component of interest

Page 3: Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

Objective

Invent novel analytical method to identify changes observed among samples.

Subgoals:• Changes observed need some measure of

confidence. Changes can be clustered in confidence “windows”.

• Generate time series representation of multiple datasets.

• Compress data in order to efficiently handle huge datasets.

Page 4: Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

Environmental Sample Described

Metagenomic datasets (FASTQ format) from environmental samples taken at 3 timepoints.

• Environmental sample taken once a day for 3 days.

• Each day, biological material and particulates captured in a buffer that is not DNA-free.

• Particulates (inorganic or plant material) removed by centrifugation), DNA extracted, and material sequenced on Illumina HiSeq 2000

Page 5: Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

Issues to Consider

• BLAST is computationally expensive and time consuming. For many time-course samples, the system will be bogged down in this analysis.– What can be done to simplify the dataset to reduce the

computational burden of BLAST/megaBlast

• Bacteria, virus, and fungi which do not cause human disease are poorly represented in reference databases. Therefore, much of the genomic data will appear as ‘unknown’– How can we cluster or categorize genomic data without a good

reference

• Confidence assessment is a difficult problem.

Page 6: Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

Tools to provide preliminary analysis:Improved speed with lower accuracy

Tools to analyze genomic data that do not rely on BLAST:

• Analytically equivalent (almost), but computationally-improved approaches– Reference Mapping tools, such as Bowtie (sourceforge)

Page 7: Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

Tools to provide preliminary analysis:Improved speed with lower accuracy

Tools to analyze genomic data that do not rely on BLAST:

• Heuristics – 16S/23S rDNA analysis. Because these genomic sequences

are highly conserved, organisms that have not been sequenced can still be included in a relative quantitation of bacterial species in a sample.

– Pro: 16S database is smaller than NCBI RefSeq. Also, this approach could be joined with other heuristics to selectively evaluate a subset of reads within a large datasets.

– Con: In the best of cases, the granularity of analysis is low. Additional analysis would be required based on the output, and this would likely rely on BLAST.

Page 8: Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

Tools to provide preliminary analysis:Improved speed with lower accuracy

Tools to analyze genomic data that do not rely on BLAST:• Heuristics

– CLoVR, from Institute for Genome Sciences, University of MD: Completes 16S classification and alignment to capture diversity. Developed for 16S ribosomal RNA amplicon sequencing

– The CloVR-16S pipeline employs several well-known phylogenetic tools and protocols:

1.QIIME – a Python-based workflow package, allowing for sequence processing and phylogenetic analysis using different methods including the phylogenetic distance metric UniFrac, UCLUST, PyNAST and the RDP Bayesian classifier;

2.UCHIME – a tool for rapid identification of chimeric 16S sequence fragments;

3.Mothur – a C++-based software package for 16S analysis;

4.Metastats and custom R scripts used to generate additional statistical and graphical evaluations.

Page 9: Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

Tools to provide preliminary analysis:Improved speed with lower accuracy

http://clovr.org/wp-content/uploads/2010/07/clovrfig11.png

Page 10: Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

Tools to provide preliminary analysis:Improved speed with lower accuracy

Tools to analyze genomic data that do not rely on BLAST:

• Pattern-Matching programs:– K-mer analysis: K-mer distributions have been observed to be

well-preserved among related strain/species. These could be clustered into groups, allowing for directed post-k-mer analysis.

– Amino acid K-mers can be used to identify homologous genes– Microbes are present everywhere, but the reference material

available in NCBI is not a uniform representation of existing flora and fauna.

Page 11: Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

Objective

Invent novel analytical method to identify changes observed among samples.

Subgoals:• Changes observed need some measure of

confidence. Changes can be clustered in confidence “windows”.

• Generate time series representation of multiple datasets.

• Compress data in order to efficiently handle huge datasets.