2013 alumni-webinar

50
I’ve got the Big Data Blues C. Titus Brown [email protected] Microbiology, Computer Science, and BEACON

Transcript of 2013 alumni-webinar

Page 1: 2013 alumni-webinar

I’ve got the Big Data Blues

C. Titus [email protected]

Microbiology, Computer Science, and BEACON

Page 2: 2013 alumni-webinar

Outline

1. Genetics 101 and 102 - what you need to know.2. Marek’s Disease – chicken cancer.3. Generating lots of data – the sequencing revolution.4. The problems of data analysis and data integration.5. Some preliminary results on Marek’s Disease5. An apparent digression: chess and computers.6. My actual research :)

Page 3: 2013 alumni-webinar

Genetics 101: DNA to RNA to protein to phenotype…

http://commons.wikimedia.org/wiki/File:Spombe_Pop2p_protein_structure_rainbow.png; http://commons.wikimedia.org/wiki/File:Protein_CA2_PDB_12ca.png

Page 4: 2013 alumni-webinar

…plus diploidy (2x each chromosome)

Page 5: 2013 alumni-webinar

…plus regulation and interaction.

Page 6: 2013 alumni-webinar

PHYSICAL AGENTS

INFECTIOUSAGENTS

HORMONES RADIATION

CANCER

GENETIC FACTORS

CHEMICAL CARCINOGENS

LIFESTYLE FACTORS

(slide courtesy Suga Subramanian)

Page 7: 2013 alumni-webinar

Herpesvirus and Cancer

• Epstein-Barr Virus– Burkitt’s lymphoma– Hodgkin’s lymphoma– Nasopharyngeal

carcinoma

• Herpes Virus-8– Kaposi’s sarcoma– Multicentric lymphoma

• Mardivirus– Marek’s Disease

• Viral neoplastic disease• Alpha-herpesvirus• Model for Burkitt’s lymphoma

(slide courtesy Suga Subramanian)

Page 8: 2013 alumni-webinar

Clinical Signs Asymmetric Paralysis

http://partnersah.vet.cornell.edu/avian-atlas/

Page 9: 2013 alumni-webinar

Visceral LymphomaLiver

NO

RM

AL

LYM

PH

OM

A

Courtesy: John Dunn, USDA

Page 10: 2013 alumni-webinar

Importance of Marek’s Disease

• Agricultural Impact– Economic losses (2 billion)– Viral evolution: Increased virulence – Current Vaccines: Not enough– Long term viral persistence

• Model Sytem– Human herpes viral infections– Viral induced lymphoma

(slide courtesy Suga Subramanian)

Page 11: 2013 alumni-webinar

MAREK’S DISEASE VIRUS

(MDV)INBRED CHICKEN

LINES

MD-RESISTANT LINE

MD-SUSCEPTIBLE LINE

LINE 62 LINE 73

GENETIC RESISTANCE TO MAREK’S DISEASE

?(slide courtesy Suga Subramanian)

Page 12: 2013 alumni-webinar

What happens when we infect?

Page 13: 2013 alumni-webinar

…how does the virus specifically interact with genes?

Page 14: 2013 alumni-webinar

…and what are the mechanisms of resistance?

Page 15: 2013 alumni-webinar

Digression: DNA sequencing

• Observation of actual DNA sequence• Counting of molecules

Image: Werner Van Belle

Page 16: 2013 alumni-webinar

Fast, cheap, and easy to generate.

Image: Werner Van Belle

Page 17: 2013 alumni-webinar

Applying sequencing to Marek’s Disease

Page 18: 2013 alumni-webinar

Differentially expressed genes (DEG) due to infection

Gene GO Analysis, IPA Pathway Analysis

DEGs in Md5-infected and not in Md5ΔMeq-infected groups

YES NO

Meq-dependent DEGs DEGs not dependent on Meq

DEGs in Line 6 and not in Line 7 DEGs in Line 7 and not in Line 6YES NO NO YES

Meq-dependent DEGs involved in MD resistance

Meq-dependent DEGs involved in

MD susceptibility

Meq-dependent DEGs common to both lines

Back to Marek’s disease:

(slide courtesy Suga Subramanian)

Page 19: 2013 alumni-webinar

LINE 6

MD-RESISTANCE: ROLE OF MEQ

MDV MDV-no Meq

Genes involved in MD-resistance

that are regulated by Meq

Genes involved in MD-resistance that are not regulated

by Meq

1031 1670

(slide courtesy Suga Subramanian)

Page 20: 2013 alumni-webinar

Pathway Analysis: MD resistance

(slide courtesy Suga Subramanian)

Page 21: 2013 alumni-webinar

LINE 7

MD-SUSCEPTIBILITY: ROLE OF MEQ

MDV MDV-no Meq

Genes involved in MD-susceptibilitythat are regulated

by Meq

Genes involved in MD-susceptibility

that are not regulated by Meq

650 540

(slide courtesy Suga Subramanian)

Page 22: 2013 alumni-webinar

Pathway Analysis: MD susceptibility

(slide courtesy Suga Subramanian)

Page 23: 2013 alumni-webinar

Next problem: data analysis & integration!

• Once you can generate virtually any data set you want…

• …the next problem becomes finding your answer in the data set!

• Think of it as a gigantic NSA treasure hunt: you know there are terrorists out there, but to find them you to hunt through 1 bn phone calls a day…

Page 24: 2013 alumni-webinar

Digression: “Heuristics”

• What do computers do when the answer is either really, really hard to compute exactly, or actually impossible?

• They approximate! Or guess!

• The term “heuristic” refers to a guess, or shortcut procedure, that usually returns a pretty good answer.

Page 25: 2013 alumni-webinar

Often explicit or implicit tradeoffs between compute “amount” and quality of result

http://www.infernodevelopment.com/how-computer-chess-engines-think-minimax-tree

Page 26: 2013 alumni-webinar

My actual research focus

What we do is think about ways to get computers to play chess better, by:

– Identifying better ways to guess;– Speeding up the guessing process;– Improving people’s ability to use the chess playing

computer

Now, replace “play chess” with“analyze biological data”...

Page 27: 2013 alumni-webinar

My actual research focus…

We build tools that help experimental biologists work efficiently and correctly with large amounts of data, to help answer their

scientific questions.

This touches on many problems, including:• Computational and scientific correctness.• Computational efficiency.• Cultural divides between experimental biologists and

computational scientists.• Lack of training (biology and medical curricula devoid of math

and computing).

Page 28: 2013 alumni-webinar

Not-so-secret sauce: “digital normalization”

• One primary step of one type of data analysis becomes 20-200x faster, 20-150x “cheaper”.

Page 29: 2013 alumni-webinar

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Page 30: 2013 alumni-webinar

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Page 31: 2013 alumni-webinar

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Page 32: 2013 alumni-webinar

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Page 33: 2013 alumni-webinar

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Page 34: 2013 alumni-webinar

Restated:

Can we use lossy compression approaches to make downstream analysis faster and better? (Yes.)

~2 GB – 2 TB of single-chassis RAM

Page 35: 2013 alumni-webinar

Some diginorm examples:

1. Assembly of the H. contortus parasitic nematode genome.

2. Assembly of two Midwest soil metagenomes, Iowa corn and Iowa prairie.

3. Reference-free assembly of the lamprey (P. marinus) transcriptome.

Page 36: 2013 alumni-webinar

1. The H. contortus problem

• A sheep parasite.

• ~350 Mbp genome

• Sequenced DNA 6 individuals after whole genome amplification, estimated 10% heterozygosity (!?)

• Significant bacterial contamination.

(w/Robin Gasser, Paul Sternberg, and Erich Schwarz)

Page 37: 2013 alumni-webinar

H. contortus life cycle

Refs.: Nikolaou and Gasser (2006), Int. J. Parasitol. 36, 859-868;Prichard and Geary (2008), Nature 452, 157-158.

Page 38: 2013 alumni-webinar

Assembly after digital normalization

• Diginorm readily enabled assembly of a 404 Mbp genome with N50 of 15.6 kb;

• Post-processing led to 73-94% complete genome.

• Diginorm helped by making analysis possible.– Highly variable population.– Lots of contamination from microbes.

Page 39: 2013 alumni-webinar

Next steps with H. contortus

• Publish the genome paper

• Identification of antibiotic targets for treatment in agricultural settings (animal husbandry).

• Serving as “reference approach” for a wide variety of parasitic nematodes, many of which have similar genomic issues.

Page 40: 2013 alumni-webinar

2. Soil metagenome assembly

Page 41: 2013 alumni-webinar

A “Grand Challenge” dataset (DOE/JGI)

Page 42: 2013 alumni-webinar

Putting it in perspective:Total equivalent of ~1200 bacterial genomesHuman genome ~3 billion bp

Assembly results for Iowa corn and prairie(2x ~300 Gbp soil metagenomes)

Total Assembly

Total Contigs(> 300 bp)

% Reads Assembled

Predicted protein coding

2.5 bill 4.5 mill 19% 5.3 mill

3.5 bill 5.9 mill 22% 6.8 mill

Adina Howe

Page 43: 2013 alumni-webinar

3. Sea lamprey gene expression

• Non-native• Parasite of

medium to large fishes

• Caused populations of host fishes to crash

Li Lab / Y-W C-D

Page 44: 2013 alumni-webinar

Transcriptome results• Started with 5.1 billion reads from 50 different tissues.

(4 years of computational research, and about 1 month of compute time, GO HERE)

• Final assembly contains ~95% of genes (est.)• This is an extra 40% over previous work.• Enabling studies in –

– Basal vertebrate phylogeny– Biliary atresia– Evolutionary origin of brown fat (previously thought to be mammalian

only!) – J Exp Biol. 2013– Pheromonal response in adults

Page 45: 2013 alumni-webinar

What are the tissue level changes in gene expression that support regeneration? Transcriptome analysis of a regenerating vertebrate after SCI

brainspinal cord

RNA-Seq to determinedifferential expressionprofile after injury

Sampling >weekly

-/+ Dex

Ona Bloom

Page 46: 2013 alumni-webinar

Challenges ahead

• We need more people working at the interface– “Priesthood” model doesn’t scale!– Cultural shifts in biology needed…

• We need more data!– Data often only makes sense in context of other data– This is a hard sell: “if you give us 1000x as much data, we

might start to develop some idea of what it means.”

• We actually know very little about biology still!

Page 47: 2013 alumni-webinar

Open science & sharing

• Science, and biology in particular, is in the middle of a transition to a “data intensive” field.

• The sharing ethos is not incentivized properly; you get more credit for discovering new stuff than for discoveries resulting from sharing.

• We are focused on sharing: methods, programs, educational materials…

Page 48: 2013 alumni-webinar

Being disruptive?

Possible initiative from my lab:“We will analyze your data for you if we can

make your data openly available in 1 yr.”

Will it work, or sink like a stone? Ask me in a year

Page 49: 2013 alumni-webinar

MSU’s role in my research

• MSU provides nice infrastructure, great administrative support, and a truly excellent community (students, profs, and other researchers).

• MSU is also uniquely interdisciplinary in many ways; very few “hard” boundaries in biology research.

Page 50: 2013 alumni-webinar

Credits

• Marek’s Disease: Suga Subramanian and Hans Cheng (USDA)• Haemonchus: Erich Schwarz (Caltech/Cornell), Paul Sternberg

(Caltech), Robin Gasser (U. Melbourne)• Lamprey: Weiming Li (MSU), Ona Bloom (Feinstein), Jen

Morgan (MBL/Woods Hole)• Great Prairie: Jim Tiedje (MSU), Janet Jansson (LBL), Susanna

Tringe (Joint Genome Inst.)

Funding: MSU; USDA; NSF; NIH.

Drop me a line – [email protected]