STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole...

49
STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu

Transcript of STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole...

Page 1: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT115Introduction to Computational

Biology and BioinformaticsSpring 2012

Jun Liu

&

Xiaole Shirley Liu

Page 2: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT1152

Outline

• Course information

• Computational biology problems revolve around the Central Dogma of Molecular Biology

• Course structure (syllabus)

• Q&A

Page 3: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT1153

STAT115 Lectures• Instructor:

– Jun Liu: 617-495-1600, [email protected]

– Xiaole Shirley Liu: 617-632-2472, [email protected]

• Lecture: Tuesdays and Thursdays 11:30-1– NWB, B-108 (Cambridge); Kresge 213 (Boston)

– Selected lecture notes available online after lecture

• Office hours– J Liu: Tu 1-3 PM, SC 715

– XS Liu: Thu 2-4 PM, CLSB (3 Blackfan Circle) 11022, Boston

Page 4: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT1154

STAT115 Labs and Web

• Teaching Fellows: – Alejandro Zarat: [email protected]– Daniel Fernandes: [email protected] – Lab in Science Center FL 418D, Harvard Yard,

W 6-8 pm (google map link in the course syllabus).

• Course website: www.stat115.com

• Lecture notes (also in the course website): http://CompBio.pbwiki.com

Page 5: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT1155

STAT115 Recommended Texts

Page 6: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT1156

STAT115 Recommended Texts

Page 7: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT1157

STAT115 Grading

• Homework: 80 pts– 6 HW, 14*5+10=80 pts each

– Problems to be solved by hand, running some software online to obtain results, and some coding (python and R)

– 6 total late days, <= 3 days for a single HW

• Quiz at selected lectures 2*10=20 pts– 10 highest normalized scores, 2 pts each

– All short answers, true/false, multiple choice

Page 8: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Genome and gene

Entity Definition Molecular Mechanisms

Genome Unit of information transmission

DNA replication

Gene Unit of information expression

Transcription to RNA Translation to protein

Page 9: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Nucleic acid and proteins

Macromolecule Backbone Repeating unit Length Role

Nucleic acid

DNA

Phosphodiester bonds

Deoxyribonucleotides (A, C, G, T)

103-108 Genome

RNA

Phosphodiester bonds

Ribonucleotides (A, C, G, U)

103-105 103-104 102-103

Genome Messenger Gene product

Protein Peptide bonds Amino acids (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y)

102-103 Gene product

Page 10: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

1 cctcttttcc gtggcgcctc ggaggcgttc agctgcttca agatgaagct gaacatctcc 61 ttcccagcca ctggctgcca gaaactcatt gaagtggacg atgaacgcaa acttcgtact 121 ttctatgaga agcgtatggc cacagaagtt gctgctgacg ctctgggtga agaatggaag 181 ggttatgtgg tccgaatcag tggtgggaac gacaaacaag gtttccccat gaagcagggt 241 gtcttgaccc atggccgtgt ccgcctgcta ctgagtaagg ggcattcctg ttacagacca 301 aggagaactg gagaaagaaa gagaaaatca gttcgtggtt gcattgtgga tgcaaatctg 361 agcgttctca acttggttat tgtaaaaaaa ggagagaagg atattcctgg actgactgat 421 actacagtgc ctcgccgcct gggccccaaa agagctagca gaatccgcaa acttttcaat 481 ctctctaaag aagatgatgt ccgccagtat gttgtaagaa agcccttaaa taaagaaggt 541 aagaaaccta ggaccaaagc acccaagatt cagcgtcttg ttactccacg tgtcctgcag 601 cacaaacggc ggcgtattgc tctgaagaag cagcgtacca agaaaaataa agaagaggct 661 gcagaatatg ctaaactttt ggccaagaga atgaaggagg ctaaggagaa gcgccaggaa 721 caaattgcga agagacgcag actttcctct ctgcgagctt ctacttctaa gtctgaatcc 781 agtcagaaat aagatttttt gagtaacaaa taaataagat cagactctg

RPS6 (ribosomal protein S6) gene

The information in a gene is encoded by its DNA sequence

Page 11: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

1 mklnisfpat gcqklievdd erklrtfyek rmatevaada lgeewkgyvv risggndkqg 61 fpmkqgvlth grvrlllskg hscyrprrtg erkrksvrgc ivdanlsvln lvivkkgekd 121 ipgltdttvp rrlgpkrasr irklfnlske ddvrqyvvrk plnkegkkpr tkapkiqrlv 181 tprvlqhkrr rialkkqrtk knkeeaaeya kllakrmkea kekrqeqiak rrrlsslras 241 tsksessqk

RPS6 (ribosomal protein S6) protein sequence:

The structure of a protein is encoded by its amino acids sequence

Page 12: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Nucleotide codes

A Adenine W Weak (A or T)

G Guanine S Strong (G or C)

C Cytosine M Amino (A or C)

T Thymine K Keto (G or T)

U Uracil B Not A (G or C or T)

R Purine (A or G) H Not G (A or C or T)

Y Pyrimidine (C or T) D Not C (A or G or T)

N Any nucleotide V Not T (A or G or C)

Page 13: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

The Four Nucleosides of DNA

dA dG dC dT

A nucleoside is a sugar, here deoxyribose, plus a base

dA = deoxyadenosine, etc.

PYRIMIDINESPURINES

DNA is built from nucleotides

Page 14: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Structure of DNA:Double helix

Page 15: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Base Pairing

Page 16: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

A nucleotide is a phospate, a sugar, and a purine or a pyramidine base.

The monomeric units of nucleic acids are called nucleotides.

Page 17: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Amino acid codes

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val Asx Glx Sec Unk

A R N D C Q E G H I L K M F P S T W Y V B Z U X

Alanine Arginine Asparagine Aspartic acid Cysteine Glutamine Glutamic acid Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine Asn or Asp Gln or Glu Selenocysteine Unknown

Protein are built from amino acids

Page 18: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

http://web.mit.edu/esgbio/www/lm/proteins/peptidebond.html

Page 19: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.
Page 20: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

The diversity of protein structure

Page 21: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Anfinsen 1961 ribonuclease re-naturing experiments: Sequence determines structure

Page 22: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT11522

Central Dogma of Molecular Biology

DNA replication

DNA

RNA

Transcription

Physiology

Folded withfunction

Protein

Translation

Page 23: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT11523

Central Dogma of Molecular Biology• DNA RNA Protein

• Genome sequencing, assembly and annotation– Sequence alignment (pairwise & multiple)– Gene prediction

• Genome variation:– Single base difference (SNP) and big copy

number duplication / deletions– Association studies

• Comparative genomics and phylogenies

Page 24: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT11524

Case Study IThe Human Genome Race

• Human Genome Project: 1990-2003– Originally 1990-2005– Boosted by technology improvement

(automation improved throughput and quality with reduced cost)

– Competition from Celera

• Informatics essential for both the public and private sequencing efforts– Sequence assembly and gene prediction– Working draft finished simultaneously spring

2000

Page 25: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT11525

Competing Sequencing Strategies• Clone-by-clone and whole-genome shotgun

Page 26: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Retail DNA Test

• TIME's Best Inventions (2008)

26

“Your genome used to be a closed book. Now a simple, affordable (399 USD) test can shed new light on everything from your intelligence to your biggest health risks. Say hello to your dna — if you dare” -- time.com

“Your genome used to be a closed book. Now a simple, affordable (399 USD) test can shed new light on everything from your intelligence to your biggest health risks. Say hello to your dna — if you dare” -- time.com

Page 27: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

1000 Genome Project

• Sequencing the genomes of at least a thousand people from around the world to create the most detailed and medically useful picture to date of human genetic variation

27

Page 28: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT11528

Central Dogma of Molecular Biology

• DNA RNA Protein • RNA structure prediction• Differential gene expression:

– Gene expression microarray and analysis, normalization, clustering, gene ontology and classification

• Transcription regulation– Transcription factor motif finding, epigenetic

regulation, transcription regulatory network

• Post-transcriptional regulation: mi/siRNA

Page 29: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT11529

Case Study IICancer Classifications Using Microarrays

• Microarray contains hundreds to millions of tiny probes

• Simultaneously detect how much each gene is “on”

• Cancer type classification – AML: acute myeloid leukemia

– ALL: acute lymphoblastic leukemia

– Check multiple samples of each type on microarrays

– Find good gene markers

Page 30: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT11530

ALL vs AML

• Golub et al, Science 1999.

Page 31: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT11531

ALL vs AML

Page 32: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT11532

Central Dogma of Molecular Biology

• DNA RNA Protein

• Protein sequence motifs

• Protein structure prediction

• Mass spectrometry proteomics

• Protein interaction networks

Page 33: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT11533

Case Study IIIIs Tamiflu for you?

• Roche’s Oseltamivir (Tamiflu) is the only available orally application drug for avian influenza (bird flu)

• 75 pediatric severe adverse events– Fatalities, neuropsychiatric, and skin

– 69 in Japan

• Inhibit neuraminidase of flu – The structure of its active site is homologous

to human sialidases (HsNEU2)

– An Asian-specific SNP (~10%) changes R41 to Q

Page 34: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT11534

Is Tamiflu for you?

• Tamiflu binds to R41Q much stronger– Molecular simulations

– Decreased sialidase activity severe side effect

– Li et al, Cell Res, 2007

Page 35: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Study of HIV drug resistance

STAT11535

Protease Inhibitors (PIs) target HIV-1 protease enzyme which is responsible for the posttranslational processing of the viral gag- and gag-pol-encoded poly proteins to yield the structural proteins and enzymes of the virus.

Page 36: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

36

Data: can we detect drug resistance mutations?

• Protease sequences from treated patients (949 cases)VVTIRIGGQLKEALLDTGAD

IVTIRIGGQLKEALLDTGAD

RVTIRIGGQLREALLDTGAD

• Sequences from untreated patients (4146 controls)LVTIRIGGQLREALLDTGAD

IVTIRIGGQLKEALLDTGAD

LVTIRIGGQLKEALLDTGAD

Which ones contributes to drug resistance?

Page 37: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

37

Drug resistance mutations

• The IAS-USA Drug Resistance Mutations list in HIV-1 updated in Fall 2006

• For IDV, mutations on the list are 10, 20, 24, 32, 36, 46, 54, 71, 73, 77, 82, 84, 90

• The ones we detect

10, 24, 32, 46, 54, 71, 73, 82, 90

Page 38: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

38

InteractionsWhat is known:

The occurrence of changes at L10, L24, M46, I54, A71, V82, I84, L90 was highly significantly correlated with phenotypic resistance.

Minor mutations influence drug resistance only in combination with other mutations.

73 + 90, 32+47, 84+90, 46+54+82, 88+90, Our results are consistent with above.The story about the mutation combination {46,54,82}

Conditional independence: 46 – 82 – 54. Single mutation at 54 has no effect V82A mutation is the key – without it others have small effect

Page 39: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

39

Zhang et al. (2010, PNAS)

Page 40: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Human genome sequencing

• Human genome project: 13 years (1990-2003), $3 billions, 6 countries, thousands of researchers and technicians

• 2011: 4 genomes in 8 days, costing $3000 each.

• In 2-3 years, each genome for 1-2 days, hundreds $, huge data

• Bioinformatics: turn data to knowledge

40

Page 41: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Gene expression microarrays

• In the 90s, gene chip, $2000/sample

• 2011: chips for multiple copies of 1000 genes, $5-10/sample

• Using computational approach to infer gene expressions of ~20K genes from the observed expressions of the 1000 genes.

• Used for medical diagnosis, large scale drug target screening

Page 42: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Statistics?

42

Page 43: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

04/19/2343

Quotes

• True logic of this world is in the calculus of probabilities --- J. C. Maxwell

• What we see is the solution to a computational problem, our brains compute the most likely causes from the photon absorptions within our eyes --- H. Helmholtz

Page 44: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Beauty, Mathematics, Statistics, and Science

• Statistics: the only systematic way (that I know of) to connect mathematics with ordinary life activities

• Focus: studying and quantifying uncertainty; optimally extracting information; prediction

• Models: All models are wrong, but– Even those imperfect ones are very useful!

– Used as a powerful mathematical framework for organizing our thoughts and integrating information

• Mathematicians and physicists take care of the “beauty-only” part, and we take care of the rest

44

Page 45: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Recent Success Stories

• Mapping disease genes – genetics and genomics

• Random walk, Markov, page rank and

• Jim Simons making many billions of $$$

• Compressive sensing, sparsity, random matrix and …

45

Obama

Page 46: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Two schools of thoughts in statistics

• Bayesian: using probability distribution as a direct measure of uncertainty– Bayes Theorem:

• Frequentist: embedding the observed event in a sequence of “imaginary replications” – like a false positive false negative evaluation

46

( | ) ( )( | )

( )

P B A P AP A B

P B

( | ) ( ) ( | ) ( )( | )

( ) ( | ) ( )

P Data P P Data PP Data

P Data P Data P d

( | ) ( )( | )

( )

P BloodP C HeartD P HeartDP HeartDisease BloodP C

P BloodP C

Page 47: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

STAT11547

Q&A

• Is this course for me?– Upper undergraduate and entry graduate

students interested in computational biology

• Do I have the background?– Biology knowledge is easy to accumulate– Statistics: basic stats tests, probability, some

linear algebra helps– Programming: prior programming helps

although good logic and willingness to learn and work for it are more important

Page 48: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

Q&A

• STAT115 or STAT215?– STAT215 if: – You want to work on an exploratory research

problem (either from the professors or on your own)

– You have better coding skills

STAT11548

Page 49: STAT115 Introduction to Computational Biology and Bioinformatics Spring 2012 Jun Liu & Xiaole Shirley Liu.

All biology is becoming computational, much the same way it has became

molecular … Otherwise “low input, high throughput and no output science”

--- Sydney Brenner

2002 Nobel Prize