CS 598SS Probabilistic Methods in Biological Sequence Analysis

41
CS 598SS Probabilistic Methods in Biological Sequence Analysis Saurabh Sinha

description

CS 598SS Probabilistic Methods in Biological Sequence Analysis. Saurabh Sinha. What is the course about?. Bioinformatics / Computational Biology Tools for analyzing genomes Probabilistic methods. What is the course format?. Research course Lectures by instructor - PowerPoint PPT Presentation

Transcript of CS 598SS Probabilistic Methods in Biological Sequence Analysis

Page 1: CS 598SS Probabilistic Methods in Biological Sequence Analysis

CS 598SSProbabilistic Methods in

Biological Sequence Analysis

Saurabh Sinha

Page 2: CS 598SS Probabilistic Methods in Biological Sequence Analysis

What is the course about?

• Bioinformatics / Computational Biology

• Tools for analyzing genomes

• Probabilistic methods

Page 3: CS 598SS Probabilistic Methods in Biological Sequence Analysis

What is the course format?

• Research course• Lectures by instructor• Student presentations of research papers

– 1 or 2 paper(s) per student

• Research project & presentation– Typically, 2 students per project– 30 mins presentation at end of course.

Page 4: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Grading

• Project: 40%

• Paper presentation: 25%

• Assignments and/or tests: 25%

• Participation: 10%

• Grade distribution

Page 5: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Expectations

• Programming skills (for the project)

• Basic exposure to probability theory

• Basic exposure to algorithms

Page 6: CS 598SS Probabilistic Methods in Biological Sequence Analysis

What you can do at the end of the course

• Start working on research projects in bioinformatics: biological sequence analysis

• Use principled approaches, supported by probability theory, instead of ad hoc methods

• Join me as a graduate advisee ?

Page 7: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Administrative Details

• Instructor: – Saurabh Sinha– Room 2122, Siebel Center– Email: [email protected]

• Class hrs: Tue & Thurs, 2:00pm - 3:15pm, 1131SC

• CRN: 43781• Credits: 4 graduate hrs• Welcome to sit in, if not taking for credit

Page 8: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Books

• Not required1. Biological Sequence Analysis : Probabilistic Models

of Proteins and Nucleic Acids -- Durbin, Eddy, Krogh, Mitchison2. Bioinformatics: The Machine Learning Approach

-- Baldi, Brunak3. Statistical Methods in Bioinformatics

-- Ewens and Grant4. Bioinformatics -- Polanski and Kimmel

Page 9: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Why study bioinformatics?

• Molecular biology is the new frontier of 21st century science

• Computer science is the crown prince of 20th century engineering

• Bioinformatics is the application and development of computer science with the goal of supporting molecular biology

Page 10: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Why study bioinformatics?

• Flood of data: several Giga (Tera?) bytes of sequence, and gene expression data.

• Noise in the data– Biological– Experimental

• Algorithms needed to make discoveries– Probabilistic methods– Need for efficiency

Page 11: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Why study bioinformatics?

• The big picture:– Human health and quality of life– Fundamental science

• Billions of dollars being spent– Health research gets the major chunk of the US

Govt’s funds– Fundamental health research is at the molecular

level– Molecular biology research increasingly a

quantitative science

Page 12: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Why study bioinformatics?• Recent issue of Science: top 25 questions>What Is the Universe Made Of?>What is the Biological Basis of

Consciousness?>Why Do Humans Have So Few Genes?>To What Extent Are Genetic Variation and Personal Health Linked?>Can the Laws of Physics Be Unified?>How Much Can Human Life Span Be Extended?>What Controls Organ Regeneration?>How Can a Skin Cell Become a Nerve Cell?>How Does a Single Somatic Cell Become a Whole Plant?>How Does Earth's Interior Work?>Are We Alone in the Universe?>How and Where Did Life on Earth Arise?>What Determines Species Diversity?>What Genetic Changes Made Us Uniquely Human?>How Are Memories Stored and Retrieved?>How Did Cooperative Behavior Evolve?>How Will Big Pictures Emerge from a Sea of Biological Data?>How Far Can We Push Chemical Self-Assembly?>What Are the Limits of Conventional Computing?>Can We Selectively Shut Off Immune Responses?>Do Deeper Principles Underlie Quantum Uncertainty and Nonlocality?>Is an Effective HIV Vaccine Feasible?>How Hot Will the Greenhouse World Be?>What Can Replace Cheap Oil -- and When?>Will Malthus Continue to Be Wrong?

Page 13: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Basic Molecular Biology

Page 14: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Life, Cells, Proteins

• The study of life the study of cells• Cells are born, do their job, duplicate,

die– What is “their job”?– Break down nutrients, produce energy,

produce required molecules

• All these processes controlled by proteins

Page 15: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Protein functions

• “Enzymes” (catalysts)– Control chemical reactions in cell

• Transfer of signals/molecules between and inside cells– E.g., sensing of environment

• Regulate production of other proteins

Page 16: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Protein molecule

• Protein is a sequence of amino-acids

• 20 possible amino acids

• The amino-acid sequence “folds” into a 3-D structure called protein

Page 17: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Protein Structure

Protein

DNA

The DNA repair protein MutY (blue) bound to DNA (purple).

PN

AS

cover, courtesy Am

ie B

oal

Page 18: CS 598SS Probabilistic Methods in Biological Sequence Analysis

DNA

• Deoxyribonucleic acid: a molecule that is involved in production of proteins

• Double helical structure (discovered by Watson, Crick, Wilkins & Franklin)

• Chromosomes are densely coiled and packed DNA

Page 19: CS 598SS Probabilistic Methods in Biological Sequence Analysis

SOURCE: http://www.microbe.org/espanol/news/human_genome.asp

Chromosome

DNA

Page 20: CS 598SS Probabilistic Methods in Biological Sequence Analysis

The DNA Molecule

G -- C A -- T T -- A G -- C C -- G G -- C T -- A G -- C T -- A T -- A A -- T A -- T C -- G T -- A

Base = Nucleotide

5’

3’

Page 21: CS 598SS Probabilistic Methods in Biological Sequence Analysis

SRC:http://www.biologycorner.com/resources/DNA-RNA.gif

Cell

From DNA to Amino-acid sequence

Page 22: CS 598SS Probabilistic Methods in Biological Sequence Analysis

From DNA to Protein: In words

1. DNA = nucleotide sequence • Alphabet size = 4 (A,C,G,T)

2. DNA mRNA (single stranded)• Alphabet size = 4 (A,C,G,U)

3. mRNA amino acid sequence• Alphabet size = 20

4. Amino acid sequence “folds” into 3-dimensional molecule called protein

Page 23: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Central Dogma

• “Information” flows from DNA to RNA to Protein

• Why “information” ?

• The DNA in a cell has complete information of which proteins will be present in the cell

Page 24: CS 598SS Probabilistic Methods in Biological Sequence Analysis

DNA and genes

• DNA is a very “long” molecule

• DNA in human has 3 billion base-pairs– String of 3 billion characters !

• DNA harbors “genes” – A gene is a substring of the DNA string

Page 25: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Genes code for proteins

• DNA mRNA protein can actually be written as Gene mRNA protein

• A gene is typically few hundred base-pairs (bp) long

Page 26: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Transcription

• Process of making a single stranded mRNA using double stranded DNA as template

Page 29: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Translation

• Process of making an amino acid sequence from (single stranded) mRNA

• Each triplet of bases translates into one amino acid: each such triplet is called “codon”

• The translation is basically a table lookup

Page 30: CS 598SS Probabilistic Methods in Biological Sequence Analysis
Page 31: CS 598SS Probabilistic Methods in Biological Sequence Analysis

The

Gen

etic

Cod

e

SO

UR

CE

: ht

tp:/

/ww

w.b

iosc

ienc

e.or

g/at

lase

s/ge

neco

de/g

enec

ode.

htm

Page 32: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Step 2: mRNA to Amino acid sequence

Translation

Page 33: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Review so far

• Proteins: important molecules, amino acid sequences

• DNA: structure, base-pairing.

• Genes: substrings of DNA

• Gene --> mRNA (transcription)

• mRNA --> amino acid sequence (translation), genetic code.

Page 34: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Gene expression

• Process of making a protein from a gene as template

• Transcription, then translation

• Can be regulated

Page 35: CS 598SS Probabilistic Methods in Biological Sequence Analysis

GENE

ACAGTGA

TRANSCRIPTIONFACTOR

PROTEIN

Transcriptional regulation

Page 36: CS 598SS Probabilistic Methods in Biological Sequence Analysis

GENE

ACAGTGA

TRANSCRIPTIONFACTOR

PROTEIN

Transcriptional regulation

Page 37: CS 598SS Probabilistic Methods in Biological Sequence Analysis

The importance of gene regulation

Page 38: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Genetic regulatory network controlling the development of the body plan of the sea urchin embryoDavidson et al., Science, 295(5560):1669-1678.

Page 39: CS 598SS Probabilistic Methods in Biological Sequence Analysis

• That was the “circuit” responsible for development of the sea urchin embryo

• Nodes = genes

• Switches = gene regulation

• Change the switches and the circuit changes

• Gene regulation significance:– Development of an organism– Functioning of the organism– Evolution of organisms

Page 40: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Genome

• The entire sequence of DNA in a cell• All cells have the same genome

– All cells came from repeated duplications starting from initial cell (zygote)

• Human genome is 99.9% identical among individuals

• Human genome is 3 billion base-pairs (bp) long

Page 41: CS 598SS Probabilistic Methods in Biological Sequence Analysis

Genome features

• Genes• Regulatory sequences• The above two make up 5% of human

genome• What’s the rest doing?

– We don’t know for sure

• “Annotating” the genome– Task of bioinformatics