Introduction to Bioinformatics - Craig...

Post on 21-Jun-2020

19 views 1 download

Transcript of Introduction to Bioinformatics - Craig...

Introduction to Bioinformatics

2

• Cell biology– Organisms and cells– Building blocks of cells– How genes encode proteins?

• Bioinformatics– What is bioinformatics?– Practical applications– Tools and databases

Contents

Cell Biology

4

5

Lineage tree of life on earth

6

• Prokaryotes– Bacteria– Archaebacteria

• Eukaryotes– Plants– Animals– Fungi

Lineage tree of life on earth

7

• Single cell organisms• Consists of cytosol bounded by the plasma membrane• Possesses a cell wall• Gram-negative bacteria have a thin cell wall and an

outer membrane• Gram-positive bacteria have a thick cell wall and no

outer membrane• DNA is condensed to the cell center and lacks a

defined nucleus• Ribosomes are found in the DNA-free region • Relatively simplified internal organization• Some can grow in extreme conditions (temperature,

pH, salt concentration)

Prokaryotic cells

8

Prokaryotic cells

9

• Single cell (unicellular fungi and protozoans) or multicellular organisms (plants and animals)

• Both plant cells and fungi possess a cell wall, however are of different compositions

• Surrounded by a plasma membrane, like the prokaryotes

• Contains a defined nucleus• Structurally more complex: organelles, cytoskeleton• Organelles are enclosed compartments separated

from the cytoplasm, defined by internal membranes• Cytoskeletons are structural proteins giving cell

strength and rigidity; can be connected to organelles and provide tracks for organelle movements

Eukaryotic cells

10

Eukaryotic cells

11

Lineage tree of life on earth

12

Animal cell structure

13

Plant cell structure

14

Building blocks of cells

• Macromolecules– Nucleic acids (e.g. DNA, RNA)– Proteins (e.g. collagen)– Sugars (e.g. glucose, glycogen)– Lipids (e.g. cholesterol)

• Other molecules– Water– Ions

15

Central dogma

DNA

mRNA

Protein

• Genetic information flow:

16

• Contains genetic information arranged in units termed “genes”

• In an organism, all cells contain the same DNA content

• Basic subunits– adenine (A)– guanine (G)– cytosine (C)– thymine (T)

Deoxyribonucleic acid (DNA)

17

Native DNA is a double helix of complementary antiparallel chains

18

DNA is packaged into chromosomes

19

The total DNA in the chromosomes of an organism is its genome

20

Ribonucleic acid (RNA)

• Contains genetic information as messenger RNAs (mRNA)

• In an organism, cells contain different types of mRNAs

• Basic subunits– adenine (A)– guanine (G)– cytosine (C)– uracil (U)

21

Protein

• Contains genetic information as amino acid sequence

• Basic subunits are 20 amino acids• A protein’s amino acid sequence determines

its 3D structure, which in turn determines the function of that protein

Question: What are essential amino acids?Amino acids that cannot be synthesized by the body cells, therefore have

to be included in the diet. Soy bean and corn are rich in essential amino acids

22

The genetic code is a triplet code

23

AUG GCU UGU UUA CGA AUU TAG

M A C L R I *

Met Ala Cys Leu Arg Ile *

ATG GCT TGT TTA CGA ATT TAGGene X

mRNA

Protein

• Example:

How genes encode proteins?

24

Bioinformatics

26

• Background– Massive explosion in the amount of biological information

available due to huge advances in the fields of molecular biology and genomics.

• What is bioinformatics?– Bioinformatics is the application of computer technology to

the management, interpretation and analysis of biological data.

– An interdisciplinary research area that is the interface between the biological and computational sciences.

• Goals– To uncover the wealth of biological information hidden in the

mass of data.– To provide improvements in research fields such as human

health, agriculture, the environment, energy and biotechnology.

What is Bioinformatics?

27

• Large scale sequencing projects– Genome sequencing

• Examples: microbial or human genome sequencing• Determine the DNA sequence of an organism• Discover “genes” in the genome using bioinformatics tools

– EST (Expressed Sequence Tag) sequencing• Examples: a specific tissue or cell type from a given

organism• Determine the mRNA sequences found in specific tissue

or cell type• Determine “genes” expressed in specific tissue or cell

type

Data generation

28

DNA sequencing

29

4,000

-

4.6

E. coli

35,000

80%

3,000

Human

45,000

80%

2,500

Maize

40,00027,000Estimated gene count

40%10%Repetitive DNA

400125Genome size (Mb)

RiceArabidopsis

Genome sizes

30

• Whole genome shotgun sequencing– For genome of relatively small sizes (e.g. bacteria)– Break up the genome into small DNA fragments– Rely on computer algorithms to assemble the fragments– Examples: microbial genomes, Drosophila, Human (Celera)

• Hierarchical sequencing– For genome of large sizes (e.g. human, maize)– Break genome into many long pieces– Map each long piece onto the chromosome (physical

mapping)– Select and sequence pieces with minimal overlaps– Examples: Rice, Human

Genome sequencing strategies

31

cut many times at random

Genome

1. Cut genomic DNA cut into pieces2. Sequence random fragments3. Put sequence together into one piece relying on

computer algorithms

Whole Genome Shotgun

32

1. Genomic DNA cut into pieces2. Assign chromosomal location for each DNA

fragment3. Sequence fragments originated from known

location4. Stitch together fragments from each

chromosomal location

GenomeChr 1, region 5Chr 1, region 1

Hierarchical Sequencing

33

• The order of the nucleotide bases contains the instructions for making an organism.

• There are 4 types of nucleotide base: A-adenine, T- thymine, C- cytosine, G- guanine.

• Every three bases codes for an amino acid.• There are 20 different amino acids that

combined in different ways make different proteins.

Genome facts

34

• The human genome is composed of more than 3 billion nucleotide bases.

• Almost all nucleotide bases (99.9%) are exactly the same in all people.

• Our DNA is 98% identical to chimpanzees.• Less than 2% of the genome codes for proteins.• The vast majority of the DNA in the genome (>97%) has no

known function.• The functions remain unknown for over 50% of discovered

genes.• Chromosome 1 has the most genes (2,968) and chromosome Y

has the least (231).• If unwound and tied together, the strands of DNA in one cell

would stretch 6 feet.• The total number of human genes is estimated to be between

30,000 - 40,000.

Human genome facts

35

CTCTAGCTATCTTGGTCTCCTACACAGCCTATGCACATGAGCCCATGCCTCTCCTCTCCTTGCGCCTGCATAGAGAGGTGGTATGATCACCTGGAAAGTTTTTAACTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTTACAAGCCTAGACCTTATGCATGGTCGGACGGACACATCTGATCATAGGACATATGAGTAGGCCACACTCCTCCTGCCCCTCTCTCGTAGAGATCAACACACACTGCTCTTAGTGCCAGGACCTAGAGAGGGGAGCGTGGAGAGGGCATCAGGGGGCCTTGGAGTCCCATCAGTAAAGCACATGTTTCCTTTCTGTGATTCCTCAAGCCCCATGGACTTACCGCTTTACCAACAACTGCAGCTAAGCCCGTCTTCCCCAAAGACGGACCAATCCAGCAGCTTCTACTGCTACCCATGCTCCCCTCCCTTCGCCGCCGCCGACGCCAGCTTTCCCCTCAGCTACCAGATCGGTAGTGCCGCGGCCGCCGACGCCACCCCTCCACAAGCCGTGATCAACTCGCCGGACCTGCCGGTGCAGGCGCTGATGGACCACGCGCCGGCGCCGGCTACAGAGCTGGGCGCCTGCGCCAGTGGTGCAGAAGGATCCGGCGCCAGCCTCGACAGGGCGGCTGCCGCGGCGAGGAAAGACCGGCACAGCAAGATATGCACCGCCGGCGGGATGAGGGACCGCCGGATGCGGCTCTCCCTTGACGTCGCGCGCAAATTCTTCGCGCTGCAGGACATGCTTGGCTTCGACAAGGCAAGCAAGACGGTACAGTGGCTCCTCAACACGTCCAAGTCCGCCATCCAGGAGATCATGGCCGACGACGCGTCTTCGGAGTGCGTGGAGGACGGCTCCAGCAGCCTCTCCGTCGACGGCAAGCACAACCCGGCAGAGCAGCTGGGAGGAGGAGGAGATCAGAAGCCCAAGGGTAATTGCCGCGGCGAGGGGAAGAAGCCGGCCAAGGCAAGTAAAGCGGCGGCCACCCCGAAGCCGCCAAGAAAATCGGCCAATAACGCACACCAGGTCCCCGACAAGGAGACGAGGGCGAAAGCGAGGGAGAGGGCGAGGGAGCGGACCAAGGAGAAGCACCGGATGCGCTGGGTAAAGCTTGCTTCAGCAATTGACGTGGAGGCGGCGGCTGCCTCGGGGCCGAGCGACAGGCCGAGCTCGAACAATTTGAGCCACCACTCATCGTTGTCCATGAACATGCCGTGTGCTGCCGCTGAATTGGAGGAGAGGGAGAGGTGTTCATCAGCTCTCAGCAATAGATCAGCAGGTAGGATGCAAGAAATCACAGGGGCGAGCGACGTGGTCCTGGGCTTTGGCAACGGAGGAGGAGGATACGGCGACGGCGGCGGCAACTACTACTGCCAAGAGCAATGGGAACTCGGTGGAGTCGTCTTTCAGCAGAACTCACGCTTCTACTGAACACTACGGGCGCACTAGGTACTAGAACTACTCTTTCGACTTACATCTATCTCCTTTCCCTCAACGTGAGCTTCTCAATAATTTGCTGTCTTAATCTATGCGTGTGTTTCTCTTTCTAGACTTCGTAATTGGCTGTGTGACGATGAACT

A piece of DNA sequence

36

CTCTAGCTATCTTGGTCTCCTACACAGCCTATGCACATGAGCCCATGCCTCTCCTCTCCTTGCGCCTGCATAGAGAGGTGGTATGATCACCTGGAAAGTTTTTAACTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTTACAAGCCTAGACCTTATGCATGGTCGGACGGACACATCTGATCATAGGACATATGAGTAGGCCACACTCCTCCTGCCCCTCTCTCGTAGAGATCAACACACACTGCTCTTAGTGCCAGGACCTAGAGAGGGGAGCGTGGAGAGGGCATCAGGGGGCCTTGGAGTCCCATCAGTAAAGCACATGTTTCCTTTCTGTGATTCCTCAAGCCCCATGGACTTACCGCTTTACCAACAACTGCAGCTAAGCCCGTCTTCCCCAAAGACGGACCAATCCAGCAGCTTCTACTGCTACCCATGCTCCCCTCCCTTCGCCGCCGCCGACGCCAGCTTTCCCCTCAGCTACCAGATCGGTAGTGCCGCGGCCGCCGACGCCACCCCTCCACAAGCCGTGATCAACTCGCCGGACCTGCCGGTGCAGGCGCTGATGGACCACGCGCCGGCGCCGGCTACAGAGCTGGGCGCCTGCGCCAGTGGTGCAGAAGGATCCGGCGCCAGCCTCGACAGGGCGGCTGCCGCGGCGAGGAAAGACCGGCACAGCAAGATATGCACCGCCGGCGGGATGAGGGACCGCCGGATGCGGCTCTCCCTTGACGTCGCGCGCAAATTCTTCGCGCTGCAGGACATGCTTGGCTTCGACAAGGCAAGCAAGACGGTACAGTGGCTCCTCAACACGTCCAAGTCCGCCATCCAGGAGATCATGGCCGACGACGCGTCTTCGGAGTGCGTGGAGGACGGCTCCAGCAGCCTCTCCGTCGACGGCAAGCACAACCCGGCAGAGCAGCTGGGAGGAGGAGGAGATCAGAAGCCCAAGGGTAATTGCCGCGGCGAGGGGAAGAAGCCGGCCAAGGCAAGTAAAGCGGCGGCCACCCCGAAGCCGCCAAGAAAATCGGCCAATAACGCACACCAGGTCCCCGACAAGGAGACGAGGGCGAAAGCGAGGGAGAGGGCGAGGGAGCGGACCAAGGAGAAGCACCGGATGCGCTGGGTAAAGCTTGCTTCAGCAATTGACGTGGAGGCGGCGGCTGCCTCGGGGCCGAGCGACAGGCCGAGCTCGAACAATTTGAGCCACCACTCATCGTTGTCCATGAACATGCCGTGTGCTGCCGCTGAATTGGAGGAGAGGGAGAGGTGTTCATCAGCTCTCAGCAATAGATCAGCAGGTAGGATGCAAGAAATCACAGGGGCGAGCGACGTGGTCCTGGGCTTTGGCAACGGAGGAGGAGGATACGGCGACGGCGGCGGCAACTACTACTGCCAAGAGCAATGGGAACTCGGTGGAGTCGTCTTTCAGCAGAACTCACGCTTCTACTGAACACTACGGGCGCACTAGGTACTAGAACTACTCTTTCGACTTACATCTATCTCCTTTCCCTCAACGTGAGCTTCTCAATAATTTGCTGTCTTAATCTATGCGTGTGTTTCTCTTTCTAGACTTCGTAATTGGCTGTGTGACGATGAACT

A piece of DNA sequence- carrying a gene unit

37

• Sequence properties– Length, base composition, GC content, etc.

• Sequence assembly– Put sequence together based on similarity

• Gene prediction– Find gene units in a given DNA sequence

• Repeat finding– Find repeated units in a given DNA sequence

• Sequence similarity search– Find other similar sequences based on DNA or protein

sequences• Protein function analysis

– Predict protein function based on known functional units found in protein sequence (domains)

Data analysis (tools)

38

• Databases– Research articles

• What is the latest research with regard to genes involved in horse coat color?

– Taxonomy• How many plant or animal genomes have been sequenced?

– Nucleotide• What is the nucleotide sequence of the maize

domestication gene teosinte branched 1 (tb1)?– Protein

• What is the protein sequence of the maize domestication gene tb1?

– Genome• Where are the human diabetes genes located in the

human genome? Which chromosome?

Data storage (databases)

39

• Agriculture– Improve insect resistance– Improve nutritional quality– Improve drought resistant and/or environmental adaptability

• Animals– Improve production and nutrition of farm animals

• Molecular medicine– Preventative medicine– Gene therapy

• Microbial genome– Waste cleanup– Climate change– Alternative energy sources– Biotechnology – Antibiotic resistance– Forensic analysis of microbes– Metagenomics

Long term goals