CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of...

36
Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen Marchal 1 CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of contents: Introduction to bioinformatics Chapter 1: Introduction to bioinformatics ........................................................................................................ 1 Voorwoord Bio-informatica .......................................................................................................................... 2 Bioinformatics, a research domain at the cross roads of different disciplines ............................................. 3 Driving force for bioinformatics: ................................................................................................................... 3 The high-throughput-ization of molecular biology................................................................................... 4 the sequencing revolution ........................................................................................................................ 4 Different subfields in bioinformatics research.............................................................................................. 8 Structural genomics .................................................................................................................................... 10 assembly ................................................................................................................................................. 10 structural annotation .............................................................................................................................. 15 Biological application: genome sequencing ........................................................................................... 16 Comparative genomics................................................................................................................................ 21 Overview ................................................................................................................................................. 21 Biological application 1: evolutionary biology ........................................................................................ 24 Biological application 2: Studying genome evolution............................................................................. 25 Biological application 3: metagenomics (G. Venter) .............................................................................. 29 Functional genomics & Systems Biology ..................................................................................................... 32 Systems biology ...................................................................................................................................... 32 BIOLOGICAL Application: Synthetic biology from an engineering point of view: rational design .......... 32 CONCLUSION ............................................................................................................................................... 34 Updated 23/01/2015

Transcript of CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of...

Page 1: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 1

CHAPTER 1: INTRODUCTION TO BIOINFORMATICS

Table of contents: Introduction to bioinformatics

Chapter 1: Introduction to bioinformatics ........................................................................................................ 1

Voorwoord Bio-informatica .......................................................................................................................... 2

Bioinformatics, a research domain at the cross roads of different disciplines ............................................. 3

Driving force for bioinformatics: ................................................................................................................... 3

The high-throughput-ization of molecular biology ................................................................................... 4

the sequencing revolution ........................................................................................................................ 4

Different subfields in bioinformatics research .............................................................................................. 8

Structural genomics .................................................................................................................................... 10

assembly ................................................................................................................................................. 10

structural annotation .............................................................................................................................. 15

Biological application: genome sequencing ........................................................................................... 16

Comparative genomics ................................................................................................................................ 21

Overview ................................................................................................................................................. 21

Biological application 1: evolutionary biology ........................................................................................ 24

Biological application 2: Studying genome evolution ............................................................................. 25

Biological application 3: metagenomics (G. Venter) .............................................................................. 29

Functional genomics & Systems Biology ..................................................................................................... 32

Systems biology ...................................................................................................................................... 32

BIOLOGICAL Application: Synthetic biology from an engineering point of view: rational design .......... 32

CONCLUSION ............................................................................................................................................... 34

Updated 23/01/2015

Page 2: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 2

VOORWOORD BIO-INFORMATICA

Bio-informatica, hoewel een relatief recente term, bestaat reeds meer dan 400 jaar. Galileo schreef immers

“the book of nature is written in the language of mathematics!”.

Het gebruik van wiskundige modellen om biologische fenomenen te verklaren en gegevens te analyseren is

zeker niet nieuw. Tot nog toe was het enkel gemeengoed in bepaalde deeldomeinen van de biologie (e.g.

populatiegenetica, fylogenie, “molecular modeling” etc.).

Belangrijke technologische vernieuwingen in de moleculaire biologie in het begin van de jaren ‘90 brachten

hierin grondige verandering. De toepassing van de hoge-doorvoer technologieën (genomica, transcriptomica,

proteomica, metabolomica) laat immers toe om in zeer korte tijd de DNA-sequentie van hele genomen in

kaart te brengen, de expressie van duizenden genen of proteïnen in een organisme te analyseren, de aard en

concentratie van alle metabolieten te evalueren en de interacties tussen deze verschillende genetische

entiteiten te identificeren. Dit heeft geleid tot een onevenaarbare data-explosie. Voor het analyseren van

deze data volstaat een excel spread sheet niet langer, maar is een interdisciplinaire aanpak noodzakelijk

Deze dataexplosie heeft ook geleid tot een drastische verruiming in het “biologisch” denken (ook wel de

nieuwe biologie geheten). De finale doelstelling van de moleculaire biologie “het verwerven van inzicht in

de werking en evolutie van organismen” bleef dezelfde. De manier om dit doel te bereiken is gewijzigd.

Tot voor enkele jaren werden in het functioneel moleculair biologisch onderzoek, genen, proteïnen en andere

moleculen één voor één als geïsoleerde entiteiten bestudeerd. Het gebruik van de nieuwe technologieën

situeert de functie van een gen nu in een globale context, namelijk als deel van een complex regulatorisch

netwerk. Vanuit dit nieuw perspectief wordt het organisme beschouwd als een systeem dat interageert met

zijn omgeving. Het gedrag ervan wordt bepaald door de complexe dynamische interacties tussen

genen/proteïnen/metabolieten op het niveau van het regulatorische netwerk. Door de beschikbaarheid van

data van verschillende modelorganismen kunnen bovendien de cellulaire mechanismen tussen de

organismen vergeleken worden.

Organisme voorgesteld als een systeem dat interageert met zijn omgeving. Via de werking van regulatorische netwerken past een organisme zich

voortdurend aan aan wisselende omgevingssignalen. Deze aanpassingen resulteren in een gewijzigd gedrag of fenotype. De regulatorische netwerken

kunnen beschouwd worden als de biologische signaalverwerkingssystemen.

Traditionele studies van biologische systemen waren veeleer beschrijvend. De systeembenadering van de

biologie impliceert echter een doorgedreven kwantitatieve en geïntegreerde analyse van complexe gegevens.

Onder invloed van deze nieuwe tendens ontstond de term "bio-informatica" (voor het eerst gebruikt rond

1993) en werd de hoge-doorvoer functionele moleculaire biologie een deel van de “systeembiologie”.

Zoals de moleculaire biologie zijn systeembiologie en bio-informatica onderzoeksdomeinen met vele

deeldisciplines (structurele, functionele, comparatieve bio-informatica).

De bio-informatica vraagstelling ontstaat vanuit de biologie. De computationele wetenschappen stellen een

arsenaal standaardalgoritmes en principes ter beschikking. Beide moeten op een zinvolle manier verenigd

Page 3: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 3

worden, rekening houdend zowel met de specifieke eigenschappen van het gebruikte algoritme als met deze

van het biologisch probleem. Het verzoenen van algoritmen uit exacte wetenschappen met experimentele

data afkomstig van stochastische biologische systemen vormt hierbij de belangrijkste uitdaging. Om die

reden kan het oplossen van een biologisch probleem via computationele weg al snel een paar jaar onderzoek

in beslag nemen maar leidt het tot waardevolle resultaten die in sommige gevallen het traditioneel biologisch

onderzoek overstijgen.

Toekomst van bio-informatica

Bio-informatica is dus geen “hype”. Naarmate de moleculair biologische technologie evolueert zal ze verder

aan belang toenemen. De meest succesvolle moleculair biologische laboratoria zullen daarbij ongetwijfeld

deze zijn, die het “wet lab” onderzoek sturen a.h.v. de predicties van geavanceerd computioneel onderzoek.

De toekomst van zowel de moleculaire biologie als de bio-informatica ligt in de uitbouw van het onderzoek

waarbij de grens tussen het “wet lab” en het computationeel aspect vervaagt.

Doel van de cursus bio-informatica

Het doel van de cursus is tweeledig:

De eerste en waarschijnlijk meest belangrijke doelstelling is om jullie ervan te overtuigen dat bioinformatica

een essentieel onderdeel is van jullie curriculum. Met een aantal voorbeelden en verwezenlijkingen uit het

domein hoop ik jullie van te kunnen overtuigen dat ‘bioinformatica’ en ‘systeem biologie’ ons leven en

denken zullen veranderen. De moleculaire bioloog van de 21e eeuw zal niet enkel beschikken over een goed

uitgebouwde biologische kennis, maar hij dient ook vertrouwd te zijn met belangrijke principes uit de

wiskunde, de statistiek en de informatietechnologie. Dergelijke integratie van biologisch inzicht, analytisch

en probleemoplossend denken is eigen aan de bioinformatica.

Een tweede aspect van de cursus is om jullie vertrouwd te maken met het gebruik van bioinformatica tools.

Het bio-informatica domein is echter zeer ruim en in volle expansie. Het is dan ook onmogelijk om alle tools

en onderdelen te belichten. We zullen een aantal belangrijke en veel gebruikte voorbeelden bespreken. Het

is hierbij van belang dat jullie realiseren dat zinvolle bio-informatica meer is dan enkel het toepassen van

tools maar dat het essentieel is om ook de onderliggende mathematische principes van deze tools te

begrijpen en tegelijk inzicht te verwerven in de datageneratie protocols en de biologische complexiteit. Dit

impliceert dat bio-informatica meer is dan een hulpmiddel bij het moleculair biologisch onderzoek maar dat

het een volwaardig onderzoeksdomein op zichzelf vormt.

BIOINFORMATICS, A RESEARCH DOMAIN AT THE CROSS ROADS OF DIFFERENT DISCIPLINES

Bioinformatics is an interdisciplinary research area at the interface between biological and computational

sciences. It is the scientific field deals with the computational management of all kinds of molecular biological

information. Most of the bioinformatics work that is being done deals with either analyzing biological data,

or with the organization of biological information. The ultimate goal of the field is the discovery of new

biological insights as well as to create a global perspective from which unifying principles in biology can be

discerned. Bioinformatics is the application of computer science to biological science, but is also the creation

of computer science for biological science. Bioinformatics emerged as an important discipline shortly after

the development of DNA sequencing technologies in the 1970s, although the word “bioinformatics” did not

start appearing in the biomedical literature until around 1990.

DRIVING FORCE FOR BIOINFORMATICS:

Page 4: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 4

THE HIGH-THROUGHPUT-IZATION OF MOLECULAR BIOLOGY

Traditional genetics and molecular biology have been directed toward understanding the role of a particular

gene or protein in a molecular biological process. A gene is sequenced to predict its function or to manipulate

its activity or expression. Traditional molecular biology was focusing on single genes. With the advent of novel

molecular biological techniques such as genome scale sequencing, large scale expression analysis (gene,

protein expression, microarrays, 2D-electrophoresis, mass spectroscopy), large scale identification of

protein-protein interactions (yeast 2 hybrid; protein chips) or protein-DNA interactions (immunochromatine

precipitation), the scale of molecular biology has changed. One is no longer focusing on a single gene but

many genes or proteins are analyzed simultaneously (i.e. at high throughput level transcriptomics,

translatomics interactomics, metabolomics). This novel approach offers advantages: one can study the

function or the expression of a gene in a global context of the cell. Because a gene does not act on its own, it

is always embedded in a larger network (systems biology). These holistic approaches allow better

understanding of fundamental molecular biological processes.

On the other hand, high throughput approaches pose several novel challenges to molecular biology: the

analysis of such large scale data is no longer trivial. Simple spreadsheet analysis such as excel are no longer

sufficient. More advanced datamining procedures become necessary. Another urgent problem is also how

to store and organize all the information.

There is, in fact, an inseparable relationship between the experimental and the computational aspects.

On the one hand, data resulting from high-throughput experimentation require intensive

computational interpretation and evaluation.

On the other hand, computational methods produce questionable predictions that should be

reviewed and confirmed through experiments.

THE SEQUENCING REVOLUTION

The evolution of sequence technologies provides the most prominent example of technological revolution of

the 21 century.

Page 5: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 5

1953: James Watson and Francis Crick publish their classic paper that describes the double helical structure

of DNA. In 1962, Watson and Crick, together with Maurice Wilkins, receive the Nobel prize for their discovery

1955: Frederick Sanger (a British biochemist) determined, together with colleagues, the complete amino

acid sequence of bovine insulin. He concluded that each protein has a unique sequence. It was this

achievement that earned him his first Nobel prize in chemistry in 1958.

1966: Over the course of several years, Marshall Nirenberg, Har Khorana and Severo Ochoa and their

colleagues elucidated the genetic code

1967: Walter Fitch and Emanuel Margolish publish a paper on ‘The construction of phylogenetic trees’. This

marks the start of phylogenetics based on sequence data.

1972: The first complete nucleotide sequence of a gene is determined by a group at Ghent University. It is

the sequence of the gene coding for the Bacteriophage MS2 Coat protein. The gene is 129 amino acids long

1976: The Enterobacterial phage MS2 genome was the first genome to be completely sequenced. This was

accomplished by Walter Fiers and his team, building upon their earlier milestone in 1972 of the first gene to

be completely sequenced

1977: Frederick Sanger and colleagues in developed what was later called the “Sanger sequencing” method.

It will become the most widely-used sequencing method for approximately 25 years.

Page 6: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 6

2001: Special issues of Science (Feb. 16, 2001) and Nature (Feb. 15, 2001) publish the first drafts of the

human genome sequence. Former president Bill Clinton gives a press conference, together with Francis

Collins and Craig Venter. To everyone’s (?) surprise, the Human genome contains only about 23.000 protein-

coding genes.

2006: Introduction of next (second) generation sequencing technologies. These (will) lead to a huge increase

in the number of genome sequence data produced

NGS technology

Sequencing human genome: 13 jaar/ 3 miljard dollar

Genome Watson (454 techn): 20 mensen/2 maanden. Totaalprijs 1.000.000 dollar

Te verwachten: 1000 dollar human genome

Page 7: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 7

2008 : The 1000 Human Genomes Project, an international research effort to establish a detailed catalogue

of human genetic variation, is launched. Scientists plan to sequence the genomes of at least one thousand

anonymous participants from a number of different ethnic groups within the following three years, using the

newly developed sequencing technologies that are faster and less expensive.

The revolution goes on….

Page 8: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 8

DIFFERENT SUBFIELDS IN BIOINFORMATICS RESEARCH

This intricate merge between molecular biology and computational biology has given rise to new research

fields and application. In each of these research fields, a specific field of bioinformatics expertise is required.

Three main fields can be distinguished:

Structural genomics

o Input: raw sequence data

o Goal: annotation

o Bioinformatics Tools: genome assembly, gene/promoter/intron prediction

Comparative genomics

o Input: annotated genomes

o Goal: annotation, evolutionary genomics

o Bioinformatics Tools: sequence alignment, tree construction

Functional genomics.

o Input: experimental information

o Goal: function assignment, systems biology

o Bioinformatics Tools: microarray analysis, network reconstruction, dataintegration

Note that the field of molecular dynamics and protein modeling is not covered in this course.

Page 9: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 9

For some purposes, different subfield have to be combined i.e., the distinction is not always as clear cut as it

seems.

For instance, for genome annotation:

As these genomes are collected they need to be annotated. This means that we will have

To identify the location of the genes on the genome (structural annotation)

To assign a function to each of the potential genes (functional annotation).

In structural annotation, the question to be answered is 'where are the genes'? One needs to localize the

gene elements on the sequence (chromosome) and find the coding sequences, intergenic sequences,

exons/intro boundaries, promoters, 5'UTR, 3'UTR regions, and so on.

In functional annotation, one tries to get information on the function of genes. Often, it is possible to get

hints on the biochemical function of the gene products by finding homologs in protein databases or by

studying the biochemical characteristics of the gene (proteome, transcriptome analysis).

In the following, each of the bioinformatics subfields will be briefly described and illustrated with a biological

case study.

Functional

genomics/

Systems Biology

Structural

Genomics/Annotation

Comparative

Genomics/

evolutionary

biology

Page 10: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 10

STRUCTURAL GENOMICS

ASSEMBLY

Structural genomics is based on raw sequence data. The first step in structural genomics consists of

assembling raw sequence fragments into contigs or whole genomes. The complexity of the assembly process

depends on the used sequence technique. Two major sequencing approaches will be described below:

For more information see also

http://www.genomenewsnetwork.org/articles/06_00/sequence_primer.shtml

http://www.bio.davidson.edu/courses/genomics/method/shotgun.html

Sequencing and assembly

TOP DOWN SEQUENCING/ DE NOVO ASSEMBLY

The first genome sequencing approach “top down” is based on the known order of DNA fragments. To

sequence larger molecules such as human chromosomes,

1) individual chromosomes are broken into random fragments of approximately 150 kb.

2) These fragments are then cloned into BACs (vectors).

3) In an intensive but largely automated laboratory procedure, the resulting library is screened for clusters

of fragments called contigs which have overlapping or common sequences. These contigs are then joined to

produce an integrated physical map of the genome based on the order of the BACS. Once the correct map

has been identified unique overlapping clones are chosen for sequencing.

4) However, these clones are too large for direct sequencing. One procedure for sequencing these subclones

is to subclone them further into smaller fragments that are of sizes suitable for sequencing (500 bp). 5) From

the DNA sequences of approximate length of 500 bp, genome sequences are assembled using the fragment

order on the physical map as a guide.

Page 11: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 11

This method of creating physical maps of genomes and then using this map to guide the sequencing was used

by the Public Human genome Consortium to create a draft of the human genome. This carefully crafted, but

laborious procedure was designed to produce a sequence of the human genome that was based on a top

down approach, at each stage using the physical map to guide the placement of sequences (Lander, Nature

2001). The reasoning behind this strategy was the avoidance of sequence repeats that might otherwise

confound obtaining the correct genome sequence.

Top down sequencing

1. 2.

3. 4.

Genome fragmentation

Physical map

BAC library

Subclone library

5.

Genome assembly

Page 12: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 12

SHOT GUN SEQUENCING/ DE NOVO ASSEMBLY

A contrasting “bottom-up” method in which the genome sequence is derived from solely overlaps in large

numbers of random sequence without using the physical map as a guide, has been devised. This alternative

method, called shotgun sequencing attempts to assemble a linear map from subclone sequences without

knowing their order on the chromosome. Contigs are assembled based on alignment of all possible sequence

pairs in the computer. This method is now routinely used to sequence microbial genomes and the cloned

fragments of larger clones (see also metagenomics).

The shotgun method was used by Celera Genomics to sequence the human genomes (Venter, Science 2001;

http://www.jcvi.org/). There has since been controversy as to whether or not use of the public data by the

Venter group contributed significantly to their draft of the human genome or from the overlaps in a highly

redundant set of fragments by automatic computational methods (shotgun method).

1. Genome

fragmentation

2. Library

3. Sequences 4. Genome assembly

Shot Gun

Sequencing

1. Genome

fragmentation

2. Library

3. Sequences 4. Genome assembly

Shot Gun

Sequencing

Page 13: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 13

Fuzz about the public versus the commercial effort (lander versus venter)

http://www.dnalc.org/view/15326-Analysis-in-public-and-private-Human-Genome-Projects-Eric-Lander-

.html

For large genomes a combination between top down and bottom up sequencing is used as illustrated below.

NEXT GENERATION SEQUENCE TECHNOLOGY

However, with the next generation sequencing technology, relying on shotgun sequencing is customarily

used, making use of short read sequencing (illumina, 150 bp) and libraries of different size e.g. 500 bp

standard library + 3 kb library. With the latest PACbio technology very long reads become possible, making

the shot gun sequencing even more feasible. The figure below shows why the assembly can get difficult and

why a good design is important. Assembly is complicated by the presence of repeats which make the

alignment between reads ambiguous. This problem is even more complicated because the reads can contain

errors (some technologies are more error prone than others). The use of reads generated from libraries of

different insert lengths or the use of long reads can resolve the ambiguities. In general, the quality of the

assembly depends on the coverage and length of the reads. The longer the reads the easier, the less

pronounced the problem of generating ambiguous alignments between reads. In addition a high coverage is

needed so that sufficient overlap between reads is guaranteed.

Page 14: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 14

From Nagarajan Nature reviews, 2013

RESEQUENCING/ DE NOVO ASSEMBLY

Sequancing for de novo assembly in general is expensive as it requires multiple libraries and high coverage

and preferentially long reads. Once a good reference genome is assembled (with the de novo assembly), this

reference genome can be used as a scaffold (replaces the physical map) to guide reference based assembly.

Reference based assembly is used in case of resequencing e.g. when the genome of a human individual is

sequenced to search for genetic variants (mutations compared to the reference genome). Because of the

availability of a reference genome resequencing is cheaper (lower coverage possible).

Page 15: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 15

STRUCTURAL ANNOTATION

Once assembled, structural elements such as the location of genes, introns, exons, splice sites, promoters,

repeated elements etc. need to be predicted in the genome (the structural analysis). Distinct gene predictions

algorithms have been developed. Methods for ab initio gene predictions are based on supervised machine

learning techniques(1). The model (e.g. a hidden markov model or a neural network) is trained on a set of

known genes (or promoters or introns) and subsequently used to predict the location of unknown genes (or

promoters or introns) in an organism. Features (properties in the genome) that are extracted from the

trainingsset and that thus help recognizing genes are for instance specific codon usage (which differs

between coding and non coding regions), spice site recognition sites (when predicting splice sites) etc.

Because of differences in codon usage and splice junctions between organisms, a model must be trained for

each novel genome.

Once the complete genome is known genome maps can be constructed that indicate the position of each

gene on the genome. Comparing gene maps of different organisms allows identification of translocation or

other chromosome arrangements (important in cancer research).

Fig. genome map of the bacterium A. tumefaciens

Page 16: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 16

BIOLOGICAL APPLICATION: GENOME SEQUENCING

GENOME SEQUENCES OF MODEL ORGANISMS

The first bacterial genome to be sequenced was that of Haemophilus influenzae (sequenced by the TIGR

institute (http://www.tigr.org) in 1995). The success of sequencing this genome in relatively short time

heralded the sequencing of a large number of additional prokaryotic organisms. To data the genomes 96 of

these species have been sequenced among which the model organisms E. coli and B. subtilis.

Later on eukarotic genomes were sequenced. In 2002 the human genome sequence was completed by two

distinct research groups in parallel: a commercial group Celera and an academic sequence consortium

(Sanger Center). Nowadays the sequences of several eukaryotic model organisms have been determined and

the number of sequences is steadily increasing.

Microbial genomes http://www.ncbi.nlm.nih.gov/genomes/static/micr.html

Genome resources at ncbi:

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome

Vertebrate genomes:

http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?taxid=7227 http://www.ncbi.nlm.nih.gov/genome/guide/human/ http://www.ncbi.nlm.nih.gov/genome/guide/mouse/index.html http://www.ncbi.nlm.nih.gov/genome/guide/rat/index.html http://www.ncbi.nlm.nih.gov/genome/guide/zebrafish/index.html http://www.ensembl.org/

yeast genomes

http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?chr=scerevisiae.inf

Plant genomes:

http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/Resources_1.html#arab

Page 17: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 17

From Nature Reviews Genetics 4, 251-262 (2003);

GENOME SEQUENCES AT SPECIES LEVEL

In the early days most genomes were sequenced by the classical Sanger sequencing approach (see figure

below), but nowadays the next-generation sequencing (NGS) methodology is taken over. Mainly the

developments in nanotechnology have resulted in the origin of novel technologies for sequencing and

synthesizing DNA sequences. Next-generation sequencing has the ability to process millions of sequence

reads in parallel rather than 96 at a time. All NGS platforms share a common technological feature: massively

parallel sequencing of clonally amplified or single DNA molecules that are spatially separated in a flow cell

(for a recent review see Metzker, M.L. (2010) Nature Reviews Genetics 11:31-46) (see figure below). This

design is a paradigm shift from that of Sanger sequencing, which is based on the electrophoretic separation

of chain-termination products produced in individual sequencing reactions.

Page 18: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 18

Fig. Sanger sequencing methodology: The dideoxynucleotide termination DNA sequencing technology

invented by Fred Sanger and colleagues in 1977, formed the basis for DNA sequencing from its inception

through 2004. Originally based on radioactive labeling, the method was automated by the use of fluorescent

labeling coupled with excitation and detection on dedicated instruments, with fragment separation by slab

gel and ultimately by capillary gel electrophoresis.

Page 19: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 19

Overview of next generation sequencing technologies.

For instance Illumina:

http://www.illumina.com/documents/products/techspotlights/techspotlight_sequencing.pdf

The breakthroughs in these technologies are unprecedented and follow the law of Moore. Related to the

sequencing technology, it is to be expected that within a few years we will have the 1000 dollar genome,

which allows the genome of a human to be sequenced within a few hours for 1000 dollar.

(comparison the human genome is 3.4 Gb=3.4 miljard baseparen en heeft 20000-25000 genen).

As a result recent sequencing projects start focusing on sequencing different individuals of the same species

(1000 genomes project e.g. http://www.1000genomes.org/) rather than only sequencing representatives of

different species. This has been made possible thanks to the lower sequencing cost of the next generation

sequencing approaches. This opens novel perspectives for amongst others, personalized medicine,

sequence-based trait selection, evolution experiments.

Page 20: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 20

The ‘1000 genomes project’

The 1000 Genomes Project is the first project to sequence the genomes of a large number of people, to

provide a comprehensive resource on human genetic variation. As with other major human genome

reference projects, data from the 1000 Genomes Project will be made available quickly to the worldwide

scientific community through freely accessible public databases.

The goal of the 1000 Genomes Project is to find the genetic variants that have frequencies of at least 1% in

the populations studied. This goal can be attained by sequencing many individuals lightly. To sequence a

person's genome, many copies of the DNA are broken into short pieces and each piece is sequenced. The

many copies of DNA mean that the DNA pieces are more-or-less randomly distributed across the genome.

The pieces are then aligned to the reference sequence and joined together. To find the complete genomic

sequence of one person with current sequencing platforms requires sequencing that person's DNA the

equivalent of about 28 times (called 28X). If the amount of sequence done is only an average of once across

the genome (1X), then much of the sequence will be missed, because some genomic locations will be covered

by several pieces while others will have none. The deeper the sequencing coverage, the more of the genome

will be covered at least once. Also, people are diploid; the deeper the sequencing coverage, the more likely

that both chromosomes at a location will be included. In addition, deeper coverage is particularly useful for

detecting structural variants, and allows sequencing errors to be corrected.

The data now available to scientists contains 99% of all genetic variants that occur in the populations studied,

down to the level of rare variations that only occur in 1 out of every 100 people. "The whole point of this

resource is that we're moving to a point where individuals are being sequenced in clinical settings and what

you want to do there is sift through the variants you find in an individual and interpret them," said Professor

Gil McVean of Oxford University, a lead author for the study.

The information will be pored over by thousands of researchers, who will analyze and interpret the DNA

variations between people in a bid to work out which ones are implicated in disease. In addition to the DNA

sequences, the 1,000 Genomes Project has stored cell samples from all the people it has sequenced, to allow

future scientific projects to look at the biological effect of the DNA variations they might want to study. How

is this done e.g. through a GWAS.

Genome wide association studies (GWAS): Any two human genomes differ in millions of different ways.

There are small variations in the individual nucleotides of the genomes (SNPs) as well as many larger

variations, such as deletions, insertions and copy number variations. Any of these may cause alterations in

an individual's traits, or phenotype, which can be anything from disease risk to physical properties such as

height. In a genetic association study one asks if the allele of a genetic variant is found more often than

expected in individuals with the phenotype of interest (e.g. with the disease being studied) than in individuals

without the disease.

Page 21: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 21

Overview of a genomewide association study, from W. Gregory Feero et al 2010Th e new england journal of

medicine.

The most common approach of GWA studies is the case-control setup which compares two large groups of

individuals, one healthy control group and one case group affected by a disease. All individuals in each group

are genotyped for the majority of common known SNPs. The exact number of SNPs depends on the

genotyping technology, but are typically one million or more. For each of these SNPs it is then investigated if

the allele frequency is significantly altered between the case and the control group. In such setups, the

fundamental unit for reporting effect sizes is the odds ratio. The odds ratio reports the ratio between two

proportions, which in the context of GWA studies are the proportion of individuals in the case group having

a specific allele, and the proportions of individuals in the control group having the same allele. When the

allele frequency in the case group is much higher than in the control group, the odds ratio will be higher than

1, and vice versa for lower allele frequency. Additionally, a P-value for the significance of the odds ratio is

typically calculated. Finding odds ratios that are significantly different from 1 is the objective of the GWA

study because this shows that a SNP is associated with disease (see course Integrative biology).

COMPARATIVE GENOMICS

OVERVIEW

The basic idea of comparative genomics is the comparison of sequences between genomes. Sequence

alignment methodologies form the basis tools for comparative genomics (Blast, clustalW, Needleman Wunsh,

Markov models…) (see chapter sequence alignment).

VALIDATE GENE PREDICTION

1) Comparative genomics can be used to aid or validate gene predictions: Since gene prediction methods

based on sequence features only (ab initio gene prediction) are only partially accurate, gene identification is

Page 22: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 22

facilitated by high-throughput sequencing of partial cDNA copies of expressed genes (called expressed

sequence tags or ESTs). Presence of ESTs confirms that the predicted gene is transcribed. A more through

sequencing of full length cDNA clone may be necessary to confirm the structure of genes. Gene prediction

methods that not only take into account sequence features (codon usage, intron exon recognition sites), but

also sequence homology (with ESTs, cDNAs, proteins) are called extrinsic gene finding methods (and are in

fact a combination of structural and comparative genomics). An example is the Genescan method discussed

in ensembl.

HOMOLOGY BASED ANNOTATION:

The amino acid sequence of proteins encoded by the predicted genes can be used as a query sequence in a

database similarity search. A match of a predicted protein sequence to one or more database sequences not

only serves to validate the gene prediction, but also can give indications on the function of the gene.

Not all genes will give hits in database searches. Some proteins might be unique for a certain organism or

might not have been characterized before. In such cases it might also be important to search for characteristic

domains (conserved amino acid patterns that can be aligned) that represent a structural fold or a biochemical

feature (see chapter pattern searches).

STUDYING GENE FAMILIES

Another important goal of comparative genomics is the study of protein families proteins come in families.

In theory we all inherit our genetic material from a common ancestor. All descendants have the same number

of genes. In principle all species have one copy of the same ancestral gene that during evolution underwent

changes (referred to as orthologs). Orthologs are genes that are so highly conserved by sequence in different

genomes that the proteins they encode are strongly predicted to have the same structure and function and

to have arisen from a common ancestor through speciation. Some parts of the gene/protein that are under

selective pressure will not tolerate changes during evolution and will remain conserved. These usually

correspond to the functional regions of the protein. Multiple alignment can thus be used to infer the

important domains in a protein.

Page 23: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 23

However during evolution sometimes part of a (or complete) genome can duplicate. In case multiple

homologs of the same gene are present in ne genome we refer to them as paralogs.

Although the definition of paralogs and orthologs is in theory straightforward their occurrence in genomes

can give rise to complex evolutionary relations within a gene family (see figure). Genes in two species that

have directly evolved from a single gene in the last common ancestor (orthologs by definition) are most likely

to share the function. However, often, these sequences have duplicated after the speciation event (i.e. after

the two species diverged from each other). In this case one-to-many or many-to-many relationships between

genes originate. In such cases, it is non-trivial to determine which of the orthologs is functionally equivalent

to the ortholog in the other species. It may be only one, but several genes could also have redundant

functions (especially if the duplication event took place recently).

Fig. Example of a cluster of a protein family (sigma factors in E. coli). Orthologs are often determined as reciprocal best

blast hits.

This is an important issue to account for when attempting to extrapolate the function of a gene to its

homologs in other species.

Page 24: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 24

Fig. A schematic gene tree showing speciation (S), duplication (D) and loss events (L). Human1, Mouse 1, Human 2 and

Worm 1 are all orthologs as the last common event in their evolution was a speciation. Human 1 and Human 2 are

paralogs as the last event in their common evolution was a duplication. Adapted from (Alexeyenko et al., 2006). In Figure

1 Human 1 and Human 2 are out-paralogs relative to mouse since they resulted from a duplication event that occurred

before their speciation with Mouse.

Identifying conserved regions in multiple sequence alignments to infer functional importance does not have

to be limited to the coding region and can equally well be applied to non-coding regions, in which case we

refer to Phylogenetic footprinting. Functional regions here often refer to regulatory motifs. The principle is

the same as for protein coding regions except that different types of alignment procedures need to be used.

BIOLOGICAL APPLICATION 1: EVOLUTIONARY BIOLOGY

With all these genome sequences at hand, comparative genomics allows studying our own evolution.

In September 2006, an international team published the genome of our closest relative, the chimpanzee.

With the human genome already in hand, researchers could begin to line up chimp and human DNA and

examine, one by one, the 40 million evolutionary events that separate chimps from us. The genome data

confirm our close kinship with chimps: We differ by only about 1% in the nucleotide bases that can be aligned

between our two species, and the average protein differs by less than two amino acids.

Given the dramatic behavioral and developmental differences that have arisen since their divergence from a

common ancestor 6-7 million years ago, the question arises of how these phenotypic differences are

reflected at the genome sequence level.

Recent studies have shown that mainly genes involved in smell and hearing are significantly different

between humans and chimpanzees.

Also changes in gene regulatory binding sequences (promoters, enhancers, and silencers) are likely to have

contributed to divergence between humans and chimps. Using a comparative approach, it has been shown

Page 25: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 25

that regulatory binding sites lost in human, but still present in chimp are located in specific genomic regions

and are associated with genes involved in sensory perception.

In this figure, two examples are given of regulatory binding sites that changed between human and chimp.

Note how small differences in sequences can have such large phenotypic influences.

BIOLOGICAL APPLICATION 2: STUDYING GENOME EVOLUTION

Studying genomes has unveiled that during evolution genes, but sometimes also full genomes can duplicate

(paralogs). This gives rise to multiple copies of the same gene in a single species.

These duplicates are often the source of novel genetic material on which evolution can act. The most likely

fate of a duplicated gene is gene loss. Indeed most duplications are deleterious. In that case redundant genes

are removed from the genomes or evolve into pseudogenes.

However, through mutation and natural selection, one of the copies can develop a new function, leaving the

second copy to cover for the original function (evolution can experiment as you have a spare copy). If paralogs

(duplicated genes) coexists during evolution, they therefore usually have divergent functionalities or

complementary functions. Either their proteins become involved in novel biochemical pathways or the

Ancestral gene

Copy 1

Copy 1

Function 1

Function 1 Function 1

Function 1 New Function 2

Copy 2

Copy 2

Gene duplication

TimeAncestral gene

Copy 1

Copy 1

Function 1

Function 1 Function 1

Function 1 New Function 2

Copy 2

Copy 2

Gene duplication

Time

Page 26: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 26

paralogs have distinct expression domains. The latter means that the original function e.g. was expressed in

the roots and leaves of a plant but that after duplication one copy is expressed exclusively in the roots while

the other copy becomes responsible for expression in the leaves. Both copies become complementary

because their expression domain is different (can be caused by mutations in the regulatory elements). The

latter phenomenon is referred to as subfunctionalization.

By comparing genomes against themselves, researchers have found that all flowering plants underwent 3 full

genome duplication events.

Page 27: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 27

Because for some genes a multiplication level of 8 was observed,the occurrence of 3 whole genome

duplication events was assumed (1->2->4->8 copies).

Page 28: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 28

These whole genome duplications could be dated and the eldest could be mapped to fossil records. The

duplication coincided with the origin and specialization of flowering plants during the Early Cretaceous.

Fossil records of plant evolution

The extant (or modern) angiosperms, i.e. flowering plants did not appear until the Early Cretaceous (145–

125 Mya), when the final combination of angiosperm features occurred, as supported by evidence from

micro- and macrofossils. During the Aptian, 125–112Mya (Figure below) species diversity was low and pollen

and megafossils were rare components of terrestrial floras. Angiosperm fossils show a dramatic increase in

diversity between the Albian (112–99.6 Mya) and the Cenomanian (99.6–93.5 Mya) at a global scale (Figure

1). The angiosperm radiation yielded species with new growth architectures and new ecological roles. Early

angiosperms had small flowers with a limited number of parts that were probably pollinated by a variety of

insect taxa but specialized for none. Accordingly, Cenomanian flowers do not yet provide strong evidence for

specialization of pollination syndromes. However, by the Turonian (93.5–89.3 Mya), flowering plants had a

wide variety of features that are, in extant species, closely associated with several types of specialized insect

pollination and with high species diversity within angiosperm subclades. The evolution of larger seed size in

many angiosperm lineages during the early Cenozoic (from 65 Mya) indicates that animal-mediated dispersal

and shade-tolerant life-history strategies. In summary, fossils with affinities to diverse angiosperm lineages,

including monocots, are all found in Early Cretaceous floras. However, the question remains why this was

such a decisive time in the evolution of plants. Can whole-genome duplication events have had a key role in

the origin of angiosperms and their morphological and ecological diversification.

Evidences:

Many angiosperms have experienced one or more episodes of polyploidy in their ancestry.

o Duplicated genes and genomes can provide the raw material for evolutionary diversification

and the functional divergence of duplicated genes

Page 29: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 29

o The dates of the duplication events correspond to time periods of large expansion in

angiosperms as recorded based on fossils.

Bioinformatics methods used

Field of comparative genomics and phylogeny. Methodologies mainly based on sequence alignment and

phylogenetic tree construction.

BIOLOGICAL APPLICATION 3: METAGENOMICS (G. VENTER)

METAGENOMICS: DNA SEQUENCING OF ENVIRONMENTAL SAMPLES

Nature Reviews Genetics 6, 805-814 (2005); doi:10.1038/nrg1709

Although genomics has classically focused on pure, easy-to-obtain samples, such as microbes that grow

readily in culture or large animals and plants, these organisms represent only a fraction of the living or once-

living organisms of interest. Many species are difficult to study in isolation because they fail to grow in

laboratory culture, depend on other organisms for critical processes, or have become extinct. Methods that

are based on DNA sequencing circumvent these obstacles, as DNA can be isolated directly from living or dead

cells in various contexts. Such methods have led to the emergence of a new field, which is referred to as

metagenomics.

DNA sequencing can provide insights into organisms that are difficult to study because they are

inaccessible by conventional methods such as laboratory culture. Examples are for instance,

organisms that exist only in tight association with other organisms, including various obligate

symbionts and pathogens, members of natural microbial consortia and an extinct cave bear

Isolation and sequencing of DNA from mixed communities of organisms (metagenomics) has

revealed surprising insights into diversity and evolution.

Partially assembled or unassembled genomic sequence from complex microbial communities has

revealed the existence of novel and environment-specific genes.

Corresponds with ageolder

Corresponds with ageolder

Corresponds with ageolder

Page 30: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 30

The application of high-throughput shotgun sequencing environmental samples has recently provided global

views of those communities not obtainable from 16S rRNA or BAC clone–sequencing surveys. The sequence

data have also posed challenges to genome assembly, which suggests that complex communities will demand

enormous sequencing expenditure for the assembly of even the most predominant members.

However, for metagenomic data, this complete assembly may not always be necessary or feasible.

Determining the proteins encoded by a community, rather than the types of organisms producing them,

suggests a means to distinguish samples on the basis of the functions selected for by the local environment

and reveals insights into features of that environment.

For instance, Examination of higher order processes reveals known differences in energy production (e.g.,

photosynthesis in the oligotrophic waters of the Sargasso Sea and starch and sucrose metabolism in soil) or

population density and interspecies communication, overrepresentation of conjugation systems, plasmids,

and antibiotic biosynthesis in soil (Fig. 4, lower left). The predicted metaproteome, based on fragmented

metagenomic analyses performed here could be used to predict features of the sampled environments such

as energy sources or even pollution levels.

Metagenomics data bases are currently been set up: for instance

http://www.megx.net/index.php?navi=EasyGenomes

EXAMPLE G. VENTER SARGASSO SEA

Boston (04/16/04)—This Spring, J. Craig Venter is sailing around the French Polynesian Islands scooping up

bucketfuls (figuratively) of seawater in an ambitious voyage to sample microbial genomes found in the

world's oceans. His 95-foot yacht, Sorcerer II, has been outfitted with all manner of technical equipment to

accommodate the task, as well as a few surfboards should that opportunity arise.

Venter and colleagues report finding 1.2 million genes, including almost 70,000 entirely novel genes, from an

estimated 1,800 genomic species, including 148 novel bacterial phylotypes. This diversity is staggering and

to a large extent unexpected. "We chose the Sargasso seas because it was supposed to be a marine desert,"

says Venter wryly. "The assumption was low diversity there because of the extremely low nutrients." His

team sequenced a total of 1.045 billion base pairs of non-redundant sequence. At the height of the work,

"over 100 million letters of genetic code were sequenced every 24 hours." The results have been deposited

in GenBank. You can go and search for them.

PALEOGENOMICS

Mammoth genome

A very recent application is the use of metagenomics approaches to sequence the mammoth genome:

Usually mitochondrial genomes are sequenced form extinct species as it is abundantly present in

eukaryotic cells and thus easier to sequence. In permafrost settings, theoretical calculations predict

DNA fragment survival up to 1 million years (11, 12). When preserved in such conditions sequencing

of genomic DNA is still possible.

Page 31: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 31

1 g of bone was used to extract DNA which was subsequently used for library construction and

sequencing technology that recently became available (13, 19).

The mammalian fraction dominated the identifiable fraction of the metagenome.

Nonvertebrate eukaryotic and prokaryotic species occur at approximately equal ratios, with paucity

of fungal species and nematodes.

hits against grass species to outnumber the ones from Brassicales by a ratio of 3:1, which could be

indicative of ancient pastures on which the mammoth is believed to have grazed.

From Poinar et al., Science 2006.

Ancient salt crystals

Bacteria have been found associated with a variety of ancient samples, however few studies are

generally accepted due to questions about sample quality and contamination. Cano and Borucki isolated in

1995 a strain of Bacillus sphaericus from an extinct bee trapped in 25-30 million-year-old amber. More

recently a report about the isolation of a 250 million-year-old halotolerant bacterium from a primary salt

crystal has been published. Halite crystals from the dissolution pipe at the 569 m level of the Salado

Formation were taken from sampling. A fluid volume of 9 l was recovered from an inclusion in the crystal

and inoculated into two different media: casein-derived amino acids medium (CAS) and glycerol-acetate

medium (GA). Only the CAS enrichment yielded the bacteria, designated 2-9-3.

Once isolated the bacteria, the next step in the research is to achieve its taxonomical classification.

Two important genotypic markers widely used in recent bacterial taxonomy are the 16S rRNA gene sequence

data and DNA-DNA hybridization data. Many researchers reported the correlation between 16S rRNA gene

sequence similarity values and genomic DNA relatedness. It has been proposed that phenotypically related

bacterial strains showing 70% or greater genomic DNA relatedness constitute a single bacterial species. In

contrast, those having <70% but >20% similarity are considered to be different species within a genus.

These analysis showed that the organism was most similar to Bacillus marismortui (99% similarity S)

and Virgibacillus pantothenticus (97.5% S). Phylogenetic analysis showed that isolate 2-9-3 is part of a distinct

lineage within the larger Bacillus cluster

Additional info

Page 32: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 32

Metagenomics and industrial applications. Nat Rev Microbiol. 2005 Jun;3(6):510-6. Review.

The metagenomics of soil. Nat Rev Microbiol. 2005 Jun;3(6):470-8. Review.

http://www.bio-itworld.com/news/041604_report4889.html

The gut microbiome project

http://en.wikipedia.org/wiki/Microbiome

FUNCTIONAL GENOMICS & SYSTEMS BIOLOGY

SYSTEMS BIOLOGY

Is field that originated in the early 90’s: it stems from ‘molecular biology’ but reflects a novel holistic way of

thinking: understanding complex biological phenomena in their entirety. In systems biology a cell is

considered as a system that interacts with its environment. It receives dynamically changing environmental

cues and transduces these signals into the observed behavior (phenotype or dynamically changing

physiological responses). This signal transduction is mediated by the regulatory network (below). Genetic

entities (proteins), located on top of a regulation cascade, are activated by external cues. They further

transduce the signal downstream in the cascade via protein-protein interactions, chemical modifications of

intermediate proteins, etc into transcriptional activation and subsequent translation. Ultimately, these

processes turn the genetic code into functional entities, the proteins. The action of regulatory networks

determines how well cells can react or adapt to novel conditions. This signaling network in a cell can be

compared with the electronic circuitry on a microchip. It also consists of individual components (often called

modules). Systems biology is the science that tries to decode the design principles of biological systems. It

can be used for both fundamental and applied purposes. A typical example of a fundamental application is

the domain of evolutionary systems biology which has as a goal studying the impact of network rewiring on

adaptive behavior and organism evolution.

Figure: The cell as a signal transduction system. The signaling circuitry can be considered as having

a modular composition, in which each part form an individual functional unit.

BIOLOGICAL APPLICATION: SYNTHETIC BIOLOGY FROM AN ENGINEERING POINT OF VIEW:

RATIONAL DESIGN

Synthetic biology best reflects the paradigm shift in biological thinking that occurred during the last decennia.

It reflects how biology is approached from a computational/engineering point of view.

The major difference between ‘genetic engineering/biotechnology’ to ‘synthetic biology’ would reflect a

novel mind set: the idea of rational design. Synthetic biology relies on the identification, reuse or adaptation

of existing parts of systems to construct reduced systems tailored to an aim whose starting assumptions

might be very different from those of the natural system. The idea of using parts stems from the parallel

Page 33: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 33

between electronic circuits and biological systems. Each component of the system can be seen as an

individual transistor. By combining the different signaling components a circuitry can be designed that has

operational characteristics or that gives rise to functionalities which do not occur as such in nature.

The premise of synthetic biology is thus built on the modularity of signal transduction pathways. This

modularity thus allows constructing and synthesizing artificial biological systems by combining “microchip

design principles” with libraries of molecular modules to obtain a desired microbial functionality. According

to this vision, Synthetic Biology should be able to rely on a list of standardized parts (amino acids, bases,

proteins, genes, circuits, cells, etc) whose properties have been characterized quantitatively and on software

modeling tools that would help putting parts together to create a new biological function.

The idea behind the MIT ‘Registry of Standard Biological Parts’ (http://parts.mit.edu), is that as more libraries

of parts are being constructed and provided that all these parts are well documented and standardized, in

the end one can select immediately his appropriate part from the library and the tedious step of making a

mutant library or synthesizing all possible sequences and characterizing their in and output characteristics

can be omitted.

Currently the Registry is a collection of ~3200 genetic parts that can be mixed and matched to build synthetic

biology devices and systems. Founded in 2003 at MIT, the Registry is part of the Synthetic Biology

community's efforts to make biology easier to engineer. It provides a resource of available genetic parts to

iGEM teams and academic labs.

Current challenges in synthetic biology:

The premise of synthetic biology is built on the modularity of signal transduction pathways. Artificial

biological systems are synthesized by combining parts with desired functionalities and kinetic behavior, as

predicted by a model-based design. However the generation of the parts with the proper characteristics is

still very laborious and ad hoc (large libraries are made randomly, see figure below. All parts within such

library need to be characterized experimentally).

A fundamental systems understanding of how regulation or a certain kinetic behavior is encoded could

further rationalize the design of modules and contribute to a better standardization (that is the key features

in the primary sequence that drive a specific expression behavior, motifs, motif spacing, nucleosome

positioning, etc).

Page 34: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 34

Modeling is the key to both systems and synthetic biology (see figure below). For systems biology modeling

aims at getting a fundamental understanding of the host cellular behavior, while for synthetic biology ‘model-

based design’ is used to determine the circuit topology and its parameters subject to predefined design

requirements. Such design requirements should not only consider desired input/output characteristics

(linear, oscillating behavior, bistability), but also take into account the easiness by which certain parts can be

manipulated in the lab. A challenging task is making design models that determine design parameters

conditioned on systems properties of the global cellular system.

CONCLUSION

Page 35: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 35

Bioinformatics has several subdisciplines (functional genomics/systems biology; comparative

genomics; structural genomics.

Each subdomain is a research domain in evolution, with novel tools being produced at an ever

increasing pace

The real life examples of the different subdomains show how computational analysis (bioinformatics) in

combination with novel data generation procedures will change the way ‘molecular biologists’ think and

perform research (referred to as the new biology).

How can bioinformatics change the world?

Bioinformatics has numerous application domains, and will for instance revolutionize the medical field,

because it for example will make personalized medicine possible. Now that the genome (the blueprints of

life) of everyone can (and will) get sequenced, we can start investigating why some persons are susceptible

to certain diseases while others are not, and why a certain treatment works for some and not for others. In

agriculture, we can for instance study why certain plants are more resistant to drought than others, which is

important in these days of global warming and climate change. Bioinformatics can also address more

fundamental questions in evolutionary biology, such as whether Neanderthals and the ancestor of modern

humans ever had sex (the answer is yes), questions that can only be addressed with bioinformatics or

computational biology.

If Bioinformatics will become so prominent and is referred to as ‘the new biology’, how will this affect the

more classical wet lab science?

Of course without data there is no bioinformatics. But there is indeed a tendency that increasingly, data

generation becomes robotized or outsourced. This has a consequence that wet lab scientists have more time

left to spend on the design of their experiment and will be confronted at a much earlier stage with the analysis

of their data, and the problems related to this. What do you hope to get out of your data, how will you

summarize all these data, what is the hypothesis you want to formulate, and so on? So rather than focusing

on a single gene, they will need to start thinking more globally, solving the bigger picture and that is what

the term ‘new biology’ is referring to. This is now often considered the problem of the bioinformatician but

obviously, the wet lab scientist of the future will have to adopt at least some of those skills. So the distinction

between a bioinformatician and a wet lab scientist (systems biologist) will become fuzzier and in the coming

decades, we expect that about one third of the people in the life sciences will be bioinformaticians or at least

use some sort of bioinformatics in their research. However, although genome hackers and number crunchers

can learn a lot from the loads of data generated, wet lab work will always be necessary. Bioinformatics is

also often about making predictions, but of course these still will need to be validated in the lab. On the other

hand, for some specific fields such as evolutionary research, bioinformatics is often sufficient or even the

only way to obtain results.

What is the most important skill of being a bioinformatician

First of all you need to be a generalist rather than a specialist. You need to know a bit of everything but

nothing too much in detail (that can even be disadvantageous I think). To give an example: wet lab scientists

typically have a very detailed view on biology: biological systems have randomly evolved into emerging

complex systems that can not be captured in a few rules. There are more exceptions than fixed rules in

biology. Engineers on the other hand model systems and these models depend on predefined rules. As a

bioinformatician you need to keep both parties happy: a good formalization of a biological question should

Page 36: CHAPTER 1: INTRODUCTION TO BIOINFORMATICS Table of ...bioinformatics.intec.ugent.be/kmarchal/Course... · Introduction Bioinformatics (Ba Biochemistry and Biotechnology) Kathleen

Introduction Bioinformatics (Ba Biochemistry and Biotechnology)

Kathleen Marchal 36

reduce the problem to a model that is mathematically tractable but that still captures the intricacies the

biologist is interested in. Finding the right assumptions and simplifications builds on this generic knowledge.

This generic knowledge is also key to the scientific Intuition you need to have as a bioinformatician. As was

already mentioned: with bioinformatics we can solve research questions that could not be addressed before.

There is so much data out there that when you integrate it all you can tackle research questions that go far

beyond what was accessible or could be dreamt of by a single person or even a single lab. The difficulty often

is defining these novel research questions or hypothesis no one has ever thought of before. This again

requires very good interdisciplinary knowledge on how the data was generated, what type of information

does it contain, how can it be integrated etc biological scientists need to know some bioinformatics.