Adriana Maggi DOCENTE DI BIOTECNOLOGIE FARMACOLOGICHE CORSO DI LAUREA SPECIALISTICA IN BIOTECNOLOGIE...

58
Adriana Maggi DOCENTE DI BIOTECNOLOGIE FARMACOLOGICHE CORSO DI LAUREA SPECIALISTICA IN BIOTECNOLOGIE DEL FARMACO AA 2011/2012 Lezione 5 Bioinformatica nel processo di drug discovery

Transcript of Adriana Maggi DOCENTE DI BIOTECNOLOGIE FARMACOLOGICHE CORSO DI LAUREA SPECIALISTICA IN BIOTECNOLOGIE...

Adriana Maggi

DOCENTE DI BIOTECNOLOGIE FARMACOLOGICHECORSO DI LAUREA SPECIALISTICA IN BIOTECNOLOGIE DEL FARMACO

AA 2011/2012Lezione 5

Bioinformatica nel processo di drug discovery

Bioinformatics in the drug discovery process

Alessandro VillaCenter of Excellence on Neurodegenerative Diseases

Department of Pharmacological SciencesUniversity of Milan

Bioinformatics

Bioinformatics is about finding and interpreting

biological data using informatic tools, with the goal of

enabling and accelerating biological research

Bioinformatics spans a wide range of activities

- Data capture- Automated recording of experimental results- Data storage- Visualization of raw data and analytical results

- Access to data using a multitude of databases and query tools

WorkflowExperimental

Design

Sample collection and analysis

Data collection, filtering, and input

Data analysis

Output results

Human based

Computer aided

Critical analysis of the results

Definitely human based

Bioinformatics is web-centric

Most academic groups and many companies make computing tools and data available on the web

based on the contribution of worldwide researchers

Most of them are free

Focus on

Bioinformatics strategies for disease gene identification

Traditional Methods of Drug Discovery

natural (plant-derived) treatment

for illness

isolation of active compound

(small, organic)

synthesisof compound

manipulation of structure to get better

drug(greater efficacy, fewer side effects)

Modern Methods of Drug Discovery

What’s different?

• Drug discovery process begins with a disease (rather than a treatment)

• Use disease model to pinpoint relevant genetic/biological components (i.e. possible drug targets)

Defining genetic disease

Genetic disorders are caused by abnormalities in the genetic material

Abnormalities can range from a small mutation in a single gene to the addition or subtraction of an entire chromosome or set of

chromosomes.

In general, four types of genetic disorders can be distinguished

Monogenetic

Monogenetic (also called Mendelian or single gene) disorders are caused by a mutation in one particular pair of gene.

A mutated gene can result in a mutated protein, which can no longer carry out its normal function.

Over 10,000 human diseases are known to be caused by defects in single genes, affecting about 1% of the population as a whole.

Monogenetic disorders often have simple and predictable inheritance patterns.

Thalassaemia

Sickle cell anemia

Haemophilia

Cystic Fibrosis

Tay sachs disease

Fragile X syndrome

Huntington's disease

Monogenetic disorders

Polygenic

Polygenic disorders are due to mutations in multiple genes in combination with external factors, such as lifestyle and environment

Heritability presents the contritution of genetic factors in the formation of multiple gene diseases. Higher heritability is generally interpreted as a larger contribution of genes.

Examples of polygenic diseases include coronary heart disease, diabetes, hypertension, and peptic ulcers.

At present, there are still many difficulties in prenatal diagnosis for multiple-gene diseases, however, as technology develops, prenatal diagnosis for common multiple-gene diseases will be available in the near future.

Type 1 diabetes

Multiple sclerosis

Autism

Asthma

Celiac disease

Polygenic disorders

Chromosomal

Abnormalities in the chromosomal number or structure, e.g. (partial) deletion, extra copies, breakage, and (partial) rearrangements, can result in disease.

Down syndrome

Klinefelter's syndrome

Prader–Willi syndrome

Turner syndrome

Chromosomal

Mitochondrial

Mitochondria, like the cell nucleus, contains DNA (mtDNA), which is the biggest difference between mitochondria and other sub-units. mtDNA is only inherited from the mother and exhibits higher mutation rate than that of nuclear DNA as well as low repair capacity.

Mitochondrial diseases have threshold effects. That means mitochondrial diseases could occur only if the abnormal mtDNA exceeds the threshold. Although sometimes diseases would not happen in the female carriers, for their underthreshold abnormal mtDNA or certain nuclear effects, mutant mtDNA can also be passed from generation to generation.

Kearns-Sayre syndrome

Chronic progressive external ophthalmoplegia

Mitochondrial encephalomyopathy with lactic acidosis

Leigh syndrome

Mitochondrial disorders

DISEASE

Gene identification/finding of inherited disease

Every gene has a specific task

Identification of disease genes is similar to finding genes responsible for normal functions

The mutation may be within a gene/protein or within a regulatory part of the genome that, e.g., affects the amount

of protein being produced.

The mutation changes the protein, which alters the way the task is usually performed

Gene identification/finding of inherited disease

Timeline

1983 – Invention of Polymerase Chain Reaction (PCR) technique by Kary Mullis

1989 – the National Center for Human Genome Research is created

1990 – the Human Genome Project (HGP) starts to map and sequence human DNA

1996 – the DNA sequence of the first eukaryotic genome (S. Cerevisiae) is completed

2003 – the human genome sequence is completed

Now – the genome sequences are still frequently updated with new and rearranged sequences, and some parts are still missing.

2002 – the mouse genome sequence is completed

Gene identification/finding of inherited disease

We have a huge amount of genetic data in place.

And now?

Find a candidate gene!

Candidate gene definition

A candidate gene is a gene that is suspected to be involved in a genetic disease

It is located in a chromosome region suspected of being involved in the expression of a trait such as a disease, whose protein product suggests that it could be the gene in question.

Disease genes identification in complex disorders

Complex disorders are multifactorial and many such diseases, like heart and vascular disease are quite

common.

Five steps are applicable to research of a complex disease:

I. Establish that the disease is indeed (partially ) caused by genetic factors

To prove that the candidate is in fact a gene, demonstration of a genetic mutation is needed.

Mutation analysis can be done by direct sequencing. Changes in the splicing process of the gene may be missed when screening protein-coding DNA sequences only, but are detectable at the RNA level using RT-PCR. With RT-PCR and related methods it is possible to evaluate whether the spatio-temporal gene expression pattern is compatible with the phenotype of interest.

Final proof may require the examination of the effect of induced mutation in model organisms.

Mutation analysis

II. Perform segregation analysis on individual pedigree to determine the type of inheritance.

Inheritance can vary from Mendelian to polygenic, depending on penetrance and environment.

The mode of inheritance determines the linkage analysis methods applicable (next step).

Segregation analysis E.g. Genetic Association Interaction Analysis Software (GAIA)

http://www.bbu.cf.ac.uk/html/research/biostats.htm

III. Perform linkage analysis to map susceptibility loci

Genetic linkage analysis is a statistical method that is used to associate functionality of genes to their location on chromosomes. The main idea is that markers which are found in vicinity on the chromosome have a tendency to stick together when passed on to offsprings. Thus, if some disease is often passed to offsprings along with specific markers, then it can be concluded that the gene(s) which are responsible for the disease are located close on the chromosome to these markers.

Parametric, useful for Mendelian traits;

Non parametric, useful for complex diseases.

Pedigree Analysis Software

E.g. MERLIN

http://www.sph.umich.edu/csg/abecasis/Merlin/index.html

IV. Fine mapping of the susceptibility gene by population-association studies

Association studies require the use of DNA from many individuals. However, association studies do not use families. Rather, they look at DNA from affected individuals compared to DNA of controls (non-affected individuals who do not have to be relatives.)

After using linkage to get an idea where disease genes may be located, use association to try to better locate the gene. Association allows to test candidate genes, or very small genetic regions, to see if they are associated with the phenotype in study. These tests can result in the location of a risk gene.

V. Elucidation of the DNA sequences/genes

Confirm their molecular and biochemical action and involvement.

Relatively easy in case of Mendelian disorders, because the disease is due to a single change.

In complex disorders the susceptibility is often modelled as a quantitative trait locus (QTL)

In silico positional cloning

Once the critical region for a genetic disease has been determined by linkage analysis, population-association, etc., the human genome sequence can be used to identify positional candidate disease genes.

Genome browsers, biological databases, and other bioinformatics tools all contribute to the gene finding strategy.

Bioinformatics approach to disease gene identification

The release of genomic sequences, full-lenght cDNA sequences, expressed sequence tags (ESTs), and large-scale expression micro-array data of human and model organisms (e.g. Mus Musculus) offer invaluable resources for studying genetic diseases.

This huge amount of data is stored in numerous different databases, thus making the use of high performance computing an essential tool for decoding the information contained in these databases.

DNA Data Bank of Japan

http://www.ddbj.nig.ac.jp/index-e.html

European Molecular Biology Laboratory database

http://www.ensembl.org/

GenBank

http://www.ncbi.nlm.nih.gov/genbank/

First StepTo search for all genes between two genetic markers on the

chromosome under study

Essential is a proper description of the location of genes and other annotations like regulatory elements

Databases and computational tools have been developed to identify all genes on the human genome sequence. None is perfect and genes may be missed, or false genes may be annotated manual evaluation is necessary or..

Multiple sequence analyses on different databases should be performed

USCS Genome Browser

http://genome.ucsc.edu/

Second Step

Functional cloning and candidate gene selection

We identified all the genes between the genetic markers

In theory, every gene within the disease critical region can cause the disease.When the critical region is large, or the gene density is high,

positional candidates are many.

Strategies:

•There may be already a suspicion on the biochemical/pathogenic background of the disease

•If a genetic disorder affects e.g. the liver, select only genes expressed in liver

•For known genes, the knowledge in literature can be used to select the candidate genes

•Genes located within the critical disease region that have a functional similarity to genes involved in related diseases are good candidates

The Gene Ontology project is a major bioinformatics initiative with the aim

of standardizing the representation of gene and gene product attributes

across species and databases. The project provides a controlled vocabulary

of terms for describing gene product characteristics and gene product

annotation data from GO Consortium members, as well as tools to access

and process this data.

Database for Annotation, Visualization and Integrated Discovery

http://david.abcc.ncifcrf.gov/home.jsp

Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. (2009) Nat Protoc. 4(1):44 -57.

Further, knowledge of model organisms makes comparative candidate selection possible

This situation applies when a gene is known, which causes a similar phenotype in other species.

Transfer of knowledge by phenotype is strightforward in Mus Musculus, being evolutionarily close to humans

This grid, called Oxford grid, shows the relationship between human and mouse chromosomes. Chromosome location of either of the species often predicts the

chromosome location in the other species.

When none of the known genes has mutations, it is possible to try to find new genes in the critical region.

Comparative genome analysis of related species present us with a wealth of opportunities for studying

evolution and gene/protein function.

Homology-based function-prediction transfers information from known genes/proteins to unknow sequences and remains the

primary method to determine the function of a new gene

Example of homology-based methods are Basic Local Alignment Search Tool (BLAST)Evolutionary annotation database (EVOLA)

BLAST

http://blast.ncbi.nlm.nih.gov/

EVOLA

http://www.h-invitational.jp/evola/search.html

EVOLA

http://www.h-invitational.jp/evola/search.html

Biological Networks

Over the last years, the wealth of information derived from high-throughput interaction screening methods have been used to map different biological interactions.

These maps provide a vision of the molecular networks in biological systems.

C

B

A

D

F E

Protein-protein interaction networks represent the interaction between proteins such as the building of protein complexes and the activation of one protein by another protein.

Gene regulatory and signal transduction networks describe how genes can be activated or repressed and therefore which proteins are produced in a cell at a particular time.

Metabolic networks show how metabolites are transformed, for example to produce energy or synthesize specific substances.

Biological Networks

Gene regulatory, protein-protein interaction and metabolic networks interact with each other and build a complex network of interactions.

The study of biological network is essential to understand the role of candidate genes in genetic diseases

Biological Networks

Finally they are very useful to identify the genotypes that are associated with phenotypes,

a major goal in genetic research

Ingenuity Pathway Analysis Software

http://www.ingenuity.com/index.html

Confirming a candidate gene

Selected genes have to be tested individually to see if there is evidence that mutations in them do cause the disease in question.

Mutation screening. Identifying mutations in several unrelated affected individuals strongly suggests that the correct candidate gene has been chosen, but formal proof requires additional evidence.

Restoration of normal phenotype in vitro. If a cell line that displays the mutant phenotype can be cultured from the cells of a patient, transfection of a cloned normal allele into the cultured disease cells may result in restoration of the normal phenotype by complementing the genetic deficiency.

Production of a mouse model of the disease. Once a putative disease gene is identified, a transgenic mouse model can be constructed. If the human phenotype is known to result from loss of function, gene targeting can be used to generate a germline knockout mutation in the mouse ortholog. The mutant mice are expected to show some resemblance to humans with the disease.

Suggested readings

1. A. L. Barabási, N. Gulbahce, J. Loscalzo, Network medicine: a network-based approach to human disease. Nature Reviews Genetics 12, 56 (2011).

2. J. K. DiStefano, Disease Gene Identification: Methods and Protocols. (Humana Press, 2011).

3. S. D. Mooney, V. G. Krishnan, U. S. Evani, Bioinformatic tools for identifying disease gene and SNP candidates. Methods Mol. Biol 628, 307 (2010).

4. A. Schlicker, T. Lengauer, M. Albrecht, Improving disease gene prioritization using the semantic similarity of Gene Ontology terms. Bioinformatics 26, i561 (2010).

5. N. Tiffin et al., Integration of text-and data-mining using ontologies successfully selects disease gene candidates. Nucleic acids research 33, 1544 (2005).

6. Y. Zhang et al., Systematic analysis, comparison, and integration of disease based human genetic association data and mouse genetic phenotypic information. BMC medical genomics 3, 1 (2010).