Unit 6: Genetics & Heredity Ch 12 and 13: Heredity & Human Genetics
GENOMICS AND PROTEOMICS ANALYSES ......GENOMICS Genetics: the science of genes , heredity , and the...
Transcript of GENOMICS AND PROTEOMICS ANALYSES ......GENOMICS Genetics: the science of genes , heredity , and the...
Dr J Boateng BIOT 1011 Bioinformatics
GENOMICS AND PROTEOMICS ANALYSES
Dr Joshua Boateng
21 /11 / 2011
Dr J Boateng BIOT 1011 Bioinformatics
The biotechnology/IT
market will increase
at a compound annual
growth rate (CAGR) of
24% to nearly $38
billion by 2006.
– Source: IDC Research
Biotech and pharmaceutical companies
spent $10 billion on hardware, software,
and services in 2002.
–Source: Gartner
Reference: Prof. A.S. Kolaskar Vice Chancellor, University of Pune
Dr J Boateng BIOT 1011 Bioinformatics
GENOMICSGenetics: the science of genes, heredity, and the variation of organisms. In modern research, genetics provides tools in the investigation of the function of a particular gene, e.g. analysis of genetic interactions.
Genomics: the study of large-scale genetic patterns across the genome for a given species. It deals with the systematic use of genome information to provide answers in biology, medicine, and industry.
The study of sequences, gene organization & mutations at the DNA level i.e. the study of information flow within a cell
Genomics has the potential of offering new therapeutic methods for the treatment of some diseases, as well as new diagnostic methods.
Major tools and methods related to genomics are bioinformatics, genetic analysis, measurement of gene expression, and determination of gene function.
Dr J Boateng BIOT 1011 Bioinformatics
Dr J Boateng BIOT 1011 Bioinformatics
GENOME COMPARISONS
Species Chrom. Genes Base pairs
Humans 46 28-35,000 3.1 billion
Mouse 40 22.5-30000 3.1 billion
Puffer fish 44 31000 2.7 million
Malaria Mosquito 6 14000 365 million
Fruit Fly 8 14000 137 million
Roundworm 12 19000 97 million
E. Coli 1 5000 4.1 million
Dr J Boateng BIOT 1011 Bioinformatics
GENOMIC ANALYSIS• Many diverse studies require the determination
of the abundance of large numbers of specific DNA or RNA molecules in complex mixtures, including, for example, the determination of the changes in mRNA levels of many genes
– Genome analysis entails the prediction of genes in uncharacterized genomic sequences.
– The 21st century has seen the announcement of the draft version of the human genome sequence. Model organisms have been sequenced in both the plant and animal kingdoms.
Dr J Boateng BIOT 1011 Bioinformatics
GENOMIC ANALSIS• However, the pace of genome annotation is not
matching the pace of genome sequencing.
• Experimental genome annotation is slow and time consuming. The demand is to be able to develop computational tools for gene prediction.
• Computational gene prediction is relatively simple for the prokaryotes where all the genes are converted into the corresponding mRNA and then into proteins.
• The process is more complex for eukaryotic cells where the coding DNA sequence is interrupted by random sequences called introns.
Dr J Boateng BIOT 1011 Bioinformatics
BIOLOGICAL QUESTIONSSome of the questions biologists want to answer
today are:
• What part of and DNA sequence codes for a protein and what part of it is junk DNA?
• Classify the junk DNA as intron, untranslated region, transposons, dead genes, regulatory elements.
• Divide a newly sequenced genome into the genes (coding) and the non-coding regions.
Dr J Boateng BIOT 1011 Bioinformatics
Biological Research in 21st Century
“The new paradigm, now emerging is that all the 'genes' will be known (in the sense of being resident in databases available electronically), and that the starting "point of a biological investigation will be theoretical.”
- Walter Gilbert
Dr J Boateng BIOT 1011 Bioinformatics
IMPORTANCE OF GENOME ANALYSIS
• The importance of genome analysis can be understood by comparing the human and chimpanzee genomes.
• The chimp and human genomes vary by an average of just 2% i.e. just about 160 enzymes. A complete genome analysis of the two genomes would give a strong insight into the various mechanisms responsible for the differences.
Dr J Boateng BIOT 1011 Bioinformatics
COMPLEXITY IS AN UNDERSTATEMENT?
Dr J Boateng BIOT 1011 Bioinformatics
GENOMIC ANALYSIS_ basics• Techniques used to estimate the relative
abundance of two or more sets of mRNA
– differential screening of cDNA libraries,
– subtractive hybridization,
– differential display,
• However, more advanced methods have been recently developed.
Dr J Boateng BIOT 1011 Bioinformatics
GENOMICS ANALYSIS_Advances
• Advanced methods are particularly amenable to organisms whose entire genome sequences are known, such as S. cerevisiae.
• It is now practicable to investigate changes of mRNA levels of all yeast open reading frames (ORFs) in one experiment.
Dr J Boateng BIOT 1011 Bioinformatics
Advanced genomic analysis techniques• DNA sequencing
• DNA microarray technology– analysis of gene expression profiles at the mRNA level
• Bioinformatic tools to organize and analyze such data
• Chip-based analysis of samples
• Models of gene networks
Dr J Boateng BIOT 1011 Bioinformatics
Microarray Technology
Dr J Boateng BIOT 1011 Bioinformatics
Post-genomic Era• Series of “omics”
– Comparative genomics
– Structural and functional genomics
– Transriptomics
– Proteomics
– Metabolomics
Dr J Boateng BIOT 1011 Bioinformatics
Bioinformatics tools needed for analysis of
data from these “omics”…
Dr J Boateng BIOT 1011 Bioinformatics
Data Mining
Development of new tools for data mining
– Sequence alignment
– Genome sequencing
– Genome comparison
– Micro array data analysis
– Proteomics data analysis
– Small molecular array analysis
To derive “information” and gain “knowledge” from the data
Dr J Boateng BIOT 1011 Bioinformatics
Dr J Boateng BIOT 1011 Bioinformatics
COMPARATIVE GENOMICS
• Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease
• Understand the uniqueness between different species
• Comparative genomics involves the use of computer programs that can line up multiple genomes and look for regions of similarity among them.
Dr J Boateng BIOT 1011 Bioinformatics
When we BLAST a sequence is that comparative genomics?
Difference is in Scale and Direction
• One or several genes
compared against all
other known genes.
• Use genome to
inform us about the
entire organism.
• Use information from
many genomes to learn
more about the
individual genes.
• Entire Genome
compared to other
entire genomes.
Other “omics” Comparative
Dr J Boateng BIOT 1011 Bioinformatics
Background on Comparative Genomic Analysis
• Sequencing the genomes of the human, the mouse and a wide variety of other organisms - from yeast to chimpanzees –
• Driving force for the development of new field of biological research called -comparative genomics.
Dr J Boateng BIOT 1011 Bioinformatics
Dr J Boateng BIOT 1011 Bioinformatics
BACKGROUND• Comparing the human genome with the
genomes of different organisms helps to better understand gene structure and function and thereby develop new strategies in the battle against human disease.
• Comparative genomics also provides a powerful new tool for studying evolutionary changes among organisms.
• This helps to identify the genes that are conserved among species along with the genes that give each organism its own unique characteristics.
• Using computer-based analysis to zero in on the genomic features that have been preserved in multiple organisms over millions of years, researchers will be able to pinpoint the signals that control gene function.
• This should in turn translate into innovative approaches for treating human disease and improving human health.
Dr J Boateng BIOT 1011 Bioinformatics
Dr J Boateng BIOT 1011 Bioinformatics
BACKGROUND
• The evolutionary perspective may prove extremely helpful in understanding disease susceptibility. For example, chimpanzees do not suffer from some of the diseases that strike humans, such as malaria and AIDS.
• A comparison of the sequence of genes involved in disease susceptibility may reveal the reasons for this species barrier, thereby suggesting new pathways for prevention of human disease.
Dr J Boateng BIOT 1011 Bioinformatics
BACKGROUND• Although living creatures look and behave in
many different ways, all of their genomes consist of DNA, the chemical chain that makes up the genes that code for thousands of different kinds of proteins.
• Precisely which protein is produced by a given gene is determined by the sequence in which four chemical building blocks - adenine (A), thymine (T), cytosine (C) and guanine (G) - are laid out along DNA's double-helix structure.
Dr J Boateng BIOT 1011 Bioinformatics
BACKGROUND• In order for researchers to most efficiently use an
organism's genome in comparative studies, data about its DNA must be in large, contiguous segments, anchored to chromosomes and, ideally, fully sequenced.
• Furthermore, the data needs to be organized for easy access and high-speed analysis by sophisticated computer software.
• Organisms that have been completely sequenced include: mouse (Mus musculus), human (Homo sapiens), fruit fly (Drosophila melanogaster); and ....................
Dr J Boateng BIOT 1011 Bioinformatics
BACKGROUND
• The fledgling field of comparative genomics has already yielded some dramatic results.
• For example, a March 2000 study comparing the fruit fly genome with the human genome discovered that about 60 percent of genes are conserved between fly and human.
• Simply put, the two organisms appear to share a core set of genes. Researchers have found that two-thirds of human cancer genes have counterparts in the fruit fly.
Dr J Boateng BIOT 1011 Bioinformatics
BACKGROUND• More surprisingly, when scientists inserted a
human gene associated with early-onset Parkinson's disease into fruit flies, they displayed symptoms similar to those seen in humans with the disorder.
• This raises the possibility that the tiny insects could serve as a new model for testing therapies aimed at Parkinson's.
Dr J Boateng BIOT 1011 Bioinformatics
Comparative GenomicsWhat one should look for?
Human
P. falciparum
Mosquito
Proteins that are shared by –
•All genomes
•Exclusively by Human & P.f.
•Exclusively by Human &
Mosquito
•Exclusively by P.f. & Mosquito
Unique proteins in –
Human
P.f. Targets for
anti-malarial drugs
Mosquito
Dr J Boateng BIOT 1011 Bioinformatics
Comparative Gene Prediction• GenScan : ab initio gene prediction.
• GeneWise, Procrustes : homology guided.
• Rosseta, SGP1 (Syntetic Gene Prediction), CEM (Conserved Exon Method) : gene prediction and sequence alignment are clearly separated.
• GenomeScan : Ab Initio modified by BLAST homologies.
• SGP-2, TwinScan, SLAM, DoubleScan : modification of GenScan scoring schema to incorporate similarity to known proteins.
Dr J Boateng BIOT 1011 Bioinformatics
•The term proteome, coined in 1994.A linguistic equivalent to the concept of genome
Proteome - complete set of proteins that is
expressed, and modified by the entire genome in the lifetime of a cell.
Practical: the complement of proteins expressed
by a cell at any one time.
Proteome – by the dictionary
Dr J Boateng BIOT 1011 Bioinformatics
Proteomics (Practical) - the study of the proteome using technologies of large-scale protein separation and identification.
Large scale separation : 2DE Liquid Chromatography
Identification : MALDI MSTandem MS/MSFT-MS …..
Proteomics – by the dictionary
• http:www.bio-itworld.com/archive/031704/horizons_horizons_comm.html
Dr J Boateng BIOT 1011 Bioinformatics
Dr J Boateng BIOT 1011 Bioinformatics
Proteomics according to MedlineDevelopment of Proteomics
1730
From 220 publications in the previous millennium (‘94-’99)To 21,350 (!!!) publications in this millennium (‘00-’05)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1997 1998 1999 2000 2001 2002 2003 2004
Papers
Reviews
Dr J Boateng BIOT 1011 Bioinformatics
Proteomics –by GoogleTHE REALISTIC TRUTH.
Proteomics 886,000 hits (2004)4,700,000 hits (2005)
Genomics 2,070,000 hits (2004)16,000,000 hits (2005)
Dr J Boateng BIOT 1011 Bioinformatics
Comparing Proteomics & Genomics
Genome Genomics
analysis
proteome Proteome
analysis
DNA
Nc-RNA
mRNA Coding DNA Proteins
Peptides
Glyco, other modifications
linear Dynamic
Up/down
3D Dynamic
Up/ down
variants
Completion
Archived
(EST, cDNA, GEO
No notion of completion
Poorly archived
Dr J Boateng BIOT 1011 Bioinformatics
Proteomics – GenomicsMore differences…
Gene/ RNA
dynamic
Protein
dynamic
Stable molecules
Handling cheap/ easy
Minimal modification
Works in isolation
Fragile molecules
Handling dependent
Labile modification
Protein-interaction
Localization dependent
Handle
Tech
HTP
Sequencing (established) MS related (not yet)
DNA array / genotyping/ expression / CGH/
Protein Chip (not yet)
Antibodies array (not yet)
Dr J Boateng BIOT 1011 Bioinformatics
Proteomics:– Original definition: study of the proteins
encoded by the genome of a biological sample
– Current definition: study of the whole protein complement of a biological sample (cell, tissue, animal, biological fluid [urine, serum])
– Usually involves high resolution separation of polypeptides at front-end, followed by mass spectrometry identification and analysis
Dr J Boateng BIOT 1011 Bioinformatics
Challenges facing Proteomic TechnologiesChallenges facing Proteomic Technologies
• Limited/variable sample material
• Sample degradation (occurs rapidly, even during sample preparation)
• Vast dynamic range required
• Post-translational modifications (often skew results)
• Specificity among tissue, developmental and temporal stages
• Perturbations by environmental (disease/drugs) conditions
• Researchers have deemed sequencing the genome “easy,” as PCR was able to assist in overcoming many of these issues in genomics.
Dr J Boateng BIOT 1011 Bioinformatics
The Proteomics Tool Kit• technologies for separating and
visualizing proteins and peptides
• technologies for assessing protein-protein interactions
• technologies for identifying proteins*
• technologies for quantifying protein expression*
• bioinformatic tools for assessment and communication
Dr J Boateng BIOT 1011 Bioinformatics
Proteomic TechnologiesProteomic Technologies
• Amino Acid Composition
• Array-based Proteomics
• 2D PAGE
• Mass Spectrometry
• Structural Proteomics
• Informatics (and the challenges facing the
Human Proteome Project)
Dr J Boateng BIOT 1011 Bioinformatics
Amino Acid Composition (Edmund)Amino Acid Composition (Edmund)
• Pioneering method of obtaining information from proteins.
• Cumbersome and tedious by today’s standards.
• Requires the use of terrible smelling ß-mercaptoethanol.
• Not “high-throughput” by today’s standards, hence, comp is no longer the most widely used technique.
Dr J Boateng BIOT 1011 Bioinformatics
Protein Sequencingstep 1, fragmenting into peptides
Protein Sequencingstep 1, fragmenting into peptides
Dr J Boateng BIOT 1011 Bioinformatics
Protein Sequencingstep 2, sequencing the peptides by Edmund degradation.
Separation by HPLC and detect by absorbance at 269nm.
Dr J Boateng BIOT 1011 Bioinformatics
Array-based ProteomicsArray-based Proteomics
• Employ two-hybrid assays
• Use GFP, FRET, and GST– GFP = green florescent protein
– FRET = florescence resonance energy transfer
– GST = glutathione S-transferase, a well characterized protein used as a marker protein.
Dr J Boateng BIOT 1011 Bioinformatics
Array-based ProteomicsArray-based Proteomics
Dr J Boateng BIOT 1011 Bioinformatics
Array-based ProteomicsArray-based Proteomics
• Offer a high-throughput technique for proteome analysis.
• These small plates are able to hold many different samples at a time.
• Current research is ongoing in an attempt to interface array methodologies with Mass Spectrometry at ORNL.
Dr J Boateng BIOT 1011 Bioinformatics
2D PAGE2D PAGE• 2-D gel electrophoresis is a multi-step procedure that
can be used to separate hundreds to thousands of proteins with extremely high resolution.
• It works by separation of proteins by their pI's in one dimension using an immobilized pH gradient (first dimension: isoelectric focusing) and then by their MW's in the second dimension.
• The core technology of proteomics is 2-DE
• At present, there is no other technique that is capable of simultaneously resolving thousands of proteins in one separation procedure. (sited in 2000)
Dr J Boateng BIOT 1011 Bioinformatics
Traditional IEF procedure:
• Iso electric focusing (IEF) in run in thin polyacrylamide gel rods in glass or plastic tubes.
• Gel rods containing: 1. urea, 2. detergent, 3. reductant, and 4. carrier ampholytes (form pH gradient).
• Problem: 1. tedious. 2. not reproducible.
Evolution of 2-DE methodology
In the past
Dr J Boateng BIOT 1011 Bioinformatics
SDS-PAGE Gel size:
• This “O’Farrell” techniques has been used for 20 years without major modification.
• 20 x 20 cm have become a standard for 2-DE.
• Assumption: 100 bands can be resolved by 20 cm long 1-DE.
• Therefore, 20 x 20 cm gel can resolved 100 x 100 = 10,000 proteins, in theory.
Evolution of 2-DE methodology
100
100
Dr J Boateng BIOT 1011 Bioinformatics
Problems with traditional 1st dimension IEF
• Works well for native protein, not good for denaturing proteins, because:
1. Takes longer time to run.
2. Techniques are cumbersome. (the soft, thin, long gel rods needs excellent experiment technique)
3. Batch to batch variation of carrier ampholytes.
4. Patterns are not reproducible enough.
5. Lost of most basic proteins and some acidic protein.
Evolution of 2-DE methodology
OPERATOR DEPENDENT
Dr J Boateng BIOT 1011 Bioinformatics
2D PAGE2D PAGE
• 2-D gel electrophoresis process consists of these steps:
• Sample preparation
– First dimension: isoelectric focusing
– Second dimension: gel electrophoresis
• Staining
• Imaging analysis via software
Dr J Boateng BIOT 1011 Bioinformatics
Challenges for 2-DE
1. Spot number:
– 10,000-150,000 gene products in a cell.
– PTM makes it difficult to predict real number.
– Sensitivity and dynamic range of 2-DE must be adequate.
– It’s impossible to display all proteins in one single gels.
Dr J Boateng BIOT 1011 Bioinformatics
Challenges for 2-DE
2. Isoelectric point spectrum:
– pI of proteins: range from pH 3-13. (by in vitro translated ORF)
– PTM would not alter the pI outside this range.
– pH gradient from 3-13 dose not exist.
– For proteins which pI > 11.5, they need to be handed separately.
Dr J Boateng BIOT 1011 Bioinformatics
Challenges for 2-DE
3. molecular weights:
– Small proteins or peptides can be analysed by modifying the gel and buffer condition of SDS-PAGE.
– Protein > 250 kDa do not enter 2nd SDS-PAGE properly.
– 1-DE (SDS-PAGE) can be run in a lane at the side of 2-DE.
Dr J Boateng BIOT 1011 Bioinformatics
Challenges for 2-DE
4. hydrophobic proteins:
–Some very hydrophobic proteins do not go in solution.
–Some hydrophobic proteins are lost during sample preparation and iso electric focusing (IEF).
–More chemical developments are required.
Dr J Boateng BIOT 1011 Bioinformatics
Challenges for 2-DE
5. Sensitivity of detection:
– Low copy number proteins are very difficult to detect, even employing most sensitive staining methods.
– Sensitivity of staining methods:1. Silver staining
2. Fluorescent staining
3. Dye binding staining (CBR)
Dr J Boateng BIOT 1011 Bioinformatics
Challenges for 2-DE
6. Loading capacity:
– For detection of low abundant proteins, more sample needs to be loaded.
– A wide dynamic range of the SDS-PAGE is required to prevent merging of highly abundant protein.
– Loading capacity: IEF > SDS-PAGE.
Dr J Boateng BIOT 1011 Bioinformatics
Challenges for 2-DE
7. Quantitation:
– The detection method must give reliable quantitative information.
– Silver staining does not give reliable quantitative data.
Dr J Boateng BIOT 1011 Bioinformatics
Challenges for 2-DE
8. Reproducibility:
– Highest importance in 2-DE experiment.
– Immobilized pH gradient strip have improved a lot for 1st dimension consistency
– Variation most comes from sample preparation.
Dr J Boateng BIOT 1011 Bioinformatics
“A good-looking spot pattern –streak and smear free – is not a guarantee for best 2-DE protocol”
Dr J Boateng BIOT 1011 Bioinformatics
Technologies for identifying proteins
• Western blotting
• Chemical (Edman) sequencing of proteins
• mass spectrometry
– peptide mass fingerprint
– mass spec decay
– databases and search engines
Dr J Boateng BIOT 1011 Bioinformatics
Mass SpectrometryMass Spectrometry
• Mass Spectrometry is another tool to analyze the proteome.
• In general a Mass Spectrometer consists of:– Ion Source
– Mass Analyzer
– Detector
• Mass Spectrometers are used to quantify the mass-to-charge (m/z) ratios of substances.
• From this quantification, a mass is determined, proteins are identified, and further analysis is performed.
Dr J Boateng BIOT 1011 Bioinformatics
MASS SPECTROMETRY
MORE DETAILED MASS SPECTROMETRY
APPLICATIONS IN MORNING LECTURE ON 28TH NOVEMBER 2011
Dr J Boateng BIOT 1011 Bioinformatics
application of bioinformatics in the fields of genomics and proteomics
Dr J Boateng BIOT 1011 Bioinformatics
What is Bioinformatics?
Conceptualizing biology in terms of molecules and then applying
“informatics” techniques from math, computer science, and statistics to
understand and organize the information associated with these molecules on a
large scale
Dr J Boateng BIOT 1011 Bioinformatics
How do we use Bioinformatics?
• Store/retrieve biological information (databases)
• Retrieve/compare gene sequences
• Predict function of unknown genes/proteins
• Search for previously known functions of a gene
• Compare data with other researchers
• Compile/distribute data for other researchers
Dr J Boateng BIOT 1011 Bioinformatics
National Center for Biotechnology Information
GenBank and other genome databases
Sequence retrieval:
Protein Structure:
3D modeling programs –RasMol, Protein Explorer
Sequence comparison programs:
BLAST GCG MacVector
Dr J Boateng BIOT 1011 Bioinformatics
Dr J Boateng BIOT 1011 Bioinformatics
Similarity Search: BLAST
A tool for searching gene or protein sequence databases for related genes of interest
The structure, function, and evolution of a gene may be determined by such comparisons
Alignments between the query sequence and any given database sequence, allowing for mismatches and gaps, indicate their degree of similarity
http://www.ncbi.nlm.nih.gov/BLAST/
Dr J Boateng BIOT 1011 Bioinformatics
MRCKTETGAR
MRCGTETGAR
% identity
90%
CATTATGATA
GTTTATGATT
70%
Dr J Boateng BIOT 1011 Bioinformatics
Strengths:
• Accessibility
• Growing rapidly
• User friendly
Weaknesses:
• Sometimes not up-to-date
• Limited possibilities
• Limited comparisons and information
• Not accurate
Dr J Boateng BIOT 1011 Bioinformatics
Need for improved BioinformaticsGenomics:• Human Genome Project• Gene array technology• Comparative genomics• Functional genomics
Proteomics:
• Global view of protein function/interactions
• Protein motifs
• Structural databases
Dr J Boateng BIOT 1011 Bioinformatics
Data Mining
Handling enormous amounts of data
Sort through what is important and what is not
Manipulate and analyze data to find patterns and variations that correlate with biological
function
Dr J Boateng BIOT 1011 Bioinformatics
Proteomics• Uses information determined by biochemical/crystal structure methods
• Visualization of protein structure
• Make protein-protein comparisons
• Used to determine:
- conformation/folding
- antibody binding sites
- protein-protein interactions
- computer aided drug design
Dr J Boateng BIOT 1011 Bioinformatics
bioinformatics
students educators
researchers institutions