Gene Expression Analysis Unit 19 BIOL221T: Advanced Bioinformatics for Biotechnology Irene...

54
Gene Expression Gene Expression Analysis Analysis Unit 19 Unit 19 BIOL221T BIOL221T : Advanced : Advanced Bioinformatics for Bioinformatics for Biotechnology Biotechnology Irene Gabashvili, PhD

Transcript of Gene Expression Analysis Unit 19 BIOL221T: Advanced Bioinformatics for Biotechnology Irene...

Gene Expression Gene Expression AnalysisAnalysisUnit 19Unit 19

BIOL221TBIOL221T: Advanced : Advanced Bioinformatics for Bioinformatics for

BiotechnologyBiotechnologyIrene Gabashvili, PhD

Major challenge in Major challenge in biologybiology

Gain an understanding of the workings Gain an understanding of the workings of the cell by integrating available of the cell by integrating available information from the various fields of information from the various fields of molecular and cellular biology and molecular and cellular biology and physiology into an accurate model to physiology into an accurate model to generate hypotheses for testinggenerate hypotheses for testing

In previous lectures we mostly talked In previous lectures we mostly talked about Genome (Sequence) informaticsabout Genome (Sequence) informatics

-Omes & -Omics:-Omes & -Omics:

Genome - all the genes of an organismGenome - all the genes of an organism Transcriptome – all the transcripts Transcriptome – all the transcripts

(mRNAs) of an organism(mRNAs) of an organism Proteome – all the proteins of an Proteome – all the proteins of an

organismorganism Metabolome – all metabolites (low Metabolome – all metabolites (low

molecular weight molecules participating molecular weight molecules participating in general metabolic reactions required in general metabolic reactions required for the maintenance, growth) of an for the maintenance, growth) of an organismorganism

Sequencing SuccessesSequencing Successes

T7 bacteriophagecompleted in 198339,937 bp, 59 coded proteins

Escherichia colicompleted in 19964,639,221 bp, 4,293 ORFs

Sacchoromyces cerevisaecompleted in 199612,069,252 bp, 5,800 genes

Completed sequencesCompleted sequences1995 – First complete bacterial genomes2002 – About 35 bacterial genomes; 0.5-5 Mb; hundreds to 2000 genes1996 April – Yeast (Saccharomyces cerevisiae) 12 Mb, 5,500 genes1998 Dec. -Worm (Caenorhabditis elegans) 97 Mb, 19,000 genes2000 March - Fly (Drosophila melanogaster) 137 Mb, 13,500 genes2000 Dec. - Mustard (Arabidopsis thaliana) 125 Mb, 25,498 genes 2000 June – Human (Homo sapiens) 1st rough draft2001 Feb 15/16 – Human, “working draft” 3000 Mb, 35,000~40,000 genes

Mouse, rat, chimp

Bac- by Bac shotgun

(public sequence))

Total shotgun from the BAC

ends(Celera)

No prerequisitesClone contig is a prerequisite

Human Genome Organization

HUMAN GENOME

Genes and gene-related sequences

Extragenic DNA

Nuclear genome3000 Mb25-35-40-65-80K genes

Mitochondrial genome16.6 kb37 genes

Coding DNA

Noncoding DNA

Unique or low copy number

Moderate to highly repetitive

Pseudogenes Gene fragments

Introns,untranslatedsequences, etc.

Tandemly repeated or clustered repeats

Interspersedrepeats

Unique or moderately repetitive

Two rRNAgenes

22 tRNAgenes

13 polypeptide-encoding genes

30% 70%

10% 90%

80% 20%

Human RNA genes Human RNA genes (non-coding RNA transcripts)(non-coding RNA transcripts)

100000 RNA genes in human 100000 RNA genes in human genome (rough)genome (rough)

rRNA rRNA tRNAtRNA Small nuclear RNASmall nuclear RNA Small nucleolar RNASmall nucleolar RNA SRP RNASRP RNA MicroRNAMicroRNA Antisense RNAAntisense RNA Non-coding gene mRNA isoforms;Non-coding gene mRNA isoforms; RNAs form transcribed pseudogenes RNAs form transcribed pseudogenes

Human pseudogenesHuman pseudogenes

Non-processed pseudogenes

Processed pseudogenes

Contain introns;

Arise by duplications;

Frequency of transfer depend on chromosomal context (pericentromeral fragment are transferred more often)

Do not contain introns;

Arise by retrotransposition;

Frequency of transfer depends on initial level of gene expression

(Highly expressed genes are transferred more often)

Complete Partial

Both types of pseudogenes are raw material for evolution

Molecular Biology ToolsMolecular Biology Tools

Northern/Southern BlottingNorthern/Southern Blotting Differential DisplayDifferential Display RNAi (small RNA interference)RNAi (small RNA interference) Serial Analysis of Gene Expression Serial Analysis of Gene Expression

(SAGE) (SAGE) DNA Microarrays or Gene ChipsDNA Microarrays or Gene Chips Yeast two-hybrid analysisYeast two-hybrid analysis Immuno-precipitation/pull-downImmuno-precipitation/pull-down GFP Tagging & MicroscopyGFP Tagging & Microscopy

SAGESAGE Every mRNA molecule is converted into a Every mRNA molecule is converted into a

short (10-14 base), unique tag. short (10-14 base), unique tag. Equivalent to reducing all the people in a Equivalent to reducing all the people in a city into a telephone book with surnamescity into a telephone book with surnames

After creating the tags, these are After creating the tags, these are assembled or concatenated into a long assembled or concatenated into a long “list”“list”

The list can be read using a DNA The list can be read using a DNA sequencer and the list compared to a sequencer and the list compared to a database to ID genes or proteins and their database to ID genes or proteins and their frequencyfrequency

SAGESAGE

Convert mRNAto dsDNA

Digest with NlaIII

Split into 2 aliquots

AttachLinkers

SAGESAGE

Linkers havePCR & TaggingEndonuclease

Cut with TEBsmF1

Mix both aliquotsBlunt-end ligateto make “Ditag”

Concatenate& Sequence

HybridizationHybridization

Nucleic acid hybridization is a Nucleic acid hybridization is a fundamental tool in molecular genetics. It fundamental tool in molecular genetics. It takes advantage of the complementary takes advantage of the complementary nature of double stranded DNA or RNA to nature of double stranded DNA or RNA to the DNA or even RNA to RNA. the DNA or even RNA to RNA.

Nucleic acid probes are used extensively Nucleic acid probes are used extensively in many different diagnostic tests.in many different diagnostic tests.

Hybridization is also used in cloning and Hybridization is also used in cloning and PCR PCR

Principles of Principles of hybridizationhybridization

The addition of a probe to a complex The addition of a probe to a complex mixture of target DNA. The mix is mixture of target DNA. The mix is incubated under conditions that promote incubated under conditions that promote the formation of hydrogen bonds between the formation of hydrogen bonds between complementary strands. complementary strands.

Factors that affect hybridization Factors that affect hybridization characteristicscharacteristics Strand Length Strand Length Base CompositionBase Composition Chemical environmentChemical environment

Principles of Principles of nucleic acid nucleic acid

hybridizationhybridization

Types of probes

StringencyStringency

Strand lengthStrand length The longer the probe the more stable the The longer the probe the more stable the

duplexduplex Base CompositionBase Composition

The % G:C base pairs are more stable The % G:C base pairs are more stable than A:Tthan A:T

Chemical environmentChemical environment The concentration of Na+ ions stablizeThe concentration of Na+ ions stablize Chemical denaturants (formamide or Chemical denaturants (formamide or

urea) destablize hydrogen bonds.urea) destablize hydrogen bonds.

Reassociation KineticsReassociation Kinetics When double stranded DNA is denatured by When double stranded DNA is denatured by

heat the speed at which the strands form heat the speed at which the strands form double stranded DNA is due to the starting double stranded DNA is due to the starting concentration of DNA. If there is a high concentration of DNA. If there is a high concentration of complementary DNA then the concentration of complementary DNA then the time required will be reduced. Reassociation time required will be reduced. Reassociation Kinetics is the speed at which complementary Kinetics is the speed at which complementary single strands form duplexes. Two single strands form duplexes. Two parameters is Concentration parameters is Concentration (Co)(Co) and time and time (t)(t) in sec. in sec. (Cot) (Cot) This dictates that single copy This dictates that single copy genes hybridize more slowly than multicopy genes hybridize more slowly than multicopy sequences. Therefore give weaker signals on sequences. Therefore give weaker signals on a southern.a southern.

Bioinformatics & Bioinformatics & Hybridization Hybridization TechniquesTechniques

Software tools to design probes, Software tools to design probes, calculate melting temperature, GC calculate melting temperature, GC content, stability, folding … content, stability, folding …

Tools to design Primers: short DNA Tools to design Primers: short DNA sequences used to initiate the synthesis sequences used to initiate the synthesis of DNAof DNA

Tools to design Probes: sequences of Tools to design Probes: sequences of DNA or RNA used to detect DNA or RNA used to detect complementary sequences by complementary sequences by hybridizationhybridization

Tools to design Primers, Tools to design Primers, Probes & cloning Probes & cloning

strategies:strategies: MatlabMatlab VectorNTIVectorNTI MacVector: www.MacVector: www.macvectormacvector.com/.com/ http://array.iis.sinica.edu.tw/ups/ http://array.iis.sinica.edu.tw/ups/ http://frodo.wi.mit.edu/http://frodo.wi.mit.edu/ http://genome.jouy.inra.fr/cgi-bin/http://genome.jouy.inra.fr/cgi-bin/

CloneIt/CloneItCloneIt/CloneIt

Tools to design Primers Tools to design Primers and Probes:and Probes:

In-Silico PCR - search for a pair of In-Silico PCR - search for a pair of primersprimers

http://genome.ucsc.edu/cgi-bin/hgPcrhttp://genome.ucsc.edu/cgi-bin/hgPcr http://bioinfo.ut.ee/index.php?pid=1http://bioinfo.ut.ee/index.php?pid=1

http://bioinfo.ut.ee/mprimer3/http://bioinfo.ut.ee/mprimer3/ http://bioinfo.ut.ee/genometester/http://bioinfo.ut.ee/genometester/ http://bioinfo.ut.ee/maphdesigner/http://bioinfo.ut.ee/maphdesigner/

https://vectordesigner.invitrogen.com/https://vectordesigner.invitrogen.com/

Dot blot or slot blotDot blot or slot blot

Southern BlotSouthern Blot

Northern BlotNorthern Blot

Mutation detection by Mutation detection by RFLPRFLP

Assay of RFLP (restriction site Assay of RFLP (restriction site polymorphism)polymorphism)

This has a variety applications including VNTR RFLPs and DNA fingerprinting.

Detection of gene deletions by Detection of gene deletions by restriction mappingrestriction mapping

In situIn situ hybridization hybridization

Chromosome Chromosome in situin situ hybridization hybridization Metaphase or protometaphase Metaphase or protometaphase

chromosomes are probed with labeled chromosomes are probed with labeled DNA . The DNA can be labeled with a DNA . The DNA can be labeled with a fluorochrome (FISH).fluorochrome (FISH).

Tissue Tissue in situin situ hybridization hybridization Sliced or whole mounted preparations Sliced or whole mounted preparations

can be probed with RNA probes to can be probed with RNA probes to detect mRNA expression detect mRNA expression

Hybridization: SummaryHybridization: Summary

Hybridization is due to Hybridization is due to complementarity of DNA strands.complementarity of DNA strands.

DNA can be labeled various waysDNA can be labeled various ways Hybridization can detect identical or Hybridization can detect identical or

similar sequences.similar sequences. A variety of techniques utilize hybridization A variety of techniques utilize hybridization

of DNA or RNA probes: of DNA or RNA probes: Southern Blot, RFLP, Southern Blot, RFLP, VNTRs, Mutation detection, deletion detection, Northern VNTRs, Mutation detection, deletion detection, Northern Blot, tissue specific expression, In situ hybridizationBlot, tissue specific expression, In situ hybridization

Microarrays are minaturized hybridization Microarrays are minaturized hybridization platformsplatforms

DNA MicroarraysDNA Microarrays

Principle is to analyze gene (mRNA) or Principle is to analyze gene (mRNA) or protein expression through large scale protein expression through large scale non-radioactive Northern (RNA) or non-radioactive Northern (RNA) or Southern (DNA) hybridization analysisSouthern (DNA) hybridization analysis

Brighter the spot, the more DNABrighter the spot, the more DNA Microarrays are like Velcro chips made of Microarrays are like Velcro chips made of

DNA fragments attached to a substrateDNA fragments attached to a substrate Requires robotic arraying device and Requires robotic arraying device and

fluorescence microarray readerfluorescence microarray reader

MicroarraysMicroarrays

““ProbeProbe”: single-stranded DNA with a ”: single-stranded DNA with a defined identity tethered to a solid defined identity tethered to a solid medium medium ““TargetTarget”: the labeled DNA or RNA”: the labeled DNA or RNA

MicroarraysMicroarrays

History & Types of ArraysHistory & Types of Arrays

The first arrays, created in the mid 80s, The first arrays, created in the mid 80s, were called macro arrays. They were were called macro arrays. They were fabricated by spotting DNA probes on a fabricated by spotting DNA probes on a membrane-type material with spot sizes membrane-type material with spot sizes of about 300 microns, which limited the of about 300 microns, which limited the density of the spots to about 2000 density of the spots to about 2000 probes. They mostly were used for DNA probes. They mostly were used for DNA clones, PCR products or clones, PCR products or oligonucleotides and typically were used oligonucleotides and typically were used with radioactively-labeled targets.with radioactively-labeled targets.

History & Types of ArraysHistory & Types of Arrays

Next came microarrays, which were Next came microarrays, which were created by using pin spotters. These created by using pin spotters. These are pin-based robotic systems that can are pin-based robotic systems that can dispense an accurate volume of a DNA dispense an accurate volume of a DNA solution in a spot of about 150 microns solution in a spot of about 150 microns onto a glass slide. DNA clones, PCR onto a glass slide. DNA clones, PCR products or pre-synthesized products or pre-synthesized oligonucleotides can be bound to the oligonucleotides can be bound to the glass surface to create high-density glass surface to create high-density arraysarrays

History & Types of ArraysHistory & Types of Arrays By the mid 90s, researchers were using 2 channel By the mid 90s, researchers were using 2 channel

microarrays: Templates for genes of interest were microarrays: Templates for genes of interest were obtained and amplified by PCR. Following obtained and amplified by PCR. Following purification and quality control, aliquots were purification and quality control, aliquots were printed on coated glass microscope slides. Total printed on coated glass microscope slides. Total RNA from both the test and reference sample was RNA from both the test and reference sample was fluorescently labeled with either Cy3– or Cy5–fluorescently labeled with either Cy3– or Cy5–dUTP using a single round of reverse dUTP using a single round of reverse transcription. The fluorescent targets were pooled transcription. The fluorescent targets were pooled and allowed to hybridize, under stringent and allowed to hybridize, under stringent conditions, to the clones on the array.conditions, to the clones on the array.

History & Types of ArraysHistory & Types of Arrays

Rather than making arrays in the Rather than making arrays in the laboratory using spotters, laboratory using spotters, oligonucleotides can be synthesized oligonucleotides can be synthesized in in situsitu on a surface, creating high-density on a surface, creating high-density arrays with up to 500,000 probe arrays with up to 500,000 probe sequences. The first company to sequences. The first company to commercialize this type of technology commercialize this type of technology was Affymetrix, which uses a proprietary was Affymetrix, which uses a proprietary light-directed oligonucleotide synthesis light-directed oligonucleotide synthesis approach (Affy GeneChips). approach (Affy GeneChips).

History & Types of ArraysHistory & Types of Arrays

Agilent Technologies uses inkjet printing Agilent Technologies uses inkjet printing technology to build the oligonucleotides on technology to build the oligonucleotides on standard format glass slides using standard format glass slides using phosphoramidite chemistry.phosphoramidite chemistry.

Nanogen developed an electronic Nanogen developed an electronic microarray, utilizing the natural charge of microarray, utilizing the natural charge of the DNAthe DNA

Illumina BeadChips - The Sentrix BeadChip Illumina BeadChips - The Sentrix BeadChip technology is set up to perform multiple technology is set up to perform multiple hybridizations in parallel. Probes: hybridizations in parallel. Probes: 50mer 50mer oligonucleotidesoligonucleotides

B&O: Chapter 16B&O: Chapter 16

Annotating array probesAnnotating array probes Designing the ExperimentDesigning the Experiment Data Collection and ManagementData Collection and Management Image processingImage processing Measures of ExpressionMeasures of Expression NormalizationNormalization Finding Significant GenesFinding Significant Genes

B&O: Chapter 16, contB&O: Chapter 16, cont

Expression VectorsExpression Vectors Clustering ApproachesClustering Approaches Beyond Statistical Significance and Beyond Statistical Significance and

ClusteringClustering The Classification ProblemThe Classification Problem DistancesDistances Fisher Exact TestFisher Exact Test

The starting point: The starting point: Annotating Array ProbesAnnotating Array Probes

Approaches to construct DNA arrays: in-Approaches to construct DNA arrays: in-sity synthesis, randomly assembled bead-sity synthesis, randomly assembled bead-based arrays, mechanically spotted based arrays, mechanically spotted arrays (cDNA clone, PCR-amplified arrays (cDNA clone, PCR-amplified amplicon or other material)amplicon or other material)

Annotation Resources: SOURCE, Annotation Resources: SOURCE, DRAGON, DAVID, RESOURCERER, TIGR DRAGON, DAVID, RESOURCERER, TIGR Gene Indices, EGO databases (Gene Indices, EGO databases (some no longer some no longer

exist, see ex. linksexist, see ex. links)) Mapping software tools: IPAMapping software tools: IPA

Designing the Designing the ExperimentExperiment

2-color microarrays: 2-color microarrays: Plenty of RNA sample – direct comparison Plenty of RNA sample – direct comparison

with dye swap (flip dye pairs)with dye swap (flip dye pairs) Limited sample: balanced block designLimited sample: balanced block design More than 2 samples are compared: More than 2 samples are compared:

Reference design (common reference Reference design (common reference needed) needed)

One colorOne color Power calculations for statistically Power calculations for statistically

significant measures of gene expressionsignificant measures of gene expression

Bioinformatics of Gene Bioinformatics of Gene ExpressionExpression

Data Collection and Management Data Collection and Management (MIAME, MAGE-ML)(MIAME, MAGE-ML)

Estimating BackgroundEstimating Background Measures of Expression (log ratio)Measures of Expression (log ratio) Normalization (2-channel arrays)Normalization (2-channel arrays) FilteringFiltering Finding Significant GenesFinding Significant Genes CusteringCustering

Internet ResourcesInternet Resources

Expression DatabasesExpression Databases Array Express: Array Express:

www.ebi.ac.uk/www.ebi.ac.uk/arrayexpressarrayexpress/ / CIBEX: CIBEX: cibexcibex.nig.ac.jp/.nig.ac.jp/ Gene Expression Omnibus: Gene Expression Omnibus:

www.ncbi.nlm.nih.gov/geo/www.ncbi.nlm.nih.gov/geo/

AnnotationAnnotation The Source database: source.stanford.eduThe Source database: source.stanford.edu DAVID: http://david.abcc.ncifcrf.gov/DAVID: http://david.abcc.ncifcrf.gov/ Gene Ontology Database, KEGGGene Ontology Database, KEGG

Internet ResourcesInternet Resources

Expression SoftwareExpression Software http://david.abcc.ncifcrf.gov/http://david.abcc.ncifcrf.gov/ BASE: base.thep.lu.seBASE: base.thep.lu.se Bioconductor: bioconductor.orgBioconductor: bioconductor.org TM4 software: http://www.tm4.org/TM4 software: http://www.tm4.org/ SAM: SAM: http://www-stat.stanford.edu/~tibs/SAM/http://www-stat.stanford.edu/~tibs/SAM/ Cluster/Treeview: Cluster/Treeview:

http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htmhttp://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm

HCE: http://www.cs.umd.edu/hcil/hce/HCE: http://www.cs.umd.edu/hcil/hce/ http://ihome.cuhk.edu.hk/~b400559/arraysoft_mining_specific.htmlhttp://ihome.cuhk.edu.hk/~b400559/arraysoft_mining_specific.html

Commercial SoftwareCommercial Software

Spotfire: Spotfire: spotfirespotfire.tibco.com.tibco.com GeneSpring: www.genespring.com/GeneSpring: www.genespring.com/ Partek Pro: http://www.partek.com/Partek Pro: http://www.partek.com/

IPA: http://www.ingenuity.com/IPA: http://www.ingenuity.com/

Beyond StatisticsBeyond Statistics

IPA – looks for significant functional IPA – looks for significant functional associations, GO- and literature associations, GO- and literature based associations, canonical based associations, canonical pathways, predefined and custom pathways, predefined and custom gene lists, creates networks, gene lists, creates networks, reconstructs significant processes, reconstructs significant processes, finds biomarkersfinds biomarkers

IPA, How to:IPA, How to:

• • Upload DataUpload Data

• • Analyze Gene Expression Data Analyze Gene Expression Data

• • Compare Gene Expression ExperimentsCompare Gene Expression Experiments

o Interpret resultso Interpret results

••Functions, Diseases, Pathways, Functions, Diseases, Pathways, Networks, Lists, MoleculesNetworks, Lists, Molecules

o Explore Networkso Explore Networks

••Highlight, Overlay, Merge, Export, Highlight, Overlay, Merge, Export, ShareShare

Learning IPALearning IPA

Workshop in Stanford on April 21Workshop in Stanford on April 21stst: : http://lane.stanford.edu/howto/index.http://lane.stanford.edu/howto/index.html?id=_2608html?id=_2608

Learning IPALearning IPA

o Merging Networkso Merging Networks

• • Simple search for Simple search for genes/proteins/chemicalsgenes/proteins/chemicals

• • Using Node View PagesUsing Node View Pages

• • Q&AQ&A

• • Hands-on exercisesHands-on exercises

From Previous LectureFrom Previous Lecture

Intermolecular InteractionsIntermolecular Interactions Interaction and Pathway Databases Interaction and Pathway Databases Search and Explore in IPA (Simple Search and Explore in IPA (Simple

search for genes/proteins/chemicals, search for genes/proteins/chemicals, Advanced Search for diseases, Advanced Search for diseases, molecule types, locations)molecule types, locations)

Finding interaction partners and Finding interaction partners and closest path in networks, in IPAclosest path in networks, in IPA

Quick functional assessments in IPAQuick functional assessments in IPA