Crash course in Omics

Post on 05-Jan-2016

35 views 1 download

Tags:

description

Crash course in Omics. EuPathDB.org University of Pennsylvania. Putative Function (GO). Expression profiling arrays (many platforms). BLAST Similarities. Subcellular Location. Genome and Annotation. Protein Expression. RNA Sequencing. Phenotype. Pathways. Isolates. SNPs. ChIP-Chip. - PowerPoint PPT Presentation

Transcript of Crash course in Omics

Crash course in Omics

EuPathDB.orgUniversity of Pennsylvania

Genome and Annotation

Structure

Phylogenetics

BLAST Similarities

SNPs Expressionprofiling arrays

(many platforms)

ComparativeGenomics

IsolatesPhenotype

Subcellular Location

Protein Expression

Orthologs

RNA Sequencing

Putative Function

(GO)

Interactions

ChIP-Chip

Pathways

Genome assembly

• 5X random genome shotgun• Library insert size• Paired-end? (Mated end pairs)• Contigs• Scaffolds

PrimerPrimer

SEQUENCE

End Reads (Mates)End Reads (Mates)

550bp

Mean & Std.Dev.Mean & Std.Dev.is knownis known

2-pair2-pair

ScaffoldScaffold

Gaps in scaffolds are traditionally indicated by 100 “N’s”

Pairs Give Order & Orientation

ConsensusConsensus

ReadsReads

SNPsSNPs

Distance?

PlasmidFosmid

NGS

ChromosomeChromosome

STS-mapped ScaffoldsSTS-mapped Scaffolds

ContigContig

Gap (mean & std. dev. Known)Gap (mean & std. dev. Known)Read pair (mates)Read pair (mates)

ConsensusConsensus

ReadsReads

SNPsSNPs

Anatomy of a WGS Assembly

STSSTS

Six Frame TranslationORF-finding

ORFs ≠ Genes

>Translation Frame 1 MQKPVCLVVAMTPKRGIGINNGLPWPHLTTDFKHFSRVTKTTPEEASRLN GWLPRKFAKTGDSGLPSPSVGKRFNAVVMGRKTWESMPRKFRPLVDRLNI VVSSSLKEEDIAAEKPQAEGQQRVRVCASLPAALSLLEEEYKDSVDQIFV VGGAGLYEAALSLGVASHLYITRVAREFPCDVFFPAFPGDDILSNKSTAA QAAAPAESVFVPFCPELGREKDNEATYRPIFISKTFSDNGVPYDFVVLEK RRKTDDAATAEPSNAMSSLTSTRETTPVHGLQAPSSAAAIAPVLAWMDEE DRKKREQKELIRAVPHVHFRGHEEFQYLDLIADIINNGRTMDDRT

Terminology

Eukaryotic Relationships ca. 2005

Keeling, 2005

Homology

Early Globin Gene

PARALOGS

ORTHOLOGS ORTHOLOGS

Gene Duplicationα-chain gene β-chain gene

β mouseα mouse β chickα chick β frogα frog

HOMOLOGS

Evolutionary relationships

• Homology - related by evolutionary descent not equivalent to similarity

• Orthology - same gene in different organisms, e.g. alpha hemoglobin in humans and chimps

• Paralogy - genes within an organism related by gene duplication, e.g. alpha and beta hemoglobin in humans

• Xenology - genes related by gene transfer

Synteny = large regions of chromosomes containing the same

genes

Synteny among Plasmodia

Expression Profiles

• The pattern of expression of one or more genes over time or a set of experimental conditions, e.g. during development or a drug treatment or in a genetic mutant such as a gene knock-out.

• Always… has a time and space component

Microarrays

•cDNA microarrays•“GeneChip” in situ synthesized oligonucleotide arrays•Oligomer (~70mer) arrays

Experiments are almost always Competitions between conditions or stages

The RNA samples from the test and the control are labeled with different colors in a reverse-transcription reaction and then hybridized, together, competitively to a slide or chip containing gene sequences in multiple copies.

Ratios of experimental to control expression are often expressed as colors rather than numbers

ClusteredMicroarrayDataGenes with SimilarExpressionProfiles areGroupedtogether

Other RNA expression

• Expressed Sequence Tags, ESTs– Usually represent

partial cDNA– Often clustered– Come from libraries

that may, or may not be normalized

– Often used to identify genes in genomes and locations of introns

• SAGE-tags (Serial Analysis of Gene Expression)– Primary purpose is

relative levels of gene expression

• RNA-Seq (NGS)– Little sequence bias– Quantitative– Can be strand-

specific

Genes can be located on either DNA strand

Overview of transcription: Either strand can serve as a template for a gene

Figure 8-4

Sequences of DNA and transcribed RNA

ConventionGene location = non-template strand, i.e. same as the mRNA

Complex patterns of eukaryotic mRNA splicing: What is a Gene?

Figure 8-14

Bioinformatics uses algorithms

• Algorithms are sets of rules for solving problems or identifying patterns

• Algorithms can be general or case specific and often need to be trained

• Computational analysis, like wet-bench analyses are only as good as the tools, techniques and material allow, and all interpretations come with caveats (like the experimental conditions, often call parameters in bioinformatics.

How to find an intron

• Usually begins with GT and end with AG• Must be longer than 19 nucleotides• Must contain a branchpoint “A”• Donor GT often followed by a sequence

pattern. This pattern is species-specific• Acceptor AG often proceeded by

pyrimidine stretch• Has a mean length of “X” as is observed

in this species

Different prediction methods often generate different

results

Prediction 1

Prediction 2

Protein Expression/Sequence

Data• MW-Isoelectric point• MW• Sequence/spans

Technology• 2D gel electrophoresis• Mass spectrometry• Tandem MS (MS-MS,

LC MS-MS etc)

Typical 2 D gel

•Direct identification of proteins from biological sample.•Capillary liquid chromatography apparatus (LC) coupled with...•Electrospray tandem mass spectroscopy (MS/MS)•“Sequest”, Mascot, or other software links mass spectra with genomic sequence database.

High throughput mass spectrometry

How Tandem MS WorksHow Tandem MS Works

Complex mixture Protein

Complex mixture Protein

PeptidesPeptides

PeptidesPeptidesIonizedIonized

+

+

+

++ +

+

+

+

+

MeasurementMeasurementIsolationIsolation FragmentationFragmentation

++ ++

Collision Inducted

Dissociation (CID)

Collision Inducted

Dissociation (CID)

Liquid chromatograp

hy

Liquid chromatograp

hy

Mass Spectrometer Protein Database

Nucleic Acid Database

EST Database

Tandem Mass SpectrumTheoretical Mass Spectrum

Correlation AnalysisCorrelation Analysis

Ranked Score of Matched PeptidesRanked Score of Matched Peptides

Sequest Database Search

ENNPCKLQYDYNTNVTHGFGQEYPCETDIVERFSDTEGAQCDKKKIKDNSEGACAPYRRLHVCVRNLENINDYSKINNKHNLLVEVCLAAKYEGESITGRYPQHQETNPDTKSQLCTVLARSFADIGDIIRGKDLYRGGNTKEKKKRKKLEENLKTIFGHIYDELKNGKTNGEEELQKRYRGDKDNDFYQLREDWWDANRETVWKAITCNAGSYQYSQPTCGRGEIPYVTLSKCQCIAGEVPTYFDYVPQYLRWFEEWAEDFCRKKKKKIPNVKTNCRQVQRGKEKYCDRDGYNCDGTIRKQYIYRLDTDCTKCSLACKTFAEWIDNQKEQFDKQKQKYQNEISGGGGRRQKRSTHSTKEYEGYEKHFNEELRNEGKDVRSFLQLLSKEKICKERIQVGEETANYGNFENESNTFSHTEYCDRCPLCGVDCSSDNCRKKPDKSCDEQITDKEYPPENTTKIPKLTAEKRKTGILKKYEKFCKNSDGNNGGQIKKWECHYEKNDKDDGNGDINNCIQGDWKTSKNVYYPISYYSFFYGSIIDMLNESIEWRERLKSCINDAKLGKCRKGCKNPCECYKRWVEKKKDEWDKIKEFFRKQKDLLKDIAGMDAGELLEFYLENIFLEDMKNANGDPKVIEKFKEILGKENEEVQDPLKTKKTIDDFLEKELNEAKNCVEKNPDNECPKQKAPGDGAAPSDPPREDITHHDGEHSSDEDEEEEEEEEQQPPAEGTEQGEEKSESKEVVEQQETPQKDTEKTVPTTTPTVDVCDTVKTALADTGSLNAACSLKYVTGKNYGWRCIAPSGTTSGKDGAICVPPRTQELCLYYLKELSDTTQKGLREAFIKTAAQETYLLWQKYKEDKQNETASTELDIDDPQTQLNGGEIPEDFKRQMFYTFGDYRDLFLGRYIGNDLDKVNNNITAVFQNGDHIPNGQKTDRQRQEFWGTYGKDIWKGMLCALQEAGGKKTLTETYNYSNVTFNGHLTGTKLNEFASRPSFLRWMTEWGDQFCRERITQLQILKERCMVYQYNGDKGKDDKKEKCTEACTYYKEWLTNWQDNYKKQNQRYTEVKGTSPYKEDSDVKESKYAHGYLRKILKNIICTSGTDIAYCNCMEGTSTTDSSNNDNIPESLKYPPIEIEEGCTCKDPSPGEVIPEKKVPEPKVLPKPPKLPKRQPKERDFPTPALKNAMLSSTIMWSIGIGFATFTYFYLKKKTKSTIDLLRVINIPKSDYDIPTKLSPNRYIPYTSGKYRGKRYIYLEGDSGTDSGYTDHYSDITSSSESEYEELDINDIYAPRAPKYKTLIEVVLEPSGNNTTASGNNTPSDTQNDIQNDGIPSSKITDNEWNTLKDEFISQYLQSEQPNDVPNDYSSGDIPLNTQPNTLYFDNPDEKPFITSIHDRDLYSGEEYSYNVNMVNTNNDIPISGKNGTYSGIDLINDSLNSNN

Peptide database

Note: ORFs in addition to predictedGenes must be searched

Mass Spectrometry can be used to

measure metabolic and other chemical

compounds

Homologous chromsomes (in a diploid)

Loci, alleles and SNPs in a population

A

a

AAGCCTCATC

ACGCCTCATC

SNP =Single Nucleotide Polymorphism

Alleles and Phenotype

• Some phenotypes are caused by a single locus in the genome and a single allele at that locus (e.g. some flower colors, or Drosophila eye color)

• Other phenotypes (Type-I diabetes, heart disease are multi-locus or “complex” (i.e. many genes are involved, each potentially with many alleles)

Population data

Data• Single Nucleotide

Polymorphisms, SNPs• Alleles• Allele frequency• Haplotypes

Technology• Chip-Seq• NGS

Alleles have frequencies in

different populations

Populations and alleles have geographic

boundariesA parasite isolate comes from a particular population, a

particular location and will have a specific haplotype (e.g. representation of alleles) often characterized via SNPs

Parasite Isolates

Data• Species, Strain, • Isolate• Location, Date• SNP• Sequence• Allele• phenotpe

Technology• PCR-RFLP• Sequencing• SNP chip• GPS

Experimental systems

PathogenVector

Host

Infectious Disease Paradigm