Gibbs Variable Selection Xiaomei Pan, Kellie Poulin, Jigang Yang, Jianjun Zhu.
CSCE555 Bioinformatics Lecture 9 Gene Finding & Comparative genomics Meeting: MW 4:00PM-5:15PM...
Transcript of CSCE555 Bioinformatics Lecture 9 Gene Finding & Comparative genomics Meeting: MW 4:00PM-5:15PM...
CSCE555 BioinformaticsCSCE555 Bioinformatics
Lecture 9 Gene Finding & Comparative genomics
Meeting: MW 4:00PM-5:15PM SWGN2A21Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555
University of South CarolinaDepartment of Computer Science and Engineering2008 www.cse.sc.edu.
HAPPY CHINESE NEW YEAR
OutlineOutline
Performance Evaluation of Gene Finding programs
Comparative genomics: ◦What to do◦Tools◦Databases◦Application case
04/20/23 2
Accuracy Measures of Gene-Finding Accuracy Measures of Gene-Finding ProgramsPrograms
Sensitivity vs. Specificity (adapted from Burset&Guigo 1996)
Sensitivity (Sn) Fraction of actual coding regions that are correctly predicted as coding
Specificity (Sp) Fraction of the prediction that is actually correct
Correlation
Coefficient (CC)
Combined measure of Sensitivity & Specificity
Range: -1 (always wrong) +1 (always right)
TP FP TN FN TP FN TN
Actual
Predicted
Coding / No Coding
TNFN
FPTP
Pre
dic
ted
Actual
No
Cod
ing
/ C
odin
g
Test DatasetsTest DatasetsSample Tests reported by
Literature◦Test on the set of 570 vertebrate
gene seqs (Burset&Guigo 1996) as a standard for comparison of gene finding methods.
◦Test on the set of 195 seqs of human, mouse or rat origin (named HMR195) (Rogic 2001).
Table: Relative Performance (adapted from Rogic 2001)
# of seqs - number of seqs effectively analyzed by each program; in parentheses is the number of seqs where the absence of gene was predicted;
Sn -nucleotide level sensitivity; Sp - nucleotide level specificity;
CC - correlation coefficient;
ESn - exon level sensitivity; ESp - exon level specificity
Results: Accuracy Statistics
Complicating Factors for Comparison
• Gene finders were trained on data that had genes homologous to test seq.
• Percentage of overlap is varied
• Some gene finders were able to tune their methods for particular data
• Methods continue to be developedNeeded
• Train and test methods on the same data.
• Do cross-validation (10% leave-out)
Why not Perfect?Why not Perfect? Gene Number
usually approximately correct, but may not
Organismprimarily for human/vertebrate seqs; maybe lower accuracy for non-vertebrates. ‘Glimmer’ & ‘GeneMark’ for prokaryotic or yeast seqs
Exon and Feature Type
Internal exons: predicted more accurately than Initial or Terminal exons;Exons: predicted more accurately than Poly-A or Promoter signals
Biases in Test Set (Resulting statistics may not be
representative)
Eukaryotic Gene Finding Eukaryotic Gene Finding ToolsTools Genscan (ab initio), GenomeScan (hybrid) (http://genes.mit.edu/) Twinscan (hybrid) (http://genes.cs.wustl.edu/) FGENESH (ab initio) (http://www.softberry.com/berry.phtml?topic=gfind) GeneMark.hmm (ab initio) (http://opal.biology.gatech.edu/GeneMark/eukhmm.cgi) MZEF (ab initio) (http://rulai.cshl.org/tools/genefinder/) GrailEXP (hybrid) (http://grail.lsd.ornl.gov/grailexp/) GeneID (hybrid) (http://www1.imim.es/geneid.html)
Outline for Comparative Outline for Comparative GenomicsGenomicsOverviewWhy do comparative genomic analysis?Assumptions/LimitationsGenome Analysis and Annotation Standard
ProcedureGeneral Purposes Databases for
Comparative GenomicsOrganism Specific DatabasesGenome Analysis EnvironmentsGenome Sequence Alignment ProgramsGenomic Comparison Visualization Tools
What is comparative What is comparative genomics?genomics?Analyzing & comparing genetic
material from different species to study evolution, gene function, and inherited disease
Understand the uniqueness between different species
What is compared?What is compared?Gene locationGene structure
◦ Exon number◦ Exon lengths◦ Intron lengths◦ Sequence similarity
Gene characteristics◦ Splice sites◦ Codon usage◦ Conserved synteny
Figure 1 Regions of the human and mouse homologous genes: Coding exons (white), noncoding exons (gray}, introns (dark gray), and intergenic regions (black). Corresponding strong (white) and weak (gray) alignment regions of GLASS are shown connected with arrows. Dark lines connecting the alignment regions denote very weak or no alignment. The predicted coding regions of ROSETTA in human, and the corresponding regins in mouse, are shown (white) between the genes and the alignment regions.
Sequenced prokaryotic Sequenced prokaryotic genomesgenomes
Bacteroides fragilis Opportunistic In progress Bordetella bronchiseptica Veterinary In progress Bordetella parapertussis Whooping cough Bordetella pertussis Whooping cough Complete Burkholderia cepacia Lung infections in CF In progress Burkholderia pseudomallei Melliodosis In progress Chlamidophila abortus Veterinary Funded Clostridium botulinum Botulism Funded Clostridium difficile Colitis In progress Corynebacterium diphtheriae Diphtheria Complete Erwinia carotovora Plant pathogen Funded Escherichia/Shigella spp. (5) Various In progress Mycobacterium bovis Tuberculosis In progress Mycobacterium marinum Various In progress Neisseria meningitidis (serogroup C) Bacterial meningitis In progress Salmonella typhi Typhoid fever Complete Salmonella spp. (5) Various In progress Staphylococcus aureus (MRSA) Various (Nosocomial) Complete Staphylococcus aureus (MSSA) Various (Community acquired) In progress Streptococcus pneumoniae Bacterial meningitis In progress Streptococcus pyogenes Various (ARF-associated) In progress Streptococcus suis Veterinary In progress Streptococcus uberis Veterinary In progress Streptomyces coelicolor Non-pathogenic Complete Tropheryma whipelli Whipple’s disease In progress Wolbachia (Culex quinquefasciatus) Vector (Bancroftian filariasis) In progress Wolbachia (Onchocerca volvulus) River Blindness Funded Yersinia enterocolitica Food poisoning In progress Yersinia pestis Plague Complete
Complete
Sequenced eukaryotic Sequenced eukaryotic genomesgenomes
Aspergillus fumigatus Farmer’s lung In progress Dictyostelium discoideum Soil amoeba In progress Entamoeba histolitica Amoebic dysentry In progress Leishmania major Leishmaniasis In progress Plasmodium falciparum Malaria In progress Schistosoma mansoni Bilharzia In progress Schizosaccharomyces pombe Fission yeast Complete Theileria annulata Veterinary In progress Toxoplasma gondii Toxoplasmosis In progress Trypanosoma brucei Sleeping sickness In progress
Bioinformatics Flow ChartBioinformatics Flow Chart
6. Gene & Protein expression data
7. Drug screening
Ab initio drug design ORDrug compound screening in database of molecules
8. Genetic variability
1a. Sequencing
1b. Analysis of nucleic acid seq.
2. Analysis of protein seq.
3. Molecular structure prediction
4. molecular interaction
5. Metabolic and regulatory networks
Complete sequence
Shotgun reads
Contigs
Genomic DNA
Shearing/Sonication
Subclone and Sequence
Assembly
Finishing
Finishing read
Genome Sequencing Genome Sequencing ProcessProcess
Genome Sequencing - Genome Sequencing - ReviewReview
Libraries
Sequencing
Release
Assembly
Annotation
Closure
Strategy
•Most genome will be sequenced and can be sequenced;
few problem are unsolvable.
Clone by clone vs whole genome shotgun
•Problem lies in understanding what you have:
•Gene prediction/gene finding
•Annotation
Subcloning; generate small insert libraries
Assembly: Process of taking raw single-pass reads into contiguous consensus sequence (Phred/Phrap)
Assembly
Libraries
Strategy
Sequencing
Closure: Process of ordering and merging consensus sequences into a single contiguous sequence
Closure
Annotation -DNA features (repeats/similarities)-Gene finding-Peptide features-Initial role assignment-Others- regulatory regions
Release Release data to the public e.g. EMBL or GenBank
Annotation of eukaryotic genomesAnnotation of eukaryotic genomes
transcription
RNA processing
translation
AAAAAAA
Genomic DNA
Unprocessed RNA
Mature mRNA
Nascent polypeptide
folding
Reactant A Product BFunction
Active enzyme
ab initio gene prediction
Comparative gene prediction
Functional identification
Gm3
Why do comparative Why do comparative genomics?genomics? Many of the genes encoded in each genome from the
genome projects had no known or predictable function
Analysis of protein set from completely sequenced genomes
Uniform evolutionary conservation of proteins in microbial genomes, 70% of gene products from sequenced genomes have homologs in distant genomes (Koonin et al., 1997)
Function of many of these genes can be predicted by comparing different genomes of known functional annotation and transferring functional annotation of proteins from better studied organisms to their orthologs in lesser studied organisms.
Cross species comparison to help reveal conserved coding regions
No prior knowledge of the sequence motif is necessary
Complement to algorithmic analysis
Assumptions/LimitationAssumptions/LimitationHomologous genes are relatively well
preserved while noncoding regions tend to show varying degrees of conservation. Conserved noncoding regions are believed to be important in regulating gene expression, maintaiing structural organization of the genome and most likely other possible functions.
Cross species comparative genomics is influenced by the evolutionary distance of the compared species.
Genome Analysis and Annotation: General Genome Analysis and Annotation: General ProcedureProcedure
Basic procedure to determine the functional and structural annotation of uncharacterized proteins:
Use a sequence similarity search programs such as BLAST or FASTA to identify all the functional regions in the sequence. If greater sensitivity is required then the Smith-Waterman algorithm based programs are preferred with the trade-off greater analysis time.
Identify functional motifs and structural domains by comparing the protein sequence against PROSITE, BLOCKS, SMART, CDD, or Pfam.
Predict structural features of the protein such as signal peptides, transmembrane segments, coiled-coil regions, and other regions of low sequence complexity
Generate a secondary and tertiary (if possible) structure prediction
Annotation:
◦ Transfer of function information from a well-characterized organism to a lesser studied organism and/or
◦ Use phylogenetic patterns (or profiles) and/or ◦ Use the phylogenetic pattern search tools (e.g. through COGs) to perform a
systematic formal logical operations (AND, OR, NOT) on gene sets -- differential genome display (Huynen et al., 1997).
Automated Genome AnnotationAutomated Genome Annotation
GeneQuiz – limited number of searches/day
MAGPIE – outside users cannot submit own seq
PEDANT – commercial version allow for full capacity
SEALS – semi automated
General Databases Useful for General Databases Useful for Comparative GenomicsComparative Genomics
Locus Link/RefSeq: http://www.ncbi.nih.gov/LocusLink/
PEDANT -Protein Extraction Description ANalysis Tool http://pedant.gsf.de/
MIPS – http://mips.gsf.de/COGs - Cluster of Orthologous Groups (of proteins)
http://www.ncbi.nih.gov/COG/KEGG - Kyoto Encyclopedia of Genes and Genomes
http://www.genome.ad.jp/kegg/MBGD - Microbial Genome Database
http://mbgd.genome.ad.jp/GOLD - Genome OnLine Database
http://wit.integratedgenomics.com/GOLD/TOGA – http://www.tigr.org/xxxxx
Problems with existing sequence Problems with existing sequence alignments algorithms for genomic alignments algorithms for genomic analysisanalysis Most algorithms were developed for comparing single
protein sequences or DNA sequences containing a single gene
Most algorithms were based on assigning a score to all the possible alignments (usually by the sum of the similarity/identity values for each aligned residue minus a penalty for the introduction of gaps) and then finding the optimal or near-optimal alignment based on the chosen scoring scheme.
Unfortunately, most of these programs cannot accurately handle long alignments.
Linear-space type of Smith-Waterman variants are too computationally intensive requiring specialized hardware (memory-limited) or very time-consuming. Higher speed vs increased sensitivity.
Genome-size comparative alignment toolsGenome-size comparative alignment tools ASSIRC - Accelerated Search for SImilarity Regions in Chromosomes
◦ ftp://ftp.biologie.ens.fr/pub/molbio/ (Vincens et al. 1998) BLAT –
◦ http://genome.ucsc.edu/cgi-bin/hgBlat?command=start (Kent xxx) DIALIGN - DIagonal ALIGNment
◦ http://www.gsf.de/biodv/dialign.html (Morgenstern et al. 1998; Morgenstern 1999( DBA - DNA Block Aligner
◦ http://www.sanger.ac.uk/Software/Wise2/dba.shtml (Jareborg et al. 1999( GLASS - GLobal Alignment SyStem
◦ http://plover.lcs.mit.edu/ (Batzoglou et al. 2000) LSH-ALL-PAIRS - Locality -Sensitve Hashing in ALL PAIRS
◦ Email: [email protected] (Buhler 2001) MegaBlast
◦ http://www.ncbi.nih.gov/blast/ (Zhang 2000) MUMmer - Maximal Unique Match (mer)
◦ http://www.tigr.org/softlab/ (Delcher et al. 1999) PIPMaker - Percent Identity Plot MAKER
◦ http://biocse.psu.edu/pipmaker/ (Schwartz et al. 2000) SSAHA – Sequence Search and Alignment by Hashing Algorithm
◦ http://www.sanger.ac.uk/Software/analysis/SSAHA/ WABA - Wobble Aware Bulk Aligner
◦ http://www.cse.ucsc.edu/~kent/xenoAli/ (Kent & Zahler 2000)
SSAHASSAHASequence Search and Alignment by Hashing
AlgorithmSoftware tool for very fast matching and
alignment of DNA sequences.Achieves fast search speed by converting
sequence information into a hash table data structure which can then be searched very rapidly for matches
http://www.sanger.ac.uk/Software/analysis/SSAHA/Run from the Unix command lineNeed > 1GB RAM (needs a lot of memory)SSAHA algorithm best for application requiring
exact or “almost exact” matches between two sequences – e.g. SNP detection, fast sequence assembly, ordering and orientation of contigs
Genome Analysis Genome Analysis EnvironmentEnvironmentMAGPIE - Automated Genome
Project Investigation EnvironmentPEDANTSEALS
Problems with Visualizing GenomesProblems with Visualizing Genomes
Alignment programs output often were visualized by text file, which can be intuitively difficult to interpret when comparing genomes.
Visualization tools needed to handle the complexity and volume of data and present the information in a comprehensive and comprehensible manner to a biologist for interpretation.
Genome Alignment Visualization tools need to provide: ◦ interpretable alignments,
◦ gene prediction and database homologies from different sources
◦ Interactive features: real time capabilities, zooming, searching specific regions of homologies
◦ Represent breaks in synteny
◦ Multiple alignments display
◦ Displaying contigs of unfinished genomes with finished genomes
◦ Handle various data formats
◦ Software availabilty (no black box)
Genome Comparison Visualization ToolGenome Comparison Visualization Tool
ACT - Artemis Comparison Tool (displays parsed BLAST alignments; based on Artemis – an annotation tool)◦ http://www.sanger.ac.uk/Software/ACT/
Alfresco (displays DBA alignments and ...)◦ http://www.sanger.ac.uk/Software/Alfresco/ (Jareborg & Durbin 2000)
PipMaker (displays BlastZ alignments)◦ http://bio.cse.psu.edu/pipmaker/ (Schwartz et al. 2000)
Enteric/Menteric/Maj (displays Blastz alignments)◦ http://glovin.cse.psu.edu/enterix/ (Florea et al. 2000; McClelland et al.
2000) Intronerator (displays WABA alignments and ...)
◦ http://www.cse.ucsc.edu/~kent/intronerator/ (Kent & Zahler 2000b) VISTA (Visualization Tool for Alignment) (displays GLASS
alignments)◦ http://www-gsd.lbl.gov/vista/
SynPlot (displays DIALIGN and GLASS alignments)◦ http://www.sanger.ac.uk/Users/igrg/SynPlot/
Artemis Comparison Tool Artemis Comparison Tool (ACT)(ACT)- ACT is a DNA sequence comparison viewer
based on Artemis- Can read complete EMBL and GenBank
entries or sequence in FASTA or raw format- Additional sequence feature can be in EMBL,
GenBank, GFF format- ACT is free software and is distributed under
the GNU Public License- Java based software- Latest release 2.0 better support Eukaryotic
Genome Comparison
http://www.sanger.ac.uk/Software/ACT/
A case A case Study:Comparison of Study:Comparison of mouse chromosome mouse chromosome 16 and the human 16 and the human genome:genome:
Mural et al., Science, 2002, 296:1661
Celera groupSynteny with human chr.’s
3,8,12,16,21,22 and rat chr.’s 10,11
Q: Why more breakpoints in mouse-human than in mouse-rat?
Q: Why more conserved genes in human than in rat?
•This also can occur between chromosomes•The longer the divergence time between 2 species, the more recombination has occurred•100 million years since human-mouse divergence•40 million years since rat-mouse divergence
Whole-genome shotgun sequencing:
1. Genome is cut into small sections2. Each section is hundreds or a few
thousand bp of DNA3. Each section is sequenced and put
in a database4. A computer aligns all sequences
together (millions of them from each chromosome) to form contigs
5. Contigs are arranged (using markers, etc) to form scaffolds
Q: What are the advantages of this over the traditional method?
Q: What are the potential sources of error?
1. Assembly of Mmu161. Assembly of Mmu16 Total size: 99Mbp1. Not one contiguous sequence (contig)2. 8,635 contigs on 20 “scaffolds”3. Average scaffold size: 10Mbp4. Number of gaps: 86155. Total size of gaps: ~6Mbp6. Total coverage: ~93Mbp
2. Identify genes in Mmu162. Identify genes in Mmu161. Scaffolds of >10kbp were examined (scaffolds larger than
1Mbp were chopped)2. Regions with repeat motifs were ignored using
RepeatMasker3. Several gene prediction engines use (GenScan, Grail,
Fgenes)4. Amino acid sequences from open reading frames searched
against nr protein db (NCBI)5. Nucleotide searchers (using DNA from across scaffolds)
performed against:1. Celera’s gene clusters2. Mmu, Rno, & Hsa EST db’s3. NCBI’s RefSeq mRNA db4. Celera’s dog genomic db5. Public pufferfish genomic db
2. Identify genes in Mmu162. Identify genes in Mmu166. 1055 genes with high & medium confidence were
predicted7. Other efforts have identified 1142 genes8. After visual annotation inspection, psuedogenes and
annotation errors removed, leaving 731 homologues genes
9. The genes found were mostly orthologues because they were reciprocal best matches by BLAST searches.
3. Identify regions of conserved 3. Identify regions of conserved synteny between Mmu16 and Hsasynteny between Mmu16 and Hsa
Regions of conserved synteny predicted by sequence similarity and by protein comparisons
Synteny based on sequence comparisons:1. Syntenic anchors were located - regions with high (80%)
similarity over short distances (~200bp or more).2. Average distance between anchors is 8kbp, but there are
gaps as large as 707kbp in the mouse and 3.4Mbp in the human
3. Identify regions of conserved 3. Identify regions of conserved synteny between Mmu16 and Hsasynteny between Mmu16 and Hsa
3. 56% of anchors were in mouse genes - exons mostly4. 44% in intergenic regions5. Relatively density is independent of coding/noncoding - making
the anchors an important marker of synteny (in addition to genes)
Human chr. Mmu len. Hsa len. No. anchors bad anch. (% incon.) Orthologues16 10,461 12,329 1,429 21 (1.5) 878 1,284 1,491 121 1 (0.8) 612 363 306 31 3 (9.7) 322 2,081 2,273 418 8 (1.9) 303q27-29 13,557 16,461 1,714 18 (1.0) 1073q11.1-13.3 41,660 46,493 5,485 63 (1.1) 16521 22,327 28,421 2,127 27 (1.3) 111
SummarySummaryPerformance evaluation of gene-finding
programsComparative genomicsComparative genomics analysis example