CSCE555 Bioinformatics Lecture 9 Gene Finding & Comparative genomics Meeting: MW 4:00PM-5:15PM...

CSCE555 BioinformaticsCSCE555 Bioinformatics

Lecture 9 Gene Finding & Comparative genomics

Meeting: MW 4:00PM-5:15PM SWGN2A21Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555

University of South CarolinaDepartment of Computer Science and Engineering2008 www.cse.sc.edu.

HAPPY CHINESE NEW YEAR

OutlineOutline

Performance Evaluation of Gene Finding programs

Comparative genomics: ◦What to do◦Tools◦Databases◦Application case

04/20/23 2

Accuracy Measures of Gene-Finding Accuracy Measures of Gene-Finding ProgramsPrograms

Sensitivity vs. Specificity (adapted from Burset&Guigo 1996)

Sensitivity (Sn) Fraction of actual coding regions that are correctly predicted as coding

Specificity (Sp) Fraction of the prediction that is actually correct

Correlation

Coefficient (CC)

Combined measure of Sensitivity & Specificity

Range: -1 (always wrong) +1 (always right)

TP FP TN FN TP FN TN

Actual

Predicted

Coding / No Coding

TNFN

FPTP

Pre

dic

ted

Actual

No

Cod

ing

/ C

odin

g

Test DatasetsTest DatasetsSample Tests reported by

Literature◦Test on the set of 570 vertebrate

gene seqs (Burset&Guigo 1996) as a standard for comparison of gene finding methods.

◦Test on the set of 195 seqs of human, mouse or rat origin (named HMR195) (Rogic 2001).

Table: Relative Performance (adapted from Rogic 2001)

# of seqs - number of seqs effectively analyzed by each program; in parentheses is the number of seqs where the absence of gene was predicted;

Sn -nucleotide level sensitivity; Sp - nucleotide level specificity;

CC - correlation coefficient;

ESn - exon level sensitivity; ESp - exon level specificity

Results: Accuracy Statistics

Complicating Factors for Comparison

• Gene finders were trained on data that had genes homologous to test seq.

• Percentage of overlap is varied

• Some gene finders were able to tune their methods for particular data

• Methods continue to be developedNeeded

• Train and test methods on the same data.

• Do cross-validation (10% leave-out)

GenScan compared to other GenScan compared to other gene-finding programsgene-finding programs

Why not Perfect?Why not Perfect? Gene Number

usually approximately correct, but may not

Organismprimarily for human/vertebrate seqs; maybe lower accuracy for non-vertebrates. ‘Glimmer’ & ‘GeneMark’ for prokaryotic or yeast seqs

Exon and Feature Type

Internal exons: predicted more accurately than Initial or Terminal exons;Exons: predicted more accurately than Poly-A or Promoter signals

Biases in Test Set (Resulting statistics may not be

representative)

Eukaryotic Gene Finding Eukaryotic Gene Finding ToolsTools Genscan (ab initio), GenomeScan (hybrid) (http://genes.mit.edu/) Twinscan (hybrid) (http://genes.cs.wustl.edu/) FGENESH (ab initio) (http://www.softberry.com/berry.phtml?topic=gfind) GeneMark.hmm (ab initio) (http://opal.biology.gatech.edu/GeneMark/eukhmm.cgi) MZEF (ab initio) (http://rulai.cshl.org/tools/genefinder/) GrailEXP (hybrid) (http://grail.lsd.ornl.gov/grailexp/) GeneID (hybrid) (http://www1.imim.es/geneid.html)

Comparative GenomicsComparative Genomics

Outline for Comparative Outline for Comparative GenomicsGenomicsOverviewWhy do comparative genomic analysis?Assumptions/LimitationsGenome Analysis and Annotation Standard

ProcedureGeneral Purposes Databases for

Comparative GenomicsOrganism Specific DatabasesGenome Analysis EnvironmentsGenome Sequence Alignment ProgramsGenomic Comparison Visualization Tools

What is comparative What is comparative genomics?genomics?Analyzing & comparing genetic

material from different species to study evolution, gene function, and inherited disease

Understand the uniqueness between different species

What is compared?What is compared?Gene locationGene structure

◦ Exon number◦ Exon lengths◦ Intron lengths◦ Sequence similarity

Gene characteristics◦ Splice sites◦ Codon usage◦ Conserved synteny

Figure 1 Regions of the human and mouse homologous genes: Coding exons (white), noncoding exons (gray}, introns (dark gray), and intergenic regions (black). Corresponding strong (white) and weak (gray) alignment regions of GLASS are shown connected with arrows. Dark lines connecting the alignment regions denote very weak or no alignment. The predicted coding regions of ROSETTA in human, and the corresponding regins in mouse, are shown (white) between the genes and the alignment regions.

Sequenced prokaryotic Sequenced prokaryotic genomesgenomes

Bacteroides fragilis Opportunistic In progress Bordetella bronchiseptica Veterinary In progress Bordetella parapertussis Whooping cough Bordetella pertussis Whooping cough Complete Burkholderia cepacia Lung infections in CF In progress Burkholderia pseudomallei Melliodosis In progress Chlamidophila abortus Veterinary Funded Clostridium botulinum Botulism Funded Clostridium difficile Colitis In progress Corynebacterium diphtheriae Diphtheria Complete Erwinia carotovora Plant pathogen Funded Escherichia/Shigella spp. (5) Various In progress Mycobacterium bovis Tuberculosis In progress Mycobacterium marinum Various In progress Neisseria meningitidis (serogroup C) Bacterial meningitis In progress Salmonella typhi Typhoid fever Complete Salmonella spp. (5) Various In progress Staphylococcus aureus (MRSA) Various (Nosocomial) Complete Staphylococcus aureus (MSSA) Various (Community acquired) In progress Streptococcus pneumoniae Bacterial meningitis In progress Streptococcus pyogenes Various (ARF-associated) In progress Streptococcus suis Veterinary In progress Streptococcus uberis Veterinary In progress Streptomyces coelicolor Non-pathogenic Complete Tropheryma whipelli Whipple’s disease In progress Wolbachia (Culex quinquefasciatus) Vector (Bancroftian filariasis) In progress Wolbachia (Onchocerca volvulus) River Blindness Funded Yersinia enterocolitica Food poisoning In progress Yersinia pestis Plague Complete

Complete

Sequenced eukaryotic Sequenced eukaryotic genomesgenomes

Aspergillus fumigatus Farmer’s lung In progress Dictyostelium discoideum Soil amoeba In progress Entamoeba histolitica Amoebic dysentry In progress Leishmania major Leishmaniasis In progress Plasmodium falciparum Malaria In progress Schistosoma mansoni Bilharzia In progress Schizosaccharomyces pombe Fission yeast Complete Theileria annulata Veterinary In progress Toxoplasma gondii Toxoplasmosis In progress Trypanosoma brucei Sleeping sickness In progress

Bioinformatics Flow ChartBioinformatics Flow Chart

6. Gene & Protein expression data

7. Drug screening

Ab initio drug design ORDrug compound screening in database of molecules

8. Genetic variability

1a. Sequencing

1b. Analysis of nucleic acid seq.

2. Analysis of protein seq.

3. Molecular structure prediction

4. molecular interaction

5. Metabolic and regulatory networks

Complete sequence

Shotgun reads

Contigs

Genomic DNA

Shearing/Sonication

Subclone and Sequence

Assembly

Finishing

Finishing read

Genome Sequencing Genome Sequencing ProcessProcess

Genome Sequencing - Genome Sequencing - ReviewReview

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

•Most genome will be sequenced and can be sequenced;

few problem are unsolvable.

Clone by clone vs whole genome shotgun

•Problem lies in understanding what you have:

•Gene prediction/gene finding

•Annotation

Subcloning; generate small insert libraries

Assembly: Process of taking raw single-pass reads into contiguous consensus sequence (Phred/Phrap)

Assembly

Libraries

Strategy

Sequencing

Closure: Process of ordering and merging consensus sequences into a single contiguous sequence

Closure

Annotation -DNA features (repeats/similarities)-Gene finding-Peptide features-Initial role assignment-Others- regulatory regions

Release Release data to the public e.g. EMBL or GenBank

Annotation of eukaryotic genomesAnnotation of eukaryotic genomes

transcription

RNA processing

translation

AAAAAAA

Genomic DNA

Unprocessed RNA

Mature mRNA

Nascent polypeptide

folding

Reactant A Product BFunction

Active enzyme

ab initio gene prediction

Comparative gene prediction

Functional identification

Gm3

Why do comparative Why do comparative genomics?genomics? Many of the genes encoded in each genome from the

genome projects had no known or predictable function

Analysis of protein set from completely sequenced genomes

Uniform evolutionary conservation of proteins in microbial genomes, 70% of gene products from sequenced genomes have homologs in distant genomes (Koonin et al., 1997)

Function of many of these genes can be predicted by comparing different genomes of known functional annotation and transferring functional annotation of proteins from better studied organisms to their orthologs in lesser studied organisms.

Cross species comparison to help reveal conserved coding regions

No prior knowledge of the sequence motif is necessary

Complement to algorithmic analysis

Assumptions/LimitationAssumptions/LimitationHomologous genes are relatively well

preserved while noncoding regions tend to show varying degrees of conservation. Conserved noncoding regions are believed to be important in regulating gene expression, maintaiing structural organization of the genome and most likely other possible functions.

Cross species comparative genomics is influenced by the evolutionary distance of the compared species.

Genome Analysis and Annotation: General Genome Analysis and Annotation: General ProcedureProcedure

Basic procedure to determine the functional and structural annotation of uncharacterized proteins:

Use a sequence similarity search programs such as BLAST or FASTA to identify all the functional regions in the sequence. If greater sensitivity is required then the Smith-Waterman algorithm based programs are preferred with the trade-off greater analysis time.

Identify functional motifs and structural domains by comparing the protein sequence against PROSITE, BLOCKS, SMART, CDD, or Pfam.

Predict structural features of the protein such as signal peptides, transmembrane segments, coiled-coil regions, and other regions of low sequence complexity

Generate a secondary and tertiary (if possible) structure prediction

Annotation:

◦ Transfer of function information from a well-characterized organism to a lesser studied organism and/or

◦ Use phylogenetic patterns (or profiles) and/or ◦ Use the phylogenetic pattern search tools (e.g. through COGs) to perform a

systematic formal logical operations (AND, OR, NOT) on gene sets -- differential genome display (Huynen et al., 1997).

Automated Genome AnnotationAutomated Genome Annotation

GeneQuiz – limited number of searches/day

MAGPIE – outside users cannot submit own seq

PEDANT – commercial version allow for full capacity

SEALS – semi automated

General Databases Useful for General Databases Useful for Comparative GenomicsComparative Genomics

Locus Link/RefSeq: http://www.ncbi.nih.gov/LocusLink/

PEDANT -Protein Extraction Description ANalysis Tool http://pedant.gsf.de/

MIPS – http://mips.gsf.de/COGs - Cluster of Orthologous Groups (of proteins)

http://www.ncbi.nih.gov/COG/KEGG - Kyoto Encyclopedia of Genes and Genomes

http://www.genome.ad.jp/kegg/MBGD - Microbial Genome Database

http://mbgd.genome.ad.jp/GOLD - Genome OnLine Database

http://wit.integratedgenomics.com/GOLD/TOGA – http://www.tigr.org/xxxxx

Problems with existing sequence Problems with existing sequence alignments algorithms for genomic alignments algorithms for genomic analysisanalysis Most algorithms were developed for comparing single

protein sequences or DNA sequences containing a single gene

Most algorithms were based on assigning a score to all the possible alignments (usually by the sum of the similarity/identity values for each aligned residue minus a penalty for the introduction of gaps) and then finding the optimal or near-optimal alignment based on the chosen scoring scheme.

Unfortunately, most of these programs cannot accurately handle long alignments.

Linear-space type of Smith-Waterman variants are too computationally intensive requiring specialized hardware (memory-limited) or very time-consuming. Higher speed vs increased sensitivity.

Genome-size comparative alignment toolsGenome-size comparative alignment tools ASSIRC - Accelerated Search for SImilarity Regions in Chromosomes

◦ ftp://ftp.biologie.ens.fr/pub/molbio/ (Vincens et al. 1998) BLAT –

◦ http://genome.ucsc.edu/cgi-bin/hgBlat?command=start (Kent xxx) DIALIGN - DIagonal ALIGNment

◦ http://www.gsf.de/biodv/dialign.html (Morgenstern et al. 1998; Morgenstern 1999( DBA - DNA Block Aligner

◦ http://www.sanger.ac.uk/Software/Wise2/dba.shtml (Jareborg et al. 1999( GLASS - GLobal Alignment SyStem

◦ http://plover.lcs.mit.edu/ (Batzoglou et al. 2000) LSH-ALL-PAIRS - Locality -Sensitve Hashing in ALL PAIRS

◦ Email: [email protected] (Buhler 2001) MegaBlast

◦ http://www.ncbi.nih.gov/blast/ (Zhang 2000) MUMmer - Maximal Unique Match (mer)

◦ http://www.tigr.org/softlab/ (Delcher et al. 1999) PIPMaker - Percent Identity Plot MAKER

◦ http://biocse.psu.edu/pipmaker/ (Schwartz et al. 2000) SSAHA – Sequence Search and Alignment by Hashing Algorithm

◦ http://www.sanger.ac.uk/Software/analysis/SSAHA/ WABA - Wobble Aware Bulk Aligner

◦ http://www.cse.ucsc.edu/~kent/xenoAli/ (Kent & Zahler 2000)

SSAHASSAHASequence Search and Alignment by Hashing

AlgorithmSoftware tool for very fast matching and

alignment of DNA sequences.Achieves fast search speed by converting

sequence information into a hash table data structure which can then be searched very rapidly for matches

http://www.sanger.ac.uk/Software/analysis/SSAHA/Run from the Unix command lineNeed > 1GB RAM (needs a lot of memory)SSAHA algorithm best for application requiring

exact or “almost exact” matches between two sequences – e.g. SNP detection, fast sequence assembly, ordering and orientation of contigs

Genome Analysis Genome Analysis EnvironmentEnvironmentMAGPIE - Automated Genome

Project Investigation EnvironmentPEDANTSEALS

Problems with Visualizing GenomesProblems with Visualizing Genomes

Alignment programs output often were visualized by text file, which can be intuitively difficult to interpret when comparing genomes.

Visualization tools needed to handle the complexity and volume of data and present the information in a comprehensive and comprehensible manner to a biologist for interpretation.

Genome Alignment Visualization tools need to provide: ◦ interpretable alignments,

◦ gene prediction and database homologies from different sources

◦ Interactive features: real time capabilities, zooming, searching specific regions of homologies

◦ Represent breaks in synteny

◦ Multiple alignments display

◦ Displaying contigs of unfinished genomes with finished genomes

◦ Handle various data formats

◦ Software availabilty (no black box)

Genome Comparison Visualization ToolGenome Comparison Visualization Tool

ACT - Artemis Comparison Tool (displays parsed BLAST alignments; based on Artemis – an annotation tool)◦ http://www.sanger.ac.uk/Software/ACT/

Alfresco (displays DBA alignments and ...)◦ http://www.sanger.ac.uk/Software/Alfresco/ (Jareborg & Durbin 2000)

PipMaker (displays BlastZ alignments)◦ http://bio.cse.psu.edu/pipmaker/ (Schwartz et al. 2000)

Enteric/Menteric/Maj (displays Blastz alignments)◦ http://glovin.cse.psu.edu/enterix/ (Florea et al. 2000; McClelland et al.

2000) Intronerator (displays WABA alignments and ...)

◦ http://www.cse.ucsc.edu/~kent/intronerator/ (Kent & Zahler 2000b) VISTA (Visualization Tool for Alignment) (displays GLASS

alignments)◦ http://www-gsd.lbl.gov/vista/

SynPlot (displays DIALIGN and GLASS alignments)◦ http://www.sanger.ac.uk/Users/igrg/SynPlot/

Artemis Comparison Tool Artemis Comparison Tool (ACT)(ACT)- ACT is a DNA sequence comparison viewer

based on Artemis- Can read complete EMBL and GenBank

entries or sequence in FASTA or raw format- Additional sequence feature can be in EMBL,

GenBank, GFF format- ACT is free software and is distributed under

the GNU Public License- Java based software- Latest release 2.0 better support Eukaryotic

Genome Comparison

http://www.sanger.ac.uk/Software/ACT/

G+C

tRNAphage/IS genesPseudogenes

Salmonella typhi vs. E. coli – SPI-2

Blast hits

S.typhi

E.coli

Neisseria meningitidis - A vs. B comparison - ACT

A case A case Study:Comparison of Study:Comparison of mouse chromosome mouse chromosome 16 and the human 16 and the human genome:genome:

Mural et al., Science, 2002, 296:1661

Celera groupSynteny with human chr.’s

3,8,12,16,21,22 and rat chr.’s 10,11

Q: Why more breakpoints in mouse-human than in mouse-rat?

Q: Why more conserved genes in human than in rat?

•This also can occur between chromosomes•The longer the divergence time between 2 species, the more recombination has occurred•100 million years since human-mouse divergence•40 million years since rat-mouse divergence

Whole-genome shotgun sequencing:

1. Genome is cut into small sections2. Each section is hundreds or a few

thousand bp of DNA3. Each section is sequenced and put

in a database4. A computer aligns all sequences

together (millions of them from each chromosome) to form contigs

5. Contigs are arranged (using markers, etc) to form scaffolds

Q: What are the advantages of this over the traditional method?

Q: What are the potential sources of error?

1. Assembly of Mmu161. Assembly of Mmu16 Total size: 99Mbp1. Not one contiguous sequence (contig)2. 8,635 contigs on 20 “scaffolds”3. Average scaffold size: 10Mbp4. Number of gaps: 86155. Total size of gaps: ~6Mbp6. Total coverage: ~93Mbp

2. Identify genes in Mmu162. Identify genes in Mmu161. Scaffolds of >10kbp were examined (scaffolds larger than

1Mbp were chopped)2. Regions with repeat motifs were ignored using

RepeatMasker3. Several gene prediction engines use (GenScan, Grail,

Fgenes)4. Amino acid sequences from open reading frames searched

against nr protein db (NCBI)5. Nucleotide searchers (using DNA from across scaffolds)

performed against:1. Celera’s gene clusters2. Mmu, Rno, & Hsa EST db’s3. NCBI’s RefSeq mRNA db4. Celera’s dog genomic db5. Public pufferfish genomic db

2. Identify genes in Mmu162. Identify genes in Mmu166. 1055 genes with high & medium confidence were

predicted7. Other efforts have identified 1142 genes8. After visual annotation inspection, psuedogenes and

annotation errors removed, leaving 731 homologues genes

9. The genes found were mostly orthologues because they were reciprocal best matches by BLAST searches.

3. Identify regions of conserved 3. Identify regions of conserved synteny between Mmu16 and Hsasynteny between Mmu16 and Hsa

Regions of conserved synteny predicted by sequence similarity and by protein comparisons

Synteny based on sequence comparisons:1. Syntenic anchors were located - regions with high (80%)

similarity over short distances (~200bp or more).2. Average distance between anchors is 8kbp, but there are

gaps as large as 707kbp in the mouse and 3.4Mbp in the human

3. Identify regions of conserved 3. Identify regions of conserved synteny between Mmu16 and Hsasynteny between Mmu16 and Hsa

3. 56% of anchors were in mouse genes - exons mostly4. 44% in intergenic regions5. Relatively density is independent of coding/noncoding - making

the anchors an important marker of synteny (in addition to genes)

Human chr. Mmu len. Hsa len. No. anchors bad anch. (% incon.) Orthologues16 10,461 12,329 1,429 21 (1.5) 878 1,284 1,491 121 1 (0.8) 612 363 306 31 3 (9.7) 322 2,081 2,273 418 8 (1.9) 303q27-29 13,557 16,461 1,714 18 (1.0) 1073q11.1-13.3 41,660 46,493 5,485 63 (1.1) 16521 22,327 28,421 2,127 27 (1.3) 111

SummarySummaryPerformance evaluation of gene-finding

programsComparative genomicsComparative genomics analysis example

AcknowledgementAcknowledgementChuong Huynh (NIH)

CSCE555 Bioinformatics Lecture 9 Gene Finding & Comparative genomics Meeting: MW 4:00PM-5:15PM...

Documents

Transcript of CSCE555 Bioinformatics Lecture 9 Gene Finding & Comparative genomics Meeting: MW 4:00PM-5:15PM...