Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines

download Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines

If you can't read please download the document

description

Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines. Arthur Gruber. Instituto de Ciências Biomédicas Universidade de São Paulo. AG-ICB-USP. Sequence annotation. Annotation is the process of adding information to a DNA sequence. - PowerPoint PPT Presentation

Transcript of Anotação automática de seqüências biológicas: ontologias e sistemas de pipelines

  • Anotao automtica de seqncias biolgicas: ontologias e sistemas de pipelinesArthur GruberInstituto de Cincias Biomdicas Universidade de So PauloAG-ICB-USP

  • Sequence annotationAnnotation is the process of adding information to a DNA sequence.The information usually has DNA coordinate.Features could be repeats, genes, promoters, protein domains..Features can be linked to other databases e.g. Pfam/Pubmed AG-ICB-USP

  • Public databasesGenBank, EMBL and DDBJ.All databases update each other automatically

    AG-ICB-USP

  • Feature tablehttp://www.ncbi.nlm.nih.gov/projects/collab/FT/Format definitionCovers DDBJ/EMBL/GenBank Defines all accepted annotation terms and hierarchy AG-ICB-USP

  • Annotation fileContains:A header with:Information about the sequenceOrganismAuthorsReferencesCommentsA feature table containingSequence features and co-ordinatesAG-ICB-USP

  • ID PFMAL1P4 standard; DNA; INV; 66441 BP.XXAC AL031747;XXSV AL031747.8XXDT 24-SEP-1998 (Rel. 57, Created)DT 27-APR-2000 (Rel. 63, Last updated, Version 13)XXDE Plasmodium falciparum DNA from MAL1P4XXKW HTG; rifin; telomere; var; var-like hypothetical protein.XXOS Plasmodium falciparum (malaria parasite P. falciparum)OC Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium.XXRN [1]RA Oliver K., Bowman S., Churcher C., Harris B., Harris D., Lawson D.,RA Quail M., Rajandream M., Barrell B.;RT ;RL Submitted (24-SEP-1998) to the EMBL/GenBank/DDBJ databases.RL P.falciparum Genome Sequencing Consortium, The Sanger Centre, WellcomeRL Trust Genome Campus, Hinxton, Cambridge CB10 1S.Header (EMBL)AG-ICB-USP

  • LOCUS PFMAL1P4 66442 bp DNA linear INV 02-DEC-2004DEFINITION Plasmodium falciparum DNA from MAL1P4, complete sequence.ACCESSION AL031747 AL844501VERSION AL031747.9 GI:23477012KEYWORDS HTG; rifin; telomere; var; var-like hypothetical protein.SOURCE Plasmodium falciparum 3D7 ORGANISM Plasmodium falciparum 3D7 Eukaryota; Alveolata; Apicomplexa; Haemosporida; Plasmodium.REFERENCE 1 AUTHORS Hall,N., Pain,A., Berriman,M., Churcher,C., Harris,B., Harris,D., TITLE Sequence of Plasmodium falciparum chromosomes 1, 3-9 and 13 JOURNAL Nature 419 (6906), 527-531 (2002) PUBMED 12368867REFERENCE 2 AUTHORS Oliver,K., Pain,A., Berriman,M., Bowman,S., Churcher,C., Harris,B., Harris,D., Lawson,D., Quail,M., Rajandream,M., Hall,N. and Barrell,B. TITLE Direct Submission JOURNAL Submitted (24-SEP-1998) P.falciparum Genome Sequencing Consortium, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UKCOMMENT On Oct 2, 2002 this sequence version replaced gi:7670004. For more information about this sequence or the Malaria Project, see http://www.sanger.ac.uk/Projects/P_falciparum.NCBI HeaderAG-ICB-USP

  • FeatureRegion of DNA that was annotated with a key/qualifierKeys: CDS, intron, miscellaneous, etc.Qualifier: notes or extra-information about a featurei.e. exon (key) /gene=adh (qualifier)AG-ICB-USP

  • Feature keys misc_differencemisc_feature misc_recomb misc_RNA misc_signal misc_structure modified_base mRNA N_regionold_sequence polyA_signal polyA_site precursor_RNA prim_transcript primer_bind promoter protein_bind RBS repeat_region repeat_unit rep_origin rRNA S_region satellite scRNA sig_peptide snRNA snoRNA source stem_loop STS TATA_signal terminator transit_peptide tRNA unsure V_region V_segment variation 3'clip 3'UTR 5'clip 5'UTR -10_signal -35_signal attenuator C_region CAAT_signal CDS conflict D-loop D_segment enhancer exon GC_signal gene iDNA intron J_segment LTR mat_peptide misc_bindingAG-ICB-USP

  • Feature qualifierAdditional information about a feature

    /allele="text" /citation=[number] /codon=(seq:"text",aa:) /codon_start=

  • Features (EMBL)AG-ICB-USP

  • Features (NCBI)AG-ICB-USPFEATURES Location/Qualifiers source 1..66442 /organism="Plasmodium falciparum 3D7" /mol_type="genomic DNA" /isolate="3D7" /db_xref="taxon:36329" /chromosome="1" repeat_region 1..583 /note="telomeric repeat" repeat_region 584..1641 /note="14bp repeat" gene join(29733..34985,36111..37349) /gene="MAL1P4.01" /note="synonyms: PFA0005w, VAR" CDS join(29733..34985,36111..37349) /gene="MAL1P4.01" /note="Subtelomeric var gene Pfam hit to PF03011 Similar to Plasmodium falciparum VaR, mal1p4.01 vaR SWALL:Q9NFB6 (EMBL:AL031747) (2163 aa) fasta scores: E(): 0, 100% id in 2163 aa" /codon_start=1 /product="erythrocyte membrane protein 1 (PfEMP1)" /protein_id="CAB89209.1" /db_xref="GI:7670005" /db_xref="GOA:Q9NFB6" /db_xref="UniProtKB/TrEMBL:Q9NFB6" /translation="MVTQSSGGGAAGSSGEEDAKHVLDEFGQQVYNEKVEKYANSKIY KEALKGDLSQASILSELAGTYKPCALEYEYYKHTNGGGKGKRYPCTELGEKVEPRFSDTLGGQCTNKKIEGNKYIKGKDVGACAPYRRLHLCSHNLESIQ

  • CDS featuresCDS stands for coding sequence and is used to denote genes and pseudogenes.These features are automatically translated on submission and the protein added to the protein databases.

    AG-ICB-USP

  • /noteNote field contains all the evidence for a gene call..plus anything else.Similarity (fasta or blast)Domain/motif information (Pfam, TMHMM, etc.)Unusual features (repeats, aa richness)

    AG-ICB-USP

  • /productThe name of the gene product eg. Alcohol dehydrogenaseUnless there is proof we must qualify...PutativePossibleAlways be conservative! eg. Putative dehydrogenase dehyrogenase like protein Only piece of annotation added to the protein databases.AG-ICB-USP

  • Naming protocolsHypothetical proteinunknown function and no homologyConserved hypothetical proteinunknown function WITH homologyAlcohol dehydrogenase likelooks a bit like it, but may not be.

    Putative alcohol dehydrogenaseprobably a alcohol dehydrogenase

    Alcohol dehydrogenasethis has previously been characterised and shown to be alcohol dehydrogenase in this organism.

    AG-ICB-USP

  • /gene The gene nameeg ADH1Only transfer a gene name if it is meaningfulNever transfer a gene name like PfB0024.Is it a gene family? make sure two genes have the same name.AG-ICB-USP

  • Transitive AnnotationAKA annotation catastropheJunk in = Junk out

    Mis-annotations spread through incorrect database submissions.

    AG-ICB-USP

  • How can we standardize the annotation terms?AG-ICB-USP

  • Through a dynamic controlled vocabularyAG-ICB-USP

  • AG-ICB-USP

  • So what does that mean?From a practical view, ontology is the representation of something we know about. Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those things.

  • Ontology Structurecell

    membrane chloroplast

    mitochondrial chloroplastmembrane membraneDirected Acyclic Graph (DAG) - multiple parentage allowed

  • GO topologyThe ontologies are structured as directed acyclic graphsSimilar to hierarchies but differ in that a more specialized term (child) can be related to more than one less specialized term (parent). For example, hexose biosynthetic process has two parents, hexose metabolic process and monosaccharide biosynthetic process. AG-ICB-USP

  • True Path Violations Create Incorrect Definitions..the pathway from a child term all the way up to its top-level parent(s) must always be true".chromosomePart_of relationshipnucleus

  • True Path Violations..the pathway from a child term all the way up to its top-level parent(s) must always be true".chromosomeMitochondrial chromosomeIs_a relationship

  • True Path Violations..the pathway from a child term all the way up to its top-level parent(s) must always be true".chromosomeMitochondrial chromosomeIs_a relationshipPart_of relationshipnucleusA mitochondrial chromosome is not part of a nucleus!

  • True Path Violations..the pathway from a child term all the way up to its top-level parent(s) must always be true".nucleuschromosomeNuclear chromosomeMitochondrial chromosomeIs_a relationshipPart_of relationshipmitochondrionPart_of relationship

  • GO Definitions: Each GO term has 2 DefinitionsA definition written by a biologist:necessary & sufficientconditions written definition(not computable) Graph structure: necessary conditionsformal(computable)

  • Term-term relationshipis_aThe is_a relationship is a simple class-subclass relationship, where A is_a B means that A is a subclass of BFor example, nuclear chromosome is_a chromosome. AG-ICB-USPGO:0043232 : intracellular non-membrane-bound organelle GO:0005694 : chromosome GO:0000228 : nuclear chromosome

  • Term-term relationshippart_ofC part_of D means that whenever C is present, it is always a part of D, but C does not always have to be presentFor example, periplasmic flagellum part_of periplasmic spaceAG-ICB-USPGO:0044464 : cell part GO:0042995 : cell projection GO:0019861 : flagellum GO:0009288 : flagellin-based flagellum GO:0055040 : periplasmic flagellum GO:0042597 : periplasmic space GO:0055040 : periplasmic flagellum

  • Current Ontologies Molecular function: tasks performed by gene product Biological process: broad biological goals accomplished by ordered assemblies of molecular functions Cellular component: subcellular structures, locations and macromolecular complexes AG-ICB-USP

  • AG-ICB-USP

  • Search result for toxinAG-ICB-USP

  • Relationships in GOis-apart ofAG-ICB-USP

  • GO paths to termsAG-ICB-USP

  • GO definitionsAG-ICB-USP

  • Pyruvate dehydrogenaseAG-ICB-USP

  • Why the interest in GO?Universal ontologyFunctional classification scheme with many different levels in a DAGWidespread interest from scientific communityAlready mappings to SP keywords and gene products-annotation on some organismsAG-ICB-USP

  • GO Evidence codesAG-ICB-USP Experimental Evidence Codes EXP: Inferred from Experiment IDA: Inferred from Direct Assay IPI: Inferred from Physical Interaction IMP: Inferred from Mutant Phenotype IGI: Inferred from Genetic Interaction IEP: Inferred from Expression Pattern Computational Analysis Evidence Codes ISS: Inferred from Sequence or Structural Similarity ISO: Inferred from Sequence Orthology ISA: Inferred from Sequence Alignment ISM: Inferred from Sequence Model IGC: Inferred from Genomic Context RCA: inferred from Reviewed Computational Analysis Author Statement Evidence Codes TAS: Traceable Author Statement NAS: Non-traceable Author Statement Curator Statement Evidence Codes IC: Inferred by Curator ND: No biological Data available Automatically-assigned Evidence Codes IEA: Inferred from Electronic Annotation Obsolete Evidence Codes NR: Not Recorded

  • Current Mappings to GO Consortium mappings -MGD, SGD, FlyBase Swiss-Prot keywords EC numbers InterPro entries Medline ID Commercial companies -CompuGen, ProteomeAG-ICB-USP

  • AG-ICB-USP

  • AG-ICB-USP

  • AG-ICB-USP

  • InterPro-to-GO

  • EC number-to-GOAG-ICB-USP

  • SP keyword-to-GOAG-ICB-USP

  • GO doesnt coverGene products: e.g. cytochrome c is not in the ontologies, but attributes of cytochrome c, such as oxidoreductase activity, are. Processes, functions or components that are unique to mutants or diseases: e.g. oncogenesis is not a valid GO term because causing cancer is not the normal function of any gene. Attributes of sequence such as intron/exon parameters: these are not attributes of gene products and will be described in a separate sequence ontology (see Sequence Ontology). Protein domains or structural features. Protein-protein interactions. Environment, evolution and expression. Anatomical or histological features above the level of cellular components, including cell types. AG-ICB-USP

  • Sequence OntologyThe four major aspects of the complete Sequence Ontology are:located sequence features for objects that can be located on sequence in coordinates,sequence attributes for describing the properties of features,consequences of mutation for the annotation of the effects of a mutationchromosome variation to describe large scale variationsAG-ICB-USP

  • Sequence OntologyAG-ICB-USPHow to edit an ontology file?OBO-Edit an ontology editor for biologistsOBO-Edit compliant format

  • Generic feature format 3AG-ICB-USP

    Generic format for sequence annotation interchangeTab-delimited text fileRepresents features in hierarchical viewUses a controlled vocabulary is compliant to Sequence Ontology

  • AG-ICB-USP

    The tab-delimited file presents 9 columns:Column 1: "seqid"Column 2: "source"Column 3: "type"Columns 4 & 5: "start" and "end"Column 6: "score"Column 7: "strand"The strand of the feature. + for positive strand (relative to the landmark), - for minus strandColumn 8: "phase"Column 9: "attributes"Generic feature format 3

  • Generic feature format 3Column 1: "seqid"Column 2: "source"Column 3: "type"Columns 4 & 5: "start" and "end"Column 6: "score"Column 7: "strand"Column 8: "phase"Column 9: "attributes"

  • How to annotate these splicing variants using Sequence Ontology terms and the GFF3?

  • The annotated genome region is named ctg123 A gene named EDEN extends from coordinates 1 to 9000 The gene encodes three alternatively-spliced variants: EDEN.1, EDEN.2 and EDEN.3 Transcript EDEN.3 presents two alternative translation start points There is a transcriptional factor binding site (a promoter) located 50 bp upstream of the translational start site of EDEN.1

  • ##gff-version 3##sequence-region ctg123 1 1497228 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . ID=tfbs00001;Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1 ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001;Name=EDEN.2 ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001;Name=EDEN.3

  • ctg123 . exon 1300 1500 . + . ID=exon00001;Parent=mRNA00003ctg123 . exon 1050 1500 . + . ID=exon00002;Parent=mRNA00001,mRNA00002ctg123 . exon 3000 3902 . + . ID=exon00003;Parent=mRNA00001,mRNA00003ctg123 . exon 5000 5500 . + . ID=exon00004;Parent=mRNA00001,mRNA00002,mRNA00003ctg123 . exon 7000 9000 . + . ID=exon00005;Parent=mRNA00001,mRNA00002,mRNA00003

    ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1

  • ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002;Name=edenprotein.2

    ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3ctg123 . CDS 5000 5500 . + 2 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3ctg123 . CDS 7000 7600 . + 2 ID=cds00003;Parent=mRNA00003;Name=edenprotein.3

    ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4ctg123 . CDS 5000 5500 . + 2 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4ctg123 . CDS 7000 7600 . + 2 ID=cds00004;Parent=mRNA00003;Name=edenprotein.4

  • AG-ICB-USPIf you writes a GFF file, you can test it! There is an online validator:

    http://dev.wormbase.org/db/validate_gff3/validate_gff3_onlineGeneric feature format 3

  • Testing the GFF3 Validator

  • AG-ICB-USP

  • Testing the GFF3 ValidatorLets change the feature names

  • Annotation viewing and editingArtemisArtemis is a free genome viewer and annotation tool developed by Kim Rutherford (Sanger Institute, UK).It allows for visualization of sequence features and results of analyses, in the context of the sequence and its six-frame translation. AG-ICB-USP

  • Annotation viewing and editingArtemisArtemis is written in Java, and is available for UNIX, GNU/Linux, BSD, Macintosh and MS-Windows systems. It can read complete EMBL and GENBANK database entries or sequence in FASTA or raw format. Extra sequence features can be in EMBL, GENBANK or GFF format. AG-ICB-USP

  • AG-FMVZ-USP

  • AG-FMVZ-USP

  • AG-FMVZ-USP

  • AG-FMVZ-USP

  • AG-FMVZ-USP

    ***************************************************************************