Annotation, Databases, GO, Pathways,and all those things
description
Transcript of Annotation, Databases, GO, Pathways,and all those things
![Page 1: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/1.jpg)
Annotation, Databases, GO, Pathways,and
all those things
![Page 2: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/2.jpg)
2
Information on microarray data
• Consists of different types of informations– Genes annotations– Samples annotations– Genes expression
levels– Covariates– Experimental design– etc
![Page 3: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/3.jpg)
Annotation: Relating probesets to genes
![Page 4: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/4.jpg)
4
Use of microarray clone annnotation
• Often, the result of microarray data analysis is a list of genes.
• The list has to be summarized with respect to its biological meaning.
• For this, information about the genes and the related proteins has to be gathered.– If the list is small (let’s say, 1–30), this is easily done by reading
database information and/or the available literature.– Sometimes, lists are longer (100s or even 1000s of genes).
Automatic parsing and extracting of information is needed.
![Page 5: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/5.jpg)
5
From clone information to genes and proteins (1)
• Microarrays are produced using information on expressed sequences as EST clones, cDNAs, partial cDNAs etc.
• At the other end, functional information is generated (and available) for proteins.
• Hence, there is a need to map a clone sequence ID to a protein ID.
• This is a non-trivial task
![Page 6: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/6.jpg)
6
From clone information to genes and proteins: a non-trivial task
• First, there are usually hundreds of ESTs (and several cDNA sequences) that map to the same gene. – The Database Unigene tries to resolve this multiplicity
problem by sequence clustering.– An alternative approach is taken by Locus Link. This
is a quite stable repository of genomic loci, supposed to be a single gene.
– Since the emphasis is on well-characterised loci, Locus Link is not complete.
![Page 7: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/7.jpg)
7
• There are other projects like RefSeq (NCBI) or TIGR Gene Indices.• According to the cross-references available for a certain microarray,
one or the other may be advantageous
From clone information to genes and proteins: multiple ways to go
![Page 8: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/8.jpg)
8
An example: The human genome
• With the completion of the human genome sequence, you’d think that such ambiguities can be resolved. In fact, that is not the case.– Part of the problem is due to the fact that it is hard to predict gene
structure (intron/exon) without knowing the entire mRNA sequence, which happens for about two-thirds of all genes.
– Then, there are errors in the assembly (putting together the sequence snippets). A typical symptom is that a gene appears to map to multiple loci on the same chromosome, with very high sequence similarity.
– But there are also sequences that are nearly indentical, but duplicated. This has happened not long ago in evolution by means of transposable elements.
![Page 9: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/9.jpg)
9
The human genome: Some figures
• Currently, it’s estimated that the human genome contains about 25,000 – 30,000 genes that code for 50,000 – 100,000 different transcripts (and thus, proteins).
• Unigene (human section) contains 105,680 clusters, but 45,999• of them are of size 2 or less. • RefSeq DNA contains 28,097 human sequences.• ENSEMBL contains 21,787 predicted genes, 31,609 predicted
transcripts. • Fully computational methods like Genscan produce more than
65,000 predictions.• Locus Link contains 15,248 genes with known function, and further
6038 genes without function annotation
![Page 10: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/10.jpg)
10
Function annotation
• Probably, the most important thing you want to know is what the genes or their products are concerned with, i.e. their function.
• Function annotation is difficult: – Different people use different words for the same function, or
– may mean different things by the same word.
– The context in which a gene was found (e.g. “TGF-induced gene”) may not be particularly associated with its function.
• Inference of function from sequence alone is error-prone and sometimes unreliable.
• The best function annotation systems (GO,SwissProt) use human beings who read the literature before assigning a function to a gene
![Page 11: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/11.jpg)
11
The Gene Ontology
• To overcome some of the problems, an annotation system has been created: The Gene Ontology (http://www.geneontology.org). – It represents a unified, consistent system, i.e. terms
occur only once, and there is a dictionary of allowed words.
– Furthermore, terms are related to each other: the hierarchy goes from very general terms to very detailed ones.
![Page 12: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/12.jpg)
12
Cross-references with GO
• The GO database exists independently from other annotation databases
• There exist cross-references (GOA) enabling to relate other annotations with those contained in GO
![Page 13: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/13.jpg)
13
Bioconductor and annotations
• Annotation information is managed in Bioconductor through metadata packages
• These packages contain one-to-one and one-to-many mappings for frequently used chips, especially Affymetrix
• Information available includes gene names, gene symbol, database accession numbers, Gene Ontology function description, enzmye classification number (EC), relations to PubMed abstracts, and others.
• There are several packages implementing functionalities to deal with annotation information: annotate, Annbuilder, ontoTools, GOstats and many more
![Page 14: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/14.jpg)
Static vs. Dynamic Annotation
Static Annotation:• Bioconductor packages containing annotation
information that are installed locally on a computer
• well-defined structure• reproducible analyses• no need for network connection
Dynamic Annotation:• stored in a remote database• more frequent updates possibly different
result when repeating analyses• more information• one needs to know about the structure of the
database, the API of the webservice etc.
![Page 15: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/15.jpg)
15
• EntrezGene is a catalog of genetic loci that connects curated sequence information to official nomenclature. It replaced LocusLink.
• UniGene defines sequence clusters. UniGene focuses on protein-coding genes of the nuclear genome (excluding rRNA and mitochondrial sequences).
• RefSeq is a non-redundant set of transcripts and proteins of known genes for many species, including human, mouse and rat.
• Enzyme Commission (EC) numbers are assigned to different enzymes and linked to genes through EntrezGene.
Available Metadata
![Page 16: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/16.jpg)
16
• Gene Ontology (GO) is a structured vocabulary of terms describing gene products according to molecular function, biological process, or cellular component
• PubMed is a service of the U.S. National Library of Medicine. PubMed provides a rich resource of data and tools for papers in journals related to medicine and health. While large, the data source is not comprehensive, and not all papers have been abstracted
Available Metadata
![Page 17: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/17.jpg)
17
• OMIM Online Mendelian Inheritance in Man is a catalog of human genes and genetic disorders.
• NetAffx Affymetrix’ NetAffx Analysis Center provides annotation resources for Affymetrix GeneChip technology.
• KEGG Kyoto Encyclopedia of Genes and Genomes; a collection of data resources including a rich collection of pathway data.
• IntAct Protein Interaction data, mainly derived from experiments.
• Pfam Pfam is a large collection of multiple sequence alignments and hidden Markov models covering manycommon protein domains and families.
Available Metadata
![Page 18: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/18.jpg)
18
• Chromosomal Location Genes are identified with chromosomes, and where appropriate with strand.
• Data Archives The NCBI coordinates the Gene Expression Omnibus (GEO); TIGR provides the Resourcerer database, and the EBI runs ArrayExpress.
Available Metadata
![Page 19: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/19.jpg)
19
• An early design decision was to provide metadata on a per chip-type basis (e.g. hgu133a, hgu95av2)
• Each annotation package contains objects that provide mappings between identifiers (genes, probes, …) and different types of annotation data
• One can list the content of a package:> library("hgu133a")> ls("package:hgu133a")[1] "hgu133a" "hgu133aACCNUM"[3] "hgu133aCHR" "hgu133aCHRLENGTHS"[5] "hgu133aCHRLOC" "hgu133aENTREZID"[7] "hgu133aENZYME" "hgu133aENZYME2PROBE"[9] "hgu133aGENENAME" "hgu133aGO"[11] "hgu133aGO2ALLPROBES" "hgu133aGO2PROBE"[13] "hgu133aLOCUSID" "hgu133aMAP"[15] "hgu133aMAPCOUNTS" "hgu133aOMIM"[17] "hgu133aORGANISM" "hgu133aPATH"[19] "hgu133aPATH2PROBE" "hgu133aPFAM"[21] "hgu133aPMID" "hgu133aPMID2PROBE"[23] "hgu133aPROSITE" "hgu133aQC"[25] "hgu133aREFSEQ" "hgu133aSUMFUNC_DEPRECATED"[27] "hgu133aSYMBOL" "hgu133aUNIGENE"
Annotation Packages
![Page 20: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/20.jpg)
A little bit of history...A little bit of history...(the pre-SQL era)
before: hgu95av2 now: hgu95av2.db
![Page 21: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/21.jpg)
21
• Objects in annotation packages used to be environments, hash tables for mapping now things are stored in SQLite
DB• Mapping only from one identifier to another, hard to reverse• quite unflexible• The user interface still supports many of the old
environment-specific interactions: You can access the data directly using any of the standard subsetting or extraction tools for environments: get, mget, $ and [[.
> get("201473_at", hgu133aSYMBOL)[1] "JUNB"> mget(c("201473_at","201476_s_at"), hgu133aSYMBOL)$`201473_at`[1] "JUNB"$`201476_s_at`[1] "RRM1"> hgu133aSYMBOL$"201473_at"[1] "JUNB"> hgu133aSYMBOL[["201473_at"]][1] "JUNB"
Annotation Packages
![Page 22: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/22.jpg)
22
Suppose we are interested in the gene BAD.
> gsyms <- unlist(as.list(hgu133aSYMBOL))> whBAD <- grep("^BAD$", gsyms)> gsyms[whBAD]1861_at 209364_at"BAD" "BAD"> hgu133aGENENAME$"1861_at"[1] "BCL2-antagonist of cell death"
Working with Metadata
![Page 23: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/23.jpg)
23
Find the pathways that BAD is associated with.
> BADpath <- hgu133aPATH$"1861_at"> kegg <- mget(BADpath, KEGGPATHID2NAME)> unlist(kegg)01510"Neurodegenerative Disorders"04012"ErbB signaling pathway"04210"Apoptosis"04370
…"Colorectal cancer"05212"Pancreatic cancer"05213"Endometrial cancer"05215
Working with Metadata
![Page 24: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/24.jpg)
24
We can get the GeneChip probes and the unique EntrezGene loci in each of these pathways. First, we obtain the Affymetrix IDs
> allProbes <- mget(BADpath, hgu133aPATH2PROBE)> length(allProbes)[1] 15> allProbes[[1]][1:10][1] "206679_at" "209462_at" "203381_s_at" "203382_s_at"[5] "212874_at" "212883_at" "212884_x_at" "200602_at"[9] "211277_x_at" "214953_s_at"
> sapply(allProbes, length)01510 04012 04210 04370 04510 04910 05030 05210 05212 0521385 169 162 137 413 243 39 167 156 11105215 05218 05220 05221 05223194 137 160 117 110
Working with Metadata
![Page 25: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/25.jpg)
25
And then we can map these to their Entrez Gene values.
> getEG = function(x) unique(unlist(mget(x, hgu133aENTREZID)))> allEG = sapply(allProbes, getEG)> sapply(allEG, length)01510 04012 04210 04370 04510 04910 05030 05210 05212 0521337 84 81 67 187 130 18 82 72 5105215 05218 05220 05221 0522385 68 74 53 53
Working with Metadata
![Page 26: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/26.jpg)
26
• Data in the new .db annotation packages is stored in SQLite databases
much more efficient and flexible • old environment-style access provided by objects of class Bimap (package AnnotationDbi)
leftobject
rightobject
leftobject
rightobject
leftobject
rightobject
.db Packages
![Page 27: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/27.jpg)
27
• Data in the new .db annotation packages is stored in SQLite databases
much more efficient and flexible • old environment-style access provided by objects of class Bimap (package AnnotationDbi)
leftobject
rightobject
leftobject
rightobject
leftobject
rightobject
bipartite graph
name
attr1 = value1attr2 0 value2
.db Packages
![Page 28: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/28.jpg)
28
• collection of classes and methods for database interaction
• they abstract the particular implementations of common standard operations on different types of databases
• resultSet: operations are performed on the database, the user controls how much information is returned
dbSendQuery create result set
dbGetQuery get all results
dbGetQuery(connection, sql query)
DBI
![Page 29: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/29.jpg)
29
Notice that there are a few more entries here. They give you access to a connection to the database.
> library("hgu133a.db")> ls("package:hgu133a.db")[1] "hgu133aACCNUM" "hgu133aALIAS2PROBE"[3] "hgu133aCHR" "hgu133aCHRLENGTHS"[5] "hgu133aCHRLOC" "hgu133aENTREZID"[7] "hgu133aENZYME" "hgu133aENZYME2PROBE"[9] "hgu133aGENENAME" "hgu133aGO"[11] "hgu133aGO2ALLPROBES" "hgu133aGO2PROBE"[13] "hgu133aMAP" "hgu133aMAPCOUNTS"[15] "hgu133aOMIM" "hgu133aORGANISM"[17] "hgu133aPATH" "hgu133aPATH2PROBE"[19] "hgu133aPFAM" "hgu133aPMID"[21] "hgu133aPMID2PROBE" "hgu133aPROSITE"[23] "hgu133aREFSEQ" "hgu133aSYMBOL"[25] "hgu133aUNIGENE" "hgu133a_dbInfo"[27] "hgu133a_dbconn" "hgu133a_dbfile"[29] "hgu133a_dbschema"
.db Packages
![Page 30: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/30.jpg)
30
> con <- hgu133a_dbconn()> q1 <- "select symbol from gene_info“> head(dbGetQuery(con ,q1)) symbol1 A2M2 NAT13 NAT24 SERPINA3
> toTable(hgu133aSYMBOL)[1:3,] probe_id symbol1 217757_at A2M2 214440_at NAT13 206797_at NAT2
extract information from a database table as data.frame
reverse mapping
> revmap(hgu133aSYMBOL)$BAD
[1] "1861_at" "209364_at"
![Page 31: Annotation, Databases, GO, Pathways,and all those things](https://reader036.fdocuments.in/reader036/viewer/2022062409/56814902550346895db632b2/html5/thumbnails/31.jpg)
31
Lkeys, Rkeys: Get left and right keys of a Bimap object
> head(Lkeys(hgu133aSYMBOL))[1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" "1294_at"
> head(Rkeys(hgu133aSYMBOL))[1] "A2M" "NAT1" "NAT2" "SERPINA3" "AADAC" "AAMP"
> table(nhit(revmap(hgu133aSYMBOL)))
1 2 3 4 5 6 7 8 9 10 11 12 13 18 19 8101 2814 1273 475 205 77 19 15 5 3 4 1 2 1 1
nhit: number of hits for every left key in a Bimap object