CellState1
State2 State iInput Response
State: genes, proteins, metabolites, ions……
The Parts-List Problem
•proteins
•peptides
•amino acids
•nucleotides
•retinoids
Gene Expression
2003
Mouse Genome
Cell-Specific
Genes
Signaling Proteins
Molecule Pages
Cell-Specific
Gene Products
Cell-Specific
Gene Products
Invoked by Input
Signaling Proteins
Invoked in Input-
Specific Response
The Parts-List: AfCS Strategy
2003
• Annotation Pipeline: Brian Saunders
• B-Cell Gene List from Agilent Array Data: Dennis Mock
• B-Cell Gene List from Affy Array Data: Eugene Ke, Chris Benner
• AfCS Protein List: Several People at UTSWMED, DUKE, etc.
The Parts-List Problem
• Array contains an aggregate of Riken (Fantom and non-Fantom), NIA, Research Genetics, and Genome Systems clones
• Provided with clone ID as basis for analysis – no resequencing was performed
• Sequence information from clone ID– Full-length cDNA
• Genbank• Non-Genbank (from Fantom or NIA databases)
– 3’ and 5’ ESTs• Genbank ID• Non-Genbank (NIA database)
2003
Agilent cDNA Microarray
2003
Agilent cDNA Microarray Details
Clone Type Unique Total
Riken 14828 14890
NIA 447 723
Research Genetics 155 155
Genome Systems 64 64
Total 15494 15832
Clone type distribution
2003
Agilent cDNA Microarray Details
Riken Fantom cDNA (Genbank) 13982
Riken Fantom cDNA (non-Genbank)
604
NIA cDNA (non-Genbank) 270
No cDNA (EST only) 728
Total 15494
Sequence type
Sequences(Genbank
Accessions)
MGILocusLink Unigene
GeneOntology
(GO)
InterProAFCS
Proteins
Merge withAFCS
MoleculePage
LocusLinkAnnotation
2003
Annotation Procedure
Ensembl
ProteinRecords
Clone ID(NIA orFantom
Databases)
Chromosome
Reference
Blast
Blast or
reference
Annotation Procedure• Choose representative sequence if possible• Choose representative gene
– LocusLink or MGI membership– BlastN against database of all nucleotides in
LocusLink and MGI – use all sequences if no representative has been chosen
– Unigene membership
• Choose representative sequence if necessary– Sequence used to choose representative gene– Sequence length if gene-selection method fails
2003
Annotation Procedure (contd..)
• If gene must determined by BlastN and more than one gene matches above a given threshold (around 30% identity), Unigene agreement is used to choose the “best” gene.
• If no gene matches above threshold, then Unigene is used to choose best gene
• If no genes match the above criteria, the top Blast hit regardless of threshold is chosen
2003
Annotation: Choice of Representative Gene
• Build database of (potentially) related genes– LocusLink MGI– LocusLink Unigene– MGI Unigene– Fantom MGI– NIA LocusLink/MGI– Top Blast hit (when above threshold, and not the
representative gene)
• Challenges– Outdated data sources (especially Fantom)– Incorrect annotation or errors– ESTs clustering to different Unigene IDs
2003
Further Gene Annotation: Relationships
• Ensembl– LocusLink
– Blast (top hits above threshold)
• Protein Database Records– MGI
– LocusLink
• Chromosome– LocusLink
– MGI
– Unigene
2003
Other Annotations
• Gene Ontology– MGI– LocusLink– Fantom
• InterPro– MGI– Protein database records (Swissprot/Trembl)
• AfCS– LocusLink merge with Molecule Page annotation
• Other miscellaneous gene annotation– Fantom– NIA
2003
Other Annotations
2003
Annotation Schema
2003
Viewing and Searching Gene Annotations
2003
Viewing and Searching Annotations
Go to “data searches” from the “data center” page
2003
Querying Annotation
2003
Choosing Query Responses of Annotation
Annotation of
4931440G06
• H3091H05– Non-Genbank cDNA used for representative
sequence
• H3150F07– No cDNA available, 3’ EST used for representative
sequence (also an example of multiple Ensembl transcripts)
• 0610007B22– Example of more than one potential gene for a clone
2003
Other Annotation Examples
• Of the 15,494 unique clones– 10,734 unique “genes”
• 4,576 have meaningful gene symbols• 6,156 have “Riken” gene symbols (e.g. 8030469F12Rik)
– Some of those are homologs to known human genes» Example: 1810037O03
– 2,172 map to multiple LocusLink IDs• Potential for “incorrect” gene choice
– example: 0610005A07
– 2,116 matches to AfCS• 1,490 unique AfCS IDs
2003
Agilent cDNA Microarray Annotation Summary
• Of the 15,494 unique clones– Over 7800 have GO and InterPro annotation
– 13,243 are matched to at least one Ensembl gene• 1,323 clones have multiple Ensembl matches
• 3,574 clones match genes with multiple transcripts– example: H3150F07
• 9,706 unique Ensembl genes in all– 2,071 of the unique Ensembl genes have transcript variants
2003
Agilent cDNA Microarray Annotation Summary
• There is a wide disparity between predicted transcript variants depending on which database one uses (which makes sense, since they use different draft genomes and different gene prediction programs)
• Within a gene, the databases may present a different number of potential variants, with little overlap between the databases
• Variants grouped as one gene in one database may be grouped into multiple genes in another database
2003
Splice Variants: Database Disparities
• Example: GNAS– Ensembl
• 6 transcripts in one record
– LocusLink (NCBI Evidence)• 4 transcripts split across 2 records, with 1 transcript not aligning
with the draft contig
– Only one translation (the “main” gene) is shared between the two sources
• Take-home message: need to pick one (the best if we are lucky) reference– Ensembl seems to have the most available features in a
digestible form
2003
Splice Variants: Database Disparities
Types of classification
• Domain or motif– Need to consider specific regions before assigning
attributes to an entire class– Automatic class assignment should be relatively safe
• Sequence identity clustering– No notion of function, but for high identity should
give a conservative class prediciton– Results cannot be entirely automated; cutoffs that
are used for one class of proteins might not be strict enough for another class of proteins
2003
AfCS Protein Classification
• Genes that are expressed in untreated and statistically differentially at least in one other treatment (at 4 hours)
• This method is not based on ratios or intensity levels
• Reduces false positive predictions (e.g. hemoglobin gene is not picked up!)
• Provides a conservative estimate of B-cell gene parts list
2003
Which genes are expressed in B-Cells?
Note: the ligand cluster according early –late conditions with 90-100% accuracy
(metrics: sample = Euclidean; gene = Pearson)
.
.
.
.
.
.
.
.
.
late 2-4 hrearly .5-1 hr
0 hr early .5-1 hr
(non-mitogenic)
late 2-4 hr
mitogenic
Interleukins
2003
Two-way hierarchical clustering :Unsupervised n=33 (0.5, 1,2,4 hrs)
(R. Tibshirani, G. Chu 2002)
Objective: The replicated expression for each gene is taken for the 4hr time condition (untreated vs ligand) to determine whether the gene is statistically
differentially up- or down- regulated.
The t-statistics for all the genes are ordered and noted. The labels are then permutated and the t-statistic is calculated again. After many iterations, the cumulative t-statistics is averaged for each gene. Finally, for a given false positive rate, [called “False Discovery Rate” or FDR], the significant genes are selected.
For each gene, define the adjusted “t-statistic” as follows:
treated - untreated
+ adjustment factor
mean of replicates
standard deviation for the gene
2003
Significance Analysis of Microarrays (SAM) Method
Differentially expressed genes for ligands vs UNTREATED @ 4hr [ SAM ; False Discovery Rate ( ) ]
ligand (4hr)
40L
(1%
)
LPS
(1%
)
AIG
(1%
)
IL4
(1%
)
CP
G (
1%)
IFB
(1.
5%)
GR
H (
1%)
2MA
(18
%)
LPA
(17
%)
CG
S (
2.9%
)
BO
M (
35%
)
IGF
(8%
)
S1P
(38
%)
PA
F (
2.4%
)
70L
(6%
)
NP
Y (
10%
)
DIM
(9%
)
LB4
(23%
)
M3A
(3.
5%)
FM
L (1
1%)
TG
F (
2.5%
)
TE
R (
35%
)
IL10
(20
%)
ELC
(26
%)
PG
E (
11%
)
BA
FF
(11
%)
BLC
(57
%)
NG
F (
42%
)
TN
F (
33%
)
SD
F (
20%
)
IFG
(25
%)
NE
B (
25%
)
SLC
(N
A)
num
ber
of g
enes
(pr
obes
)di
ffer
entia
lly e
xpre
ssed
0
50
100
150
200
500
600
700
800
900
1000
1100
down-regulated up-regulated
2003
Differentially Expressed Genes: Ligands 4 hr vs. Untreated
“mitogenic” ligands FDR = 1%
FDR = 35%FDR = 18%
FDR = 1%- 3%
Two-way dendrogram using significantly expressed genes (4 hrs)
number of ligands
0 1 2 3 4 5 6 7 8 9 10 11 12
num
ber
of g
enes
0
500
1000
1500
2000
8000
10000
12000
14000
Number of genes that are significantly different than UNTREATED in as many ligands at 4hr
genes that were not significantly differentially expressed in any of the 33 ligands at 4hr
D. Fambrough, K. McClure, A. Kazlauskas, and E.S. Lander (1999). Diverse signaling pathways activated by growth factor receptors induce broadly overlapping, rather than independent sets of genes. Cell 97: 727-741
2003
Expressed Genes: Significant vs. non-significant
metabolism
cell growth and /or maintenace
cell communication
response to external stimulus
cell death
cell differentiation
cell motility
morphogenesis
response to stress
2003
149 cell communication 36 cell death 13 cell differentiation
369 cell growth and /or maintenace 13 cell motility
1 digestion 1 embryonic development
491 metabolism 11 morphogenesis
1 reproduction 128 response to external stimulus
9 response to stress
A Conservative Gene Parts-List
• Create a splice-variant gene DB using ENSEMBL• Identify 60-80 mer oligo sequences that are splice
specific using the criteria– Appropriate GC content– Appropriate melting temperature– Appropriate 5’ and 3’ ends
• For exons that are small use extended window method to obtain 60-80 mer sequences
• Validate mouse-specific oligos against human genome sequences.
• Explore motif-specific oligos where splice variation is known but exon sequences are as yet undetermined.
2003
Design of Splice-Specific Oligo Arrays
Mouse Genome
Cell-Specific
Genes
Signaling Proteins
Molecule Pages
Cell-Specific
Gene Products
Cell-Specific
Gene Products
Invoked by Input
Signaling Proteins
Invoked in Input-
Specific Response
The Parts-List: AfCS Strategy
2003
Sequence Identity Clustering
• Pairwise BlastP (no filtering), with a homology ratio defined by raw score divided by the self-Blast raw score of the shorter sequence
• Single-linkage clustering on homology– 3347 AfCS proteins– 2913 cluster with a homology of 0.1 or better– 2308 with a homology of 0.3 or better
2003
AfCS Protein Classification
2003
Example Tree from Identity Clustering
The AfCS Molecule Pages
2002
A Comprehensive Expert-Curated Resource For Signaling Proteins
What Are The Molecule Pages?
2002
• The “AfCS Molecule Pages” is a website containing comprehensive information about selected signaling proteins• Each protein has a dedicated “Molecule Page” – the public’s one stop shop for everything pertinent to that protein• A “Molecule Page” is continuously updated with data from an expert author, and with automated data obtained from the public databases• Each published update becomes an official “Molecule Page Version”
• The author is responsible for entering information about– AfCS protein’s functional states – Interactions of their protein with other
proteins, and small molecules– Mutations of the protein, and their
consequences and/or phenotypes– Relevant experimental information
Molecule Pages – Author-Entered Data
2002
• “Automated Data” for each protein is provided to both the public and the authors, and can be referenced by the author when entering their own data
• The types of automated data available will:– Summaries of and links to external database records
that correspond to, or are related to, the author’s protein
• (e.g., Genbank, SwissProt and PDB records)
Molecule Pages – Automated Data
2002
Molecule Pages
2002
RG RG*T RGD RG
GA G*AT GAD GA
G G*T GD G
RGA RG*AT RGAD RGA
GTP GDP
GTP
GTP
GTP
GDP
GDP
GDP
T2 P2 D2
P1 D1T1
T3 P3 D3
T4 P4 D4
A1
A4
A2 A3
A5 A6R1
R4
R2 R3
R5 R6
OR
Mini molecule page documentation for AfCS protein A002002Rac2
HOME | SIGNALING UPDATE | MOLECULE PAGES | DATA CENTER | ABOUT USregistration | e-alert | help | contact us | site guide | search
Permitted Use of Material
Privacy Policy
introduction browse protein list search molecule pages author application signaling maps
Protein A002002
Overview
Database Links
Protein Family
Domains & Motifs
Protein Structure
Gene Info
Orthologs & Paralogs
Blast Data
Mini Molecule Page
AfCS Protein ID A002002
Protein Name Rac2
Protein SynonymsEN-7; RAS-related C3 botulinum substrate 2; Rac2; RacB; Ras-related C3 botulinum toxin substrate 2; p21-Rac2
Author Gary M Bokoch
Co-Authors -
Protein FunctionRac2 is a member of the Rho family of small GTPases. Rac2 regulates several cellular functions by cycling between its inactive GDP-bound state (Rac-GDP) and its active GTP-bound state (Rac2-GTP).
Protein Regulation
The activity of Rac2 is controlled by several regulators. In its inacative state, Rac2-GDP is bound to RhoGDI (GDP dissociation inhibitor). The signal(s) that leads to dissociation of the Rac2-GDP-GDI complexs still remains to be determined. The exchange of GDP for GTP of Rho GTPases is regulated by a group of over 30 proteins collectively called GEFs (guanine nucleotide exchange factors). The intrinsic rate of GTP hydrolysis by Rho GTPases is controlled by another group of proteins known as GAPs (GTPase activating proteins). Rac2 is post-tranlationally processed at its carboxyl terminal CAAX motif with a geranylgeranyl lipid modification that allows it to bind to membranes. RhoGDI, however, appears to sequester Rac and prevents Rac from interacting with membranes.
Concentration Regulation Unknown
Subcellular LocalizationThe Rac2-RhoGDI complex is located in the cytoplasm. Upon receiving a stimulatory signal, Rac2 is released from RhoGDI and is regulated by RhoGDI that prevents the interaction of Rac with membranes.
Phenotypes
Neutrophils of Rac2-deficient mice displayed significant defects in chemotaxis, in shear-dependent-L- selectin-mediated capture on the endothelial substrate Glycam-1, F-actin generation, p38 and p42/44 MAP kinase activation induced by chemoattractants. Superoxide generation by bone marrow neutrophils was significantly reduced, but it was normal in activated peritoneal exudate neutrophils. These defects were reflected in vivo by baseline neutrophilia, reduced inflammatory peritoneal exudate formation, and increased mortality when challenged with Aspergillus fumigatus.
Splice Variants unknown
Mouse Gene Symbol Rac2
Genbank Accession 6679600
Genbank Organism Mouse
Major Sites of Expression T-cells, B-lymphocytes, hematopoietic cells
Cardiac Myocyte Expression
no (-)
B Lymphocyte Expression yes (Dorseuil,O. et al. (1992) J. Biol. Chem. 267:20540-20542)
Interactions•Ligands:GTP, GDP •Proteins:p67phox, p21-activated kinase (PAK), GAPs (Bcr, Abr),GEFs, smgGDS (small GTPase guanine nucleotide dissociation stimulator), RhoGDI, D4GDI
AntibodiesRabbit polyclonal available from Santa Cruz Biotechnology, Inc. Mouse monoclonal available from Upstate Biotechnology, Inc.
References
•Bokoch, G.M. (1995) Immunol. Res. 21:139-148 •Bishop, A. L. and Hall, A. (2000) Biochem. J. 348:241-255 •Roberts, A. W. et al. (1999) Imunity 10:183-196 •Williams, D. A. et al. (2000) Blood 96:1646-1654. •Scheffaek, K., Stephan, J., Jensen, O.n., Illenberger, D., and Geirshik, P. (2000) Nat. Struct. Biol. 7:122-126.
Determining Gene Parts List from Affymetrix Data
For Shankar Subramaniam
Eugene Ke
May 15th
2003
Differentially Expressed Genes
• A typical Affymetrix experiment consists of two microarrays– Control– Experiment
• Comparing two chips– Generates ratios– Generates a p-value
estimate• Emperically corrected
significance value
2003
Calling Significant Genes
• Affymetrix suggests a consensus scheme– Consider all aggregate
measures
– If greater than some percentage of arrays agree, call truly change significant
– All criteria are ultimately arbitrary
2003
Identifying Nondifferentially Expressed Genes
• Affymetrix returns signal values for each transcript– Estimate of transcript
number
• Quality controls are important to consider– Statistical measure that
signal is provably different than zero
– Need to adjust for multiple testing problem
2003
Generating a parts list
• Differentially expressed genes (D)– Significant genes
• Nondifferentially expressed genes (N)– Adjust detection p-values– Determine reasonable
threshold
• Apply union of sets– Take minimum of possible p-
values
• Determine reasonable cutoff
D N Parts List
Gene 1 0.001 0.03 0.001
Gene 2 0.4 0.3 0.3
Gene 3 0.46 0.08 0.08
Gene 4 0.61 .006 0.006
Gene 5 0.11 0.309 0.11
Gene 6 0.01 0.69 0.01
Gene 7 0.15 0.023 0.023
Gene 8 0.43 0.5 0.43
Gene 9 0.72 0.087 0.087
Gene 10 0.043 0.45 0.043
2003
Issues to Consider• How comparably are detection and change p-values?
– Detection p-value are “normal” p-values– Change p-values are “estimated” p-values
• How to adequately compensate for multiple testing problem?– Traditional methods much too stringent– Affymetrix applies some empirical approaches– What are other approaches?
• Replication– Yields better result– Apply joint probabilities
• Probability of appearing in all arrays• How to weight arrays?
– Each array uniquely effected by noise– Ideally, would have some method to weight “good” and “bad” arrays
• Ill-posed problem
2003
Comparing to other technologies
• How to relate probes from a Affymetrix chip to another chip?– Annotation– Sequence Comparison
• Annotation is fluid– Original probe sequences may not reflect current
realities– In-house annotation poses synchronization problems
• Sequence Comparison– What is considered a reasonable match?
2003
References
• Detailed Statistical Algorithms white paper– http://www.affymetrix.com/support/technical/
whitepapers/sadd_whitepaper.pdf
• Affymetrix Probe Sequences– http://www.affymetrix.com/analysis/
download_center.affx
2003
Y2H Physical Data Model
0..*
0..*
0..*
0..*
0..*
0..*
0..*
0..*0..*
AFCS_PROT
AFCS_PIDPROT_NAMEPROT_SYNONYMSPROT_CATEGORY
VARCHAR2(12)VARCHAR2(200)VARCHAR2(2000)VARCHAR2(200)
<pk>
BAIT
BAIT_IDBAIT_AFCS_IDAFCS_PROT_VERSIONPROT_GINUCLEOTIDE_GIBAIT_SEQ_STARTBAIT_SEQ_ENDBAIT_SEQ
NUMBERVARCHAR2(12)NUMBER(2)VARCHAR2(12)VARCHAR2(12)NUMBERNUMBERVARCHAR2(2000)
<pk><fk>
BAIT_PREY
BAIT_IDLIBRARY_NAMEPREY_ID
NUMBERVARCHAR2(15)NUMBER(12)
<pk,fk1><pk,fk3><pk,fk2>
BAIT_STATUS
BAIT_STATUSDESCRIPTION
VARCHAR2(30)VARCHAR2(500)
<pk>
FILE_BAIT
BAIT_IDY2H_FILE_IDBAIT_STATUS
NUMBERNUMBERVARCHAR2(30)
<pk,fk1><pk,fk3><fk2>
PREY
PREY_IDPREY_AFCS_IDPREY_SEQ_IDPREY_NAMEPREY_TYPENUCLEOTIDE_GIPREY_SEQ_STARTPREY_SEQ_END
NUMBER(12)VARCHAR2(12)NUMBER(12)VARCHAR2(300)VARCHAR2(12)VARCHAR2(12)NUMBERNUMBER
<pk><fk1><fk2>
PREY_LIBRARY
LIBRARY_NAMEDESCRIPTION
VARCHAR2(15)VARCHAR2(500)
<pk>
PREY_SEQ
PREY_SEQ_IDPROT_GIPROT_SEQNUCLEOTIDE_SEQNOVEL_NUC_CHECKSUMNUC_SEQ_TYPENOVEL_SEQ_STARTNOVEL_SEQ_END
NUMBER(12)VARCHAR2(12)CLOBCLOBVARCHAR2(12)VARCHAR2(12)NUMBER(12)NUMBER(12)
<pk>
Y2H_FILE
Y2H_FILE_IDFILE_NAMEFILE_LOCATIONEXPT_DATEDATE_RECEIVEDDATE_INSERTED
NUMBERVARCHAR2(200)VARCHAR2(25)DATEDATEVARCHAR2(12)
<pk>
Molecule Page
Y2H Database Views
AFCS_BAIT
AFCS_PIDPROTEIN_NAMEPROTEIN_SYNONYMSPROTEIN_CATEGORYAFCS_PROTEIN_VERSIONBAIT_IDBAIT_PROTEIN_GIBAIT_NUCLEOTIDE_GIBAIT_N_TERMINAL_STARTBAIT_C_TERMINAL_ENDBAIT_SEQUENCEY2H_FILE_IDBAIT_STATUS
BP_PROTEIN
AFCS_IDPROTEIN_NAMEPROTEIN_SYNONYMSPROTEIN_CATEGORY
B_PROTEIN
AFCS_IDPROTEIN_NAMEPROTEIN_SYNONYMSPROTEIN_CATEGORY
Y2H_INTERACTION
BAIT_AFCS_IDBAIT_PROTEIN_NAMEBAIT_IDBAIT_AFCS_VERSIONBAIT_PROTEIN_GIBAIT_NUCLEOTIDE_GIBAIT_N_TERMINAL_STARTBAIT_C_TERMINAL_ENDPREY_IDPREY_PROTEIN_NAMEPREY_LIBRARY_NAMEPREY_NUCLEOTIDE_GIPREY_N_TERMINAL_STARTPREY_C_TERMINAL_END
prey_V
PREY_DB_IDProtein_AFCS_IDProtein_NAMEPrey_Protein_TYPEPrey_LIBRARYNUCLEOTIDE_GIPrey_Protein_N_TermimalPREY_protein_C_TerminalPREY_Sequence_IDPROTein_GIPROTein_SEQuenceNUCLEOTIDE_SEQuenceNOVEL_NUC_CHECKSUMNUC_SEQuence_TYPENOVEL_Nuc_SEQ_STARTNOVEL_nuc_SEQ_END
Y2H.BAIT_PREYY2H.PREYY2H.PREY_SEQ
AFCS_PREY
PREY_AFCS_IDPREY_AFCS_NAMEPREY_IDPREY_TYPEPREY_LIBRARYPREY_NUCLEOTIDE_GIPREY_N_TERMINAL_STARTPREY_C_TERMINAL_END
Views in y2h
2003
Top Related