Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology...
-
Upload
derrick-knight -
Category
Documents
-
view
215 -
download
1
Transcript of Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology...
![Page 1: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/1.jpg)
Protein Sequence DatabasesProtein Sequence Databases
Nathan EdwardsDepartment of Biochemistry and Mol. & Cell. BiologyGeorgetown University Medical Center
![Page 2: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/2.jpg)
2
Protein Sequence Databases
• Link between mass spectra and proteins• A protein’s amino-acid sequence provides
a basis for interpreting• Enzymatic digestion• Separation protocols• Fragmentation• Peptide ion masses
• We must interpret database information as carefully as mass spectra.
![Page 3: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/3.jpg)
3
More than sequence…
Protein sequence databases provide much more than sequence:
• Names• Descriptions• Facts• Predictions• Links to other information sources
Protein databases provide a link to the current state of our understanding about a protein.
![Page 4: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/4.jpg)
4
Much more than sequence
Names• Accession, Name, Description
Biological Source• Organism, Source, Taxonomy
LiteratureFunction
• Biological process, molecular function, cellular component
• Known and predictedFeatures
• Polymorphism, Isoforms, PTMs, DomainsDerived Data
• Molecular weight, pI
![Page 5: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/5.jpg)
5
Database types
Curated• Swiss-Prot• UniProt• RefSeq NP
Translated• TrEMBL• RefSeq XP, ZP
Omnibus• NCBI’s nr• MSDB• IPI
Other• PDB• HPRD• EST• Genomic
![Page 6: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/6.jpg)
6
SwissProt
• From ExPASy • Expert Protein Analysis System• Swiss Institute of Bioinformatics
• ~ 515,000 protein sequence “entries”• ~ 12,000 species represented• ~ 20,000 Human proteins• Highly curated• Minimal redundancy• Part of UniProt Consortium
![Page 7: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/7.jpg)
7
TrEMBL
• Translated EMBL nucleotide sequences• European Molecular Biology Laboratory
• European Bioinformatics Institute (EBI)• Computer annotated • Only sequences absent from SwissProt• ~ 10.5 M protein sequence “entries”• ~ 230,000 species• ~ 75,000 Human proteins• Part of UniProt Consortium
![Page 8: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/8.jpg)
8
UniProt
• Universal Protein Resource• Combination of sequences from
• Swiss-Prot• TrEMBL
• Mixture of highly curated (Swiss-Prot) and computer annotation (TrEMBL)
• “Similar sequence” clusters are available• 50%, 90%, 100% sequence similarity
![Page 9: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/9.jpg)
9
RefSeq
• Reference Sequence• From NCBI (National Center for
Biotechnology Information), NLM, NIH• Integrated genomic, transcript, and
protein sequences.• Varying levels of curation
• Reviewed, Validated, …, Predicted, …• ~ 9.7 M protein sequence “entries”
• ~ 209,000 reviewed, ~ 90,000 validated• ~ 39,000 Human proteins
![Page 10: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/10.jpg)
10
RefSeq
• Particular focus on major research organisms• Tightly integrated with genome projects.
• Curated entries: NP accessions• Predicted entries: XP accessions• Others: YP, ZP, AP
![Page 11: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/11.jpg)
11
IPI
• International Protein Index• From EBI
• For a specific species, combines• UniProt, RefSeq, Ensembl• Species specific databases
• HInv-DB, VEGA, TAIR• ~ 87,000 (from ~ 307,000 ) human protein
sequence entries• Human, mouse, rat, zebra fish, arabidopsis,
chicken, cow
![Page 12: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/12.jpg)
12
MSDB
• From the Imperial College (London)• Combines
• PIR, TrEMBL, GenBank, SwissProt• Distributed with Mascot
• …so well integrated with Mascot• ~ 3.2M protein sequence entries• “Similar sequences” suppressed
• 100% sequence similarity• Not updated since September 2006
(obsolete)
![Page 13: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/13.jpg)
13
NCBI’s nr
• “non-redundant”• Contains
• GenBank CDS translations• RefSeq Proteins• Protein Data Bank (PDB)• SwissProt, TrEMBL, PIR• Others
• “Similar sequences” suppressed• 100% sequence similarity
• ~ 10.5 M protein sequence “entries”
![Page 14: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/14.jpg)
14
Others
• HPRD• Manually curated integration of literature
• PDB• Focus on protein structure
• dbEST• Part of GenBank - EST sequences
• Genome Sequences
![Page 15: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/15.jpg)
15
Human Sequences
• Number of Human genes is believed to be between 20,000 and 25,000
SwissProt ~ 20,000
RefSeq ~ 39,000
TrEMBL ~ 75,000
IPI-HUMAN ~ 87,000
MSDB ~130,000
nr ~230,000
![Page 16: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/16.jpg)
16
DNA to Protein Sequence
Derived from http://online.itp.ucsb.edu/online/infobio01/burge
![Page 17: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/17.jpg)
17
Genome Browsers
• Link genomic, transcript, and protein sequence in a graphical manner• Genes, ESTs, SNPs, cross-species, etc.
• UC Santa Cruz• http://genome.ucsc.edu
• Ensembl• http://www.ensembl.org
• NCBI Map View• http://www.ncbi.nlm.nih.gov/mapview
![Page 18: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/18.jpg)
18
UCSC Genome Browser
• Shows many sources of protein sequence evidence in a unified display
![Page 19: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/19.jpg)
19
PeptideMapper Web Service
I’m Feeling Lucky
![Page 20: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/20.jpg)
20
PeptideMapper Web Service
I’m Feeling Lucky
![Page 21: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/21.jpg)
21
Unannotated Splice Isoform
![Page 22: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/22.jpg)
22
Accessions
• Permanent labels• Short, machine readable• Enable precise communication• Typos render them unusable!• Each database uses a different format
• Swiss-Prot: P17947• Ensembl: ENSG00000066336• PIR: S60367; S60367• GO: GO:0003700;
![Page 23: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/23.jpg)
23
Names / IDs
• Compact mnemonic labels• Not guaranteed permanent• Require careful curation• Conceptual objects
• ALBU_HUMAN• Serum Albumin
• RT30_HUMAN• Mitochondrial 28S ribosomal protein S30
• CP3A7_HUMAN• Cytochrome P450 3A7
![Page 24: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/24.jpg)
24
Description / Name
• Free text description• Human readable• Space limited• Hard for computers to interpret!• No standard nomenclature or format• Often abused….
• COX7R_HUMAN• Cytochrome c oxidase subunit VIIa-
related protein, mitochondrial [Precursor]
![Page 25: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/25.jpg)
25
FASTA Format
![Page 26: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/26.jpg)
26
FASTA Format
• >• Accession number
• No uniform format• Multiple accessions separated by |
• One line of description• Usually pretty cryptic
• Organism of sequence?• No uniform format• Official latin name not necessarily used
• Amino-acid sequence in single-letter code• Usually spread over multiple lines.
![Page 27: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/27.jpg)
27
Organism / Species / Taxonomy
• The protein’s organism…• …or the source of the biological sample
• The most reliable sequence annotation available
• Useful only to the extent that it is correct• NCBI’s taxonomy is widely used
• Provides a standard of sorts; Heirachical• Other databases don’t necessarily keep up
• Organism specific sequence databases starting to become available.
![Page 28: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/28.jpg)
28
Organism / Species / Taxonomy
• Buffalo rat• Gunn rats• Norway rat• Rattus PC12 clone IS• Rattus norvegicus• Rattus norvegicus8• Rattus norwegicus• Rattus rattiscus
• Rattus sp.
• Rattus sp. strain Wistar• Sprague-Dawley rat• Wistar rats• brown rat• laboratory rat• rat• rats• zitter rats
![Page 29: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/29.jpg)
29
Controlled Vocabulary
• Middle ground between computers and people
• Provides precision for concepts• Searching, sorting, browsing• Concept relationships
• Vocabulary / Ontology must be established• Human curation
• Link between concept and object:• Manually curated• Automatic / Predicted
![Page 30: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/30.jpg)
30
Controlled Vocabulary
![Page 31: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/31.jpg)
31
Controlled Vocabulary
![Page 32: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/32.jpg)
32
Controlled Vocabulary
![Page 33: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/33.jpg)
33
Controlled Vocabulary
![Page 34: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/34.jpg)
34
Controlled Vocabulary
![Page 35: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/35.jpg)
35
Controlled Vocabulary
![Page 36: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/36.jpg)
36
Controlled Vocabulary
![Page 37: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/37.jpg)
37
Controlled Vocabulary
![Page 38: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/38.jpg)
38
Controlled Vocabulary
![Page 39: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/39.jpg)
39
Controlled Vocabulary
![Page 40: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/40.jpg)
40
Controlled Vocabulary
![Page 41: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/41.jpg)
41
Controlled Vocabulary
![Page 42: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/42.jpg)
42
Controlled Vocabulary
![Page 43: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/43.jpg)
43
Controlled Vocabulary
![Page 44: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/44.jpg)
44
Ontology Structure
• NCBI Taxonomy• Tree
• Gene Ontology (GO)• Molecular function• Biological process• Cellular component• Directed, Acyclic Graph (DAG)
• Unstructured labels• Overlapping?
![Page 45: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/45.jpg)
45
Ontology Structure
![Page 46: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/46.jpg)
46
Protein Families
• Similar sequence implies similar function• Similar structure implies similar function• Common domains imply similar function
• Bootstrap up from small sets of proteins with well understood characteristics
• Usually a hybrid manual / automatic approach
![Page 47: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/47.jpg)
47
Protein Families
![Page 48: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/48.jpg)
48
Protein Families
![Page 49: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/49.jpg)
49
Protein Families
• PROSITE, PFam, InterPro, PRINTS• Swiss-Prot keywords
• Differences:• Motif style, ontology structure, degree of
manual curation• Similarities:
• Primarily sequence based, cross species
![Page 50: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/50.jpg)
50
Gene Ontology
• Hierarchical• Molecular function• Biological process• Cellular component
• Describes the vocabulary only!• Protein families provide GO association
• Not necessarily any appropriate GO category.• Not necessarily in all three hierarchies.• Sometimes general categories are used because
none of the specific categories are correct.
![Page 51: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/51.jpg)
51
Protein Family / Gene Ontology
![Page 52: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/52.jpg)
52
Sequence Variants
• Protein sequence can vary due to• Polymorphism• Alternative splicing• Post-translational modification
• Sequence databases typically do not capture all versions of a protein’s sequence
![Page 53: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/53.jpg)
53
Sequence Variants
Swiss-Prot; a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases
- Swiss-Prot web site front page
![Page 54: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/54.jpg)
54
Sequence Variants
b) Minimal redundancy
Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In Swiss-Prot we try as much as possible to merge all these data so as to minimize the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry.
- Swiss-Prot User Manual, Section 1.1
![Page 55: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/55.jpg)
55
Sequence Variants
IPI provides a top level guide to the main databases that describe the proteomes of higher eukaryotic organisms. IPI:
1. effectively maintains a database of cross references between the primary data sources
2. provides minimally redundant yet maximally complete sets of proteins for featured species (one sequence per transcript)
3. maintains stable identifiers (with incremental versioning) to allow the tracking of sequences in IPI between IPI releases.
- IPI web site front page
![Page 56: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/56.jpg)
56
Swiss-Prot Variant Annotations
![Page 57: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/57.jpg)
57
Swiss-Prot Variant Annotations
![Page 58: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/58.jpg)
58
Swiss-Prot Variant Annotations
![Page 59: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/59.jpg)
59
Peptides to Proteins
Nesvizhskii et al., Anal. Chem. 2003
![Page 60: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/60.jpg)
60
Peptides to Proteins
![Page 61: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/61.jpg)
61
Peptides to Proteins
• A peptide sequence may occur in many different protein sequences• Variants, paralogues, protein families
• Separation, digestion and ionization is not well understood
• Proteins in sequence database are extremely non-random, and very dependent
![Page 62: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/62.jpg)
62
Omnibus Database Redundancy Elimination
• Source databases often contain the same sequences with different descriptions
• Omnibus databases keep one copy of the sequence, and • An arbitrary description, or• All descriptions, or• Particular description, based on source preference
• Good definitions can be lost, including taxonomy
![Page 63: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/63.jpg)
63
Description Elimination
• gi|12053249|emb|CAB66806.1| hypothetical protein [Homo sapiens]
• gi|46255828|gb|AAH68998.1| COMMD4 protein [Homo sapiens]
• gi|42632621|gb|AAS22242.1| COMMD4 [Homo sapiens]
• gi|21361661|ref|NP_060298.2| COMM domain containing 4 [Homo sapiens]
• gi|51316094|sp|Q9H0A8|COM4_HUMAN COMM domain containing protein 4
• gi|49065330|emb|CAG38483.1| COMMD4 [Homo sapiens]
![Page 64: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/64.jpg)
64
Description Elimination
• gi|2947219|gb|AAC39645.1| UDP-galactose 4' epimerase [Homo sapiens]
• gi|1119217|gb|AAB86498.1| UDP-galactose-4-epimerase [Homo sapiens]
• gi|14277913|pdb|1HZJ|B Chain B, Human Udp-Galactose 4-Epimerase: Accommodation Of Udp-N- Acetylglucosamine Within The Active Site
• gi|14277912|pdb|1HZJ|A Chain A, Human Udp-Galactose 4-Epimerase: Accommodation Of Udp-N- Acetylglucosamine Within The Active Site
• gi|2494659|sp|Q14376|GALE_HUMAN UDP-glucose 4-epimerase (Galactowaldenase) (UDP-galactose 4-epimerase)
• gi|1585500|prf||2201313AUDP galactose 4'-epimerase
![Page 65: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/65.jpg)
65
Description Elimination
• gi|4261710|gb|AAD14010.1| chlordecone reductase [Homo sapiens]
• gi|2117443|pir||A57407 chlordecone reductase (EC 1.1.1.225) / 3alpha-hydroxysteroid dehydrogenase (EC 1.1.1.-) I [validated] – human
• gi|1839264|gb|AAB47003.1| HAKRa product/3 alpha-hydroxysteroid dehydrogenase homolog [human, liver, Peptide, 323 aa]
• gi|1705823|sp|P17516|AKC4_HUMAN Aldo-keto reductase family 1 member C4 (Chlordecone reductase) (CDR) (3-alpha-hydroxysteroid dehydrogenase) (3-alpha-HSD) (Dihydrodiol dehydrogenase 4) (DD4) (HAKRA)
• gi|7328948|dbj|BAA92885.1| dihydrodiol dehydrogenase 4 [Homo sapiens]
• gi|7328971|dbj|BAA92893.1|dihydrodiol dehydrogenase 4 [Homo sapiens]
![Page 66: Protein Sequence Databases Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center.](https://reader036.fdocuments.in/reader036/viewer/2022070413/5697bfc21a28abf838ca499f/html5/thumbnails/66.jpg)
66
Summary
• Protein sequence databases should be interpreted with as much care as mass spectra
• Protein sequences come from genes• Use controlled vocabularies• Understand the structure of ontologies• Take advantage of computational predictions• Look for sequence variants• Peptides to proteins not as simple as it
seems• Be careful with omnibus databases