Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.
-
Upload
tracy-woods -
Category
Documents
-
view
212 -
download
0
Transcript of Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.
![Page 1: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.](https://reader031.fdocuments.in/reader031/viewer/2022020417/56649f165503460f94c2c451/html5/thumbnails/1.jpg)
Accessing information on molecular sequences
Bio 224Dr. Tom PeavySept 1, 2010
![Page 2: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.](https://reader031.fdocuments.in/reader031/viewer/2022020417/56649f165503460f94c2c451/html5/thumbnails/2.jpg)
What is an accession number?
An accession number is a label that is used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.
Examples (all for retinol-binding protein, RBP4):
X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)
N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)
NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 SwissProt protein1KT7 Protein Data Bank structure record
protein
DNA
RNA
![Page 3: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.](https://reader031.fdocuments.in/reader031/viewer/2022020417/56649f165503460f94c2c451/html5/thumbnails/3.jpg)
Accession Molecule Method NoteAC_123456 Genomic Mixed Alternate complete genomicAP_123456 Protein Mixed Protein products; alternateNC_123456 Genomic Mixed Complete genomic moleculesNG_123456 Genomic Mixed Incomplete genomic regionsNM_123456 mRNA Mixed Transcript products; mRNA NM_123456789 mRNA Mixed Transcript products; 9-digit NP_123456 Protein Mixed Protein products; NP_123456789 Protein Curation Protein products; 9-digit NR_123456 RNA Mixed Non-coding transcripts NT_123456 Genomic Automated Genomic assembliesNW_123456 Genomic Automated Genomic assemblies NZ_ABCD12345678 Genomic Automated Whole genome shotgun dataXM_123456 mRNA Automated Transcript productsXP_123456 Protein Automated Protein productsXR_123456 RNA Automated Transcript productsYP_123456 Protein Auto. & Curated Protein productsZP_12345678 Protein Automated Protein products
NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences
![Page 4: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.](https://reader031.fdocuments.in/reader031/viewer/2022020417/56649f165503460f94c2c451/html5/thumbnails/4.jpg)
Six ways to access DNA and protein sequences
1) Entrez Gene with RefSeq database (NCBI)2) UniGene3) Nucleotide or Protein databases (NCBI)4) European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)5) ExPASy Sequence Retrieval System (separate from NCBI)6) UCSC Genome Browser
![Page 5: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.](https://reader031.fdocuments.in/reader031/viewer/2022020417/56649f165503460f94c2c451/html5/thumbnails/5.jpg)
What is an EST?
• Expressed Sequence Tag sequence
• “A short strand of DNA that is part of a cDNA molecule and can act as an identifier of a gene.”
• In essence, a single pass DNA sequencing reaction for a particular cDNA
![Page 6: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.](https://reader031.fdocuments.in/reader031/viewer/2022020417/56649f165503460f94c2c451/html5/thumbnails/6.jpg)
UniGene: unique genes via ESTs
• UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene
• UniGene clusters contain many ESTs, which are DNA sequences (typically 500 base pairs in length) corresponding to the mRNA from an expressed gene. ESTs are sequenced from a complementary DNA (cDNA) library.
• UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution.
Pages 20-21
![Page 7: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.](https://reader031.fdocuments.in/reader031/viewer/2022020417/56649f165503460f94c2c451/html5/thumbnails/7.jpg)
Cluster sizes in UniGene
This is a gene with1 EST associated;the cluster size is 1
![Page 8: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.](https://reader031.fdocuments.in/reader031/viewer/2022020417/56649f165503460f94c2c451/html5/thumbnails/8.jpg)
Cluster sizes in UniGene
This is a gene (or 1 cluster) with10 ESTs associated;the cluster size is 10
Note: HTC= high thoroughput cDNAs
![Page 9: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.](https://reader031.fdocuments.in/reader031/viewer/2022020417/56649f165503460f94c2c451/html5/thumbnails/9.jpg)
FASTA format
![Page 10: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.](https://reader031.fdocuments.in/reader031/viewer/2022020417/56649f165503460f94c2c451/html5/thumbnails/10.jpg)
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologene
Orthologous genes for various model species can be easily identified using this site (curated database)