Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.

10
Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010

Transcript of Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.

Page 1: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.

Accessing information on molecular sequences

Bio 224Dr. Tom PeavySept 1, 2010

Page 2: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.

What is an accession number?

An accession number is a label that is used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence.

Examples (all for retinol-binding protein, RBP4):

X02775 GenBank genomic DNA sequenceNT_030059 Genomic contigRs7079946 dbSNP (single nucleotide polymorphism)

N91759.1 An expressed sequence tag (1 of 170)NM_006744 RefSeq DNA sequence (from a transcript)

NP_007635 RefSeq proteinAAC02945 GenBank proteinQ28369 SwissProt protein1KT7 Protein Data Bank structure record

protein

DNA

RNA

Page 3: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.

Accession Molecule Method NoteAC_123456 Genomic Mixed Alternate complete genomicAP_123456 Protein Mixed Protein products; alternateNC_123456 Genomic Mixed Complete genomic moleculesNG_123456 Genomic Mixed Incomplete genomic regionsNM_123456 mRNA Mixed Transcript products; mRNA NM_123456789 mRNA Mixed Transcript products; 9-digit NP_123456 Protein Mixed Protein products; NP_123456789 Protein Curation Protein products; 9-digit NR_123456 RNA Mixed Non-coding transcripts NT_123456 Genomic Automated Genomic assembliesNW_123456 Genomic Automated Genomic assemblies NZ_ABCD12345678 Genomic Automated Whole genome shotgun dataXM_123456 mRNA Automated Transcript productsXP_123456 Protein Automated Protein productsXR_123456 RNA Automated Transcript productsYP_123456 Protein Auto. & Curated Protein productsZP_12345678 Protein Automated Protein products

NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences

Page 4: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.

Six ways to access DNA and protein sequences

1) Entrez Gene with RefSeq database (NCBI)2) UniGene3) Nucleotide or Protein databases (NCBI)4) European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI)5) ExPASy Sequence Retrieval System (separate from NCBI)6) UCSC Genome Browser

Page 5: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.

What is an EST?

• Expressed Sequence Tag sequence

• “A short strand of DNA that is part of a cDNA molecule and can act as an identifier of a gene.”

• In essence, a single pass DNA sequencing reaction for a particular cDNA

Page 6: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.

UniGene: unique genes via ESTs

• UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene

• UniGene clusters contain many ESTs, which are DNA sequences (typically 500 base pairs in length) corresponding to the mRNA from an expressed gene. ESTs are sequenced from a complementary DNA (cDNA) library.

• UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution.

Pages 20-21

Page 7: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.

Cluster sizes in UniGene

This is a gene with1 EST associated;the cluster size is 1

Page 8: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.

Cluster sizes in UniGene

This is a gene (or 1 cluster) with10 ESTs associated;the cluster size is 10

Note: HTC= high thoroughput cDNAs

Page 9: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.

FASTA format

Page 10: Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologene

Orthologous genes for various model species can be easily identified using this site (curated database)