Computational Biology and Bioinformaticscschweikert/cisc4020/sequenceData… · Computational...
Transcript of Computational Biology and Bioinformaticscschweikert/cisc4020/sequenceData… · Computational...
Computational Biology and Bioinformatics
• Computational biology
– Development of algorithms to solve problems in biology
• Bioinformatics
– Application of computational biology to the analysis and management of biological data
• Applied bioinformatics
– Intelligent use of tools to navigate the sequence space
Better experiments can be designed by a careful Bioinformatics analysis before the bench work
Bioinformatics 7,850,000 hits
DNA sequencing
• DNA sequences are the most abundant type of sequences (5 * 1010 )
• Generated by the chain termination method (Sanger sequencing)
• Based on the action of a DNA polymerase that adds nucleotides tocomplementary strand
• Fluorescently labeled ddNTP (Dideoxynucleotides) stop synthesis acting as chain terminators. They are included in amounts so as to terminate every time the base appears in the template
• Requires template DNA, and primer and one ddNTP for each base: A,C,G, and T
• Products are separated by electrophoresis
A
AA
GG
G
CC
C
TT
T
Primer
Summary of chain termination sequencing
In the past, four different reactions, one for each ddNTP, were separated on a gel that could resolve one-base differences. The sequence was then read from the bottom of the gel to the top.
Sequence reading of fluorescently labeled reactions
• Fluorescently labeled reactions scanned by laser as particular point is passed
• Color picked up by detector
• Output sent directly to computer
In its simplest form a sequence can be represented as a string of nucleotides with a basic tag or identifier after a greater than character “>”: FASTA format
Definition line (commonly called “def line”)
>U54469.1
CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCAACAATCGATA
GCTGCCTTTGGCCACCAAAATCCCAAACTTAATTAAAGAATTAAATAATTCGAATAATAATTAAGCCCAG
TAACCTACGCAGCTTGAGTGCGTAACCGATATCTAGTATACATTTCGATACATCGAAATCATGGTAGTGT
TGGAGACGGAGAAGGTAAGACGATGATAGACGGCGAGCCGCATGGGTTCGATTTGCGCTGAGCCGTGGCA
GGGAACAACAAAAACAGGGTTGTTGCACAAGAGGGGAGGCGATAGTCGAGCGGAAAAGAGTGCAGTTGGC
GTGGCTACATCATCATTGTGTTCACCGATTATTTTTTGCACAATTGCTTAATATTAATTGTACTTGCACG
CTATTGTCTACGTCATAGCTATCGCTCATCTCTGTCTGTCTCTATCAAGCTATCTCTCTTTCGCGGTCAC
TCGTTCTCTTTTTTCTCTCCTTTCGCATTTGCATACGCATACCACACGTTTTCAGTGTTCTCGCTCTCTC
TCTCTTGTCAAGACATCGCGCGCGTGTGTGTGGGTGTGTCTCTAGCACATATACATAAATAGGAGAGCGG
More information can be be added to the FASTA definition line
>gb|U54469.1|DMU54469 Drosophila melanogaster
eukaryotic initiation factor 4E (eIF4E) gene.
GenBank
Accession.versionLocusDescritpion
Editing Errors
Capillary electrophoresis
• Newer automated
sequencers use very
thin capillary tubes
• Run all four
fluorescently tagged
reactions in same
capillary
• Can have 96
capillaries running at
the same time
96–well plate
robotic arm and syringe
96 glass capillaries
load bar
New sequencing technologies454 Life Sciences: FLX titanium: 1,000,000 reads 400 bp= 400,000,000 high quality bp in 10h run
http://www.454.com/products-solutions/how-it-works/index.asp
Present and Future of sequencing
• Sequencing costs
– Dropping each year
•Opens possibility of sequencing genomes of individuals
•Greatly facilitates comparative genomics.
DNA RNA
cDNAESTsUniGene
phenotype
genomicDNAdatabases
protein sequence databases
protein
GenBankEMBL DDBJ
Housedat EBI
EuropeanBioinformatics
Institute
There are three major public DNA databases
Housed at NCBINational
Center forBiotechnology
Information
Housed in Japan
Taxonomy at NCBI:>200,000 species are represented in GenBank
http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi
The most sequenced organisms in GenBank
Homo sapiens 13.1 billion basesMus musculus 8.4b
Rattus norvegicus 6.1b
Bos taurus 5.2bZea mays 4.6b
Sus scrofa 3.6bDanio rerio 3.0b
Oryza sativa (japonica) 1.5bStrongylocentrotus purpurata 1.4b
Nicotiana tabacum 1.1b
GenBank release 168.0
Sequence databases
• Examples of sequence databases– Primary databases (archival)
• GenBank
• EMBL (European Molecular Biology Laboratory)
• DDBJ (DNA Data Bank of Japan)
– Secondary databases (curated)• RefSeq
• EMBL Genome Reviews
• Protein databases
• TPA (Third party annotations)
• What is a database?– An indexed set of records
– Records retrieved using a query language
Web Browser
BLAST Search
Engine
Database
Web Server
Biologist query
The client–server model has made access to sequence databases fast and easy
Data flow of submissions between primary databases (Chp. 1)
GenBank
EMBL
DDBJ
•Submissions
•Updates
(Sequin)NCBI
Entrez
NIHNational Institute of
Health(USA)
•Submissions
•Updates
EBI
Ensambl
EMBLEuropean
MolecularBiology
Laboratory
•Submissions
•Updates
(Sakura)
NIGNational
Institute of
Genetics(JAPAN)
International Nucleotide
Sequence Database
CollaborationUpdated every 24 hs
National Center for
Biotechnology Information
DNA Data Bank of Japan
European
Bioinformatics Institute
CIB
Getentry
Center for Information Biology
http://getentry.ddbj.nig.ac.jp/getstart-e.html
Integrated information retrieval systemIs an interface not a database
http://www.ensembl.org/index.html
http://www.ensembl.org/index.html
Nucleotide sequence flatfilesHeader: Locus (10 characters, 1st letter, arbitrary name, not
useful), Length, Molecule type (DNA or RNA), Division code (includes functional divisions as ExpressedSeqTags,
SeqTagsSites, WholeGenomeSeq, etc.), Last release date EMBL).
Definition: Summary of biological information
Accession number: Primary key to reference a record in the database. Used in publications (1+5 or 2+6).
Version: The version is increased every time the sequence is updated. For each version there is a dif. GI geninfo identifier, which is specific of GenBank, not very useful
Keywords: abandoned because of absence of controlled vocabulary.
Source / Organism : taxonomic information top-down.
Reference: submission credit and published paper.
Feature Table: direct representation of the biological information in the record:
Source: org, chromosome, mapGene may include regulatory sequences
mRNA from start of translation to polyadenilation signal
CDS (coding sequences): from start to stop codon, provides
exon coordinates. Every CDS is assigned a protein_id.version (3+5)
Sequence: actual sequence ending in //
Secondary databases
Refseq– Comprehensive, integrated, and non-redundant set of sequences
– Genomic DNA – RNA – Protein explicitly linked
– 2_6 Format (the underscore is never present in GenBank accessions):
• Experimental Predicted
• NT_123456 (genomic contig)
• NM_123456 (mRNAs) [XM_123456 (model mRNAs)]
• NP_123456 (Proteins) [XP_123456 (model protein)]
– Undergo continuous curation: most up to date sequence
– Each RefSeq is a synthesis of information, not a piece of a primary research:
equivalent to a“review article”
– Message for
• Removed
• Secondary
Secondary databases
Third Party Annotation (TPA)– Includes
• Reannotations,
• Combinations of novel andexisting primary entries
• Annotations of trace archives
• Whole genome Shotgun data
– Provides
• GenBank accession. Version numbers and nucleotide locations for all primary entries to which the TPA sequence relates
EMBL Genome reviews– Includes
• Add information from UniProtknowledgebase, Gene Ontology Annotation, InterPro, and others
• Curated versions of entries representing complete genomes
• Standardize annotations
Protein databases
• GenPept: translations of all CDS. Not curated
• Uniprot (Swiss-Prot/TrEMBL/PIR-PSD)
– UniParc: most comprehensive, public
nonredundant protein database
• Swiss-Prot (manual)/TrEMBL(computer)/PIR-PSD
• GenBank, Patents, Int. Pr. Index (IPI)
• Protein Data Bank
– UniProt Knowledgebase: curated subset of
UniParc
• Function Postranslational modifications
• Domains Catalytic sites
• Structures Associated diseases
• Pathways Etc.
– UniRef: UniProt nonredundant reference
database: 95%, 90% and 50% sets.
• Functional groups
– Pfam
– Prosite
– IternPro
Merge 95% =
Merge 90% =
Merge 50% =
All predicted coding regions
http://www.ebi.ac.uk/blast2/index.html?UniProt
www.ncbi.nlm.nih.gov
PubMed is…
• National Library of Medicine's search service
• 19 million citations in MEDLINE
• links to participating online journals
• PubMed tutorial
Entrez integrates…
• the scientific literature; • DNA and protein sequence databases;
• 3D protein structure data; • population study data sets; • assemblies of complete genomes
Entrez is a search and retrieval system that integrates NCBI databases
BLAST is…
• Basic Local Alignment Search Tool
• NCBI's sequence similarity search tool
• supports analysis of DNA and protein databases
• 100,000 searches per day
OMIM is…
•Online Mendelian Inheritance in Man
•catalog of human genes and genetic disorders
•created by Dr. Victor McKusick; led by Dr. Ada Hamosh
at JHMI
Bookshelf is…
• searchable resource of on-line books
Taxonomy Browser is…
• browser for the major divisions of living organisms
(archaea, bacteria, eukaryota, viruses)
• taxonomy information such as genetic codes
• molecular data on extinct organisms
• practically useful to find a protein or gene from a species
Structure site includes…
• Molecular Modeling Database (MMDB)
• biopolymer structures obtained from
the Protein Data Bank (PDB)
• Cn3D (a 3D-structure viewer)
• vector alignment search tool (VAST)
Accession numbers are labels for sequences
NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences.
You may want to acquire information beginning with a
query such as the name of a protein of interest, or theraw nucleotides comprising a DNA sequence of interest.
DNA sequences and other molecular data are tagged with
accession numbers that are used to identify a sequenceor other record relevant to molecular data.
Page 26
What is an accession number?
An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that
corresponds to a molecular sequence.
Examples (all for retinol-binding protein, RBP4):
X02775 GenBank genomic DNA sequence
NT_030059 Genomic contig
Rs7079946 dbSNP (single nucleotide polymorphism)
N91759.1 An expressed sequence tag (1 of 170)
NM_006744 RefSeq DNA sequence (from a transcript)
NP_007635 RefSeq protein
AAC02945 GenBank protein
Q28369 SwissProt protein
1KT7 Protein Data Bank structure record
protein
DNA
RNA
Page 27
NCBI’s important RefSeq project: best representative sequences
RefSeq (accessible via the main page of NCBI)provides an expertly curated accession number that
corresponds to the most stable, agreed-upon “reference”
version of a sequence.
RefSeq identifiers include the following formats:
Complete genome NC_######Complete chromosome NC_######
Genomic contig NT_######
mRNA (DNA format) NM_###### e.g. NM_006744Protein NP_###### e.g. NP_006735
Page 27
Accession Molecule Method Note
AC_123456 Genomic Mixed Alternate complete genomic
AP_123456 Protein Mixed Protein products; alternate
NC_123456 Genomic Mixed Complete genomic molecules
NG_123456 Genomic Mixed Incomplete genomic regions
NM_123456 mRNA Mixed Transcript products; mRNA
NM_123456789 mRNA Mixed Transcript products; 9-digit
NP_123456 Protein Mixed Protein products;
NP_123456789 Protein Curation Protein products; 9-digit
NR_123456 RNA Mixed Non-coding transcripts
NT_123456 Genomic Automated Genomic assemblies
NW_123456 Genomic Automated Genomic assemblies
NZ_ABCD12345678 Genomic Automated Whole genome shotgun data
XM_123456 mRNA Automated Transcript products
XP_123456 Protein Automated Protein products
XR_123456 RNA Automated Transcript products
YP_123456 Protein Auto. & Curated Protein products
ZP_12345678 Protein Automated Protein products
NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences
Access to sequences: Entrez Gene at NCBI
Entrez Gene is a great starting point: it collectskey information on each gene/protein from
major databases. It covers all major organisms.
RefSeq provides a curated, optimal accession number for each DNA (NM_000518 for beta globin DNA
corresponding to mRNA) or protein (NP_000509)
Page 29
From the NCBI home
page, type “beta globin”and hit “Search”
Follow the link to “Gene”
Entrez Gene is in the headerNote the “Official Symbol” HBB for beta globin
Note the “limits” option
By applying limits, there are now far fewer entries
Entrez Gene (top of page)
Note that links tomany other HBB database entries are available
Entrez Gene (middle of page): genomic region, bibliography
Entrez Gene (middle of page, continued): phenotypes, function
Entrez Gene (bottom of page): RefSeq accession numbers
Entrez Protein:
accession, organism,
literature…
Entrez Protein:…features of a protein, and its sequence
in the one-letter amino acid code
FASTA format:versatile, compact with one header line
followed by a string of nucleotides or amino acids in the single letter code