Lecture Outline Introduction to Databases¥Databases need to be well annotated . ¥Databases need to...
Transcript of Lecture Outline Introduction to Databases¥Databases need to be well annotated . ¥Databases need to...
Introduction to Databases
Shifra Ben-Dor
Irit Orr
Lecture Outline
• Introduction
– Data and Database types
– Database components
• Data Formats
• Sample databases
• How to text search databases
What “units of information” do we deal
with in bioinformatics?
• DNA
• RNA
• Protein
• Sequence
• Structure
• Evolution
• Pathways
• Interactions
• Mutations
AAGTGCCACTGCATAAATGACCATGAGTGGGCACCGGTAAGGGAGGGTGATGCTATCTGGTCTGAAGNucleotidesequence
Genes
mRNA
Proteinprimarysequence
Protein 3Dstructure
ProteinFunction
Acts as a tumor suppressor in
many tumor types. induces growth
arrest or apoptosis depending on the
physiological circumstances or cell
type, but both activities are
involved in tumor suppression.
Involved in the transport of
chloride ions. Defects in CFTR
are the cause of cystic fibrosis.
It is the most common genetic disease
in the caucasian population, with a
prevalence of about 1 in 2000 live
births. cf, an autosomal recessive
disorder, is a common generalized
disorder of exocrine gland function
SNPs
Slide provided by Dr. Vered Caspi
• What do we want from databases?
All of these have databases and tools
that were created to work with them
Information retrieval from
sequence databases
Biological databases contain enormousamounts of data.
• Databases need to be well annotated.
• Databases need to be easily searched.
• Data found in databases should be easilyretrieved.
• Data in databases should be in standardformats.
Integrated Information Retrieval
• Many databases contain logical relations betweenspecific entries.
• One interface - connecting many biologicaldatabases.
• For example: a database that connects betweenprotein sequence, protein domain, proteinstructure and reference databases. (Interpro)
• Another example: Connection between references,protein sequence, DNA sequence, and structuredatabases. (Entrez)
Slide provided by Dr. Vered Caspi
Core Data and Annotation
Databases generally have (at least) two types of
data:
Core data: The data the database was generated
to organize
Annotation: Extra information that rounds out
our picture of the core data
For example in a genome database, the sequence
is the core data, and the location of genes is the
annotation
Database Issues
• Printed journals vs. databases
• Direct submission to databases (e.g.
GenBank, GDB, PDB)
• Archival vs. curated databases
• Databases that publish experimental
results of large genomic centers.
• Public vs. private databases.
For Example: Classification of Genomic Databases
Database
scope
Information
source
Information
type
Many genomes
One Genome
One Subject
One Gene
Direct submission from scientific community
Scientific literature
Genome center’s experimental results
Other databases
Mapping
Sequence & annotation
Protein structure & function
Variations
Comparative genomics
gene networks
Slide provided by Dr. Vered Caspi
User Interface
• Database search– free text
– field-specific
– sequence-based
• Database output– text
– graphics
– dynamic
Data Formats
There are many data formats used for
sequences (both nucleic and amino acid)
• Fasta Format
• GenBank Format
• EMBL Format
• GCG Format
Fasta Format
• Simplest format
• Least information
• Starts with a > and sequence name
on one line
• The sequence in plain text follows
>OB2T2
GTGACAACATGTACAGCTGTGAGCGGTGTAAGAAGCTGCGGAACGGAGTGAAGTACTGCA
AAGTCCTGCGGTTGCCCGAGATCCTGTGCATTCACCTAAAGCGCTTTCGGCACGAGGTGA
TGTACTCATTCAAGATCAACAGCCACGTCTCCTTGCCCTCGAGGGGCTCGACCTGCGCCC
CTTCCTTGCCAAGGAGTGCACATCCCAGATCACCACCTACGACCTCCTCTCGGTCATCTG
CCACCACGGCACGGCAGGCA
>TNRC_HUMAN P36941 (tumor necrosis factor c receptor)
MLLPWATSAPGLAWGPLVLGLFGLLAASQPQAVPPYASENQTCRDQEKEYYEPQHRICCS
RCPPGTYVSAKCSRIRDTVCATCAENSYNEHWNYLTICQLCRPCDPVMGLEEIAPCTSKR
KTQCRCQPGMFCAAWALECTHCELLSDCPPGTEAELKDEVGKGNNHCVPCKAGHFQNTSS
PSARCQPHTRCENQGLVEAAPGTAQSDTTCKNPLEPLPPEMSGTMLMLAVLLPLAFFLLL
ATVFSCIWKSHPSLCRKLGSLLKRRPQGEGPNPVAGSWEPPKAHPYFPDLVQPLLPISGD
VSPVSTGLPAAPVLEAGVPQQQSPLDLTREPQLEPGEQSQVAHGTNGIHVTGGSMTITGN
IYIYNGPVLGGPPGPGDLPATPEPPYPIPEEGDPGPPGLSTPHQEDGKAWHLAETEHCGA
TPSNRGPRNQFITHD
>TNRC_MOUSE P50284 lymphotoxin-beta receptor precursor
MRLPRASSPCGLAWGPLLLGLSGLLVASQPQLVPPYRIENQTCWDQDKEYYEPMHDVCCS
RCPPGEFVFAVCSRSQDTVCKTCPHNSYNEHWNHLSTCQLCRPCDIVLGFEEVAPCTSDR
KAECRCQPGMSCVYLDNECVHCEEERLVLCQPGTEAEVTDEIMDTDVNCVPCKPGHFQNT
SSPRARCQPHTRCEIQGLVEAAPGTSYSDTICKNPPEPGAMLLLAILLSLVLFLLFTTVL
ACAWMRHPSLCRKLGTLLKRHPEGEESPPCPAPRADPHFPDLAEPLLPMSGDLSPSPAGP
PTAPSLEEVVLQQQSPLVQARELEAEPGEHGQVAHGANGIHVTGGSVTVTGNIYIYNGPV
LGGTRGPGDPPAPPEPPYPTPEEGAPGPSELSTPYQEDGKAWHLAETETLGCQDL
>TNR1_RAT P22934 tumor necrosis factor receptor 1 precursor (p60)
MGLPIVPGLLLSLVLLALLMGIHPSGVTGLVPSLGDREKRDNLCPQGKYAHPKNNSICCT
KCHKGTYLVSDCPSPGQETVCEVCDKGTFTASQNHVRQCLSCKTCRKEMFQVEISPCKAD
MDTVCGCKKNQFQRYLSETHFQCVDCSPCFNGTVTIPCKEKQNTVCNCHAGFFLSGNECT
PCSHCKKNQECMKLCLPPVANVTNPQDSGTAVLLPLVIFLGLCLLFFICISLLCRYPQWR
PRVYSIICRDSAPVKEVEGEGIVTKPLTPASIPAFSPNPGFNPTLGFSTTPRFSHPVSST
PISPVFGPSNWHNFVPPVREVVPTQGADPLLYGSLNPVPIPAPVRKWEDVVAAQPQRLDT
ADPAMLYAVVDGVPPTRWKEFMRLLGLSEHEIERLELQNGRCLREAHYSMLEAWRRRTPR
HEATLDVVGRVLCDMNLRGCLENIRETLESPAHSSTTHLPR
Genbank sequence format
NM_000394. Homo sapiens crys...[gi:14043059]
LOCUS NM_000394 1114 bp mRNA PRI 15-MAY-2001DEFINITION Homo sapiens crystallin, alpha A (CRYAA), mRNA.ACCESSION NM_000394VERSION NM_000394.2 GI:14043059KEYWORDS .SOURCE human.ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata;Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini;Hominidae; Homo.REFERENCE 1 (bases 1 to 1114) AUTHORS Jaworski,C.J. and Piatigorsky,J. TITLE A pseudo-exon in the functional human alpha A-crystallin gene
Genbank sequence format
JOURNAL Nature 337 (6209), 752-754 (1989) MEDLINE 89143747PUBMED 2918909REFERENCE 2 (bases 1 to 1114) AUTHORS Jaworski,C.J.TITLE A reassessment of mammalian alpha A-crystallinsequences using DNA sequencing: implications for anthropoidaffinities of tarsier
FEATURES Location/Qualifiers source 1..1114 /organism="Homo sapiens" /db_xref="taxon:9606" /chromosome="21" /map="21q22.3" gene 1..1114 /gene="CRYAA" /note="CRYA1" /db_xref="LocusID:1409" /db_xref="MIM:123580" misc_feature 70..234 /note="crystallin; Region: Alphacrystallin A chain" CDS 70..591 /gene="CRYAA" /note="human alphaA-crystallin;crystallin, alpha-1" /codon_start=1
/db_xref="LocusID:1409" /db_xref="MIM:123580" /product="crystallin, alpha A"
/protein_id="NP_000385.1" /db_xref="GI:4503055"
/translation="MDVTIQHPWFKRTLGPFYPSRLFDQFFGEGLFEYDLLPFL SSTISPYYRQSLFRTVLDSGISEVRSDRDKFVIFLDVKHFSP EDLTVKVQDDFVEIHGKHNERQDDHGYISREFHRRYRLPS NVDQSALSCSLSADGMLTFCGPKIQTGLDATHAERAIPVSR EEKPTSAPSS" misc_feature 244..555 /note="HSP20; Region: Hsp20/alpha crystallin family" polyA_signal 1092..1097
BASE COUNT 183 a 400 c 309 g 222 tORIGIN 1 acactgcgct gcccagaggc cccgctgact cctgccagcc tccaggtccc cgtggtacca 61 aagctgaaca tggacgtgac catccagcac ccctggttca agcgcaccct ggggcccttc 121 taccccagcc ggctgttcga ccagtttttc ggcgagggcc tttttgagta tgacctgctg 181 cccttcctgt cgtccaccat cagcccctac taccgccagt ccctcttccg caccgtgctg 241 gactccggca tctctgaggt tcgatccgac cgggacaagt tcgtcatctt cctcgatgtg 301 aagcacttct ccccggagga cctcaccgtg aaggtgcagg acgactttgt ggagatccac 361 ggaaagcaca acgagcgcca ggacgaccac ggctacattt cccgtgagtt ccaccgccgc 421 taccgcctgc cgtccaacgt ggaccagtcg gccctctctt gctccctgtc tgccgatggc 481 atgctgacct tctgtggccc caagatccag actggcctgg atgccaccca cgccgagcga 541 gccatccccg tgtcgcggga ggagaagccc acctcggctc cctcgtccta agcaggcatt 601 gcctcggctg gctcccctgc agccctggcc catcatgggg ggagcaccct gagggcgggg 661 tgtctgtctt cctttgcttc ccttttttcc tttccacctt ctcacatgga atgagggttt 721 gagagagcag ccaggagagc ttagggtctc agggtgtccc agaccccgac accggccagt 781 ggcggaagtg accgcacctc acactccttt agatagcagc ctggctcccc tggggtgcag 841 gcgcctcaac tctgctgagg gtccagaagg agggggtgac ctccggccag gtgcctcctg 901 acacacctgc agcctccctc cgcggcgggc cctgcccaca cctcctgggg cgcgtgaggc 961 ccgtggggcc ggggcttctg tgcacctggg ctctcgcggc ctcttctctc agaccgtctt 1021 cctccaaccc ctctatgtag tgccgctctt ggggacatgg gtcgcccatg agagcgcagc 1081 ccgcggcaat caataaacag caggtgatac aagc//Revised: October 24, 2001.
EMBL sequence format
• ID A4279484 standard; DNA; FUN; 581 BP.
• XX
• AC AJ279484;
• XX
• SV AJ279484.1
• XX
• DT 14-JAN-2000 (Rel. 62, Created)
• DT 14-JAN-2000 (Rel. 62, Last updated, Version 2)
• XX
• DE Unidentified ascomycota sp. 4/97-9 5.8S rRNA gene and ITS 1 and 2
• XX
• KW 5.8S ribosomal RNA; 5.8S rRNA gene; internal transcribed spacer 1;
EMBL sequence format
• KW internal transcribed spacer 2; ITS1; ITS2.
• XX
• OS ascomycota sp. 4/97-9
• OC Eukaryota; Fungi; Ascomycota.
• XX
• RN [1]
• RP 1-581
• RA Wirsel S.G.R.;
• RT ;
• RL Submitted (21-DEC-1999) to the EMBL/GenBank/DDBJ databases.
• RL Wirsel S.G.R., Fakultaet fuer Biologie, Universitaet Konstanz,
• RL Universitaetsstr. 10, Konstanz 78434, Germany.
• XX
EMBL sequence format
• RN [2]
• RA Wirsel S.G.R., Leibinger W., Mendgen K.W.;
• RT "Genetic diversity of fungi associated with common reed (Phragmites
• RT australis)";
• RL Unpublished.
• XX
• FH Key Location/Qualifiers
• FH
• FT source 1..581
• FT /db_xref="taxon:112223"
• FT /organism="ascomycota sp. 4/97-9"
• FT /isolate="4/97-9"
EMBL sequence format
• FT misc_feature 64..226
• FT /note="internal transcribed spacer 1, ITS1"
• FT rRNA 227..385
• FT /gene="5.8S rRNA"
• FT /product="5.8S ribosomal RNA"
• FT misc_feature 386..529
• FT /note="internal transcribed spacer 2, ITS2"
• XX
• SQ Sequence 581 BP; 132 A; 164 C; 145 G; 140 T; 0 other;
ccatttagag gaagtaaaag tcgtaacaag gtctccgttg gtgaaccagggagggatc 60 ttacgagagt
gtcaccactc ccaacccact gtttacctac ccgtccaccg tgcttcggca 120 ggcagtcctg tgggacaggg
cctcgccccc ctccgggggg tgcctgccgc
EMBL entry
• Each line in the entry begins with a two-character line code, which
indicates the type of information contained in the line.
• The currently used line types, along with their respective line codes, are
listed below:
• ID - identification (begins each entry; 1 per entry)
• AC - accession number (>=1 per entry)
• SV - sequence version (1 per entry)
• DT - date (2 per entry)
• DE - description (>=1 per entry)
• KW - keyword (>=1 per entry)
EMBL entry
• OS - organism species (>=1 per entry)
• OC - organism classification (>=1 per entry)
• OG - organelle (0 or 1 per entry)
• RN - reference number (>=1 per entry)
• RC - reference comment (>=0 per entry)
• RP - reference positions (>=1 per entry)
• RX - reference cross-reference (>=0 per entry)
• RA - reference author(s) (>=1 per entry)
• RT - reference title (>=1 per entry)
• RL - reference location (>=1 per entry)
• DR - database cross-reference (>=0 per entry)
EMBL entry
• FH - feature table header (0 or 2 per entry)
• FT - feature table data (>=0 per entry)
• CC - comments or notes (>=0 per entry)
• XX - spacer line (many per entry)
• SQ - sequence header (1 per entry)
• bb - (blanks) sequence data (>=1 per entry)
• // - termination line (ends each entry; 1 per
entry )
GCG Format
• Has space for comments and space
for data, separated by two dots ..• Can contain full sequence data like
GenBank or EMBL
• Has a minimum of sequence name,length, date, type (nucleic or aminoacid) and checksum
!!NA_SEQUENCE 1.0
5B3.seq Length: 744 March 18, 1999 10:43 Type: N Check: 2586 ..
1 TCTAGAGGAG AYATYGTWAT GACCCAGTCT CCATCCTCCC TGAGTGTGTC
51 AGCAGGAGAG AAGGTCACTA TGAGCTGCAA GTCCAGTCAG AGTCTGTTAA
101 ACAGTAGAAA TCAAAAGAAC TACTTGGCCT GGTACCAGCA GAAACCAGGA
151 CAGCCTCCTA AACTTTTGAT CTACGGGGTA TTTATTAGGG ATTCTGGGGT
201 CCCTGATCGC TTCACAGGCA GTGGATCTGG AACCGATTTC ACTCTTACCA
251 TCAGCAGTGT GCAGGCTGAA GACCTGGCAG TTTATTACTG TCAGAATGAT
301 CATATTTATC CGTACACGTT CGGAGGGGGC ACWAAGCTGG AAATTAAAGG
351 GTCGACTTCC GGTAGCGGCA AATCCTCTGA AGGCAAAGGT SAGGTSCAGC
401 TGCAGGAGTC TGGACCTGGC CTGGTGAAGC CTTCCCAGTC TCTGTCCCTC
451 ACCTGCTCTG TCACTGGTTA CTCAATCACC AGTGGTTATG CCTGGAACTG
501 GATCCGGCAG TTTCCAGGAA ACAAACTGGA GTGGATGGGC TACATAAGCT
551 ACAGTGGTTT CACTAGCTAC AACCCATCTC TCAGAAGTCG AATCTCTTTC