Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis:...

66
Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Transcript of Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis:...

Page 1: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Genome Analysis:Databases, Sequence Formats and

Visualization Tools

Page 2: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Understand the purpose of, and use of, bioinformatics databases resources, such as GenBank,UniProt/Swiss-Prot, Entrez and Ensembl.

Be able to recognize common database data formats and sequence features, sequence and genome browsers.

What kind of tools are available to visualize sequence data?

Appreciate the issues surrounding bioinformatic database updating.

Objectives of today lecture

Page 3: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Biological Databases and Data Models

Page 4: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Databases in general

http://www.oxfordjournals.org/nar/database/c/Also check out the annual “web-software” issue of NAR every July

Page 5: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Databases

• Organized array of information• On the WWW or Local• Place where you put things in, and (if all goes

well!) you should be able to get them out again.

• Allows you to make discoveries.

Page 6: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Useful Database

• Primary (archival)– GenBank/EMBL/DDBJ

(seqs)– PDB -(protein structures)– Medline

(literature)– IMEx databases

(protein interactions)

• Secondary (curated)– RefSeq (seqs)– UniProt - SwissProt

(seqs)– Taxon (taxonomy)– PROSITE (binding

sites)– OMIM (genetics

literature/reviews)– IMEx databases

(protein interactions)

Page 7: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Sequence Databases

• DNA– NCBI: GenBank -> RefSeq

National Center for Biotechnology Information www.ncbi.nlm.nih.gov

– EBI: EMBLEuropean Bioinformatics Institute www.ebi.ac.uk

• Protein– NCBI: GenPept

– EBI: UniProt: TrEMBL -> UniProt: Swiss-Prot

TrEMBL= “translated EMBL”

Page 8: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

NCBI: GenBank -> RefSeq

Page 9: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Further Readings!!!

http://www.ncbi.nlm.nih.gov/books/NBK21105/

Page 10: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

EBI: EMBL

Page 11: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

UniProt: Swiss-Prot, TrEMBL

Page 12: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

UniProt: Swiss-Prot: An example of curated, reviewed annotation

Incorporates: Function of the proteinSubcellular localization of proteinPost-translational modificationDomains and sitesSecondary structureQuaternary structureSimilarities to other proteinsDiseases associated with deficiencies in the proteinSequence conflicts, variants, etc.

Page 13: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

GenBank

DDBJEMBL

EMBL

Entrez

SRS

getentry

NIGCIB EBI

NCBI

NIH

•Submissions•Updates

•Submissions•Updates

•Submissions•Updates

INSDC - International Nucleotide Sequence Database Collaboration

National Institute of genetics

Page 14: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

File Formats

Page 15: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

GenBank Flat File

Features (AA seq)

DNA Sequence

Header •Title•Taxonomy•Citation

LOCUS AF115338 591 bp DNA linear BCT 19-AUG-1999DEFINITION Pseudomonas fluorescens ECF sigma factor SigX (sigX) gene, complete cds.ACCESSION AF115338VERSION AF115338.1 GI:4959391KEYWORDS .SOURCE Pseudomonas fluorescens. ORGANISM Pseudomonas fluorescens Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae; Pseudomonas.REFERENCE 1 (bases 1 to 591) AUTHORS Brinkman,F.S., Schoofs,G., Hancock,R.E. and De Mot,R. TITLE Influence of a putative ECF sigma factor on expression of the major outer membrane protein, OprF, in Pseudomonas aeruginosa and Pseudomonas fluorescens JOURNAL J. Bacteriol. 181 (16), 4746-4754 (1999) MEDLINE 99369842 PUBMED 10438740REFERENCE 2 (bases 1 to 591) AUTHORS De Mot,R. TITLE Direct Submission JOURNAL Submitted (04-DEC-1998) F.A. Janssens Laboratory of Genetics, Applied Plant Sciences, K. Mercierlaan 92, Heverlee B-3001, BelgiumFEATURES Location/Qualifiers source 1..591 /organism="Pseudomonas fluorescens" /strain="M114" /db_xref="taxon:294" gene 1..591 /gene="sigX" CDS 1..591 /gene="sigX" /codon_start=1 /transl_table=11 /product="ECF sigma factor SigX" /protein_id="AAD34329.1" /db_xref="GI:4959392" /translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQ RTLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYR KERRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELE FQEIADIMHMGLSATKMRYKRALDKLREKFAGETET"BASE COUNT 157 a 133 c 170 g 131 tORIGIN 1 atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag 61 ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg 121 cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac 181 gcagacgatg tctgtcagga agtcatgttg aaggtgctgt atggcctgaa gaacctcgag 241 gggaaatcga agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag 301 tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga ccccctcgag 361 gaagcgtccg aagaaaaggc gcttcaaccc gaggagaagg gcgggcttga tcgctggctg 421 gtgtatgtga acccgattga ccgtggaatt ctggtgcttc gatttgtcgc agagctggaa 481 tttcaggaga tcgcagacat catgcacatg ggtttgagtg cgacaaaaat gcgttacaaa 541 cgtgctctag ataaattgcg tgagaaattt gcaggcgaga ctgaaactta g

Page 16: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

EMBL Flat File

Features (AA seq)

DNA Sequence

Header•Title•Taxonomy•Citation

ID AF115338 standard; DNA; PRO; 591 BP.AC AF115338;SV AF115338.1DT 03-JUN-1999 (Rel. 59, Created)DT 23-AUG-1999 (Rel. 60, Last updated, Version 2)DE Pseudomonas fluorescens ECF sigma factor SigX (sigX) gene, complete cds.KW .OS Pseudomonas fluorescensOC Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae; Pseudomonas.RN [1]RP 1-591RX MEDLINE; 99369842.RA Brinkman F.S., Schoofs G., Hancock R.E., De Mot R.;RT "Influence of a putative ECF sigma factor on expression of the major outerRT membrane protein, OprF, in Pseudomonas aeruginosa and PseudomonasRT fluorescens";RL J. Bacteriol. 181(16):4746-4754(1999).RN [2]RP 1-591RA De Mot R.;RT ;RL Submitted (04-DEC-1998) to the EMBL/GenBank/DDBJ databases.RL F.A. Janssens Laboratory of Genetics, Applied Plant Sciences, K.RL Mercierlaan 92, Heverlee B-3001, BelgiumDR SPTREMBL; Q9X4L7; Q9X4L7.FH Key Location/QualifiersFHFT source 1..591FT /db_xref="taxon:294"FT /organism="Pseudomonas fluorescens"FT /strain="M114"FT CDS 1..591FT /codon_start=1FT /db_xref="SPTREMBL:Q9X4L7"FT /transl_table=11FT /gene="sigX"FT /product="ECF sigma factor SigX"FT /protein_id="AAD34329.1"FT /translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQRFT TLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYRKEFT RRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELEFQEFT IADIMHMGLSATKMRYKRALDKLREKFAGETET"SQ Sequence 591 BP; 157 A; 133 C; 170 G; 131 T; 0 other; atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag 60 ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg 120 cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac 180 gcagacgatg tctgtcagga agtcatgttg aaggtgctgt atggcctgaa gaacctcgag 240 gggaaatcga agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag 300 tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga ccccctcgag 360 gaagcgtccg aagaaaaggc gcttcaaccc gaggagaagg gcgggcttga tcgctggctg 420 gtgtatgtga acccgattga ccgtggaatt ctggtgcttc gatttgtcgc agagctggaa 480 tttcaggaga tcgcagacat catgcacatg ggtttgagtg cgacaaaaat gcgttacaaa 540 cgtgctctag ataaattgcg tgagaaattt gcaggcgaga ctgaaactta g 591//

Page 17: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

UniProt: Swiss-Prot (a curated DB)

ID CYS3_YEAST STANDARD; PRT; 393 AA.AC P31373;DT 01-JUL-1993 (REL. 26, CREATED)DE CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE).GN CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35.OS TAXONOMYOC SACCHAROMYCETACEAE; SACCHAROMYCES.RX CITATION

CC -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE +CC NH(3) + 2-OXOBUTANOATE.CC -!- COFACTOR: PYRIDOXAL PHOSPHATE.CC -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZINGCC L-CYSTEINE FROM L-METHIONINE.CC -!- SUBUNIT: HOMOTETRAMER.CC -!- SUBCELLULAR LOCATION: CYTOPLASMIC.CC -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY.CC -------------------------------------------------------------------------CC DisclaimerCC --------------------------------------------------------------------------

DR DATABASE cross-referenceKW CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE.FT INIT_MET 0 0FT BINDING 203 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY).SQ SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN//

ID CYS3_YEAST STANDARD; PRT; 393 AA.AC P31373;DT 01-JUL-1993 (REL. 26, CREATED)DT 01-JUL-1993 (REL. 26, LAST SEQUENCE UPDATE)DT 01-NOV-1995 (REL. 32, LAST ANNOTATION UPDATE)DE CYSTATHIONINE GAMMA-LYASE (EC 4.4.1.1) (GAMMA-CYSTATHIONASE).GN CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35.OS SACCHAROMYCES CEREVISIAE (BAKER'S YEAST).OC EUKARYOTA; FUNGI; ASCOMYCOTA; HEMIASCOMYCETES; SACCHAROMYCETALES;OC SACCHAROMYCETACEAE; SACCHAROMYCES.RN [1]RP SEQUENCE FROM N.A., AND PARTIAL SEQUENCE.RX MEDLINE; 92250430. [NCBI, ExPASy, Israel, Japan]RA ONO B.-I., TANAKA K., NAITO K., HEIKE C., SHINODA S., YAMAMOTO S.,RA OHMORI S., OSHIMA T., TOH-E A.;RT "Cloning and characterization of the CYS3 (CYI1) gene ofRT Saccharomyces cerevisiae.";RL J. BACTERIOL. 174:3339-3347(1992).CC -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE +CC NH(3) + 2-OXOBUTANOATE.CC -!- COFACTOR: PYRIDOXAL PHOSPHATE.CC -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZINGCC L-CYSTEINE FROM L-METHIONINE.CC -!- SUBUNIT: HOMOTETRAMER.CC -!- SUBCELLULAR LOCATION: CYTOPLASMIC.CC -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY.CC --------------------------------------------------------------------------CC This SWISS-PROT entry is copyright. It is produced through a collaborationCC between the Swiss Institute of Bioinformatics and the EMBL outstation -CC the European Bioinformatics Institute. There are no restrictions on itsCC use by non-profit institutions as long as its content is in no wayCC modified and this statement is not removed. Usage by and for commercialCC entities requires a license agreement (See http://www.isb-sib.ch/announce/CC or send an email to [email protected]).CC --------------------------------------------------------------------------DR EMBL; L05146; AAC04945.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]DR EMBL; L04459; AAA85217.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]DR EMBL; D14135; BAA03190.1; -. [EMBL / GenBank / DDBJ] [CoDingSequence]DR PIR; S31228; S31228.DR YEPD; 5280; -.DR SGD; L0000470; CYS3. [SGD / YPD]DR PFAM; PF01053; Cys_Met_Meta_PP; 1.DR PROSITE; PS00868; CYS_MET_METAB_PP; 1.DR DOMO; P31373.DR PRODOM [Domain structure / List of seq. sharing at least 1 domain]DR PROTOMAP; P31373.DR PRESAGE; P31373.DR SWISS-2DPAGE; GET REGION ON 2D PAGE.KW CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE.FT INIT_MET 0 0FT BINDING 203 203 PYRIDOXAL PHOSPHATE (BY SIMILARITY).SQ SEQUENCE 393 AA; 42411 MW; 55BA2771 CRC32; TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN//

Page 18: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

PDB- Protein Data Bank

Page 19: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

PDB – Provides?

• Protein Data Bank – Protein and Nucleic

acid 3D structures– Xray, NMR,

Computationally predicted

– Sequencepresent

Page 20: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2 COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3 COMPND 2 ATF/CREB SITE DNA 1DGC 4 SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5 AUTHOR T.J.RICHMOND 1DGC 6 REVDAT 1 22-JUN-94 1DGC 0 1DGC 7 JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8 JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9 JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10 JRNL TITL 3 FLEXIBILITY 1DGC 11 JRNL REF J.MOL.BIOL. V. 233 139 1993 1DGC 12 JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070 1DGC 13 REMARK 1 1DGC 14 REMARK 2 1DGC 15 REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16 REMARK 3 1DGC 17 REMARK 3 REFINEMENT. 1DGC 18 REMARK 3 PROGRAM X-PLOR 1DGC 19 REMARK 3 AUTHORS BRUNGER 1DGC 20 REMARK 3 R VALUE 0.216 1DGC 21 REMARK 3 RMSD BOND DISTANCES 0.020 ANGSTROMS 1DGC 22 REMARK 3 RMSD BOND ANGLES 3.86 DEGREES 1DGC 23 REMARK 3 1DGC 24 REMARK 3 NUMBER OF REFLECTIONS 3296 1DGC 25 REMARK 3 RESOLUTION RANGE 10.0 - 3.0 ANGSTROMS 1DGC 26 REMARK 3 DATA CUTOFF 3.0 SIGMA(F) 1DGC 27 REMARK 3 PERCENT COMPLETION 98.2 1DGC 28 REMARK 3 1DGC 29 REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30 REMARK 3 NUMBER OF NUCLEIC ACID ATOMS 386 1DGC 31 REMARK 4 1DGC 32 SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60 SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61 SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62 SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63 SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64 SEQRES 1 B 19 T G G A G A T G A C G T C 1DGC 65 SEQRES 2 B 19 A T C T C C 1DGC 66 HELIX 1 A ALA A 228 LYS A 276 1 1DGC 67 CRYST1 58.660 58.660 86.660 90.00 90.00 90.00 P 41 21 2 8 1DGC 68 ORIGX1 1.000000 0.000000 0.000000 0.00000 1DGC 69 ORIGX2 0.000000 1.000000 0.000000 0.00000 1DGC 70 ORIGX3 0.000000 0.000000 1.000000 0.00000 1DGC 71 SCALE1 0.017047 0.000000 0.000000 0.00000 1DGC 72 SCALE2 0.000000 0.017047 0.000000 0.00000 1DGC 73 SCALE3 0.000000 0.000000 0.011539 0.00000 1DGC 74 ATOM 1 N PRO A 227 35.313 108.011 15.140 1.00 38.94 1DGC 75 ATOM 2 CA PRO A 227 34.172 107.658 15.972 1.00 39.82 1DGC 76

ATOM 842 C5 C B 9 57.692 100.286 22.744 1.00 29.82 1DGC 916 ATOM 843 C6 C B 9 58.128 100.193 21.465 1.00 30.63 1DGC 917 TER 844 C B 9 1DGC 918 MASTER 46 0 0 1 0 0 0 6 842 2 0 7 1DGC 919 END 1DGC 920

PDB

• HEADER• COMPND• SOURCE• AUTHOR• DATE• JRNL• REMARK• SECRES• ATOM COORDINATES

Page 21: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Data Formats

Flat Files √

Many other formats for particular uses…XML,

Clustal (for multiple sequence alignments), GFF (for sequence annotation), etc…

FASTA – simplest!High throughput data file formats: BAM, etc.

Page 22: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

FASTA

>gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPIIKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYENLEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVLEDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPESSDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGER

>gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4>

Page 23: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

FASTA

> Your favourite gene 1 - yfg1MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPIIKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYENLEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVLEDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPESSDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGER> Your favourite gene 2 - yfg2MQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPIVIVDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYENWTITSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVLLEDNSKEWEDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIV

Page 24: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

In GenBank, records are organized for various reasons. Understanding the rationale behind “groupings” and “numbering” systems for such databases is the key to fully taking advantage of database resources - appropriately!

Page 25: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

LOCUS vs Accession vs PID vs protein_id: What’s the difference?

LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases. ACCESSION: A unique identifier to that record (particular sequence) in GenBank/EMBL/DDBJ that does not change when record is updated. Nucleotide gi: Geninfo identifier (gi), a unique integerspecific for GenBank which will change every time the sequence changes.VERSION: System started in 1999 for GenBank/EMBL/DDBJ where the accession and version play the same function as the accession and gi number. Format: accession.version PID: Protein Identifier: g, e or d prefix to gi number.Can have one or two on one CDS (coding sequence).Protein gi: Geninfo identifier (gi), a GenBank unique integer which will change every time the sequence changes.protein_id: Identifier which has the same structure and function as the nucleotide Accession with version numbers.

Page 26: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

LOCUS, Accession, NID, gi and PID

LOCUS HSU40282 1789 bp mRNA PRI 21-MAY-1998DEFINITION Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds.ACCESSION U40282VERSION U40282.1 GI:3150001

CDS 157..1515 /gene="ILK" /note="protein serine/threonine kinase" /codon_start=1 /product="integrin-linked kinase" /protein_id="AAC16892.1" /db_xref="PID:g3150002" /db_xref="GI:3150002"

LOCUS: HSU40282 ACCESSION: U40282 VERSION: U40282.1 GI: 3150001 PID: g3150002 Protein gi: 3150002 protein_id: AAC16892.1

Page 27: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Which of these would you use to cite a sequence in a paper?

Can you think of situations where you would use one over another?

Page 28: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Which of these would you use to cite a sequence? When would you use one over another?

LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases. ACCESSION: A unique identifier to that record (particular sequence) in GenBank/EMBL/DDBJ that does not change when record is updated. Nucleotide gi: Geninfo identifier (gi), a unique integerspecific for GenBank which will change every time the sequence changes. (and can disappear!)VERSION: System started in 1999 for GenBank/EMBL/DDBJ where the accession and version play the same function as the accession and gi number. Format: accession.version PID: Protein Identifier: g, e or d prefix to gi number.Can have one or two on one CDS (coding sequence).Protein gi: Geninfo identifier (gi), a GenBank unique integer which will change every time the sequence changes.protein_id: Identifier which has the same structure and function as the nucleotide Accession with version numbers.

Page 29: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Briefly…Examples of Functional Divisions

PAT Patent EST Expressed Sequence TagsSTS Sequence Tagged SiteGSS Genome Survey Sequence HTG High Throughput Genome (unfinished)HTC High throughput cDNA (unfinished)

Genbank overview: http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=handbook&part=ch1

Page 30: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Other Sequence (& related) File FormatsHistorically, a number of other sequence and

annotation file formats have been proposed, see:http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

The demands of representing NGS data have given rise to additional file formats and data compression standards, some of which you will encounter in this course. The next few slides will present an overview of a few of these emergent NGS formats and standards. See:

http://www.broadinstitute.org/software/igv/FileFormatshttp://www.broadinstitute.org/software/igv/RecommendedFileFormats

http://genome.ucsc.edu/FAQ/FAQformat

Page 31: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Other Sequence (& Annotation) File Formats

FASTQ – FASTA with quality data2bit – compressed DNA sequence formatSAM/BAM – Sequence Alignment MappingGFF/GTF – General Feature FormatBED/WIG – annotation track data formats

Page 32: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

FASTQ FASTQ – FASTA “with an attitude” (embedded quality scores). Originally

developed at the Sanger to couple (Phred) quality data with sequence, it is now common to specify raw read output data from NGS machines in this format.

Various flavors: fastq-sanger fastq-illumina fastq-solexa

Differing in the format of the sequence identifier and in the valid range of quality scores. See:

http://en.wikipedia.org/wiki/FASTQ_formathttp://maq.sourceforge.net/fastq.shtml

http://nar.oxfordjournals.org/content/earlyÃ

/2009/12/16/nar.gkp1137.full

“…the Sanger version of the FASTQ format has found the broadest acceptance, supported by many assembly and read mapping tools …Therefore, most users will do this conversion very early in their workflows…”

@EAS54_6_R1_2_1_443_348GTTGCTTCTGGCGTGGGTGGGGGGG+EAS54_6_R1_2_1_443_348*-+*''))**55CCF>>>>>>CCCC

Page 33: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

http://hannonlab.cshl.edu/fastx_toolkit/

Linux, MacOSX or Unix only

Page 34: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

2bit File Format

Highly compressed sequence file stores multiple DNA sequences (up to 4 Gb total) in a compact randomly-accessible format. The file contains masking information as well as the DNA itself.http://genome.ucsc.edu/FAQ/FAQformat#format7

Page 35: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

SAM/BAMSAM– a tab-delimited text file that contains a compact

and index-able representation of nucleotide sequence alignments

http://samtools.sourceforge.net/SAM1.pdfhttp://samtools.sourceforge.net/

BAM – binary version of SAM (preferred by IGV) I/O format of several NGS tools, see:

http://samtools.sourceforge.net/swlist.shtmlSee also:Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9.

Page 36: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Gene/General/Generic Feature Formats (GFF)

A General Feature Format (GFF) file is a relatively simple tab-delimited text file for describing genomic features. Many genome browsers – gbrowse, IGV, etc. - take GFF as input for annotation data

There are several slightly but significantly different GFF file formats (GFF,GFF2, GFF3, GTF). The current primary standard is GFF3:

http://www.sequenceontology.org/gff3.shtml

Page 37: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Excerpt of a GFF File

##gff-version 3 1 ##sequence-region ctg123 1 1497228ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDENctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1ctg123 . exon 1300 1500 . + . ID=exon00001;Parent=mRNA00003ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001;Name=edenprotein.1

Page 38: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

BED File Format

BED format provides a flexible way to define the data lines that are displayed in an annotation track in a genome browser.http://genome.ucsc.edu/FAQ/FAQformat#format1

If your data set is BED-like, but it is very large and you would like to keep it on your own server, you should use the bigBed data format.

Page 39: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

WIGgle format

The Wiggle format is for display of dense, continuous data such as GC percent, probability scores, and transcriptome data.http://genome.ucsc.edu/goldenPath/help/wiggle.html

If you need to display continuous data that is sparse or contains elements of varying size, use the BedGraph format instead.

If you have a very large data set and you would like to keep it on your own server, you should use the bigWig data format

Page 40: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

EMBOSS Sequence Analysis Suite

emboss.sourceforge.net

Page 41: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Open Bioinformatics Foundationbioperl / biojava / biopython / bioruby / biosql etc.

www.open-bio.org

Page 42: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Sequence Databases: “Roll your Own”?

GMOD BioSQL: a lightweight database schema for storing and retrieving (annotated) sequence records using OpenBio software tools.

GMOD “Chado”: a more complex database schema for storing sequence data, genome feature annotation and a host of other related biological data (initially inspired by Drosophila genome annotation and genetics; supported by many GMOD software tools)

Page 43: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Retrieving Sequence Information: Using integrated database resources such as

Entrez

Page 44: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

What you may be looking for:

• Heard on CBC about a disease gene that was recently discovered, and you want to know more about it.

• Want to build a dataset of DNA sequences upstream of a set of co-expressed genes, to identify common regulatory element sequences

• Evolutionary, functional, structural analyses, etc…

Page 45: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Entrez: Initial version of this “Pathway to Discovery”

Amino acid sequence similarityCoding region

features

Nucleotide sequence similarity

Term frequency statistics

Literature citations in sequence databases

Literature citations in sequence databases

MEDLINE abstracts

Nucleotide sequences

Protein sequences

Page 46: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Genetic Analysis of Cancer in Families

The Genetic Predisposition to Cancer

PubMed Text Neighboring

• Common terms could indicate similar subject matter

• Statistical method• Weights based on term

frequencies within document and within the database as a whole

• Some terms are better than others

Page 47: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Genomes Structures

MVILLVILAIVLISDVTGREGSWQIPCMNVKRKKGREGDHIVLILILLNNAWASVLPESDSSDSGPLIILHEREKRLALAMAREENSPNCTPLIKRESAEDSEDLRKRKKTDEDDHIVLIL

ACGATGTGGTCGATGTTCTCTATTATTATCGGAAGCTAAGGATATCGCTGATGTGAGGTGATCGGTTCTATCTGCATAGCATGGATATTGATGGCTTATAGGCTAGCGCTGATGTGAGGTG Links

Protein Sequences

GenBank

MEDLINEExpression Data

Accession Numbers

PubMed online Journals

Full text

SNP Data

Accession Numbers - Map

MMDB structure:function

VAST

Entrez began to integrate more data…

Page 48: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

EntrezEntrez Help http://www.ncbi.nlm.nih.gov/books/NBK3837/

Check out also What’s New http://www.ncbi.nlm.nih.gov/books/NBK1969/ Or @NCBI on Twitterto keep up on new features added (like the Database of Genomic Structural Variation recently released)

SFU’s Cenk Sahinalp - international leader in structural variation bioinformatics research

Page 49: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

BLink

Page 50: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Other Sequence Databases and Sequence Data Visualization

Page 51: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

www.ensembl.org

The Ensembl Genomes Database: Focuses on humans and select vertebrates

(but a plant version is also available…)

Page 52: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

What is Ensembl?

• Publicly available, automated annotation of selected eukaryotic genomes (initially with mammalian focus)– Open source software (but slightly complicated to set up…)– Multiple different ways to access data, including

programmatic (Perl API)– Provides access to additional data from other groups

(distributed annotation system or DAS)

Page 53: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

ENSEMBL – Region in Detail

Check out the “Printable mini-course” at http://uswest.ensembl.org/info/website/tutorials/index.html

Page 54: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Generic Model Organism Database (GMOD) Project

www.gmod.org

Page 55: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

BioMart(Ensmart)

A powerful querying system

(later: we’ll learn about Ensembl’s Perl API)

Page 56: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Distributed Annotation System (DAS)

• Allows Third-Party annotation• Users choose the annotation they are interested in• Good for specialized feature annotation or for comparison

of different methodologies• Allows you to view different data in a consistent user

interface/display

Open source display focused on eukaryotes Ensembl

Open source display for any dataset Gbrowse

Page 57: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Gbrowse:Another genomedata viewer withDAS

Gene track

Protein track

Metabolic pathways track

Regulons track

3D structures track

Intergenic sequences track

Terminators track

DNA sequence track

Translation track

http://gmod.org/wiki/GBrowse

Page 58: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Page 59: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Gbrowse is used to display genomic data for many projects

Mouse, Rat, Fly, C. elegans and other animals Rice and a number of other plants S. cerevisiae and other yeasts A number of unicellular eukaryotes Many many prokaryotes Other types of data: HapMap, Segmental Duplications,

RNA-seq data-specific or other type-specific data ** Open source package ** (slightly simpler to set up

than Ensembl)

Page 60: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Entrez, Ensembl, Gbrowse: What’s the difference?

• Entrez– Search and retrieval system for major databases, including PubMed,

Sequences (including genomes), Structures, Taxonomy, etc.– NCBI (Maryland, USA) centrally hosts Entrez and they decide what to

host and maintain– Not open source

• Ensembl– Automated annotation of selected eukaryotic genomes– EMBL-EBI and the Sanger Institute (Cambridge/Hinxton, UK) centrally

hosts most resources and they decide what data to host and maintain. – Open source and can obtain a local copy plus access other DAS data

• Gbrowse– Genome/genomic data viewer– Very decentralized – anyone can set it up and publicly display any data– Open source and can set up a local copy plus access other DAS data

Page 61: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Entrez, Ensembl, Gbrowse: Benefits/Disadvantages of each?

• Entrez• Reputable institution – trust in the data• Maintained by well established group with a lot of capital• Perceived more consistency• Limited to what they make available• They make the call on how to display it, analyze it, and classify it• Some of the analyses are definitely a black box

• Ensembl• Open source – can see how the data is analyzed/processed – NOT necessarily

an issue with lower quality data – a lot of eyes are watching you (wooahh haa haa…)

• Reputable institution – trust in the data• Gbrowse

• Easy to use and set up• Open source – can see how the data is analyzed/processed• Anybody can release their data to the world • Anybody can analyze the data in they want and release it to the world

Page 62: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Local Visualization of NGS Datahttp://www.broadinstitute.org/igv/

Page 63: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

How do I update or correct errors in the Databases?

Example: For Gene names, citations, new protein name, sequencing errors in Genbank…

[email protected]

But most people don’t bother to correct things that they notice are wrong…

increased need for more focused community-based projects

Page 64: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Community Assisted Curation of Subsets of Datasets

Core curators continually update annotation of a data subset (i.e. a genome)– Literature review – Input from the community

Updates sent in batches to centralized databases - > additional review -> becomes, for example, an NCBI RefSeq

Examples: WormBase.org, Pseudomonas.com

Page 65: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Ethical issues with bioinformatics databases How public and/or open source should biomolecular

data be? How much should researchers be forced to release

data as soon as possible? How much analysis of a genome can a researcher

publish before the genome sequence is published? How do we best organize the data? BIG issue! i.e. biomolecular pathway classifications can bias analyses of pathways are found to be upregulated or downregulated by gene expression analysis

Page 66: Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Resourceshttp://www.ncbi.nlm.nih.gov/http://www.ebi.ac.uk/http://www.expasy.ch/http://www.ensembl.org/http://www.rcsb.org/pdb/http://www.pseudmonas.com/http://www.wormbase.org/http://biodas.org/http://nar.oupjournals.org/http://www.gmod.org/http://www.broadinstitute.org/igv/