Sequence File Parsing using Biopython

17
10/7/2013 BCHB524 - 2013 - Edwards Sequence File Parsing using Biopython BCHB524 2013 Lecture 11

description

Sequence File Parsing using Biopython. BCHB524 2013 Lecture 11. Review. Modules in the standard-python library: sys, os, os.path – access files, program environment zipfile, gzip – access compressed files directly urllib – access web-resources (URLs) as files - PowerPoint PPT Presentation

Transcript of Sequence File Parsing using Biopython

Page 1: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards

Sequence File Parsing using Biopython

BCHB5242013

Lecture 11

Page 2: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 2

Review

Modules in the standard-python library: sys, os, os.path – access files, program environment zipfile, gzip – access compressed files directly urllib – access web-resources (URLs) as files csv – read delimited line based records from files

Plus lots, lots more.

Page 3: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 3

BioPython

Additional modules that make many common bioinformatics tasks easier File parsing (many formats) & web-retrieval Formal biological alphabets, codon tables, etc Lots of other stuff…

Have to install separately Not part of standard python, or Enthought

biopython.org

Page 4: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 4

Biopython: Fasta format

Most common biological sequence data format Header/Description line

>accession description

Multi-accession sometimes represented accession1|accession2|accession3 lots of variations, no standardization

No prescribed format for the description Other lines

sequence, one chunk per line. Usually all lines, except the last, are the same length.

Page 5: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 5

BioPython: Bio.SeqIO

import Bio.SeqIOimport sys

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Open the FASTA file and iterate through its sequencesseqfile = open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "fasta"):    # Print out the various elements of the SeqRecord    print "\n------NEW SEQRECORD------\n"    print "seq_record.id:\n\t", seq_record.id    print "seq_record.description:\n\t",seq_record.description    print "seq_record.seq:\n\t",seq_record.seqseqfile.close()

Page 6: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 6

Biopython: Other formats Genbank format

From NCBI, also format for RefSeq sequence

UniProt/SwissProt flat-file format From UniProt for SwissProt and TrEMBL

UniProt-XML format: From UniProt for SwissProt and TrEMBL

Use the gzip module to handle compressed sequence databases

Page 7: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 7

BioPython: Bio.SeqIO

import Bio.SeqIOimport sys

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Open the FASTA file and iterate through its sequencesseqfile = open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "genbank"):    # Print out the various elements of the SeqRecord    print "\n------NEW SEQRECORD------\n"    print "seq_record.id:\n\t", seq_record.id    print "seq_record.description:\n\t",seq_record.description    print "seq_record.seq:\n\t",seq_record.seqseqfile.close()

Page 8: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 8

BioPython: Bio.SeqIO

import Bio.SeqIOimport sys

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Open the FASTA file and iterate through its sequencesseqfile = open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "swiss"):    # Print out the various elements of the SeqRecord    print "\n------NEW SEQRECORD------\n"    print "seq_record.id:\n\t", seq_record.id    print "seq_record.description:\n\t",seq_record.description    print "seq_record.seq:\n\t",seq_record.seqseqfile.close()

Page 9: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 9

BioPython: Bio.SeqIO

import Bio.SeqIOimport sys

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Open the FASTA file and iterate through its sequencesseqfile = open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):    # Print out the various elements of the SeqRecord    print "\n------NEW SEQRECORD------\n"    print "seq_record.id:\n\t", seq_record.id    print "seq_record.description:\n\t",seq_record.description    print "seq_record.seq:\n\t",seq_record.seqseqfile.close()

Page 10: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 10

BioPython: Bio.SeqIO and gzip

import Bio.SeqIOimport sysimport gzip

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Open the FASTA file and iterate through its sequencesseqfile = gzip.open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "fasta"):    # Print out the various elements of the SeqRecord    print "\n------NEW SEQRECORD------\n"    print "seq_record.id:\n\t", seq_record.id    print "seq_record.description:\n\t",seq_record.description    print "seq_record.seq:\n\t",seq_record.seqseqfile.close()

Page 11: Sequence File Parsing using Biopython

What about the other "stuff"

BioPython makes it easy to get access to non-sequence information stored in "rich" sequence databases Annotations Cross-References Sequence Features Literature

10/7/2013 BCHB524 - 2013 - Edwards 11

Page 12: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 12

BioPython: Bio.SeqIO

import Bio.SeqIOimport sysimport gzip

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Open the FASTA file and iterate through its sequencesseqfile = gzip.open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):    # What else is available in the SeqRecord?    print "\n------NEW SEQRECORD------\n"    print "repr(seq_record)\n\t",repr(seq_record)    print "dir(seq_record)\n\t",dir(seq_record)    breakseqfile.close()

Page 13: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 13

BioPython: Bio.SeqRecordimport Bio.SeqIOimport sysimport gzip

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Open the FASTA file and iterate through its sequencesseqfile = gzip.open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):    # Print out the various elements of the SeqRecord    print "\n------NEW SEQRECORD------\n"    print "seq_record.annotations\n\t",seq_record.annotations    print "seq_record.features\n\t",seq_record.features    print "seq_record.dbxrefs\n\t",seq_record.dbxrefs    print "seq_record.format('fasta')\n",seq_record.format('fasta')    breakseqfile.close()

Page 14: Sequence File Parsing using Biopython

BioPython: Random access

Sometimes you want to access the sequence records "randomly"… …to pick out the ones you want (by accession)

Why not make a dictionary, with accessions as keys, and SeqRecord values? Use SeqIO.to_dict(…)

What if you don't want to hold it all in memory Use SeqIO.index(…)

10/7/2013 BCHB524 - 2013 - Edwards 14

Page 15: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 15

BioPython: Bio.SeqIO.to_dict(…)import Bio.SeqIOimport sys

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Open the sequence databaseseqfile = open(seqfilename)

# Use to_dict to make a dictionary of sequence recordssprot_dict = Bio.SeqIO.to_dict(Bio.SeqIO.parse(seqfile, "uniprot-xml"))

# Close the fileseqfile.close()

# Access and print a sequence recordprint sprot_dict['Q6GZV8']

Page 16: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 16

BioPython: Bio.SeqIO.index(…)import Bio.SeqIOimport sys

# Check the inputif len(sys.argv) < 2:    print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)

# Get the sequence filenameseqfilename = sys.argv[1]

# Use index to make an out of core dict of seq records sprot_index = Bio.SeqIO.index(seqfilename, "uniprot-xml")

# Access and print a sequence recordprint sprot_index['Q6GZV8']

Page 17: Sequence File Parsing using Biopython

10/7/2013 BCHB524 - 2013 - Edwards 17

Exercises Read through and try the examples from Chapters 2-5 of

BioPython's Tutorial. Download human proteins from RefSeq and compute amino-acid

frequencies for the (RefSeq) human proteome. Which amino-acid occurs the most? The least? Hint: access RefSeq human proteins from

ftp://ftp.ncbi.nih.gov/refseq Download human proteins from SwissProt and compute amino-acid

frequencies for the SwissProt human proteome. Which amino-acid occurs the most? The least? Hint: access SwissProt human proteins from

http://www.uniprot.org/downloads -> “Taxonomic divisions” How similar are the human amino-acid frequencies of in RefSeq and

SwissProt?