Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf ·...
Transcript of Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf ·...
Contents
I Visualization 4
1 BED/Interval files 5
2 SAM files 6
3 PSLX (blat) files 8
4 Coverage 10
4.1 Coverage of BED files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Coverage of SAM/BAM files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.3 Coverage by Strand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4 Coverage by Exons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
II Genome Browser Custom Tracks 18
5 Uploading tracks using FTP 18
6 Using CSHL’s local Genome Browser server (http://foxtrot.cshl.edu) 20
7 Track display options 20
III Technical Details 21
8 Formatting conventions 21
9 [CHROM_SIZE] file 21
10 Bluehelix setup 22
11 direct MySQL access 22
12 Programs reference 23
13 Compiling programs from source code 23
IV Troubleshooting 23
2
IntroductionVisualizing small number of intervasl (up to 1,000,000 intervals, or files smaller than 50MB) is easily done by sim-ply uploading the file to the UCSC Genome Browser. Visualizaing large number of reads presents some technicaldifficulties.
This document shows how to visualize large files of millions of short-reads (long reads will work just as well).
• Text in fixed-font shows unix commands. See 8 for more details.
• [CHROM_SIZE] is a text file containing the names and sizes of chromosomes. See 9 for more details.
• All program mentioned here are available on BlueHelix. See 10 for more details.
3
Part I
VisualizationThe general method is the same for all file formats:
1. Convert your input files into one of several ’common’ textual formats (BED, BedGraph, Wiggle, SAM),
2. Convert the BED/BedGraph/Wiggle/SAM files into a binary File (BigBed,BigWig,BAM)
3. Add a Custom Track in UCSC Genome Browser, pointing to the binary files:
• With the public UCSC Genome Browser:Upload the binary files to an FTP site, and point the custom track to the correct URL.See section 5 for FTP usage.
• With the CSHL Mirror Genome Browser:Put the binary files on BlueHelix, and point the custom track to the correct path.See section 6 for BlueHelix usage.
4
1 BED/Interval files
Task
Display a BED file:
chr2L 13774500 13774548 seq-1 0 +chr2L 17984104 17984148 seq-2 0 +chr2L 13675851 13675900 seq-3 0 +chr2L 18884003 18884049 seq-4 0 +chr2L 3358603 3358646 seq-5 0 +
As genomic intervals on the UCSC Genome Browser:
50 bases
816850 816860 816870 816880 816890 816900 816910 816920 816930 816940 816950 816960dummy.bb
Gap Locations
FlyBase Protein-Coding Genes
FlyBase Noncoding GenesRefSeq Genes
D. melanogaster mRNAs from GenBank
seq-375826
seq-457175seq-525048
seq-585200seq-453462seq-373485
seq-456181seq-371720
seq-588847seq-375968seq-583503seq-583238seq-368435
seq-375022
seq-457465
seq-455884
seq-523133seq-587247
seq-458346
seq-519065seq-456680
seq-367774seq-518228seq-371273
seq-522655seq-374312
seq-457844
Solution
1. Sort the input file by chromosome name and start position (input file must be sorted to be converted into a binarybigBed file).
2. create a binary bigBed file from the BED file.
Commands:
$ sort -k1,1 -k2,2n < [SAMPLE.BED] > [SAMPLE.SORTED.BED]$ bedToBigBed [SAMPLE.SORTED.BED] [CHROM_SIZE] [SAMPLE.BB]
The file [SAMPLE.BB] can be uploaded to the UCSC genome browser as a custom track:
Home Genomes Genome Browser Blat Tables Gene Sorter Session FAQ Help
Add Custom Tracks
clade Insect genome D. melanogaster assembly Apr. 2006 (BDGP R5/dm3)
Paste URLs or data: Or upload: Browse… Submit
Clear
track name="BigBed Track" type=bigBed bigDataUrl=http://myserver.edu/sample.bb
5
2 SAM files
Task
Display a SAM file1:
ZAPHOD_FC42T13AAXX:9:1:503:868 0 chr3L 3390 255 44M * 0 0 ACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTAZAPHOD_FC42T13AAXX:9:1:877:655 16 chr3L 11651753 255 43M * 0 0 TAATATAAGACAGAGAACGANAGGCACTCATTAGCACAATATGZAPHOD_FC42T13AAXX:9:1:839:364 16 chr3L 12316404 255 44M * 0 0 TAATATAAGACAGAGAACGAGAGGCACTCANTAGCACAATATGTZAPHOD_FC42T13AAXX:9:1:125:213 0 chr2L 12315946 255 44M * 0 0 TAATATAAGACAGAGAACGAGNGGCACTCATTAGCACAATATGT
As a custom track in the UCSC Genome Browser:
500 bases
BAMTrack
UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics
Vertebrate Multiz Alignment & Conservation (44 Species)Placental Mammal Basewise Conservation by PhyloP
Multiz Alignments of 44 Vertebrates
Solution
1. Convert the SAM file to a BAM file.
2. Sort the BAM file.
3. Create an index (BAI file) to accompany the BAM file.
Commands:
$ samtools view -S -b -o [SAMPLE.BAM] [SAMPLE.SAM]
# .BAM extension will be added automatically to the ’SAMPLE.SORTED’ file.$ samtools sort [SAMPLE.BAM] [SAMPLE.SORTED]
# A new index file will be created with the same name and a .BAI extension$ samtools index [SAMPLE.SORTED.BAM]
1SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. Seehttp://samtools.sourceforge.net/
6
The two files [SAMPLE.SORTED.BAM] and [SAMPLE.SORTED.BAM.BAI] can be uploaded to the UCSC genomebrowser as a custom track:
Home Genomes Genome Browser Blat Tables Gene Sorter Session FAQ Help
Add Custom Tracks
clade Insect genome D. melanogaster assembly Apr. 2006 (BDGP R5/dm3)
Paste URLs or data: Or upload: Browse… Submit
Clear
track name="BAMTrack" type=bam bigDataUrl=http://myserver.edu/sample.sorted.bam
7
3 PSLX (blat) files
Task
Display a PSL file (output of BLAT program)
50 0 0 0 0 0 0 0 + 4-77 50 0 50 chrU 10049037 6678238 667828850 0 0 0 0 0 0 0 + 4-77 50 0 50 chrU 10049037 6680033 668008350 0 0 0 0 0 0 0 + 4-77 50 0 50 chrU 10049037 6878866 687891650 0 0 0 0 0 0 0 + 4-77 50 0 50 chrU 10049037 6879800 687985050 0 0 0 0 0 0 0 + 4-77 50 0 50 chrU 10049037 6880518 688056850 0 0 0 0 0 0 0 + 4-77 50 0 50 chrU 10049037 7817848 781789850 0 0 0 0 0 0 0 + 4-77 50 0 50 chrU 10049037 7818749 781879950 0 0 0 0 0 0 0 + 4-77 50 0 50 chrU 10049037 7860041 7860091
As a custom track in the UCSC Genome Browser:
500 bases
688100 688200 688300 688400 688500 688600 688700 688800 688900 689000 689100 689200PSLX example
FlyBase Protein-Coding Genes
FlyBase Noncoding GenesRefSeq Genes
Repeating Elements by RepeatMasker
115056-5290420-2
283403-2
529551-1366570-1
335694-1
245263-245216-14
338552-1522302-1
77855-6
211923-3
1047-4591168-5
Solution
1. Convert the PSLX to BED file
2. Sort the BED file by Chromosome/Start position.
3. Convert BED to BigBed file.
Commands:
$ pslToBed [SAMPLE.PSLX] [SAMPLE.BED]$ sort -k1,1 -k2,2n < [SAMPLE.BED] > [SAMPLE.SORTED.BED]$ bedToBigBed [SAMPLE.SORTED.BED] [CHROM_SIZE] [SAMPLE.BB]
The file [SAMPLE.BB] can be uploaded to the UCSC genome browser as a custom track:
8
Home Genomes Genome Browser Blat Tables Gene Sorter Session FAQ Help
Add Custom Tracks
clade Insect genome D. melanogaster assembly Apr. 2006 (BDGP R5/dm3)
Paste URLs or data: Or upload: Browse… Submit
Clear
track name="BigBed Track" type=bigBed bigDataUrl=http://myserver.edu/sample.bb
9
4 Coverage
4.1 Coverage of BED files
Task
Display the nucleotide coverage of a BED file:
chr2L 13774500 13774548 seq-1 0 +chr2L 17984104 17984148 seq-2 0 +chr2L 13675851 13675900 seq-3 0 +chr2L 18884003 18884049 seq-4 0 +chr2L 3358603 3358646 seq-5 0 +chr2L 3212400 3212446 seq-6 0 +
As a custom Wiggle track in the UCSC Genome Browser:
500 bases
816200 816300 816400 816500 816600 816700 816800 816900 817000Coverage
FlyBase Protein-Coding Genes
FlyBase Noncoding GenesRefSeq Genes
CG3639
(BigWig)
Solution
1. Sort the BED file (unlike in BedToBigBed, sorting by chromosome name is sufficient. no need to sort by startposition).
2. Use genomeCoverageBed to calculate coverage over each genomic position.
3. Use bedGraphToBigWig to convert the textual BedGraph into a BigWig file.
Commands:
$ sort -k1,1 < [SAMPLE.BED] > [SAMPLE.SORTED.BED]$ genomeCoverageBed -bg -i [SAMPLE.SORTED.BED] -g [CHROM_SIZE] > [SAMPLE.BEDGRAPH]$ bedGraphToBigWig [SAMPLE.BEDGRAPH] [CHROM_SIZE] [SAMPLE.BW]
The file [SAMPLE.BW] file is the BigWig file, which can be used as the custom track:
10
Home Genomes Genome Browser Blat Tables Gene Sorter Session FAQ Help
Add Custom Tracks
clade Insect genome D. melanogaster assembly Apr. 2006 (BDGP R5/dm3)
Paste URLs or data: Or upload: Browse… Submit
Clear
track name="Wiggle Track" type=bigWig bigDataUrl=http://myserver.edu/sample.bw
11
4.2 Coverage of SAM/BAM files
Task
Display the nucleotide coverage of a SAM (or BAM) file:
ZAPHOD_FC42T13AAXX:9:1:503:868 0 chr3L 3390 255 44M * 0 0 ACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTAZAPHOD_FC42T13AAXX:9:1:877:655 16 chr3L 11651753 255 43M * 0 0 TAATATAAGACAGAGAACGANAGGCACTCATTAGCACAATATGZAPHOD_FC42T13AAXX:9:1:839:364 16 chr3L 12316404 255 44M * 0 0 TAATATAAGACAGAGAACGAGAGGCACTCANTAGCACAATATGTZAPHOD_FC42T13AAXX:9:1:935:985 16 chr3L 11651753 255 43M * 0 0 TAATATAAGACAGAGAACGAGAGGCNCTCATTAGCACAATATGZAPHOD_FC42T13AAXX:9:1:125:213 16 chr2L 12315946 255 44M * 0 0 TAATATAAGACAGAGAACGAGNGGCACTCATTAGCACAATATGTZAPHOD_FC42T13AAXX:9:1:953:789 16 chr2L 11651753 255 44M * 0 0 TAATATAAGACAGAGAACGAGAGGCACTCATTNGCACAATATGTZAPHOD_FC42T13AAXX:9:1:31:108 16 chr2L 12315946 255 44M * 0 0 TAATATAAGACAGAGAACGAGAGGCACTCATTAGCACNATATGTZAPHOD_FC42T13AAXX:9:1:454:503 16 chr2L 11651254 255 45M * 0 0 GCTCTCTCTCTCTGTCGTGTATTGTCTTTNTGGGTTTGCGGTAAT
We want to view coverage of genomic positions as a custom track:
500 bases
816200 816300 816400 816500 816600 816700 816800 816900 817000Coverage
FlyBase Protein-Coding Genes
FlyBase Noncoding GenesRefSeq Genes
CG3639
(BigWig)
Solution
1. Convert the SAM file to a BAM file (if needed)
2. Sort the BAM file.
3. Use genomeCoverageBed to calculate coverage over each genomic position.
4. Use bedGraphToBigWig to convert the textual BedGraph into a BigWig file.
Commands:
$ samtools view -S -b -o [SAMPLE.BAM] [SAMPLE.SAM]$ samtools sort [SAMPLE.BAM] [SAMPLE.SORTED]$ genomeCoverageBed -bg -ibam [SAMPLE.SORTED.BAM] \
-g [CHROM_SIZE] > [SAMPLE.BEDGRAPH]$ bedGraphToBigWig [SAMPLE.BEDGRAPH] [CHROM_SIZE] [SAMPLE.BW]
The file [SAMPLE.BW] file is the BigWig file, which can be used as the custom track:
12
Home Genomes Genome Browser Blat Tables Gene Sorter Session FAQ Help
Add Custom Tracks
clade Insect genome D. melanogaster assembly Apr. 2006 (BDGP R5/dm3)
Paste URLs or data: Or upload: Browse… Submit
Clear
track name="Wiggle Track" type=bigWig bigDataUrl=http://myserver.edu/sample.bw
13
4.3 Coverage by Strand
Task
Display the coverage of intervals in a BED (or BAM) file:
chr2L 13774500 13774548 seq-1 0 +chr2L 17984104 17984148 seq-2 0 -chr2L 13675851 13675900 seq-3 0 +chr2L 18884003 18884049 seq-4 0 -chr2L 3358603 3358646 seq-5 0 +
As two tracks in the UCSC Genome browser - one for positive-strand reads, one for negative-strand reads:
100 bases19047600 19047650 19047700 19047750 19047800 19047850 19047900 19047950
positive_strand
negative_strand
RefSeq Genes
Solution
1. Sort the BED file (sorting by chromosome name is sufficient)
2. Use genomeCoverageBed with the -strand option to calculate coverage of each strand.
3. For the negative strand track, use awk to negate the coverage values.
4. Use bedGraphToBigWig to create BigWig track for each strand.
Commands:
$ sort -k1,1 < [SAMPLE.BED] > [SAMPLE.SORTED.BED]$ genomeCoverageBed -bg -i [SAMPLE.SORTED.BED] -g [CHROM_SIZE] -strand + \> [SAMPLE.POS.BEDGRAPH]
$ genomeCoverageBed -bg -i [SAMPLE.SORTED.BED] -g [CHROM_SIZE] -strand - | \awk '{ $4 = - $4 ; print $0 }' > [SAMPLE.NEG.BEDGRAPH]
$ bedGraphToBigWig [SAMPLE.POS.BEDGRAPH] [CHROM_SIZE] [SAMPLE.POS.BW]$ bedGraphToBigWig [SAMPLE.NEG.BEDGRAPH] [CHROM_SIZE] [SAMPLE.NEG.BW]
The two files ([SAMPLE.POS.BW] and [SAMPLE.NEG.BW]) can be sued as custom tracks in the Genome Browser.A special parameter (color) will show each strand in a different color:
14
Home Genomes Genome Browser Blat Tables Gene Sorter Session FAQ Help
Add Custom Tracks
clade Insect genome D. melanogaster assembly Apr. 2006 (BDGP R5/dm3)
Paste URLs or data: Or upload: Browse… Submit
Clear
track name="Positive Strand" color=0,0,255 type=bigWig bigDataUrl=http://myserver.edu/SAMPLE.POS.BW
track name="Negative Strand" color=255,0,0 type=bigWig bigDataUrl=http://myserver.edu/SAMPLE.NEG.BW
15
4.4 Coverage by Exons
Problem
Display a BED (or BAM) file containing blocked intervals2
chr2L 67753 67927 seq1 0 + 67753 67927 255,0,0 2 9,36 0,138chr2L 107813 108589 seq2 0 - 107813 108589 255,0,0 2 25,2 0,774chr2L 107813 108589 seq3 0 - 107813 108589 255,0,0 2 25,2 0,774chr2L 108308 108591 seq4 0 + 108308 108591 255,0,0 2 38,4 0,279chr2L 113350 113473 seq5 0 + 113350 113473 255,0,0 2 19,40 0,83chr2L 118291 118362 seq6 0 - 118291 118362 255,0,0 2 13,2 0,69chr2L 119550 119870 seq7 0 - 119550 119870 255,0,0 2 4,43 0,277chr2L 119550 119870 seq8 0 - 119550 119870 255,0,0 2 4,43 0,277chr2L 120070 120477 seq9 0 - 120070 120477 255,0,0 2 10,57 0,350
As a custom track in the UCSC Genome Browser (with only the exonic/blocks of each interval as displayed):
Spliced Reads
1 kb229500 230000 230500 231000 231500 232000 232500
Exon Coverage
RefSeq Genes
Solution
1. Sort the BED file
2. Use genomeCoverageBed to calculate coverage of the exonic regions (with the -split option)
3. Use bedGraphToBigWig to create BigWig track.
Commands:
$ sort -k1,1 < [SAMPLE.BED] > [SAMPLE.SORTED.BED]$ genomeCoverageBed -bg -i [SAMPLE.SORTED.BED] -g [CHROM_SIZE] | \-split > [SAMPLE.BEDGRAPH]$ bedGraphToBigWig [SAMPLE.BEDGRAPH] [CHROM_SIZE] [SAMPLE.BW]
The [OUTPUT.BW] file is the BigWig file, which can be used as the custom track.
2BED files with 12 columns, or SAM/BAM files which have CIGAR strings with N/D - Result of mapping spliced-junctions.
16
Home Genomes Genome Browser Blat Tables Gene Sorter Session FAQ Help
Add Custom Tracks
clade Insect genome D. melanogaster assembly Apr. 2006 (BDGP R5/dm3)
Paste URLs or data: Or upload: Browse… Submit
Clear
track name="Wiggle Track" type=bigWig bigDataUrl=http://myserver.edu/sample.bw
17
Part II
Genome Browser Custom TracksOnce you have a BigBed/BigWig/BAM custom track file, you need to load it to the Genome Browser server (byclicking on "Add Custom Track" or "Manage Custom Tracks" buttons). This section explains how to load track filesinto the Genome Browser server, and how the set track display options.
5 Uploading tracks using FTP
FTP server (File Transfer Protocol) is a computer that stores files, and gives access to them anywhere from theinternet3.
To use BigBed/BigWig/BAM custom tracks with the public UCSC Genome browser, you’ll have to put the files onan public FTP server, and instruct the UCSC Genome Browser to read the files from the FTP server (unlike our localGenome Browser mirror server, which can read files directly from BlueHelix).
Using an FTP server is the easiest way to make those files public, but HTTP server can also be used (if you know howto upload files to an HTTP server. This document does not deal with HTTP servers).
Getting an FTP server account
• All CSHL members - All CSHL members can request an FTP account from the I.T department. To request anFTP account, fill out this form: http://intranet.cshl.edu/it/requests/account_request.html . Put "FTP" in the field "...I would like to access the following server(s)".
• Hannon Lab members - Email [email protected] for an FTP account on ftp://cancan.cshl.edu .
• Other alternatives - Any public HTTP/FTP server will work just fine, if you have access to one.
Uploading a file to an FTP server
If the custom track file (BigWig,BigBed,BAM) is stored on your local computer (Mac/Windows), use one of thefollowing friendly programs to upload the file to the ftp server:
• Cyberduck (for Mac OS)
• FileZilla (For Mac, Windows, Linux)
• Apple’s Classic FTP for Mac
• WinSCP (For windows)
• and many many more...
If the custom track file is stored on BlueHelix, or if you prefer to use the command line FTP program, see the followingcommands as an example. Text in bold are commands you should type. Replace gordon with your FTP username.Replace dummy.bb with the file name of your custom track.
3This is a gross over-simplification, but it’ll do for now.
18
$ ftp -p ftp2.cshl.eduConnected to ftp2.cshl.edu.220 (vsFTPd 2.0.5)Name (ftp2.cshl.edu:gordon): gordon331 Please specify the password.Password: TYPE PASSWORD AND PRESS ENTER230 Login successful.Remote system type is UNIX.Using binary mode to transfer files.ftp> bin200 Switching to Binary modeftp> put dummy.bblocal: dummy.bb remote: dummy.bb227 Entering Passive Mode (143,48,220,121,171,132)150 Ok to send data.226 File receive OK.5937928 bytes sent in 0.56 secs (10444.5 kB/s)ftp> quit221 Goodbye.
Loading a Custom-Track from an FTP server
URL (Uniform Resource Locator) is a method to find files on the internet4.
The syntax of the URL is ftp://USER:PASSWORD@SERVER/FILE. Assuming the following details:
FTP server: ftp2.cshl.eduFTP Username: gordonFTP password: 12345678Custom Track file name: sample.bam
The full URL to access this file will be:
ftp://gordon:[email protected]/sample.bam
When adding a custom-track In the public UCSC Genome Browser http://genome.ucsc.edu, use the URL ofthe file with the bigDataUrl keyword, as so:
Home Genomes Genome Browser Blat Tables Gene Sorter Session FAQ Help
Add Custom Tracks
clade Insect genome D. melanogaster assembly Apr. 2006 (BDGP R5/dm3)
Paste URLs or data: Or upload: Browse… Submit
Clear
track type=bam bigDataUrl=ftp://gordon:[email protected]/sample.bam
If all went well, when you click "Submit" the new custom track will be added. If there was any error4Again, a gross over-simplification that will do for now. See ?? for more accurate description.
19
6 Using CSHL’s local Genome Browser server (http://foxtrot.cshl.edu)
http://foxtrot.cshl.edu is our local mirror of the UCSC Genome Browser.
It supports contains several common organisms/builds (hg18,hg19,mm9,dm3,panTro2,strPur2) and several other cus-tom builds.
Advantages of using our local server:
1. Faster tracks upload (for BED/Wiggle files)
2. Sessions and Custom tracks are saved for longer periods
3. BLAT with less stringent matching parameters (suitible for short-reads)
4. Can read custom tracks directly from BlueHelix storage (no need to upload files to HTTP/FTP server). SeeBelow for details.
7 Track display options
See this short tutorial: http://tango.cshl.edu/compskills/gb_tutorial7.pdf .
20
Part III
Technical Details
8 Formatting conventions
Fixed-Fonts sections (as the one below) depict a unix session, as typed on a terminal. This will usually be on BlueHelix.
• lines starting with ’#’ are comments
• lines starting with ’$’ are unix shell commands. These should be typed be the user.
• other lines are the program output: will be printed on the screen when the user executes the commands.
The following example shows a unix session, where the user runs the ls command (print file list):
# This is a comment. The next line shows executing the "ls" command# followed by the output of the "ls" command (the four files).$ lsfile1file2file3file4
Where input or output files are involved, they will appear in UPPER CASE, surrounded by square brackets. Theseshould be replaced by real file names when the command is executed by the user.
# The following command copies a file# The command has no output - nothing is printed after the command is executed.$ cp [INPUT.TXT] [OUTPUT.TXT]
9 [CHROM SIZE] file
The programs bedClip, genomeCoverageBed, bedGraphToBigWig, bedToBigBed require a textual file containingthe names and sizes of each chromosome (for the organism/build used). The examples in this document use the[CHROM_SIZE] place holder for this file.
On BlueHelix, files are available for the most common builds:
$ cd /data/hannon/gordon/databases/chrom_sizes$ ls -ltotal 128-rw-r--r-- 1 gordon hannon 513 Mar 9 19:55 dm3_chromInfo.txt-rw-r--r-- 1 gordon hannon 2229 Mar 9 19:54 hg18_chromInfo.txt-rw-r--r-- 1 gordon hannon 3924 Mar 9 19:54 hg19_chromInfo.txt-rw-r--r-- 1 gordon hannon 1249 Mar 9 19:54 mm9_chromInfo.txt
Each file contains three columns: chromosome, size, file (the file column can be safely ignored):
21
$ cat dm3_chromInfo.txtchr2L 23011544 /gbdb/dm3/dm3.2bitchr2LHet 368872 /gbdb/dm3/dm3.2bitchr2R 21146708 /gbdb/dm3/dm3.2bitchr2RHet 3288761 /gbdb/dm3/dm3.2bitchr3L 24543557 /gbdb/dm3/dm3.2bitchr3LHet 2555491 /gbdb/dm3/dm3.2bitchr3R 27905053 /gbdb/dm3/dm3.2bitchr3RHet 2517507 /gbdb/dm3/dm3.2bitchr4 1351857 /gbdb/dm3/dm3.2bitchrU 10049037 /gbdb/dm3/dm3.2bitchrUextra 29004656 /gbdb/dm3/dm3.2bitchrX 22422827 /gbdb/dm3/dm3.2bitchrXHet 204112 /gbdb/dm3/dm3.2bitchrYHet 347038 /gbdb/dm3/dm3.2bitchrM 19517 /gbdb/dm3/dm3.2bit
Files for every organism/build available on the UCSC Genome Browsercan be download from:
http://hgdownload.cse.ucsc.edu/goldenPath/ORG/database/chromInfo.txt.gz
Example (for hg18):
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/chromInfo.txt.gz
10 Bluehelix setup
On BlueHelix, the relevant programs are available in:
/data/hannon/gordon/ucsc_genome_browser/bin
A required library (libmysqlclient.so) is availble here:
/data/hannon/gordon/usr/lib/mysql/
When using BASH, run the following commands:
export PATH=$PATH:/data/hannon/gordon/ucsc_genome_browser/binexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/hannon/gordon/usr/lib/mysql
When using TCSH, run the following commands:
setenv PATH $PATH:/data/hannon/gordon/ucsc_genome_browser/binsetenv LD_LIBRARY_PATH $LD_LIBRARY_PATH:/data/hannon/gordon/usr/lib/mysql
TODO: make a friendly script (set agnostic)
11 direct MySQL access
The UCSC Genome Browserallows direct access to the back-end MySQL database containing all the annotation tracks(see http://genome.ucsc.edu/FAQ/FAQdownloads.html#download29 for more details).
Our local mirror (http://foxtrot.cshl.edu) contains the same annotation tables for several common builds(mainly: hg18, hg19, mm9, dm3, panTro2). Contact [email protected] to setup direct access to the database server(could be faster then connecting to UCSC’s MySQL server).
22
12 Programs reference
bedClip
bedToBigBed
bedGraphToBigWig
genomeCoverageBed
samtools
gb custom track line
13 Compiling programs from source code
Jim Kent’s Tools
Don’t.
Download the pre-compiled binaries from http://genome-test.cse.ucsc.edu/~kent/exe/.
If you insist on building it from source, you’ll find it on BlueHelix:
/home/hannon/gordon/source/kent_genome_browser_source/kent
And the build instructions here: http://genome.ucsc.edu/admin/jk-install.html.
If you have an I.T.-managed server with CentOS 5.4 and Linux kernel 2.6.18, send me an email and I can sendyou the compiled binaries for that platform.
samtools
The source code for samtools v0.1.7a is on BlueHelix:
/home/hannon/gordon/source/samtools-0.1.7a
Or on the official web site: http://samtools.sourceforge.net/
bedtools
The examples in this document require a patched version of Aaron Quinlan’s BEDTools package, available on Blue-Helix:
/home/hannon/gordon/source/BEDTools_bedgraph
The official web site: http://code.google.com/p/bedtools/
Future versions (probably 2.5.5) might incoporate these patches.
23
Part IV
Troubleshooting
SAM no header
$ samtools view -S -b dummy.sam[samopen] no @SQ lines in the header.[sam_read1] missing header? Abort!
Errors with FTP and custom tracks
ftp server response timed out > 1000000 microsec - wrong password
Error Couldn’t find host ccan.cshl.edu. h_errno 1 - bad server name
Error ftp server error on cmd=[SIZE /end221.bb ] response=[550 Could not get file size. ] - wrong file name
Error Missing bigDataUrl setting from track of type=bigBed - multiline track file.
24