Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf ·...

24
Short-reads Custom Tracks Assaf Gordon [email protected] Hannon Lab CSHL July 8, 2010 14 _ 2 _ 1

Transcript of Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf ·...

Page 1: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

Short-reads Custom Tracks

Assaf [email protected]

Hannon LabCSHL

July 8, 2010

14 _

2 _

1

Page 2: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

Contents

I Visualization 4

1 BED/Interval files 5

2 SAM files 6

3 PSLX (blat) files 8

4 Coverage 10

4.1 Coverage of BED files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.2 Coverage of SAM/BAM files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3 Coverage by Strand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.4 Coverage by Exons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

II Genome Browser Custom Tracks 18

5 Uploading tracks using FTP 18

6 Using CSHL’s local Genome Browser server (http://foxtrot.cshl.edu) 20

7 Track display options 20

III Technical Details 21

8 Formatting conventions 21

9 [CHROM_SIZE] file 21

10 Bluehelix setup 22

11 direct MySQL access 22

12 Programs reference 23

13 Compiling programs from source code 23

IV Troubleshooting 23

2

Page 3: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

IntroductionVisualizing small number of intervasl (up to 1,000,000 intervals, or files smaller than 50MB) is easily done by sim-ply uploading the file to the UCSC Genome Browser. Visualizaing large number of reads presents some technicaldifficulties.

This document shows how to visualize large files of millions of short-reads (long reads will work just as well).

• Text in fixed-font shows unix commands. See 8 for more details.

• [CHROM_SIZE] is a text file containing the names and sizes of chromosomes. See 9 for more details.

• All program mentioned here are available on BlueHelix. See 10 for more details.

3

Page 4: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

Part I

VisualizationThe general method is the same for all file formats:

1. Convert your input files into one of several ’common’ textual formats (BED, BedGraph, Wiggle, SAM),

2. Convert the BED/BedGraph/Wiggle/SAM files into a binary File (BigBed,BigWig,BAM)

3. Add a Custom Track in UCSC Genome Browser, pointing to the binary files:

• With the public UCSC Genome Browser:Upload the binary files to an FTP site, and point the custom track to the correct URL.See section 5 for FTP usage.

• With the CSHL Mirror Genome Browser:Put the binary files on BlueHelix, and point the custom track to the correct path.See section 6 for BlueHelix usage.

4

Page 5: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

1 BED/Interval files

Task

Display a BED file:

chr2L 13774500 13774548 seq-1 0 +chr2L 17984104 17984148 seq-2 0 +chr2L 13675851 13675900 seq-3 0 +chr2L 18884003 18884049 seq-4 0 +chr2L 3358603 3358646 seq-5 0 +

As genomic intervals on the UCSC Genome Browser:

50 bases

816850 816860 816870 816880 816890 816900 816910 816920 816930 816940 816950 816960dummy.bb

Gap Locations

FlyBase Protein-Coding Genes

FlyBase Noncoding GenesRefSeq Genes

D. melanogaster mRNAs from GenBank

seq-375826

seq-457175seq-525048

seq-585200seq-453462seq-373485

seq-456181seq-371720

seq-588847seq-375968seq-583503seq-583238seq-368435

seq-375022

seq-457465

seq-455884

seq-523133seq-587247

seq-458346

seq-519065seq-456680

seq-367774seq-518228seq-371273

seq-522655seq-374312

seq-457844

Solution

1. Sort the input file by chromosome name and start position (input file must be sorted to be converted into a binarybigBed file).

2. create a binary bigBed file from the BED file.

Commands:

$ sort -k1,1 -k2,2n < [SAMPLE.BED] > [SAMPLE.SORTED.BED]$ bedToBigBed [SAMPLE.SORTED.BED] [CHROM_SIZE] [SAMPLE.BB]

The file [SAMPLE.BB] can be uploaded to the UCSC genome browser as a custom track:

Home Genomes Genome Browser Blat Tables Gene Sorter Session FAQ Help

Add Custom Tracks

clade Insect genome D. melanogaster assembly Apr. 2006 (BDGP R5/dm3)

Paste URLs or data: Or upload: Browse… Submit

Clear

track name="BigBed Track" type=bigBed bigDataUrl=http://myserver.edu/sample.bb

5

Page 6: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

2 SAM files

Task

Display a SAM file1:

ZAPHOD_FC42T13AAXX:9:1:503:868 0 chr3L 3390 255 44M * 0 0 ACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTAZAPHOD_FC42T13AAXX:9:1:877:655 16 chr3L 11651753 255 43M * 0 0 TAATATAAGACAGAGAACGANAGGCACTCATTAGCACAATATGZAPHOD_FC42T13AAXX:9:1:839:364 16 chr3L 12316404 255 44M * 0 0 TAATATAAGACAGAGAACGAGAGGCACTCANTAGCACAATATGTZAPHOD_FC42T13AAXX:9:1:125:213 0 chr2L 12315946 255 44M * 0 0 TAATATAAGACAGAGAACGAGNGGCACTCATTAGCACAATATGT

As a custom track in the UCSC Genome Browser:

500 bases

BAMTrack

UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics

Vertebrate Multiz Alignment & Conservation (44 Species)Placental Mammal Basewise Conservation by PhyloP

Multiz Alignments of 44 Vertebrates

Solution

1. Convert the SAM file to a BAM file.

2. Sort the BAM file.

3. Create an index (BAI file) to accompany the BAM file.

Commands:

$ samtools view -S -b -o [SAMPLE.BAM] [SAMPLE.SAM]

# .BAM extension will be added automatically to the ’SAMPLE.SORTED’ file.$ samtools sort [SAMPLE.BAM] [SAMPLE.SORTED]

# A new index file will be created with the same name and a .BAI extension$ samtools index [SAMPLE.SORTED.BAM]

1SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. Seehttp://samtools.sourceforge.net/

6

Page 7: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

The two files [SAMPLE.SORTED.BAM] and [SAMPLE.SORTED.BAM.BAI] can be uploaded to the UCSC genomebrowser as a custom track:

Home Genomes Genome Browser Blat Tables Gene Sorter Session FAQ Help

Add Custom Tracks

clade Insect genome D. melanogaster assembly Apr. 2006 (BDGP R5/dm3)

Paste URLs or data: Or upload: Browse… Submit

Clear

track name="BAMTrack" type=bam bigDataUrl=http://myserver.edu/sample.sorted.bam

7

Page 8: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

3 PSLX (blat) files

Task

Display a PSL file (output of BLAT program)

50 0 0 0 0 0 0 0 + 4-77 50 0 50 chrU 10049037 6678238 667828850 0 0 0 0 0 0 0 + 4-77 50 0 50 chrU 10049037 6680033 668008350 0 0 0 0 0 0 0 + 4-77 50 0 50 chrU 10049037 6878866 687891650 0 0 0 0 0 0 0 + 4-77 50 0 50 chrU 10049037 6879800 687985050 0 0 0 0 0 0 0 + 4-77 50 0 50 chrU 10049037 6880518 688056850 0 0 0 0 0 0 0 + 4-77 50 0 50 chrU 10049037 7817848 781789850 0 0 0 0 0 0 0 + 4-77 50 0 50 chrU 10049037 7818749 781879950 0 0 0 0 0 0 0 + 4-77 50 0 50 chrU 10049037 7860041 7860091

As a custom track in the UCSC Genome Browser:

500 bases

688100 688200 688300 688400 688500 688600 688700 688800 688900 689000 689100 689200PSLX example

FlyBase Protein-Coding Genes

FlyBase Noncoding GenesRefSeq Genes

Repeating Elements by RepeatMasker

115056-5290420-2

283403-2

529551-1366570-1

335694-1

245263-245216-14

338552-1522302-1

77855-6

211923-3

1047-4591168-5

Solution

1. Convert the PSLX to BED file

2. Sort the BED file by Chromosome/Start position.

3. Convert BED to BigBed file.

Commands:

$ pslToBed [SAMPLE.PSLX] [SAMPLE.BED]$ sort -k1,1 -k2,2n < [SAMPLE.BED] > [SAMPLE.SORTED.BED]$ bedToBigBed [SAMPLE.SORTED.BED] [CHROM_SIZE] [SAMPLE.BB]

The file [SAMPLE.BB] can be uploaded to the UCSC genome browser as a custom track:

8

Page 9: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

Home Genomes Genome Browser Blat Tables Gene Sorter Session FAQ Help

Add Custom Tracks

clade Insect genome D. melanogaster assembly Apr. 2006 (BDGP R5/dm3)

Paste URLs or data: Or upload: Browse… Submit

Clear

track name="BigBed Track" type=bigBed bigDataUrl=http://myserver.edu/sample.bb

9

Page 10: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

4 Coverage

4.1 Coverage of BED files

Task

Display the nucleotide coverage of a BED file:

chr2L 13774500 13774548 seq-1 0 +chr2L 17984104 17984148 seq-2 0 +chr2L 13675851 13675900 seq-3 0 +chr2L 18884003 18884049 seq-4 0 +chr2L 3358603 3358646 seq-5 0 +chr2L 3212400 3212446 seq-6 0 +

As a custom Wiggle track in the UCSC Genome Browser:

500 bases

816200 816300 816400 816500 816600 816700 816800 816900 817000Coverage

FlyBase Protein-Coding Genes

FlyBase Noncoding GenesRefSeq Genes

CG3639

(BigWig)

Solution

1. Sort the BED file (unlike in BedToBigBed, sorting by chromosome name is sufficient. no need to sort by startposition).

2. Use genomeCoverageBed to calculate coverage over each genomic position.

3. Use bedGraphToBigWig to convert the textual BedGraph into a BigWig file.

Commands:

$ sort -k1,1 < [SAMPLE.BED] > [SAMPLE.SORTED.BED]$ genomeCoverageBed -bg -i [SAMPLE.SORTED.BED] -g [CHROM_SIZE] > [SAMPLE.BEDGRAPH]$ bedGraphToBigWig [SAMPLE.BEDGRAPH] [CHROM_SIZE] [SAMPLE.BW]

The file [SAMPLE.BW] file is the BigWig file, which can be used as the custom track:

10

Page 11: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

Home Genomes Genome Browser Blat Tables Gene Sorter Session FAQ Help

Add Custom Tracks

clade Insect genome D. melanogaster assembly Apr. 2006 (BDGP R5/dm3)

Paste URLs or data: Or upload: Browse… Submit

Clear

track name="Wiggle Track" type=bigWig bigDataUrl=http://myserver.edu/sample.bw

11

Page 12: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

4.2 Coverage of SAM/BAM files

Task

Display the nucleotide coverage of a SAM (or BAM) file:

ZAPHOD_FC42T13AAXX:9:1:503:868 0 chr3L 3390 255 44M * 0 0 ACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTAZAPHOD_FC42T13AAXX:9:1:877:655 16 chr3L 11651753 255 43M * 0 0 TAATATAAGACAGAGAACGANAGGCACTCATTAGCACAATATGZAPHOD_FC42T13AAXX:9:1:839:364 16 chr3L 12316404 255 44M * 0 0 TAATATAAGACAGAGAACGAGAGGCACTCANTAGCACAATATGTZAPHOD_FC42T13AAXX:9:1:935:985 16 chr3L 11651753 255 43M * 0 0 TAATATAAGACAGAGAACGAGAGGCNCTCATTAGCACAATATGZAPHOD_FC42T13AAXX:9:1:125:213 16 chr2L 12315946 255 44M * 0 0 TAATATAAGACAGAGAACGAGNGGCACTCATTAGCACAATATGTZAPHOD_FC42T13AAXX:9:1:953:789 16 chr2L 11651753 255 44M * 0 0 TAATATAAGACAGAGAACGAGAGGCACTCATTNGCACAATATGTZAPHOD_FC42T13AAXX:9:1:31:108 16 chr2L 12315946 255 44M * 0 0 TAATATAAGACAGAGAACGAGAGGCACTCATTAGCACNATATGTZAPHOD_FC42T13AAXX:9:1:454:503 16 chr2L 11651254 255 45M * 0 0 GCTCTCTCTCTCTGTCGTGTATTGTCTTTNTGGGTTTGCGGTAAT

We want to view coverage of genomic positions as a custom track:

500 bases

816200 816300 816400 816500 816600 816700 816800 816900 817000Coverage

FlyBase Protein-Coding Genes

FlyBase Noncoding GenesRefSeq Genes

CG3639

(BigWig)

Solution

1. Convert the SAM file to a BAM file (if needed)

2. Sort the BAM file.

3. Use genomeCoverageBed to calculate coverage over each genomic position.

4. Use bedGraphToBigWig to convert the textual BedGraph into a BigWig file.

Commands:

$ samtools view -S -b -o [SAMPLE.BAM] [SAMPLE.SAM]$ samtools sort [SAMPLE.BAM] [SAMPLE.SORTED]$ genomeCoverageBed -bg -ibam [SAMPLE.SORTED.BAM] \

-g [CHROM_SIZE] > [SAMPLE.BEDGRAPH]$ bedGraphToBigWig [SAMPLE.BEDGRAPH] [CHROM_SIZE] [SAMPLE.BW]

The file [SAMPLE.BW] file is the BigWig file, which can be used as the custom track:

12

Page 13: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

Home Genomes Genome Browser Blat Tables Gene Sorter Session FAQ Help

Add Custom Tracks

clade Insect genome D. melanogaster assembly Apr. 2006 (BDGP R5/dm3)

Paste URLs or data: Or upload: Browse… Submit

Clear

track name="Wiggle Track" type=bigWig bigDataUrl=http://myserver.edu/sample.bw

13

Page 14: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

4.3 Coverage by Strand

Task

Display the coverage of intervals in a BED (or BAM) file:

chr2L 13774500 13774548 seq-1 0 +chr2L 17984104 17984148 seq-2 0 -chr2L 13675851 13675900 seq-3 0 +chr2L 18884003 18884049 seq-4 0 -chr2L 3358603 3358646 seq-5 0 +

As two tracks in the UCSC Genome browser - one for positive-strand reads, one for negative-strand reads:

100 bases19047600 19047650 19047700 19047750 19047800 19047850 19047900 19047950

positive_strand

negative_strand

RefSeq Genes

Solution

1. Sort the BED file (sorting by chromosome name is sufficient)

2. Use genomeCoverageBed with the -strand option to calculate coverage of each strand.

3. For the negative strand track, use awk to negate the coverage values.

4. Use bedGraphToBigWig to create BigWig track for each strand.

Commands:

$ sort -k1,1 < [SAMPLE.BED] > [SAMPLE.SORTED.BED]$ genomeCoverageBed -bg -i [SAMPLE.SORTED.BED] -g [CHROM_SIZE] -strand + \> [SAMPLE.POS.BEDGRAPH]

$ genomeCoverageBed -bg -i [SAMPLE.SORTED.BED] -g [CHROM_SIZE] -strand - | \awk '{ $4 = - $4 ; print $0 }' > [SAMPLE.NEG.BEDGRAPH]

$ bedGraphToBigWig [SAMPLE.POS.BEDGRAPH] [CHROM_SIZE] [SAMPLE.POS.BW]$ bedGraphToBigWig [SAMPLE.NEG.BEDGRAPH] [CHROM_SIZE] [SAMPLE.NEG.BW]

The two files ([SAMPLE.POS.BW] and [SAMPLE.NEG.BW]) can be sued as custom tracks in the Genome Browser.A special parameter (color) will show each strand in a different color:

14

Page 15: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

Home Genomes Genome Browser Blat Tables Gene Sorter Session FAQ Help

Add Custom Tracks

clade Insect genome D. melanogaster assembly Apr. 2006 (BDGP R5/dm3)

Paste URLs or data: Or upload: Browse… Submit

Clear

track name="Positive Strand" color=0,0,255 type=bigWig bigDataUrl=http://myserver.edu/SAMPLE.POS.BW

track name="Negative Strand" color=255,0,0 type=bigWig bigDataUrl=http://myserver.edu/SAMPLE.NEG.BW

15

Page 16: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

4.4 Coverage by Exons

Problem

Display a BED (or BAM) file containing blocked intervals2

chr2L 67753 67927 seq1 0 + 67753 67927 255,0,0 2 9,36 0,138chr2L 107813 108589 seq2 0 - 107813 108589 255,0,0 2 25,2 0,774chr2L 107813 108589 seq3 0 - 107813 108589 255,0,0 2 25,2 0,774chr2L 108308 108591 seq4 0 + 108308 108591 255,0,0 2 38,4 0,279chr2L 113350 113473 seq5 0 + 113350 113473 255,0,0 2 19,40 0,83chr2L 118291 118362 seq6 0 - 118291 118362 255,0,0 2 13,2 0,69chr2L 119550 119870 seq7 0 - 119550 119870 255,0,0 2 4,43 0,277chr2L 119550 119870 seq8 0 - 119550 119870 255,0,0 2 4,43 0,277chr2L 120070 120477 seq9 0 - 120070 120477 255,0,0 2 10,57 0,350

As a custom track in the UCSC Genome Browser (with only the exonic/blocks of each interval as displayed):

Spliced Reads

1 kb229500 230000 230500 231000 231500 232000 232500

Exon Coverage

RefSeq Genes

Solution

1. Sort the BED file

2. Use genomeCoverageBed to calculate coverage of the exonic regions (with the -split option)

3. Use bedGraphToBigWig to create BigWig track.

Commands:

$ sort -k1,1 < [SAMPLE.BED] > [SAMPLE.SORTED.BED]$ genomeCoverageBed -bg -i [SAMPLE.SORTED.BED] -g [CHROM_SIZE] | \-split > [SAMPLE.BEDGRAPH]$ bedGraphToBigWig [SAMPLE.BEDGRAPH] [CHROM_SIZE] [SAMPLE.BW]

The [OUTPUT.BW] file is the BigWig file, which can be used as the custom track.

2BED files with 12 columns, or SAM/BAM files which have CIGAR strings with N/D - Result of mapping spliced-junctions.

16

Page 17: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

Home Genomes Genome Browser Blat Tables Gene Sorter Session FAQ Help

Add Custom Tracks

clade Insect genome D. melanogaster assembly Apr. 2006 (BDGP R5/dm3)

Paste URLs or data: Or upload: Browse… Submit

Clear

track name="Wiggle Track" type=bigWig bigDataUrl=http://myserver.edu/sample.bw

17

Page 18: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

Part II

Genome Browser Custom TracksOnce you have a BigBed/BigWig/BAM custom track file, you need to load it to the Genome Browser server (byclicking on "Add Custom Track" or "Manage Custom Tracks" buttons). This section explains how to load track filesinto the Genome Browser server, and how the set track display options.

5 Uploading tracks using FTP

FTP server (File Transfer Protocol) is a computer that stores files, and gives access to them anywhere from theinternet3.

To use BigBed/BigWig/BAM custom tracks with the public UCSC Genome browser, you’ll have to put the files onan public FTP server, and instruct the UCSC Genome Browser to read the files from the FTP server (unlike our localGenome Browser mirror server, which can read files directly from BlueHelix).

Using an FTP server is the easiest way to make those files public, but HTTP server can also be used (if you know howto upload files to an HTTP server. This document does not deal with HTTP servers).

Getting an FTP server account

• All CSHL members - All CSHL members can request an FTP account from the I.T department. To request anFTP account, fill out this form: http://intranet.cshl.edu/it/requests/account_request.html . Put "FTP" in the field "...I would like to access the following server(s)".

• Hannon Lab members - Email [email protected] for an FTP account on ftp://cancan.cshl.edu .

• Other alternatives - Any public HTTP/FTP server will work just fine, if you have access to one.

Uploading a file to an FTP server

If the custom track file (BigWig,BigBed,BAM) is stored on your local computer (Mac/Windows), use one of thefollowing friendly programs to upload the file to the ftp server:

• Cyberduck (for Mac OS)

• FileZilla (For Mac, Windows, Linux)

• Apple’s Classic FTP for Mac

• WinSCP (For windows)

• and many many more...

If the custom track file is stored on BlueHelix, or if you prefer to use the command line FTP program, see the followingcommands as an example. Text in bold are commands you should type. Replace gordon with your FTP username.Replace dummy.bb with the file name of your custom track.

3This is a gross over-simplification, but it’ll do for now.

18

Page 19: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

$ ftp -p ftp2.cshl.eduConnected to ftp2.cshl.edu.220 (vsFTPd 2.0.5)Name (ftp2.cshl.edu:gordon): gordon331 Please specify the password.Password: TYPE PASSWORD AND PRESS ENTER230 Login successful.Remote system type is UNIX.Using binary mode to transfer files.ftp> bin200 Switching to Binary modeftp> put dummy.bblocal: dummy.bb remote: dummy.bb227 Entering Passive Mode (143,48,220,121,171,132)150 Ok to send data.226 File receive OK.5937928 bytes sent in 0.56 secs (10444.5 kB/s)ftp> quit221 Goodbye.

Loading a Custom-Track from an FTP server

URL (Uniform Resource Locator) is a method to find files on the internet4.

The syntax of the URL is ftp://USER:PASSWORD@SERVER/FILE. Assuming the following details:

FTP server: ftp2.cshl.eduFTP Username: gordonFTP password: 12345678Custom Track file name: sample.bam

The full URL to access this file will be:

ftp://gordon:[email protected]/sample.bam

When adding a custom-track In the public UCSC Genome Browser http://genome.ucsc.edu, use the URL ofthe file with the bigDataUrl keyword, as so:

Home Genomes Genome Browser Blat Tables Gene Sorter Session FAQ Help

Add Custom Tracks

clade Insect genome D. melanogaster assembly Apr. 2006 (BDGP R5/dm3)

Paste URLs or data: Or upload: Browse… Submit

Clear

track type=bam bigDataUrl=ftp://gordon:[email protected]/sample.bam

If all went well, when you click "Submit" the new custom track will be added. If there was any error4Again, a gross over-simplification that will do for now. See ?? for more accurate description.

19

Page 20: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

6 Using CSHL’s local Genome Browser server (http://foxtrot.cshl.edu)

http://foxtrot.cshl.edu is our local mirror of the UCSC Genome Browser.

It supports contains several common organisms/builds (hg18,hg19,mm9,dm3,panTro2,strPur2) and several other cus-tom builds.

Advantages of using our local server:

1. Faster tracks upload (for BED/Wiggle files)

2. Sessions and Custom tracks are saved for longer periods

3. BLAT with less stringent matching parameters (suitible for short-reads)

4. Can read custom tracks directly from BlueHelix storage (no need to upload files to HTTP/FTP server). SeeBelow for details.

7 Track display options

See this short tutorial: http://tango.cshl.edu/compskills/gb_tutorial7.pdf .

20

Page 21: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

Part III

Technical Details

8 Formatting conventions

Fixed-Fonts sections (as the one below) depict a unix session, as typed on a terminal. This will usually be on BlueHelix.

• lines starting with ’#’ are comments

• lines starting with ’$’ are unix shell commands. These should be typed be the user.

• other lines are the program output: will be printed on the screen when the user executes the commands.

The following example shows a unix session, where the user runs the ls command (print file list):

# This is a comment. The next line shows executing the "ls" command# followed by the output of the "ls" command (the four files).$ lsfile1file2file3file4

Where input or output files are involved, they will appear in UPPER CASE, surrounded by square brackets. Theseshould be replaced by real file names when the command is executed by the user.

# The following command copies a file# The command has no output - nothing is printed after the command is executed.$ cp [INPUT.TXT] [OUTPUT.TXT]

9 [CHROM SIZE] file

The programs bedClip, genomeCoverageBed, bedGraphToBigWig, bedToBigBed require a textual file containingthe names and sizes of each chromosome (for the organism/build used). The examples in this document use the[CHROM_SIZE] place holder for this file.

On BlueHelix, files are available for the most common builds:

$ cd /data/hannon/gordon/databases/chrom_sizes$ ls -ltotal 128-rw-r--r-- 1 gordon hannon 513 Mar 9 19:55 dm3_chromInfo.txt-rw-r--r-- 1 gordon hannon 2229 Mar 9 19:54 hg18_chromInfo.txt-rw-r--r-- 1 gordon hannon 3924 Mar 9 19:54 hg19_chromInfo.txt-rw-r--r-- 1 gordon hannon 1249 Mar 9 19:54 mm9_chromInfo.txt

Each file contains three columns: chromosome, size, file (the file column can be safely ignored):

21

Page 22: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

$ cat dm3_chromInfo.txtchr2L 23011544 /gbdb/dm3/dm3.2bitchr2LHet 368872 /gbdb/dm3/dm3.2bitchr2R 21146708 /gbdb/dm3/dm3.2bitchr2RHet 3288761 /gbdb/dm3/dm3.2bitchr3L 24543557 /gbdb/dm3/dm3.2bitchr3LHet 2555491 /gbdb/dm3/dm3.2bitchr3R 27905053 /gbdb/dm3/dm3.2bitchr3RHet 2517507 /gbdb/dm3/dm3.2bitchr4 1351857 /gbdb/dm3/dm3.2bitchrU 10049037 /gbdb/dm3/dm3.2bitchrUextra 29004656 /gbdb/dm3/dm3.2bitchrX 22422827 /gbdb/dm3/dm3.2bitchrXHet 204112 /gbdb/dm3/dm3.2bitchrYHet 347038 /gbdb/dm3/dm3.2bitchrM 19517 /gbdb/dm3/dm3.2bit

Files for every organism/build available on the UCSC Genome Browsercan be download from:

http://hgdownload.cse.ucsc.edu/goldenPath/ORG/database/chromInfo.txt.gz

Example (for hg18):

http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/chromInfo.txt.gz

10 Bluehelix setup

On BlueHelix, the relevant programs are available in:

/data/hannon/gordon/ucsc_genome_browser/bin

A required library (libmysqlclient.so) is availble here:

/data/hannon/gordon/usr/lib/mysql/

When using BASH, run the following commands:

export PATH=$PATH:/data/hannon/gordon/ucsc_genome_browser/binexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/hannon/gordon/usr/lib/mysql

When using TCSH, run the following commands:

setenv PATH $PATH:/data/hannon/gordon/ucsc_genome_browser/binsetenv LD_LIBRARY_PATH $LD_LIBRARY_PATH:/data/hannon/gordon/usr/lib/mysql

TODO: make a friendly script (set agnostic)

11 direct MySQL access

The UCSC Genome Browserallows direct access to the back-end MySQL database containing all the annotation tracks(see http://genome.ucsc.edu/FAQ/FAQdownloads.html#download29 for more details).

Our local mirror (http://foxtrot.cshl.edu) contains the same annotation tables for several common builds(mainly: hg18, hg19, mm9, dm3, panTro2). Contact [email protected] to setup direct access to the database server(could be faster then connecting to UCSC’s MySQL server).

22

Page 23: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

12 Programs reference

bedClip

bedToBigBed

bedGraphToBigWig

genomeCoverageBed

samtools

gb custom track line

13 Compiling programs from source code

Jim Kent’s Tools

Don’t.

Download the pre-compiled binaries from http://genome-test.cse.ucsc.edu/~kent/exe/.

If you insist on building it from source, you’ll find it on BlueHelix:

/home/hannon/gordon/source/kent_genome_browser_source/kent

And the build instructions here: http://genome.ucsc.edu/admin/jk-install.html.

If you have an I.T.-managed server with CentOS 5.4 and Linux kernel 2.6.18, send me an email and I can sendyou the compiled binaries for that platform.

samtools

The source code for samtools v0.1.7a is on BlueHelix:

/home/hannon/gordon/source/samtools-0.1.7a

Or on the official web site: http://samtools.sourceforge.net/

bedtools

The examples in this document require a patched version of Aaron Quinlan’s BEDTools package, available on Blue-Helix:

/home/hannon/gordon/source/BEDTools_bedgraph

The official web site: http://code.google.com/p/bedtools/

Future versions (probably 2.5.5) might incoporate these patches.

23

Page 24: Short-reads Custom Tracks - Hannon Laboratorycancan.cshl.edu/labmembers/gordon/files/viz2.pdf · Short-reads Custom Tracks Assaf Gordon gordon@cshl.edu ... 13 Compiling programs from

Part IV

Troubleshooting

SAM no header

$ samtools view -S -b dummy.sam[samopen] no @SQ lines in the header.[sam_read1] missing header? Abort!

Errors with FTP and custom tracks

ftp server response timed out > 1000000 microsec - wrong password

Error Couldn’t find host ccan.cshl.edu. h_errno 1 - bad server name

Error ftp server error on cmd=[SIZE /end221.bb ] response=[550 Could not get file size. ] - wrong file name

Error Missing bigDataUrl setting from track of type=bigBed - multiline track file.

24