Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline...

28
Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4 Introduction 4 Pipeline 4 Maq 4 GBrowse 4 Hardware Requirements 5 Workflow 6 Preparing to Run Maq 6 UNIX/Linux Environment 6 Testing PERL 6 Installing Maq 7 Getting Reference Sequences 8 Reference Genome with Multiple Chromosomes 9 Output File from Pipeline 9 Required Pipeline Output File 9 Format of Sequence.txt File 10 Quality Values 11 Getting Consensus, Identifying SNPs and Indels 11 Building Consensus 13 Extracting Consensus Information

Transcript of Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline...

Page 1: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

Part # 1005020, Rev. AMay 2008

Using Pipeline Output Data for Whole Genome Alignment

FOR RESEARCH ONLY

Topics4 Introduction

4 Pipeline

4 Maq

4 GBrowse

4 Hardware Requirements

5 Workflow

6 Preparing to Run Maq

6 UNIX/Linux Environment

6 Testing PERL

6 Installing Maq

7 Getting Reference Sequences

8 Reference Genome with Multiple Chromosomes

9 Output File from Pipeline

9 Required Pipeline Output File

9 Format of Sequence.txt File

10 Quality Values

11 Getting Consensus, Identifying SNPs and Indels

11 Building Consensus

13 Extracting Consensus Information

Page 2: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

2

Part # 1005020, Rev. A

13 SNP Calling

16 Indel Discovery

18 Viewing SNPs and Indels with GBrowse

18 GBrowse

18 Reformatting Data

22 Using GBrowse

25 Appendix A: Installing Maq Yourself

26 Appendix B: Quality Value Tables

26 Illumina Symbolic ASCII Quality Values

27 Sanger Symbolic ASCII Quality Values

Page 3: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

This publication and its contents are proprietary to Illumina, Inc., and are intended solely for the contractual use of its customers and for no other purpose than to operate the system described herein. This publication and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed, or reproduced in any way whatsoever without the prior written consent of Illumina, Inc.

For the proper operation of this system and/or all parts thereof, the instructions in this guide must be strictly and explicitly followed by experienced personnel. All of the contents of this guide must be fully read and understood prior to operating the system or any of the parts thereof.

FAILURE TO COMPLETELY READ AND FULLY UNDERSTAND AND FOLLOW ALL OF THE CONTENTS OF THIS GUIDE PRIOR TO OPERATING THIS SYSTEM, OR PARTS THEREOF, MAY RESULT IN DAMAGE TO THE EQUIPMENT, OR PARTS THEREOF, AND INJURY TO ANY PERSONS OPERATING THE SAME.

Illumina, Inc. does not assume any liability arising out of the application or use of any products, component parts, or software described herein. Illumina, Inc. further does not convey any license under its patent, trademark, copyright, or common-law rights nor the similar rights of others. Illumina, Inc. further reserves the right to make any changes in any processes, products, or parts thereof, described herein without notice. While every effort has been made to make this guide as complete and accurate as possible as of the publication date, no warranty or fitness is implied, nor does Illumina accept any liability for damages resulting from the information contained in this guide.

© 2008 Illumina, Inc. All rights reserved. Illumina, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, iScan, and GenomeStudio are registered trademarks or trademarks of Illumina. All other brands and names contained herein are the property of their respective owners.

Page 4: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

4

Part # 1005020, Rev. A

Introduction

The Genome Analyzer can generate several Gb of data a week. Converting these huge amounts of sequence data into usable information requires fast and efficient downstream analysis. This document describes how to align Genome Analyzer Pipeline sequence data to a known genome using the Mapping and Assembly with Quality (Maq) application. Results can then be assessed opening the output files, or imported into a GBrowse implementation to view in the genomic context.

The key sections of this guide are:Preparing to Run Maq on page 6Gives information on installing Maq.Output File from Pipeline on page 9Describes the fields in the relevant Pipeline files and the various metrics.Getting Consensus, Identifying SNPs and Indels on page 11Explains how to get a consensus sequence, SNPs and indels from Maq.Viewing SNPs and Indels with GBrowse on page 18Explains how to use GBrowse to view SNPs and indels.

Pipeline The Genome Analyzer Pipeline software is a highly customizable workflow engine capable of taking the raw image data generated by the Genome Analyzer and producing intensity scores, base calls, quality metrics, and quality scored alignments. This software is the result of extensive collaborations with many of the world’s leading sequencing centers.

Maq Maq is a third party open source software tool that builds mapping assemblies from short reads generated by next-generation sequencing machines. Maq is specifically developed for the Genome Analyzer by Heng Li and Richard Durbin from the Sanger Institute. Maq runs on UNIX/Linux, so you will need a computer that uses Linux or UNIX as the operating system.

GBrowse GBrowse is an open source genome viewer, generated as part of the Generic Model Organism Database project (GMOD). Many genome centers and universities have implemented GBrowse to enable you to view their genomic data.

Hardware Requirements

At minimum, you will need 1 GB of memory. This should be enough to map 2 million reads to a bacterial genome, though 4 GB is preferable. For mammalian-sized genome alignments, you will need to map many batches of about 2 million reads, and you will be better served with 16 GB of memory.

NOTE

This guide does not explain how to use Pipeline, and only provides limited information for the use of Maq and GBrowse. The main goal is to provide a path to efficiently use Pipeline output for whole genome alignment.

Page 5: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

5

Pipeline to Maq to GBrowse

Workflow The workflow for generating consensus, SNPs and indels is illustrated in Figure 1.

Figure 1 Workflow Generating Consensus, SNPs and Indels

Page 6: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

6

Part # 1005020, Rev. A

Preparing to Run Maq

Before you can install Maq, there are a number of requirements you need to fulfill. This section lists these requirements, and gives some options for installing these.

UNIX/Linux Environment

You need to install Maq in an environment that runs on UNIX or Linux (a version of UNIX).

Workstation

Your best option is to run Maq on a dedicated UNIX or Linux workstation. See if you can find such a workstation in your department where you can install and run Maq.

You may need to install Linux on a computer from scratch. Talk to your IT department to see what is required, and whether they can help.

Linux Distributions

If you do not have access to a workstation running UNIX/Linux and you need to install Linux, there are many different distributions of Linux available, paid or free. Good choices are Red Hat Linux (paid) and Fedora Linux (free), but others should work too. Use the documentation provided with your Linux distribution for installation.

Testing PERL Maq uses a number of scripts that are written in the programming language Perl. Many UNIX/Linux distributions already have Perl installed, so first check whether Perl is installed in your UNIX/Linux environment by typing the following:

1. Go to your UNIX/Linux environment

2. In the command prompt, enter:perl -v

3. Evaluate whether you have Perl installed:• If Perl has been installed, you will get a message stating the version

of Perl, copyright and other information. Continue with the section Installing Maq.

• If Perl is not installed yet, you will get a message like this:perl: command not found

If Perl is not installed yet, go to www.activestate.com and install the most recent fully released version of Perl for Linux and your hardware configuration.

Installing Maq When your Linux environment is set up, ask your IT department to install Maq. The download is available from maq.sourceforge.net (Figure 2). We used Maq versions 0.6.5 and 0.6.6 to test the application.

Page 7: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

7

Pipeline to Maq to GBrowse

Figure 2 Maq Home Page

Getting Reference

Sequences

You need to download a reference genome for the organism you sequenced to compare it to. Many are available from the NCBI website.

1. Open your browser and navigate to www.ncbi.nlm.nih.gov.

2. Click on the link Genomic Biology in the left navigation bar.

3. Browse to your species under Genome Projects Database in the right navigation bar.

4. Navigate to or search for the species you are looking for, and click on Project data | Genomic

5. Download the genomic files in fasta format (*.fasta, *.fa or *.fna). Download each chromosome of your organism.

6. Make sure to keep track of the exact build of the genome you are using. You can find this in the genbank file, in the Comments section.

NOTEIf you have to install Maq yourself, refer to Appendix A: Installing Maq Yourself on page 25.

Download page

Maq FAQ

Maq User’s Manual

Maq Reference Manual

Maq Wiki

NOTEAnother good source for reference genomes is UCSC (hgdownload.cse.ucsc.edu).

Page 8: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

8

Part # 1005020, Rev. A

Reference Genome with

Multiple Chromosomes

If you use a reference genome with multiple chromosomes, you may only find them as a fasta file per chromosome. You will need to combine these fasta files in one file for the reference genome, else your alignment scores may be affected. Perform the following:

1. Open the command line (Terminal) in Linux.

2. Go to the directory containing the downloaded reference genome files using the cd command.

3. Enter the following:cat chr1.fa chr2.fa chr3.fa >ref.fa

where:• chr1.fa chr2.fa and chr3.fa are the fasta input files.• ref.fa is the fasta reference genome output file.

Page 9: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

9

Pipeline to Maq to GBrowse

Output File from Pipeline

After you called the bases in Pipeline, Pipeline saves files containing the sequence information. This section specifies what file you need from Pipeline for alignment in Maq, and explains the different elements in this file.

Required Pipeline Output

File

The Pipeline output file you should use for alignment in Maq has the following naming scheme:

s_N_R_sequence.txt (for paired-end sequence files)

or

s_N_sequence.txt (for single-read sequence files)

where:The N stands for the lane.The R stands for the read, in case of paired-end sequencing.

An example of a sequencing reads for one clusters is s_3_2_sequence.txt; this file contains information from read 2 of lane 3.

Format of Sequence.txt

File

The s_N_R_sequence.txt file contains sequence and quality information for one read from one sequencing lane. The files are in FASTQ format.

An example of an entry for one read is shown below:@SLXA-B3_604:2:1:512:767/1GCCTAACCTTTCTGAACCTCATGCGGAAAAACTGTTT+SLXA-B3_604:2:1:512:767/1ccccccccccccchKhcchcU`]`LPVRTINKSNLAA

Every entry contains the following lines:Read Identifier:The line @SLXA-B3_604:2:1:512:767/1 contains the read identifier, which has the following elements:

The read indentifier line starts with an '@', which indicates this line is going to be followed by a sequence line.Sequence:The line GCCTAACCTTTCTGAACCTCATGCGGAAAAACTGTTT con-tains the called sequence for this entry.

Description Element

Abbreviated run name SLXA-B3_604

Lane 2

Tile 1

Coordinates of the cluster on tile 512,767

Indicates the read of a paired end run /1

Page 10: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

10

Part # 1005020, Rev. A

Read Identifier:The line +SLXA-B3_604:2:1:512:767/1 contains the same read identifier as above, but this time the line starts with a '+' , which indicates it is going to be followed by a quality score line.Quality scores:The line ccccccccccccchKhcchcU`]`LPVRTINKSNLAA contains the quality scores for this entry. Every base call in an entry has a corresponding qual-ity score, i.e., the nth position in the quality scores line corresponds to the nth nucleotide in the sequence line.

Quality Values The quality scores are in Illumina symbolic ASCII format, according to the following formula:

Quality value = (ASCII character code) - 64.

The values of the characters in the Illumina symbolic ASCII format are listed in the Appendix, section Illumina Symbolic ASCII Quality Values on page 26.

For a single basecall, a Q value of 30 is great, Q20 is a good score, while Q10 is still usable.

Difference of Illumina and Phred Scoring Scheme

The Illumina quality scoring scheme and the Phred quality scoring scheme are different:

Illumina: 10 x log10((1-e)/e)Phred: -10log10(e)

where: e=error probability.

The two definitions round to the same value from approximately Q15 and above, however our scores can go as low as -5.

Difference of Illumina and Sanger FASTQ

The Sanger FASTQ format, which is used by Maq, differs slightly from the Illumina FASTQ format. The main difference is that the quality of the base calls is scored using different scales (Illumina versus Phred quality scores). Maq comes with tools to convert Illumina FASTQ (also often called Solexa FASTQ) to Sanger FASTQ; see Preparing to Run Maq on page 6 and the Maq documentation for more information.

Page 11: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

11

Pipeline to Maq to GBrowse

Getting Consensus, Identifying SNPs and Indels

Maq aligns your sequence reads to a reference sequence, builds a consensus and calls single nucleotide polymorphisms (SNPs), and can identify insertion/deletions (indels) if you have performed paired-end sequencing. This section explains briefly how to perform these actions, and what output files you will get when you call SNPs and identify indels.

A lot of this information has been summarized from the Maq user’s manual and the Maq reference manual, available at maq.sourceforge.net (see Figure 2). For more detailed instructions and comprehensive descriptions of the commands in Maq, see these documents; additional information is present in the FAQ section and in the Maq Wiki.

Generating Analysis Folder

You need to generate a folder in which you run the analysis. Copy the following files to this folder:

Read files (Illumina FASTQ format).Reference sequence file (FASTA format).

All output files Maq generated will be stored in this folder (unless you specifically direct Maq to another folder).

Building Consensus

The first thing you need to do is align the reads to the reference, and build a consensus. This is described in this section.

Converting Illumina FASTQ to Sanger FASTQ

As described in Quality Values on page 10, the FASTQ format used by Maq is different from the Illumina FASTQ format. To use Maq, you need to first convert the format for all read files by entering:

maq sol2sanger s_N_R_sequence.txt s_N_R_sequence.fastqwhere:• s_N_R_sequence.txt is the Illumina read sequence file• s_N_R_sequence.fastq is the output file in Sanger FASTQ.

Converting Sanger FASTQ to BFQ

Next you need to convert Sanger FASTQ to binary FASTQ (bfq) for all read files by entering:

maq fastq2bfq s_N_R_sequence.fastq s_N_R_sequence.bfqwhere:• s_N_R_sequence.fastq is the Sanger FASTQ read sequence file.• s_N_R_sequence.bfq is the output file in binary FASTQ.

NOTE

For small sequencing projects (1 lane of sequence data from a procaryote), many of these steps can be combined as a batch using the easyrun command. See the Maq user’s manual for information.

Page 12: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

12

Part # 1005020, Rev. A

Converting Reference FASTA to BFA

Next you need to convert FASTA to binary FASTA (bfa) for the reference sequence by entering:

maq fasta2bfa ref.fasta ref.bfawhere:• ref.fasta is the FASTA reference sequence file.• ref.bfa is the output reference file in binary FASTA.

Aligning Reads to Reference

For single-read sequencing, you align the reads from one file to the reference sequence by entering:

maq map s_N_sequence.map ref.bfa s_N_sequence.bfq

For paired-end sequencing, you align the reads from two matching paired-end files to the reference sequence by entering:

maq map s_N_sequence.map ref.bfa s_N_1_sequence.bfq s_N_2_sequence.bfq

where:• s_N_sequence.map is the mapped alignment output file.• ref.bfa is the reference file in binary FASTA.• s_N_sequence.bfq is the single-read output file in binary FASTQ.• s_N_1_sequence.bfq is the paired-end first read output file in binary

FASTQ.• s_N_2_sequence.bfq is the paired-end second read output file in

binary FASTQ.

Merging Map Files

Maq works best with 1 to 3 million reads as input when aligning reads to the reference sequence. If you have a big sequencing project with multiple lanes, you should perform the alignment per lane first, and then combine the map files using mapmerge.

So if you used multiple lanes to sequence the same sample, you can combine the mapped alignments now by entering:

NOTE

When you align paired-end reads, you will get a message that indicates the success of the pairing:(total, isPE, mapped, paired) = (4316000, 1,

4226477, 6142)The number of mapped reads should be close to the number of paired reads. If the number of paired samples is very low (6142 in the example above), and you have done long distance paired-end reads, you need to specify the maximum read length (which should be slightly longer than the average paired-end fragment length).For example, for paired-end reads from 500 bp fragments, add a maximum fragment length of 550 bp by adding the argument -a 550, i.e. enter the following:

maq map -a 550 s_N_sequence.map ref.bfa s_N_1_sequence.bfq s_N_2_sequence.bfq

Page 13: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

13

Pipeline to Maq to GBrowse

maq mapmerge s_123_sequence.map s_1_sequence.map s_2_sequence.map s_3_sequence.map

where:• s_123_sequence.map is the combined mapped alignment output file

for lane 1,2, and 3.• s_N_sequence.map is the mapped alignment file for lane N.

Building Consensus

Now you can assemble the consensus from the (merged) map files:maq assemble s123.cns ref.bfa s_123_sequence.map

where:• s123.cns is the consensus output file• ref.bfa is the reference file in binary FASTA.• s_123_sequence.map is the merged mapped alignment file.

Extracting Consensus

Information

Once you have built the consensus, you can extract the new consensus sequence in FASTA format, or in FASTQ format (containing Sanger quality scores).

Extracting Consensus in FASTA Format

To extract the consensus in FASTA format, enter the following:maq cns2ref s123.cns >s123.cns.fasta

where:• s123.cns is the consensus file.• s123.cns.fasta is the output consensus file in FASTA.

Extracting Consensus in FASTQ Format

To extract the consensus in Sanger FASTQ format, enter the following:maq cns2fq s123.cns >s123.cns.fastq

where:• s123.cns is the consensus file.• s123.cns.fastq is the output consensus file in FASTQ.

The files are saved in the Sanger FASTQ format, with quality scores in the Sanger symbolic ASCII format (see Quality Values on page 10 for differences with the Illumina quality scheme).

The quality scores are in Sanger symbolic ASCII format, according to the following formula:

Quality value = (ASCII character code)- 33

The values of the characters in the Sanger symbolic ASCII format are listed in the Appendix, section Sanger Symbolic ASCII Quality Values on page 27.

SNP Calling Extracting SNP Calls

Once you have built the consensus, extract SNPs the following way:maq cns2snp s123.cns >s123.snp

Page 14: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

14

Part # 1005020, Rev. A

where:• s123.cns is the consensus file• s123.snp is the tab-delimited, output snp file.

SNP File

To view the SNP calls, open the snp file in excel (Figure 3).

Figure 3 SNP File Opened in Excel

The columns contain the following information:

Chromosome/Reference

Position

Reference Base

Consensus Base

Consensus Quality

Read Depth

Highest Mapping Quality

Quality Difference

Average # Hits

Column Name Description

A Chromosome / Reference

Chromosome or reference sequence.

B Position Position of SNP on the reference sequence.

C Reference Base The base as present in the reference sequence.

D Consensus Base The base called in the consensus of your sequencing reads.

E Consensus Quality

The quality of the base called in the consensus. This is the Sanger quality, which is different from the Illumina quality scores (see Difference of Illumina and Phred Scoring Scheme on page 10).

F Read Depth The amount of reads covering the position.

G Average # Hits The average number of hits of reads covering this position, which roughly equals the copy number of the flanking region in the reference genome.

Page 15: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

15

Pipeline to Maq to GBrowse

For the consensus bases, heterozygotes are designated using IUB codes:

Improving SNP Quality

In addition, the following commands are useful for filtering SNP calls:SNPfilter.SNPfilter removes SNPs that are covered by just one read, fall in a repeti-tive region, or fall in a 10 bp region with at least 3 SNPs. Enter the follow-ing:perl maq.pl SNPfilter s123.snp >s123.filtered.snp

where:• s123.snp is the consensus file.

H Highest Mapping Quality

The highest mapping quality of the reads covering the position.

I Quality Difference

The quality difference between the strong allele and the weak allele. If the quality difference is close to the highest mapping quality, you may be looking at a read error.

IUB code Bases

A A

C C

G G

T T

M A/C

K G/T

Y C/T

R A/G

W A/T

S G/C

D A/G/T

B C/G/T

H A/C/T

V A/C/G

N A/C/G/T

Column Name Description

Page 16: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

16

Part # 1005020, Rev. A

• s123.filtered.snp is the tab-delimited, output filtered snp file.rmdup.Rmdup removes pairs with identical ends, which could have been caused by PCR at sample prep. Removing duplicates may improve SNP calling accuracy. This filter needs to be done before the consensus is assembled (Building Consensus on page 13); use it as follows:maq rmdup s_123_rmdup.map s_123_sequence.map

where:• s_123_rmdup.map is the output filtered mapped alignment file• s_123_sequence.map is the input mapped alignment file

Indel Discovery Extracting Indels

Once you have built the consensus, you can extract the indels the following way:

maq indelpe ref.bfa s_123_sequence.map >s_123_sequence.indelpewhere:• ref.bfa is the reference file in binary FASTA.• s_123_sequence.map is the merged mapped alignment file.• s_123_sequence.indelpe is the tab-delimited, output indel file.

Indel File

To view the indels found, open the indel file in excel (Figure 4).

Figure 4 Indel File Opened in Excel

NOTEYou can only find indels using Maq with paired-end data.

Chromosome/Reference

Position

Indel Type

# Ref Reads

Indel Size

Reverse Reads

Forward Reads

Page 17: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

17

Pipeline to Maq to GBrowse

The columns contain the following information:

Column Name Description

A Chromosome / Reference

Chromosome or reference sequence.

B Start Position Start position of indel on reference sequence.

C Indel Type * Indicates the indel is confirmed by reads from both strands.+ Means the indel is hit by at least two reads but from the same strand.- Shows the indel is only found on one read.. Means the indel is too close to another indel and is filtered out.

D # Ref Reads The number of reads across the indel.

E Indel Size Size of indel.

F Forward Reads Number of reads on the forward strand confirming the consensus.

G Reverse Reads Number of reads on the reverse strand confirming the consensus.

NOTEIf you want to concentrate on the most promising indels, filter the indel file in excel for * in the Indel Type field.

Page 18: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

18

Part # 1005020, Rev. A

Viewing SNPs and Indels with GBrowse

Once you have files with SNPs and indels, you may want to view them in a genomic context. Many genome centers have implimented GBrowse, an open source genome viewer. This section helps you viewing your results in a GBrowse viewer.

You will need to perform the following steps:

1. Find a GBrowse implementation for the organism and build you are interested in.

2. Transfer your SNP or indel data to the proper file format.

3. Upload the file to GBrowse.

Now you are ready to look at your SNPs and indels as annotations in a genomic context.

GBrowse GBrowse is an open source genome viewer, generated as part of the Generic Model Organism Database project (GMOD). Many genome centers and universities have implemented GBrowse to enable you to view their genomic data.

Finding Suitable GBrowse Implementation

Lists of implementations can be found at the following two websites:

http://www.gmod.org/wiki/index.php/GMOD_Users

http://www.gmod.org/wiki/index.php/Gbrowse

Browse through these lists and see if there is a GBrowse implementation for the organism and build you are interested in. These lists are not comprehensive; if you can’t find one you can use, try entering GBrowse and your particular build in google, and see if you can find an appropriate implementation that way.

Alternative Solutions

If no suitable implementation of GBrowse exists, you can do two things:Redo your alignments with a build that is supported in a GBrowse implementation.Install GBrowse locally. This is possible, but requires more work and skill. See http://www.gmod.org/wiki/index.php/GBrowse for instructions.

Reformatting Data

The SNP and indel files do not have the appropriate format for GBrowse to recognize. Fortunately, they are usually not extremely big, and can be handled in Excel, and you do not need a Perl script to change the format. This section explains how to reformat your SNP or indel data.

Annotation File Format

GBrowse can read a number of different file formats. Here we explain the annotation file format that works well with our data (Figure 5).

Page 19: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

19

Pipeline to Maq to GBrowse

Figure 5 GBrowse File

The annotation file is a text file, and has to start with the following line:reference=landmark name

The reference line has the following properties:The line starts with reference= (in lowercase).The line refers to the chromosome (reference=chr1) or the accession number of the organism (reference=NC_000913).No spaces allowed.The reference applies to all entries below it, until a new reference is found. Multiple reference lines are allowed.

The reference line is followed by data lines, which have the following fields:

Column Entry Description

A Feature Type In our case SNP or INDEL.

B Feature Name A unique name for each entry.

C Feature Position One or more ranges in the format 123-456,987-654 or 123...456,987...654.

D Description (optional)

A description that will be displayed in the viewer.

E URL (optional) If you have a hyperlink, provide it here.

NOTEDo not use spaces, unless you put quotation marks around the field entry.

Page 20: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

20

Part # 1005020, Rev. A

Reformatting SNP Files

To reformat the SNP file, perform the following steps:

1. Open the SNP file in Excel.

2. To get a unique SNP name, enter SNP1 in the top field of the empty column J.

3. You need to have a range of nucleotides for the feature position field. In the top field of the empty column K, enter:=CONCATENATE(B1,"-",B1)

4. You need one field with an informative description for every SNP. In the top field of the empty column L, enter:=CONCATENATE(C1,">",D1,",Q",E1,",",B1)

The SNP description will consist of the following information:reference base>consensus base,quality score,position

5. To copy all formulas and calculate values for every entry:a. Select fields J1, K1, and L1b. Drag down the selected fields by the bottom right corner (Figure 6).

Figure 6 Drag Down Bottom Right Corner

The values in column J and K should automatically recalculate, and col-umn L should be filled with unique names (SNP1, SNP2, and so on).

6. Save the file in Excel format (*.xls).

7. Open a new book. This will be the annotation file

8. Copy the values from columns J, K and L of the modified SNP file to columns B, C and D of the annotation file (paste values only).

9. Enter “SNP” in the top field of the empty column A of the annotation file. Copy SNP all the way down to the last data line.

10. Select the first row and insert an empty line by pressing Ctrl Shift + .

11. Enter the reference line in field A1, for example reference=chr1

orreference=NC_000913

Select Bottom Right CornerDrag Down to Last Entry

Page 21: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

21

Pipeline to Maq to GBrowse

The SNP annotation file should look like this (Figure 7):

Figure 7 SNP Annotation File

12. Save the SNP annotation file as a text (tab delimited) file (*.txt).

Reformatting Indel Files

To reformat the indel file, perform the following steps:

1. Open the indel file in Excel.

2. To get a unique indel name, enter INDEL1 in the top field of the empty column H.

3. You need to have a range of nucleotides for the feature position field. In the top field of the empty column I, enter:=CONCATENATE(B1,"-",B1)

4. You need one field with an informative description for every indel. In the top field of the empty column J, enter:=CONCATENATE(C1,",",E1,",f",F1,",r",G1)

The indel description will consist of the following information:Indel type,indel size,f forward reads,r reverse reads

5. To copy all formulas and calculate values for every entry:a. Select fields H1, I1, and J1b. Drag down the selected fields by the bottom right corner (Figure 6).The values in column I and J should automatically recalculate, and col-umn L should be filled with unique names (INDEL1, INDEL2, and so on).

NOTE

You can refer to multiple chromosomes per file; just insert a reference line with the new chromosome above the data line where the next chromosome starts. The reference applies to all entries below it, until a new reference is found.

NOTE

If you want to concentrate on the most promising indels, filter the indel file in excel for * in the Indel Type field (column C), and copy all the promising indels to a new book.

Page 22: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

22

Part # 1005020, Rev. A

6. Save the file in Excel format (*.xls).

7. Open a new book. This will be the annotation file

8. Copy the values from columns H, I, and J of the modified indel file to columns B, C and D of the annotation file (paste values only).

9. Enter “INDEL” in the top field of the empty column A of the annotation file. Copy “INDEL”all the way down to the last data line.

10. Select the first row and insert an empty line by pressing Ctrl Shift + .

11. Enter the reference line in field A1, for example reference=chr1

orreference=NC_000913

The indel annotation file should look like this (Figure 8):

Figure 8 Indel Annotation File

12. Save the indel annotation file as a text (tab delimited) file (*.txt).

Using GBrowse When you have generated your annotation file, and found a suitable GBrowse implementation, you can start viewing your indels or SNPs in a genomic context.

For comprehensive GBrowse help, FAQs and a tutorial, see http://www.gmod.org/wiki/index.php/Gbrowse.

Upload the Annotation File

1. Navigate your web browser to the GBrowse running web site.

2. Scroll down to the bottom of the page, where you can upload your own annotations (Figure 9). Different GBrowse implementations may look slightly different.

NOTE

You can refer to multiple chromosomes per file; just insert a reference line with the new chromosome above the data line where the next chromosome starts. The reference applies to all entries below it, until a new reference is found.

Page 23: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

23

Pipeline to Maq to GBrowse

Figure 9 Upload Annotation File

3. Click Browse, go to the annotation file, select the file, and click Open.

4. Click Upload.

Viewing SNPs and Indels

Once your annotation file is uploaded you will see the file appear with the separate features (Figure 10).

Figure 10 Uploaded Annotation File

Make sure the annotation check box is selected. You can now edit the uploaded annotation file, or click on the separate features (SNPs or indels). This will display the feature in the viewer panel (Figure 11 and Figure 12).

Figure 11 Your Favorite SNP in the GBrowse Viewer

Browse to File

Upload File

Annotation Check Box

Clickable Features

Uploaded Annotation FileEdit File

Published SNPs

Your Favorite SNP

Gene Information

Zoom and Browse Area

Page 24: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

24

Part # 1005020, Rev. A

Figure 12 Your Favorite Indels in the GBrowse Viewer

Your Favorite Indels

Gene Information

Zoom and Browse Area

Page 25: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

25

Pipeline to Maq to GBrowse

Appendix A: Installing Maq Yourself

If you decide to install Maq yourself, do the following:

1. Open your browser in Linux and navigate to maq.sourceforge.net.

2. Click on the link download page (see Figure 2).

3. Click on the link Download for the most recent version of Maq.

4. Click on the package for your Linux and hardware configuration. If you are not sure which one is best, choose platform independent.

5. Click Save to download the package.

6. Repeat steps 3 to 5 for Maqview and Maq-Data.

7. Open the command line (Terminal).

8. Go to the directory containing the downloaded files using the cd command. The exact location depends on how your Linux is set up.

9. To unzip the packages type the following in the command line:bunzip2 *.bz2

10. List the directory contents by using the ls command.

11. To remove the files from the archive, type the following for every *.tar file in the directory:tar xvf name.tar

You should get three new directories (check by using the ls command).

12. Go to the directory containing the Maq files:cd maq-x.x.x

13. Install the package by entering the following three commands in succession:./configuremakemake install

14. If you get a message that access is denied to the default install directory, you need to specify a directory that you do have access to. Enter the following two commands:./configure --prefix=/home/share/yourfolder

(with /home/share/yourfolder your accessible directory)make install

15. Go one directory up:cd ..

16. Test whether Maq is working by entering:maq

You should get a message explaining Maq usage. If the command maq is not recognized, try the second method decribed in the Maq User Manual, or ask a Linux expert for help.

Page 26: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

26

Part # 1005020, Rev. A

Appendix B: Quality Value Tables

Illumina Symbolic ASCII Quality Values

The quality values of the characters in the Illumina symbolic ASCII quality values are listed in the table below:

Table 1 Quality Value of Characters in the Illumina Symbolic ASCII Format

Char. Code

Qual. Value

Char. Code

Qual. Value

Char. Code

Qual. Value

Char. Code

Qual. Value

Char. Code

Qual. Value

Char. Code

Qual. Value

; -5 C 3 K 11 S 19 [ 27 c 35

< -4 D 4 L 12 T 20 \ 28 d 36

= -3 E 5 M 13 U 21 ] 29 e 37

> -2 F 6 N 14 V 22 ^ 30 f 38

? -1 G 7 O 15 W 23 _ 31 g 39

@ 0 H 8 P 16 X 24 ‘ 32 h 40

A 1 I 9 Q 17 Y 25 a 33

B 2 J 10 R 18 Z 26 b 34

Page 27: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

27

Pipeline to Maq to GBrowse

Sanger Symbolic ASCII Quality

Values

The quality values of the characters in the Sanger Symbolic ASCII Quality Values are listed in the table below:

Table 2 Quality Value of Characters in the Sanger Symbolic ASCII Format

Char. Code

Qual. Value

Char. Code

Qual. Value

Char. Code

Qual. Value

Char. Code

Qual. Value

Char. Code

Qual. Value

Char. Code

Qual. Value

Char. Code

Qual. Value

! 0 / 14 = 28 K 42 Y 56 g 70 u 84

" 1 0 15 > 29 L 43 Z 57 h 71 v 85

# 2 1 16 ? 30 M 44 [ 58 i 72 w 86

$ 3 2 17 @ 31 N 45 \ 59 j 73 x 87

% 4 3 18 A 32 O 46 ] 60 k 74 y 88

& 5 4 19 B 33 P 47 ^ 61 l 75 z 89

' 6 5 20 C 34 Q 48 _ 62 m 76 { 90

( 7 6 21 D 35 R 49 ‘ 63 n 77 | 91

) 8 7 22 E 36 S 50 a 64 o 78 } 92

* 9 8 23 F 37 T 51 b 65 p 79 ~ 93

+ 10 9 24 G 38 U 52 c 66 q 80

, 11 : 25 H 39 V 53 d 67 r 81

- 12 ; 26 I 40 W 54 e 68 s 82

. 13 < 27 J 41 X 55 f 69 t 83

Page 28: Using Pipeline Output Data for Whole Genome …...Part # 1005020, Rev. A May 2008 Using Pipeline Output Data for Whole Genome Alignment FOR RESEARCH ONLY Topics 4Introduction 4 Pipeline

Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121-1975 +1.800.809.ILMN (4566)+1.858.202.4566 (outside North America) [email protected]