Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and...

24
ILLUMINA PROPRIETARY Part #15001325 Rev. A January 2009 Pipeline and CASAVA Quick Reference Booklet FOR RESEARCH ONLY Topics 3 Pipeline 3 Pipeline Concepts 4 Running Analysis 6 Command Line Options 8 GERALD Parameters 10 GERALD Configuration File 12 ELAND Alignments 14 Run Folder Changes in Pipeline v1.3 15 CASAVA 15 Estimating Build Depth 16 Running CASAVA 18 Running Specific Use Cases 21 CASAVA Output Files

Transcript of Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and...

Page 1: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

ILLUMINA PROPRIETARY

Part #15001325 Rev. AJanuary 2009

Pipeline and CASAVAQuick Reference BookletFOR RESEARCH ONLY

Topics3 Pipeline

3 Pipeline Concepts

4 Running Analysis

6 Command Line Options

8 GERALD Parameters

10 GERALD Configuration File

12 ELAND Alignments

14 Run Folder Changes in Pipeline v1.3

15 CASAVA

15 Estimating Build Depth

16 Running CASAVA

18 Running Specific Use Cases

21 CASAVA Output Files

Page 2: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

This publication and its contents are proprietary to Illumina, Inc., and are intended solely for the contractual use of its customers and for no other purpose than to operate the system described herein. This publication and its contents shall not be used or distributed for any other purpose and/or otherwise communicated, disclosed, or reproduced in any way whatsoever without the prior written consent of Illumina, Inc.

For the proper operation of this system and/or all parts thereof, the instructions in this guide must be strictly and explicitly followed by experienced personnel. All of the contents of this guide must be fully read and understood prior to operating the system or any of the parts thereof.

FAILURE TO COMPLETELY READ AND FULLY UNDERSTAND AND FOLLOW ALL OF THE CONTENTS OF THIS GUIDE PRIOR TO OPERATING THIS SYSTEM, OR PARTS THEREOF, MAY RESULT IN DAMAGE TO THE EQUIPMENT, OR PARTS THEREOF, AND INJURY TO ANY PERSONS OPERATING THE SAME.

Illumina, Inc. does not assume any liability arising out of the application or use of any products, component parts, or software described herein. Illumina, Inc. further does not convey any license under its patent, trademark, copyright, or common-law rights nor the similar rights of others. Illumina, Inc. further reserves the right to make any changes in any processes, products, or parts thereof, described herein without notice. While every effort has been made to make this guide as complete and accurate as possible as of the publication date, no warranty of fitness is implied, nor does Illumina, Inc., accept any liability for damages resulting from the information contained in this guide.

© 2008, 2009 Illumina, Inc. All rights reserved. Illumina, Solexa, Making Sense Out of Life, Oligator, Sentrix, GoldenGate, DASL, BeadArray, Array of Arrays, Infinium, BeadXpress, VeraCode, IntelliHyb, iSelect, CSPro, iScan, and GenomeStudio are registered trademarks or trademarks of Illumina. All other brands and names contained herein are the property of their respective owners.

Page 3: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

3

Pipeline & CASAVA Quick Reference Booklet

Pipeline

Pipeline Concepts

Analysis Modules The Pipeline is divided into modules that are a collection of Perl or Python scripts and C++ executables. The first two scripts (goat_pipeline.py and bustard.py) can invoke the next script automatically.

Typically, the analysis begins with the base calling script bustard.py, using intensity data generated by IPAR.

Pipeline Workflow The standard workflow for invoking the Pipeline modules is as follows:

1. Navigate (via the command line) to the Run Folder location.

2. Create a configuration file that specifies what analysis should be done for each lane.

3. Run a check on the Run Folder.

4. Add command line options, generate the analysis folder, and corresponding makefiles.

5. Navigate to the analysis directory and start your analysis by executing makefiles.

Bustardperforms

Base Calling

Sequence Analysis,Visualization,

and Alignment usingELAND or PhageAlign

Firecrestperforms

Image Analysis

Makefile

GOAT

Makefile Makefile

goat_pipeline.py

busta rd.py GERALD.pl

IPAR performsImage Analysis

Pipeline performsImage Analysis

Page 4: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

4

Part #15001325 Rev. A

Running Analysis

Although several different software programs are involved in an analysis run, a single command generates the analysis folders, then a second command (`make recursive') can be used to start a complete analysis.

Starting with IPAR Image Analysis

Data

Usage

/<PL1.3path>/bin/bustard.py <run-folder-directory>/Data/IPAR_1.3 [--matrix=mymatrix.txt|auto|auto<n>] [--phasing=0.01|auto|auto<n>] [--prephasing=0.01] [--with-sig2] [--with-seq] [--with-prb] [--with-qhg] [--with-qval] [--directory=/path/C1-14_Firecrest1.8.20_01-08-2006_user] [--make] [--GERALD=/path/config.txt] [--control-lane=5]

Data Analysis

1. Generate pipeline makefiles and analysis structure:

/<PL1.3path>/bin/bustard.py <RunFolder>/Data/IPAR_1.3 --make

All standard pipeline parameters are available for use.

2. Navigate to the Bustard sub-directory generated in the IPAR_1.3 directory and execute the make files:• To perform base calling only:

make all• For base calling and alignment (Gerald analysis) :

make recursive

Starting with Image Analysis

Usage

/path-to-pipeline/bin/goat_pipeline.py <run-folder-directory> [<run-folder-directory2>] [--cycles=1-25|auto] [--tiles=s_1,s_2_0003,...] [--matrix=mymatrix.txt|auto|auto<n>] [--offsets=/path/default_offsets.txt|auto] [--phasing=0.01|auto|auto<n>] [--prephasing=0.01] [--with-sig2] [--with-seq] [--with-prb] [--with-qhg] [--with-qval] [--directory=/path/C1-14_Firecrest1.8.20_01-08-2006_user] [--make] [--GERALD=/path/config.txt] [--control-lane=5]

Page 5: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

5

Pipeline & CASAVA Quick Reference Booklet

Data Analysis

1. Type the following command to run a check on the Run Folder

/path-to-pipeline/bin/goat_pipeline.py --GERALD=/data/070813_ILMN-1_0217_1234/config.txt /data/070813_ILMN-1_0217_1234

2. Add --make to the command listed above to create an analysis directory in the Run Folder. If you specify the --GERALD option, you will create the GERALD analysis folder and the corresponding makefile.

/path-to-pipeline/bin/goat_pipeline.py --GERALD=/data/070813_ILMN-1_0217_1234/config.txt --make /data/070813_ILMN-1_0217_1234

3. Change to the newly generated directory .:

make recursive

Paired Reads The simplest way to use paired-read data assumes that you have a single Run Folder containing the images for both reads. For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read starts.

An alternative way assumes that both reads of a pair are stored in two separate Run Folders. Specify both folders as arguments to goat_pipeline.py. The two Run Folders will not work with IPAR data.

Parallelization Switch

If your system supports automatic load-sharing to multiple CPUs, you can parallelize the analysis run to <n> different processes by using the “make” utility parallelization switch.

make recursive -j n

Nohup Command You should use the Unix nohup command to redirect the standard output and keep the “make” process running even if your terminal is interrupted or if you log out.

nohup make recursive -j n &

Page 6: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

6

Part #15001325 Rev. A

Command Line Options

You can invoke the goat_pipeline.py and bustard.py scripts with a number of optional command line arguments.

General Options --make

The --make command creates the analysis directory and a makefile in the relevant analysis directory.

--new-read-cycle=<cycle>

Use this command to start a new read in a paired-end run.

--GERALD=<config.txt>

Use this command to start the GERALD makefile generator after the Bustard folder is created.

--tiles=<tile>|<lane>[,<tile>|<lane>,...]

Use this command to select certain tiles for analysis.

--cycles=<cycle>[-<cycle>[,<cycle>[-<cycle>...]]]:

Use this command to select certain cycles for analysis. --cycles cannot be used when starting from IPAR analysis files.

--compression=<method>

Allowed values are “none” and “gzip” (the default).

GOAT Options --nobasecall

Use --nobasecall to skip the base calling step in the analysis.

--offsets=<filename> | auto | default

Use --offsets=<filename> to specify a certain default offset file.

GOAT and BustardOptions

--control-lane=<n>

Use this command to select a lane <n> that is to be used to estimate phasing and matrix correction for all other lanes.

--matrix=<filename> | auto|auto<n> | lane

Use the --matrix command to specify the frequency cross-talk matrix file, where filename refers to the path of the matrix file.

--phasing=<x> | auto | auto<n>

Use the --phasing command to apply a particular phasing correction.

Page 7: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

7

Pipeline & CASAVA Quick Reference Booklet

--prephasing=<x>

Use the --prephasing command to apply a particular correction for prephasing.

--with-sig2, --with-seq, --with-prb, --with-qhg, --with-qval

Use these commands to generate the sig2, seq, prb, qhg, and qval files respectively.

Paired Reads --phasing=<read>:value, --phasing=<read>:<read>

Use this command to specify phasing options for one specific read of a pair.

--matrix=<read>:value, --matrix=<read>:<read>

Use this command to specify matrix options for one specific read of a pair.

Makefile Targets all

All is the default makefile target. It runs the complete analysis in the current directory (image analysis or base caller).

-j <n>

This parallelization switch can be used with the “make” command to execute the Pipeline run in parallel over <n> number of processor cores.

clean

This target removes all analysis output files.

recursive

This target performs the analysis in the current directory and in all available subdirectories

compress

This target uses gzip to apply a loss-less compression to the output files after an analysis run.

uncompress

This target uncompresses a folder that has previously been compressed and returns it to its original state.

compress_images

This target uses bzip2 to compress the image data in the Images folder.

uncompress_images

This target uncompresses the Images folder that has previously been compressed and returns it to its original state.

Page 8: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

8

Part #15001325 Rev. A

GERALD Parameters

GERALD can be run in various analysis modes. Customize your analysis by specifying variables, parameters, and options.

Table 1 ANALYSIS Variables

VariableAlignment Program

Application Description

ANALYSIS eland_extended

ELAND Single reads Aligns single-read data reads against a target using ELAND alignments.

ANALYSIS eland_pair

ELAND Paired reads Aligns paired-end reads against a target using ELAND alignments. A single-read alignment is done for each half of the pair, and then the best-scoring alignments are compared to find the best paired-read alignment.

ANALYSIS eland_tag ELAND Gene Expression

Aligns reads to a non-redundant reference set of separate sequence tags and produces exact matches.

ANALYSIS none None Any application

Omits the indicated lane from the analysis.Setting the parameter 8:ANALYSIS none ignores lane 8.

ANALYSIS default PhageAlign Single reads Aligns each read against a reference sequence using PhageAlign.This mode is suitable only for small genome references.

ANALYSIS eland_rna ELAND Single reads Aligns each read against a large reference genome, splice junctions, and contaminants using ELAND.

Table 2 Analysis Parameters

Parameter Description

USE_BASES Use this parameter to identify bases to be used for alignment analysis.The USE_BASES string uses an asterisk (*) to indicate “fill up the read as far as possible with the preceding character.”If USE_BASES all is set, all sequenced bases will show up in the analysis results. Otherwise, only cycles which have a Y at the corresponding position in the USE_BASES string will appear in the results.

SEQUENCE_FORMAT This parameter specifies what format to use for data export in the s_N_sequence.txt file. Allowed values are --fasta, --fastq, or --scarf.

QCAL_SOURCE This parameter specifies the base call calibration that is used. Allowed values are auto, auto<n>, upstream, or /path/to/qtable.txt

Page 9: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

9

Pipeline & CASAVA Quick Reference Booklet

Make Option The --make option creates GERALD directories and makefiles. Without the option, GERALD will not create any directories and files and only operates in a diagnostic mode. You must specify this option to generate the GERALD analysis folder and subsequently run the analysis.

Rerunning the Analysis

The config.txt file used to generate an analysis is copied to the analysis folder so it can be used by GERALD if a reanalysis of the same data is required. To change parameters and rebuild the analysis, modify the configuration file and run the following command:

GERALD.pl config.txt --make

Building an SRF Archive

With version 1.3, the Pipeline is distributed with a modified version of io_lib and allows the generation of SRF archives. This is done by adding the following line in the config.txt file:

1:SRF_ARCHIVE_REQUIRED yes

This will create an SRF archive containing the sequences and the quality values. To create different archives, for instance including the signal and noise, or with a different filter, the tool illumina2srf should be used manually.

Table 3 Lane-by-Lane Parameters

Option Definition

6:ELAND_GENOME /directory/genome

ELAND_GENOME points to a directory of squashed genome files. Specify the name of the folder containing the reference sequence(s) for lane 6.

67:ELAND_GENOME /directory/genome

Specify the name of the file containing the reference sequence to use for lanes 6 and 7.

Page 10: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

10

Part #15001325 Rev. A

GERALD Configuration File

This section describes a typical GERALD configuration file that uses the current features and parameters.

Table 4 GERALD Configuration File Parameters

Parameter Definition

EXPT_DIR data/070813_ILMN-1_0217 _FC1234/Data/C1-27_Firecrest1. 9.0_23-08-2007-user/Bustard1.9.0_23-08-2007_user/

Provide the path to the experiment directory, if not specified on the command line or auto-completed by goat_pipeline.py.

USE_BASES nY*n Ignore the first and last base of the read.The USE_BASES string contains a character for each cycle.

• If the character is “Y”, the cycle is used for alignment.• If the character is “n”, the cycle is ignored.• Wild cards (*) are expanded to the full length of the read.

ELAND_GENOME /home/user/Genomes/Eland/BAC_plus_vector/

Specify the genome reference for alignment with ELAND.

GENOME_DIR /home/user/Genomes

GENOME_FILE BAC_plus_vector.fa

Specify the genome reference directory and file for alignment with PhageAlign.

ANALYSIS eland_extended Specify the type of alignment that should be performed

Table 5 GERALD Configuration File Lane-Specific Options

Parameter Definition

7:USE_BASES nY20 Align only 20 cycles for lane 7, starting with the second cycle.

567:ANALYSIS eland_extended

567:USE_BASES all

Align lanes 5, 6, and 7 only against a genomic sample.

8:ANALYSIS none Omit lane 8, which contains only primers.

3:QCAL_SOURCE auto8 Lane 3 will use the lane 8 qtable.

123:QCAL_SOURCE auto8 Lanes 1–3 will use the lane 8 qtable.

Table 6 GERALD Configuration File Optional Parameters

Parameter Definition

EMAIL_LIST [email protected] [email protected]

Send a notification to the user at the end of an analysis run.

Page 11: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

11

Pipeline & CASAVA Quick Reference Booklet

WEB_DIR_ROOT file://server.example.com/share

Include hyperlinks with a specific prefix to the Run Folder.

BAD_TILES s_1_0001 s_2_0003 Identify bad tiles. These tiles will be aligned but excluded from coverage.

POST_RUN_COMMAND /yourPath/yourCommand yourArgs

Allows user-defined scripts to be run after all GERALD targets have been built.

Table 7 GERALD Configuration File Paired-End Analysis Options

Parameter Definition

ANALYSIS eland_pair Use the paired-end alignment mode of ELAND to align paired reads against a target.

USE_BASES Y*,nY*n Use all bases on the first read and ignore the first and last base of the second read.

6:USE_BASES nY25 Ignore the first base on both the first and second read; use 25 bases each and ignore any other bases.

Table 6 GERALD Configuration File Optional Parameters (Continued)

Parameter Definition

Page 12: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

12

Part #15001325 Rev. A

ELAND Alignments

ANALYSIS eland_extended

There are two parameters that affect the output of the alignment, ELAND_SEED_LENGTH and ELAND_MAX_MATCHES. Both parameters can be specified lane-by-lane.

ANALYSIS eland_pair

ANALYSIS eland_pair allows the analysis of a paired-read run using ELAND alignments.The following table describes the parameters for ANALYSIS eland_pair.

Table 8 Parameters for ANALYSIS eland_extended

Parameter Description

ELAND_SEED_LENGTH By default, the first 32 bases of the read are used as a “seed” alignment. Setting ELAND_SEED_LENGTH to 25 will use 25 bases.

ELAND_MAX_MATCHES By default, ANALYSIS eland_extended will consider at most ten alignments of each read. This can be varied between 1 and 255.

Table 9 Parameters for ANALYSIS eland_pair

Parameter Description

--circular This causes pickBestPair to treat each chromosome as circular and not linear.

--min-percent-unique-pairs The number of unique pairs, expressed as a percentage of the total number of clusters passing filters, must exceed a certain percentage. Otherwise, no pairing is attempted and the two reads are effectively treated as two sets of single reads.

--min-percent-consistent-pairs Of the unique pairs, the vast majority should have the same orientation with respect to each other. By default, the threshold for this parameter is set to 70%.

--min-paired-read-alignment-score For each cluster, all possible pairings of alignments between the two reads are compared. This is the score of the best one.

--min-single-read-alignment-score If a read has a zero paired-read alignment score, but a single-read alignment score that exceeds this threshold, its alignment will still go in the sorted.txt files.

--add-shadow-to-singleton-threshold If one read has a score exceeding --min-single-read-alignment-score but the other read does not, then the non-aligning “shadow” read is added to the sorted.txt file with a zero alignment score, if the combined base quality of the shadow read exceeds this threshold.

Page 13: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

13

Pipeline & CASAVA Quick Reference Booklet

ANALYSIS eland_tag

Using ANALYSIS eland_tag to align experimental reads to a reference set produces not only exact matches but also one or two mismatches. ANALYSIS eland_tag uses ELAND to align to a non-redundant set of annotation tags. Illumina provides human and mouse annotation that consists of a non-redundant set of all possible GATC+16 or CATG+17 sequences in the genome and transcriptome, choosing the best annotation for each distinct sequence. You may also use publicly available annotation for SAGE tags or generate your own.

ANALYSIS eland_rna

Eland_rna is the eland module built specifically for RNA Sequencing

Prerequisites

Three sets of data files are needed:A NCBI genome sequence.A set of splice junction sequences.A set of contaminant sequences for the genome.

All three of these datasets must be squashed into the 2-bits-per-base format that the ELAND aligner understands.

Running an Eland_rna Analysis

The ANALYSIS parameter within the GERALD configuration file specifies what analysis to perform on the sequences; you will need to set up this parameter the following way (example shown):

ANALYSIS eland_rna

ELAND_GENOME /data/Genome/ELAND/hg18

ELAND_RNA_GENOME_SPLICE /data/Genome/ELAND_RNA/Human/human.34.splice

ELAND_RNA_GENOME_CONTAM /data/Genome/ELAND_RNA/Human/MT_Ribo_Filter

Table 10 Parameters for ANALYSIS eland_rna

Parameter Description

ELAND_GENOME Must point to a squashed version of the human genome.

ELAND_RNA_GENOME_SPLICE Must point to a squashed version of the splice junction file.

ELAND_RNA_GENOME_CONTAM Must point to a squashed version of the files of ultra-abundant sequences . Any read that hits to these is ignored.

Page 14: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

14

Part #15001325 Rev. A

Run Folder Changes in Pipeline v1.3

config.xmlfile

<ExperimentName>YYMMDD_machinename_XXXX

ExperimentName

Data

FirecrestImage Analysis

BustardBase Calling

.paramsfile

_int.txtfiles

_seq.txtfiles

GERALD

alignmentfiles

_sig2.txtfile

Images

L001(By Lane)

.tiffiles

C1.1(C Lane.Cycle)

visualizationfiles

_nse.txtfiles

.params file

_pos.txtfiles

.paramsfile

_prb.txtfile

filteringresults

<ExperimentName>YYMMDD_machinename_XXXX

ExperimentName

Data

IPAR / FirecrestImage Analysis

BustardBase Calling

_int.txtfiles

GERALD

alignmentfiles

Images

L001(By Lane)

.tiffiles

C1.1(C Lane.Cycle)

visualizationfiles

_nse.txtfiles

_pos.txtfiles

.params file

_qseq.txtfiles

config.xmlfile

_seq.txt, _sig2.txt, and _prb.txtreplaced by _qseq.txt file

Image Analysis folder generatedby IPAR or Firecrest

.params info moved toconfig.xml file in sub-folder

.params info moved toconfig.xml file in sub-folder

Pipeline v1.0 Pipeline v1.3

Firecrest

_pos.txt moved up toImage Analysis folder

filtering results doneat Bustard stage

L001(By Lane)

Page 15: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

15

Pipeline & CASAVA Quick Reference Booklet

CASAVAThe CASAVA workflow is illustrated below.

Estimating Build Depth

Estimating the build depth enables you to keep track of your progress with large resequencing Projects. To estimate the depth for paired 35 base reads in a short insert human project:

1. Obtain the yield of purity-filtered data from the Chip Results Summary or Lane Results Summary in the Summary.htm or run_analyses.html file.

2. For Paired End runs assume 80% will uniquely align, 20% will not. Subtract 20% from the yield.

3. CASAVA will automatically remove PCR duplicates in DNA sequencing projects. To allow for this, subtract a further 10% from the yield.

4. Divide the remaining yield of PF data by the genome size to estimate the sequence depth.

You may want to consider the following adjustments: For resequencing projects on other genomes, adjust the %align according to the Pipeline results summary. If your sample deviates from the reference genome, you will get a lower percentage aligned reads. A larger fragment size may give slightly more duplicates, while a longer read length will contain fewer.

Page 16: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

16

Part #15001325 Rev. A

Running CASAVA

Usage To run CASAVA, enter the following in the command line:

./run.pl [options]

Options The options that define CASAVA analysis are listed in the table below (with SE = single end (single read), PE = paired end).

Option Application Description

-e --exportDir=DIR

SE, PEMandatory!

Source directory, known also as GERALD directory or run directory.

-g --genomeSize=PATH

SE, PEMandatory!

Full PATH to xml file with chromosome/genome sizes.

-l --lanes=NUMBER_LIST

SE, PEMandatory!

List of the lanes to use.

-p --projectDir=DIR

SE, PEMandatory!

Project directory (where CASAVA keeps all intermediate files).

-r --runId=STRING

SE, PEMandatory!

Unique identifier for each run.

-a --applicationType=TYPE

SE, PE Type of analysis [DNA, RNA]; default is DNA.

-b --buildDir=DIR

SE, PE The destination directory of the final build; default is projectDir.

--currentBuildDir=DIR SE, PE The current build directory; default is Parsed_DATE.

-f --force

SE, PE Ignore errors from previous CASAVA execution.

-h --help

SE, PE Prints this information.

--regExpScaffold=REGEXP

--replaceScaffold=EXP

SE, PE

SE, PE

Match all scaffold names to REGEXP and change to EXP. Default is ^(c)(M)T$|^(c)hr(\S+)$|^(c)(\S+)$|^(.+)()$All names matching regExpScaffold will be change to EXP. Default is $1$2

-rt --removeTemps=ON/OFF

SE, PE Removes all temporary data, running as soon as it is possible. Default is OFF.

-t --targets=LIST

SE, PE List of targets to run [export, duplicates, sort, allele, snp]. Default is all targets. The option allows for running only part of CASAVA.Targets: configure, listRuns, removeRun, export, duplicates, sort, alelle, snp, allClean, snpClean, sortClean, duplicatesClean, alelleClean.

-w --workflow

SE, PE Instead of running CASAVA, generates the task definition file tasks-DATA.txt

Page 17: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

17

Pipeline & CASAVA Quick Reference Booklet

--workflowAuto SE, PE Instead of running CASAVA, generates the tasks definition file and runs it using all processors on the current machine.

--workflowFile=FILE SE, PE Used with –w. Changes default name of task definition file from tasks-DATE.txt to FILE.

--verbose=NUMBER SE, PE Sets the verbose level (default is 0, which is the minimum).

--version SE, PE Prints version information.

--spliceJunction=NAME SE (for RNA Sequencing)

NAME of splice junction set in features directory. Default for human genome is splice_sites-34. The name of this splice file need to correspond to the name of the squashed fasta file used in eland_rna .

--featureFileName=FILE SE (for RNA Sequencing)

FILE name of exons definition file in features directory. Default for human genome is exon_coords.txt.

-ref --refSequences=PATH

SE, PE PATH of the reference genome sequences. Default is projectDir/genomes/.The fasta files should not be squashed for CASAVA.

-genes --genesListPath=PATH

SE (for RNA Sequencing)

PATH of the reference sequences for genes. Default is CASAVA/features/ (not supported).

--snpThreshold=NUMBER SE, PE Sets the SNP caller threshold to NUMBER. This is the minimum allele call score required to call a SNP. For a heterozygous SNP to be called, the score for both alleles must exceed this value. Default is 10.

--snpMaxRatio=NUMBER SE, PE Sets the SNPCaller max ratio to NUMBER. This is used to evaluate possible heterozygous SNPs. This sets the maximum ratio between the first and second allele call scores. Situations where the first allele is much stronger than the minor allele should be called as homozygous SNPs, as the minor allele may simply be noise. Default is 3.

-rm --readMode=MODE

SE, PE Run-read-mode for all runs:• paired for paired end (default).• single for single end (single read).

--QVCutoff=NUMBER SE, PE Sets the alleleCaller QVCutoff to NUMBER (default 6).

--snpCovCutoff=NUMBER SE, PE sets the SNPCaller coverage cutoff to NUMBER (default 3: SNPs are called only at the positions where the depth is no greater than three times the chromosomal mean). This prevents SNP calling in regions with extreme depth, such as near the centromere of a human chromosome.--snpCovCutoff=-1 turns off the filter.

Page 18: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

18

Part #15001325 Rev. A

Running Specific Use Cases

Large Genomes For bigger builds configure the build before running CASAVA.

Configuring One Run

1. Check the available disk space.

2. Go to /opt/GOAT/CASAVA_1_0.

3. Run (change run parameters to match your situation):

run.pl --projectDir=/Human/ --runId=STRING_R1 --exportDir=/Human/STRING_GERALD/ --lanes=1,2,3,4,6,7,8 --genomeSize=./conf/human_genomes_size.xml --refSequences= /data/Genome/CASAVA_1_0/hg18/ --target=configure

Configuring Multiple Runs

There are two methods to add multiple runs:Command line method.Run the command as in Configuring One Run, but then list all runs, export directories, and lanes, which would look like this:

run.pl -p projectDir -t configure -r RunId1 -e exportDir1 -l laneList1 -r RunId2 -e exportDir2 -l laneList2 ... -r RunIdn -e exportDirn -l laneListn

Xml file edition method.a. Go to /Human/conf/run.conf.xml.b. Add appropriate entries to the run.conf.xml file.c. Then run the configuration again by executing:

run.pl -p /Human/ -t configure

Browse Configuration

You should be able to see the build web page in projectDir/html. Open home.html and look for all runs in the list, or execute:

run.pl -p /Human/ -t listRuns

Running CASAVA on SGE

1. Run the following command:

run.pl -p /Human/ --workflowAfter up to 10 minutes this will produce something like this:

perl /opt/GOAT/CASAVA_1_0/TaskManager/runTasks.pl -t /Human/tasks.DATE.txt -h hostname.domainname -p 8001 --sge=X -s

2. Copy the result.

Page 19: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

19

Pipeline & CASAVA Quick Reference Booklet

3. Add “--queue=custom.q” to specify the queue name of the cluster. This defaults to isilon.q

4. Modify --sge=X to --sge=100 for a slots cpu (maximum 100 cpu will be used).

5. Run the command:

perl /opt/GOAT/CASAVA_1_0/TaskManager/runTasks.pl -t /Human/tasks.DATE.txt -h hostname.domainname -p 8001 --sge=100 -s

After a few minutes you should be able to see sub-command being exe-cuted.

6. You can monitor progress and errors on SGE by typing:

/opt/GOAT/CASAVA_1_0/TaskManager/monitor.pl -t tasks.DATE.txt

Small Genomes Running CASAVA

1. Check the available disk space.

2. Go to /opt/GOAT/CASAVA_1_0.

3. Run (change run parameters to match your situation):

run.pl --projectDir=/E_coli/ --runId=Ecoli --exportDir=/Ecoli/STRING_GERALD/ --lanes=4 --genomeSize=./conf/E_coli_gs.xml --refSequences= /GENOMES/EcoliFasta/

The command will make a CASAVA build using STRING_GERALD and the fourth lane in projectDir.

Browse Configuration

You should be able to see the build web page in projectDir/html. Open home.html and look for all runs in the list, or execute:

run.pl -p /Ecoli/ -t listRuns

RNA Sequencing Running CASAVA

1. Check the available disk space.

2. Go to /opt/GOAT/CASAVA_1_0/.

3. Run (change run parameters to match your situation using runRNA.pl):

runRNA.pl --projectDir=/HUMAN_BRAIN/ --runId=HUMAN_BRAIN --exportDir=/HUMAN_BRAIN/STRING_GERALD/ --lanes=2,4,7 --genomeSize=./conf/human_genomes_size.xml --refSequences=/data/Genome/CASAVA_1_0/hg18/ --spliceJunction=splice_sites-34

Page 20: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

20

Part #15001325 Rev. A

The command will make a CASAVA build using STRING_GERALD and the second, fourth and seventh lane in projectDir.

Browse Configuration

You should be able to see the build web page in projectDir/html. Open home.html and look for all runs in the list.

Page 21: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

21

Pipeline & CASAVA Quick Reference Booklet

CASAVA Output Files

The CASAVA output files contain run information, statististical analysis, sequence tags, SNP information, and (for RNA Sequencing) gene counts, exon counts, and splice junction counts.

Build Directory An outline of the CASAVA build directory is shown below

Build Web Page The build web page is located in projectDir/html. When you open the file home.html, you will find a list of all runs, and a link to statistics.

Page 22: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

22

Part #15001325 Rev. A

CASAVA Build The CASAVA build, containing sequence, SNP, and (for RNA Sequencing) counts information, is located in the projectDir/Parsed_xx_xx_xx folder.

Page 23: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read
Page 24: Pipeline and CASAVA Quick Reference Booklet€¦ · For Genome Analyzer software SCS 1.0 and Pipeline version 0.3 and later, the Pipeline automatically knows where the second read

Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121-1975 +1.800.809.ILMN (4566)+1.858.202.4566 (outside North America) [email protected]