SeqMan NGen User's Manual

50
SeqMan NGen User's Manual For Windows® and Macintosh® Version: 1.2 DNASTAR, Inc. 2008

Transcript of SeqMan NGen User's Manual

Page 1: SeqMan NGen User's Manual

SeqMan NGen User's Manual

For Windows® and Macintosh®

Version: 1.2

DNASTAR, Inc. 2008

Page 2: SeqMan NGen User's Manual
Page 3: SeqMan NGen User's Manual

Contents Copyright © 2007-2008 by DNASTAR, Inc. ..........................................................1 Introduction ...........................................................................................................2 Technical Requirements .......................................................................................3

Technical Requirements for Windows® ............................................................3 Technical Requirements for Macintosh®...........................................................3

Installing SeqMan NGen .......................................................................................4 Installing on Windows®.....................................................................................4 Installing on Macintosh® ...................................................................................4

How to Use SeqMan NGen...................................................................................5 Practice Data and Script for Windows® ............................................................6 Practice Data and Script for Macintosh®...........................................................7

Writing Scripts for SeqMan NGen.........................................................................8 Conventions Used in this Manual ....................................................................12 Differences in scripts for Macintosh® and Windows®.....................................12

Commands .........................................................................................................13 Project Management .......................................................................................13

newProject ...................................................................................................13 saveProject ..................................................................................................13 saveReport ..................................................................................................14 writeUnassembledSeqs ...............................................................................15 closeProject .................................................................................................15 openProject..................................................................................................15 quit ...............................................................................................................16

File Loading.....................................................................................................16 setDefaultDirectory ......................................................................................16 loadSeq .......................................................................................................17 loadTemplate ...............................................................................................17 loadVector....................................................................................................18 load454DualEnd ..........................................................................................18 loadRepeat ..................................................................................................19 loadContaminant..........................................................................................20 loadConstraint..............................................................................................20

Parameter Settings..........................................................................................21 setParam .....................................................................................................21 setVectorParam...........................................................................................26 setQualityParam ..........................................................................................27 setRepeatParam..........................................................................................27 setContaminantParam .................................................................................27

Preprocessing and Assembling .......................................................................28

SeqMan NGen User's Manual Contents • iii

Page 4: SeqMan NGen User's Manual

assemble .....................................................................................................28 realignContigs..............................................................................................29 splitTemplates..............................................................................................29 removeSmallContigs....................................................................................29 setPairSpecifier............................................................................................29

Annotating Template Sequence Prior to Assembly.............................................31 Viewing Assembly Results in SeqMan Pro .........................................................32

Viewing Areas Exceeding the Maximum Depth of Coverage ..........................33 Components of SeqMan NGen...........................................................................34

Mer Tags .........................................................................................................34 Match Percentage ...........................................................................................35 Repeat Handling..............................................................................................36

Assembling Dual-End Data.................................................................................37 Appendix I ...........................................................................................................39

Alphabetical List of Commands.......................................................................39 Appendix II ..........................................................................................................42 Appendix III .........................................................................................................43

Example Regular Expressions in PERL ..........................................................43 Special characters .......................................................................................43 Numerical modifiers .....................................................................................43 Example expressions and their meanings ...................................................44

Index ...................................................................................................................45

iv • Contents SeqMan NGen User's Manual

Page 5: SeqMan NGen User's Manual

Copyright © 2007-2008 by DNASTAR, Inc.

All rights reserved. Reproduction, adaptation, or translation without prior written permission is prohibited, except as allowed under the copyright laws or with the permission of DNASTAR, Inc.

4th Edition, September 2008. SeqMan NGen Version 1.2. Printed in Madison, Wisconsin, USA.

Trademark Information: DNASTAR, Lasergene, SeqMan NGen, SeqMan Pro, and SeqBuilder, are trademarks or registered trademarks of DNASTAR, Inc. Windows® is a registered trademark of Microsoft Corp. Macintosh® is a registered trademark of Apple Computers, Inc. Illumina® is a registered trademark of Illumina, Inc. 454® is a registered trademark of the 454 Life Sciences™ Corporation. All other copyrights, trademarks, and registered trademarks are the property of their respective owners.

DNASTAR, Inc. does not encourage the infringement of any patents owned by any other party.

Disclaimer & Liability: DNASTAR, Inc. makes no warranties, expressed or implied, including without limitation the implied warranties of merchantability and fitness for a particular purpose, regarding the software. DNASTAR does not warrant, guaranty, or make any representation regarding the use or the results of the use of the software in terms of correctness, accuracy, reliability, up-to-date status, or otherwise. The entire risk as to the results and performance of the software is assumed by you. The exclusion of implied warranties is not permitted by some states. The above exclusion may not apply to you.

In no event will DNASTAR, Inc. and their directors, officers, employees, or agents (collectively DNASTAR) be liable to you for any consequential, incidental or indirect damages (including damages for loss of business profits, business interruption, loss of business information and the like) arising out of the use of, or the inability to use the software even if DNASTAR, Inc. has been advised of the possibility of such damages. Because some states do not allow the exclusion or limitation of liability for consequential or incidental damages, the above limitations may not apply to you.

DNASTAR, Inc. reserves the right to revise this publication and to make changes to it from time to time without obligation of DNASTAR, Inc. to notify any person or organization of such revision or changes. The screen and other illustrations in this publication are meant to be representative of those that appear on your monitor or printer.

SeqMan NGen User's Manual Copyright © 2007-2008 by DNASTAR, Inc. • 1

Page 6: SeqMan NGen User's Manual

Introduction SeqMan NGen is a command-line application that uses a unique algorithm to assemble fragment data sequenced using Illumina®, 454®, and Sanger technology. SeqMan NGen offers complete flexibility in adjusting assembling parameters to meet the needs of your specific data set. As the user, you may also decide on a number of preprocessing options, including vector and end-trimming, marking known repeats, and excluding contaminant sequences, such as primer reads, from your assembly. Data sets can be assembled de novo, or with a template sequence. Multiple templates can also be used if desired. Annotating your template sequences in SeqBuilder for known SNPs, CDSs, and other features prior to assembly will enhance the analysis of identified putative SNPs in SeqMan Pro. Following a templated assembly, any remaining unassembled sequences can be assembled into contigs, if desired. We recommend using dual end data, if available, when assembling a data set de novo. SeqMan NGen offers three output options for saving your finished assembly: the SeqMan Pro project file format, as a Phrap assembly, and in FASTA format. Following assembly in SeqMan NGen, your saved project can be viewed in SeqMan Pro for analysis, including identifying SNPs and analyzing coverage. SeqMan NGen can also export a report summarizing your assembly statistics, including the number of assembled/unassembled (matched/unmatched) sequences and contigs in your project, the parameters used, the average quality scores, and the number of sequences excluded from the assembly due to exceeding the maximum coverage parameter (maxAssemblyCoverage). Although the upper limit for project size depends on many factors, including the amount of your computer’s RAM and processor speed, a computer meeting our recommended technical requirements should be able to use SeqMan NGen to easily assemble 1-2 lanes of Illumina® data (about 8-10 million reads), one million 454® reads, or 200,000 Sanger reads.

2 • Introduction SeqMan NGen User's Manual

Page 7: SeqMan NGen User's Manual

The following is a general workflow for using SeqMan NGen:

Annotate template sequence(s) in SeqBuilder (optional)

Write script for assembly

Run the script in SeqMan NGen to assemble sequencing data

View assembly statistics Analyze assembly in SeqMan Pro

Technical Requirements

Technical Requirements for Windows® SeqMan NGen on Windows requires at minimum:

• Windows XP x64 or Vista™ x64 • 1.5 GB RAM • 3 GB available hard disk space • 1.6 GHz processor speed

RAM requirements vary depending on the volume of data being analyzed. Running large assemblies and opening the resulting project file uses both your computer’s RAM and its hard disk space. In order to optimize performance for large assemblies, especially datasets generated using Illumina® and 454® technologies, we recommend:

• Windows XP x64 or Vista™ x64 • 8 GB RAM (Additional RAM may be useful to speed up large assemblies

where much data divergence is expected.) • 10 GB available hard disk space • 2.19 GHz processor speed

Note: SeqMan Pro can open project files up to 4GB in size.

Technical Requirements for Macintosh® SeqMan NGen on Macintosh requires at minimum:

• Mac OS X 10.4 or 10.5 • Intel® 64-bit processor • 1.5 GB RAM • 3 GB available hard disk space

SeqMan NGen User's Manual Technical Requirements • 3

Page 8: SeqMan NGen User's Manual

• 1.6 GHz processor speed RAM requirements vary depending on the volume of data being analyzed. Running large assemblies and opening the resulting project file uses both your computer’s RAM and its hard disk space. In order to optimize performance for large assemblies, especially datasets generated using Illumina® and 454® technologies, we recommend:

• Mac OS X 10.4* or 10.5 • Intel® 64-bit processor • 8 GB RAM (Additional RAM may be useful to speed up large assemblies

where much data divergence is expected.) • 10 GB available hard disk space • 2.66 GHz processor speed

Note: SeqMan Pro can open project files up to 4GB in size. *For large de novo Illumina® assemblies, Mac OSX 10.5 is recommended.

Installing SeqMan NGen

Installing on Windows® To install SeqMan NGen on Windows, launch SeqMan NGen 1.2 Install.exe and then follow the on-screen instructions. After installation, you will be prompted to restart your computer. By default, SNG.exe will be installed to the following directory: C:\Program Files (x86)\DNASTAR\SeqMan NGen 1 Before installing, consult our Technical Requirements.

Installing on Macintosh® To install SeqMan NGen on Macintosh, launch SeqMan NGen 1.2.pkg and then follow the on-screen instructions. By default, SNG will be installed to: /usr/bin

Note: /usr/bin is a standard install location for executable files. The /usr directory is hidden. To view it, run the command open /usr/bin from Terminal.

Before installing, consult our Technical Requirements.

4 • Installing SeqMan NGen SeqMan NGen User's Manual

Page 9: SeqMan NGen User's Manual

How to Use SeqMan NGen SeqMan NGen is an application that is run via the command line. On Windows: To launch Command Prompt, click Start, click Run, type cmd, and then click OK. Once Command Prompt is launched:

1. Type SNG at the command prompt. 2. Enter a space and then drag and drop the script you want to run into the

command window. Alternatively, you can enter the directory and file name of the script you want to run enclosed in quotes.

Note: Windows Vista does not support drag and drop for Command Prompt windows.

3. Finally, make sure Command Prompt is the active window, and then press Enter to execute the script.

Example: C:\>SNG "C:\Sequencing Data\ABC_Sample.script" On Macintosh: Launch Terminal from Applications:Utilities Once Terminal is launched:

1. Type SNG at the command prompt. 2. Enter a space and then drag and drop the script you want to run into the

Terminal window. Alternatively, you can enter the directory and file name of the script you want to run.

3. Finally, make sure Terminal is the active window, and then press Enter to execute the script.

Example: myMac:~ macuser$ SNG /library/Sequencing Data/ABC_Sample.script

SeqMan NGen User's Manual How to Use SeqMan NGen • 5

Page 10: SeqMan NGen User's Manual

When your assembly has been successfully completed, the message “Script Complete” will appear in the Command Prompt/Terminal window. Your saved project file is now ready to be viewed in SeqMan Pro.

Practice Data and Script for Windows®

A small set of practice data and a corresponding script are included with your SeqMan NGen User’s Manual. To gain familiarity with how to execute a script, please follow the steps below.

1. Copy the Win_Examples folder (distributed with your SeqMan NGen User’s Manual), open My Computer, and then paste the folder at the root of your C:\ drive. (i.e. not within any subfolders)

2. Open Command Prompt by first by clicking on the Start button in the bottom left of your screen and choose Run:

3. Type cmd and click OK. The Command Prompt window appears:

4. At the prompt, type SNG followed by a space. 5. Drag and drop the Win_Practice_Script.script file from the

Win_Examples folder into the Command Prompt window. Or, type: "C:\ Win_Examples\Win_Practice_Script.script" at the prompt.

6. Make sure that Command Prompt is the active window, and then press Enter. The assembly will run, and when it is finished, a “Script Complete” message will appear at the bottom of the Command Prompt window:

6 • How to Use SeqMan NGen SeqMan NGen User's Manual

Page 11: SeqMan NGen User's Manual

7. Go back to the C:\Win_Examples\ directory and you should now see 2

additional files: Practice Project.sqd, a SeqMan project file, and Practice assembly report.txt, a report summarizing the assembly statistics.

8. Open Practice assembly report.txt to view the assembly statistics. 9. Launch SeqMan Pro, located by default in: C:\Program Files (x86)\DNASTAR\Lasergene 8\SeqMan.exe 10. Go to File>Open and select Practice Project.sqd from

C:\Win_Examples\. The assembled project will open and is ready for analysis.

Practice Data and Script for Macintosh®

A small set of practice data and a corresponding script are included with your SeqMan NGen User’s Manual. To gain familiarity with how to execute a script, please follow the steps below.

1. Copy the Mac_Examples folder (distributed with your SeqMan NGen User’s Manual), and paste it onto your Desktop.

2. Launch Terminal from Applications:Utilities. 3. At the prompt, type SNG followed by a space. 4. Drag and drop the Mac_Practice_Script.script file from the

Mac_Examples folder into the Terminal window.

SeqMan NGen User's Manual How to Use SeqMan NGen • 7

Page 12: SeqMan NGen User's Manual

5. Make sure that Terminal is the active window, and then press Enter. The assembly will run, and when it is finished, a “Script Complete” message will appear at the bottom of the Terminal window:

6. Go back to the Desktop/Mac_Examples directory and you should now

see 2 additional files: Practice Project.sqd, a SeqMan Pro project file, and Practice assembly report.txt, a report summarizing the assembly statistics.

7. Open Practice assembly report.txt to view the assembly statistics. 8. Launch SeqMan Pro, located by default in: Applications/DNASTAR/Lasergene 8/SeqMan 9. Go to File>Open and select Practice Project.sqd from

Desktop/Mac_Examples. The assembled project will open and is ready for analysis.

Writing Scripts for SeqMan NGen To create a script for use in SeqMan NGen, type a list of commands in the SeqMan NGen scripting language using a text editor, and then save the list as a text file with a *.script extension. Scripting commands for SeqMan NGen should be written in the following format:

8 • Writing Scripts for SeqMan NGen SeqMan NGen User's Manual

Page 13: SeqMan NGen User's Manual

commandname parameter:value Commands that have multiple parameters should be written with all of the parameters on one line, each separated by a space: Commandname parameter1:value parameter2:value parameter3:value The exceptions to this rule are the Parameter Settings commands, such as setParam and setVectorParam, which can be repeated on a new line for each parameter: Commandname parameter1: value Commandname parameter2: value Commandname parameter3: value For example: setParam matchSize:15 setParam minMatchPercent:90 setParam maxGap:15 Parameters that have more than one value should be written with the group of values enclosed in { } brackets, and each value separated by a space: Commandname parameter:{value1 value2 value3 value4}

Currently, only one command in the SeqMan NGen scripting language, setPairSpecifier, can have more than one value.

To add a comment to a script, precede the comment with a semi-colon:

;sample comment

To prevent a command from being executed without deleting it from the script, type a semi-colon immediately before the command:

;commandname parameter:value All parameters that require a file name or directory value must be written using English language characters only and have the file name/directory enclosed in quotes, as shown in the following example: On Windows: saveProject file: "C:\Assembled Projects\abc_project.sqd" On Macintosh: saveProject file: "/Library/Assembled Projects/abc_project.sqd"

SeqMan NGen User's Manual Writing Scripts for SeqMan NGen • 9

Page 14: SeqMan NGen User's Manual

Many parameters have default values that are used when no parameter is specified. The chart shown in Appendix I lists all the default values used by SeqMan NGen. If you want to use the default value for a parameter, it is not necessary to write that parameter in your script. For example, the following two sets of scripts (Win/Mac) will use all of the same parameters when run in SeqMan NGen. Sample_1 shows the script written with the least commands required for the assembly. Sample_2 shows the same script written with all of the default parameters.

Win_Sample_1

setDefaultDirectory defaultWinDirectory: "C:\Assemblies\ABC_proj\" loadTemplate file: "TemplateSeq.seq" loadSeq file: "ABC_Sequences.fas" loadContaminant file: "PrimerFragments.fas" assemble saveProject file: "ABC_assembly.sqd" saveReport file: "ABC_report.txt" closeProject

Win_Sample_2

setDefaultDirectory defaultWinDirectory: "C:\Assemblies\ABC_proj\" loadTemplate file: "TemplateSeq.seq" loadSeq file: "ABC_Sequences.fas" loadContaminant file: "PrimerFragments.fas" setParam useRepeatHandling:true setParam coverageType:fixed setParam fixedCoverage:6 setParam matchSize:15 setParam minMatchPercent:90 setParam matchSpacing:10 setParam matchRepeatPercent:150 setParam maxUsableCount:25 setParam maxGap:15 setParam matchWindowLength:50 setParam matchScore:10 setParam maxAssemblyCoverage:500 setParam gapPenalty:30 setParam mismatchPenalty:20 setParam min454SeqLen:50 setParam max454SeqLen:350 setParam defaultQuality:15 setParam templateDefaultQuality:500 setParam splitFalseJoins:True setParam falseJoinMinColDepth:4 setParam falseJoinMinInconsistent:4 setParam falseJoinMinFraction:25 setParam falseJoinMinMatches:2 setParam falseJoinUniformQual:true setParam falseJoinQualThresh:15 setParam allowConstraintBased:true

10 • Writing Scripts for SeqMan NGen SeqMan NGen User's Manual

Page 15: SeqMan NGen User's Manual

setParam skipRealign:false setParam useSeqMan7Format:false setParam splitTemplateContigs:false setParam assembleBoneyard:false setParam minContigSeqs:0 setParam snpPasses:2 setParam snpMatchPercentage:90 assemble trimEnds:false vectScan:false repeatScan:false contamScan:false doAssemble:true saveProject file: "ABC_assembly.sqd"saveUnassembled:false format:seqman saveReport file: "ABC_report.txt" closeProject

Mac_Sample_1

setDefaultDirectory defaultMacDirectory: "/Library/Assemblies/ABC proj/" loadTemplate file: "TemplateSeq.seq" loadSeq file: "ABC_Sequences.fas" loadContaminant file: "PrimerFragments.fas" assemble saveProject file: "ABC_assembly.sqd" saveReport file: "ABC_report.txt" closeProject

Mac_Sample_2

setDefaultDirectory defaultMacDirectory: "/Library/Assemblies/ABC proj/" loadTemplate file: "TemplateSeq.seq" loadSeq file: "ABC_Sequences.fas" loadContaminant file: "PrimerFragments.fas" setParam useRepeatHandling:true setParam coverageType:fixed setParam fixedCoverage:6 setParam matchSize:15 setParam minMatchPercent:90 setParam matchSpacing:10 setParam matchRepeatPercent:150 setParam maxUsableCount:25 setParam maxGap:15 setParam matchWindowLength:50 setParam matchScore:10 setParam maxAssemblyCoverage:500 setParam gapPenalty:30 setParam mismatchPenalty:20 setParam min454SeqLen:50 setParam max454SeqLen:350 setParam defaultQuality:15 setParam templateDefaultQuality:500 setParam splitFalseJoins:True setParam falseJoinMinColDepth:4 setParam falseJoinMinInconsistent:4 setParam falseJoinMinFraction:25 setParam falseJoinMinMatches:2 setParam falseJoinUniformQual:true

SeqMan NGen User's Manual Writing Scripts for SeqMan NGen • 11

Page 16: SeqMan NGen User's Manual

setParam falseJoinQualThresh:15 setParam allowConstraintBased:true setParam skipRealign:false setParam useSeqMan7Format:false setParam splitTemplateContigs:false setParam assembleBoneyard:false setParam minContigSeqs:0 setParam snpPasses:2 setParam snpMatchPercentage:90 assemble trimEnds:false vectScan:false repeatScan:false contamScan:false doAssemble:true saveProject file: "ABC_assembly.sqd"saveUnassembled:false format:seqman saveReport file: "ABC_report.txt" closeProject

Note: Both of these sets of scripts (Mac and Win) are distributed with your SeqMan NGen User’s Manual. Sample_2.script is a good starting place for creating your own scripts. Make sure to replace the file names and directories with your own before using the script. If you find it necessary to adjust any of the default parameters, simply edit the values and re-save the script.

Conventions Used in this Manual As mentioned in the previous section, commands and their parameters should be written on one line, with each parameter and its value separated by a space: commandname parameter1:value parameter2:value parameter3:value

However, for ease of reading, the examples in this manual will be shown with parameters indented on separate lines under each command, as shown below:

commandname

parameter1:value

parameter2:value

parameter3:value

Also, the commands and parameters written in this manual have been written using a mixture of upper and lowercase letters to make them easier to read. However, commands in scripts are case-insensitive and can be written in either case.

Differences in scripts for Macintosh® and Windows® SeqMan NGen scripts can easily be transferred between Macintosh and Windows platforms. However, it is important to know the two main differences between writing SeqMan NGen scripts to be run on Macintosh and Windows computers:

12 • Writing Scripts for SeqMan NGen SeqMan NGen User's Manual

Page 17: SeqMan NGen User's Manual

1. The parameter for the setDefaultDirectory command must be set to the appropriate platform: defaultWinDirectory or defaultMacdirectory.

2. On Macintosh, all directories are written with forward slashes “/” separating each folder. On Windows, all directories must be written with backslashes “\”. For example:

On Macintosh: "/Library/Assemblies/abc_project.sqd"

On Windows: "C:\Assemblies\abc_project.sqd"

Commands This section lists all of the available commands within the SeqMan NGen scripting language, organized into the following functional groups:

• Project Management • File Loading • Parameter Settings • Preprocessing and Assembling

Note: Once you become familiar with the function of each command, an alphabetical list of commands and their parameters is provided in Appendix I for ease of reference.

Project Management

newProject This is an optional command to show that a new project is being created. Parameters

None

saveProject This command saves your assembly to a project file. By default, the SeqMan Pro project file format (*.sqd) is used. Phrap (*.ace) and FASTA (*.fas) formats may also be specified.

Note: As a command-line tool, SeqMan NGen will not prompt you if you try to save a new project file with the same name as an existing file in the same location. When you run a script multiple times, be sure to change the file name of the project to be saved each time to prevent existing project files from being overwritten.

Parameters

SeqMan NGen User's Manual Commands • 13

Page 18: SeqMan NGen User's Manual

file This required parameter specifies the directory and file name of the project file to be saved. The directory and file name must be enclosed in quotes. See example below.

saveUnassembled This optional parameter may have a value of true or false, and is available for projects saved in the SeqMan Pro format (*.sqd) only. If true, the unassembled sequences in your project will be saved, and displayed in the Unassembled Sequences window when you open your project in SeqMan Pro. If false, the sequences that are not assembled will not be saved within your project file. Default false.

Note:

• This parameter is currently not available when assembling Illumina® data.

• Saving the unassembled sequences within your project will increase its size.

format This optional parameter specifies the output file format. The following three values are allowed: SeqMan, Phrap, or Fasta. SeqMan will save a SeqMan Pro .sqd file, Phrap will save an .ace file, and Fasta will save a .fas and .qual of the consensus sequence for each contig. Default SeqMan.

Example (Win) SaveProject

file: "C:\My projects\ABC_project.sqd" saveUnassembled:false format:seqman

Example (Mac)

SaveProject file: "/Library/My projects/ABC_project.sqd" saveUnassembled:false format:seqman

saveReport This optional command exports a report as a text file that summarizes the statistics of your assembly, including the parameters used, the number of assembled/unassembled sequences and contigs, average quality scores, and the number of sequences excluded from the assembly due to exceeding the maxAssemblyCoverage parameter.

14 • Commands SeqMan NGen User's Manual

Page 19: SeqMan NGen User's Manual

Note: The same information contained within this report is also saved within your SeqMan Pro project file (*.sqd) regardless of whether you choose to export the report by setting this parameter. The report can be viewed in SeqMan Pro by going to Project>Report.

Parameters

file This required parameter specifies the directory and file name of the report to be saved. The directory and file name must be enclosed in quotes. See example below.

Example (Win) saveReport

file: "C:\abc_Project\abc_report.txt"

Example (Mac) saveReport

file: "/Library/abc_Project/abc_report.txt"

writeUnassembledSeqs This optional command saves all sequences that were not assembled in the project as *.fas and *.qual files.

Parameters file

This required parameter specifies the directory and file name of the unassembled sequences to be saved.

saveTrimmed Value may be true or false. When true, only the trimmed portion of the unassembled sequences will be saved. Default False.

closeProject This optional command closes the current project and frees the memory in use so that your system is ready for additional assemblies. This can be useful if you want to run multiple assemblies in one script. Parameters

None

openProject This optional command loads a SeqMan Pro (*.sqd) or Phrap (*.ace) project file. Parameters

SeqMan NGen User's Manual Commands • 15

Page 20: SeqMan NGen User's Manual

file This required parameter specifies the directory and file name of the project file to be opened. The directory and file name must be enclosed in quotes. See example below.

Example (Win) openProject

file: "C:\My Projects\ABC_assembly.sqd"

Example (Mac) openProject

file: "/Library/My Projects/ABC_assembly.sqd"

quit This optional command closes the current project and exits the SeqMan NGen program. Parameters

None

File Loading

setDefaultDirectory This optional command defines the default directory for your project. When a default directory is specified, files located in that directory only need to be identified by their subfolder and/or file name in subsequent commands. Parameters

defaultWinDirectory This required parameter specifies the directory to be used as your default on a Windows computer. The directory must be enclosed in quotes. See example below.

defaultMacDirectory This required parameter specifies the directory to be used as your default on a Macintosh computer. The directory must be enclosed in quotes. See example below.

Example (Win) setDefaultDirectory

defaultWinDirectory: "C:\ABC_proj\"

Example (Mac) setDefaultDirectory

defaultMacDirectory: "/Library/ABC_proj/"

16 • Commands SeqMan NGen User's Manual

Page 21: SeqMan NGen User's Manual

loadSeq This command loads a sequence file or files for assembly. See Appendix II for a list of supported file types. Parameters

file This required parameter specifies the directory and file name of the sequence file(s) to be loaded. The directory and file name must be enclosed in quotes. (See example 1 below). A folder may also be specified, in which case all of the sequence files within that folder will be loaded. If a folder is specified, a final backslash “\” (or forward slash “/” on Mac) is necessary after the folder name. (See example 2).

Example 1 (Win)

loadSeq

file: "C:\ABC_project\ABC_sequences.fas"

Example 2 (Win)

loadSeq

file: "C:\ABC_project\"

Example 1 (Mac)

loadSeq

file: "/Library/ABC_project/ABC_sequences.fas"

Example 2 (Mac)

loadSeq

file: "/Library/ABC_project/"

loadTemplate This command loads a sequence file to be used as a template for all other sequences to be assembled to. A template is required when assembling Illumina® data. Your template sequence will be displayed as a “reference” sequence in SeqMan Pro for SNP analysis. See Appendix II for supported file types. Parameters

SeqMan NGen User's Manual Commands • 17

Page 22: SeqMan NGen User's Manual

file This required parameter specifies the directory and file name of the template sequence file to be loaded. A folder may also be specified, in which case all of the sequence files within that folder will be loaded and treated as template sequences. If a folder is specified, a final backslash “\” (or forward slash “/” on Mac) is necessary after the folder name. The directory and file name must be enclosed in quotes. See example below.

Example (Win) loadTemplate

file: "C:\abc_Project\abc_template.seq"

Example (Mac) loadTemplate

file: "/Library/abc_Project/abc_template.seq"

loadVector This command loads a vector sequence file to be used for vector trimming. See Appendix II for supported file types. Parameters

file This required parameter specifies the directory and file name of the vector sequence file to be used for vector trimming. The directory and file name must be enclosed in quotes. See example below. cloneSite This parameter specifies the position of the cloning site on the vector where insertion occurs.

Example (Win) loadVector

file: "C:\vectors\123_vector.seq" cloneSite:826

Example (Mac) loadVector

file: "/Library/vectors/123_vector.seq" cloneSite:826

load454DualEnd This command loads a file of 454 sequences and checks for the presence of a linker defining the dual end sequences. If the linker is found, the linker is removed and the remaining portion is split into two sequences linked with a dual end constraint. Parameters file

18 • Commands SeqMan NGen User's Manual

Page 23: SeqMan NGen User's Manual

The directory and file name of the .fas, .fna, or .sff file containing the 454 sequences.

linker The directory and file name of the .fas, fna, or .sff file containing the 454 linker sequences. If not specified, SeqMan NGen will use its default 454 linker sequence: GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC

min The minimum distance for the dual end constraint. Default 0. max The maximum distance for the dual end constraint. Default 10000. discardLinkerless Value may be true or false. When true, reads that do not have a linker sequence will be discarded from the assembly. Default False.

Example (Win)

load454DualEnd file: "C:\454 data\123_dualend.fas" linker: "C:\454 data\123_linkerseqs.fas" min: 0 max: 10000 discardLinkerless: false

Example (Mac) load454DualEnd

file: "/Library/454 data/123_dualend.fas" linker: "/Library/454 data/123_linkerseqs.fas" min: 0 max: 10000 discardLinkerless: false

loadRepeat This command loads a sequence file to be used to identify repeat sequences in your assembly. All sequences identified as repeats will be added to the assembly last, after all non-repeats have been assembled. See Appendix II for supported file types. Parameters

file This required parameter specifies the directory and file name of the repeat sequence file. A folder may also be specified, in which case all of the sequence files within that folder will be loaded and used as repetitive sequences. If a folder is specified, a final backslash “\”

SeqMan NGen User's Manual Commands • 19

Page 24: SeqMan NGen User's Manual

(or forward slash “/” on Mac) is necessary after the folder name. The directory and file name must be enclosed in quotes. See example below.

Example (Win) loadRepeat

file: "C:\repetitive_seqs\123_repeat.seq" Example (Mac)

loadRepeat file: "/Library/repetitive_seqs/123_repeat.seq"

loadContaminant This command loads a contaminant sequence file to be used to identify known contaminants, such as primers, in your assembly. Sequences that contain at least 12 matching 17-mers are flagged as contaminant sequences and will be removed from the assembly. See Appendix II for supported file types. Parameters

file The directory and file name of the contaminant sequence file. A folder may also be specified, in which case all of the sequence files within that folder will be loaded and used for contaminant screening. If a folder is specified, a final backslash “\” (or forward slash “/” on Mac) is necessary after the folder name. The directory and file name must be enclosed in quotes. See example below.

Example (Win) loadContaminant

file: "C:\contaminants\123_abc.seq"

Example (Mac) loadContaminant

file: "/Library/contaminants/123_abc.seq"

loadConstraint This command loads a constraint file. The file can be in the NCBI ancillary file format, or in the CAP3 constraint file format. SeqMan NGen uses constraint files to identify dual-end reads, similar to using the setPairSpecifier command. Constraint files in the NCBI ancillary file format also contain trimming information, which SeqMan NGen will load and use.

Note: SeqMan NGen will create a CAP3 file when saving a Phrap project (*.ace) that used dual-end constraints.

Parameters file

20 • Commands SeqMan NGen User's Manual

Page 25: SeqMan NGen User's Manual

The directory and file name of the constraint sequence file. The directory and file name must be enclosed in quotes. See example below.

Example (Win) loadConstraint

file: "C:\constraints\123_xyz.con"

Example (Mac) loadConstraint

file: "/Library/constraints/123_xyz.con"

Parameter Settings

setParam This command allows you to adjust the stringency of one or more of the assembling parameters for your project. SeqMan NGen will use the default values for any parameter that is not specified within your script. Parameters

matchSize The minimum number of matching consecutive bases required to determine the overlap of sequence reads. For further information, see the “Mer Tags” section. Default 15.

Note: The matchSize value must be an odd number. If an even number is entered, SeqMan NGen will automatically increase the value to the next odd number.

minMatchPercent The minimum percentage of matches in an overlap required to join two sequences in the same contig. For further information, see the “Match Percentage” section. Default 90.

matchWindowLength The size of the window used to calculate the match percentage. Default 50.

matchSpacing The length of the window of a sequence read where at least one mer tag will be chosen. For further information, see the “Mer Tags” section. Default 10.

useRepeatHandling Value may be true or false. When true, the assembler uses the repeat probabilities to determine if a mer occurs too frequently to

SeqMan NGen User's Manual Commands • 21

Page 26: SeqMan NGen User's Manual

use. For further information, see the “Repeat Handling” section. Default is True.

coverageType Specifies the type of coverage to be used for repeat handling. Value may be Genome, which uses the length of the genome being assembled to calculate the expected coverage, or Fixed which uses a fixed value as the expected coverage. For further information, see the “Repeat Handling” section. Default is Fixed.

Note: If you know the length of the genome/fragment being assembled, we recommend using Genome for the coverageType and then specifying the length using the genomeLength parameter. If you do not know the genome/fragment length, used Fixed for coverageType and provide your most accurate estimate of expected coverage for the fixedCoverage value.

fixedCoverage The estimated depth of the sequencing, which can be used instead of the genome length for repeat handling. Default 6.

Note: Use caution when estimating the value for fixedCoverage. If the value you use is significantly lower than the actual depth, the assembly may take a much longer time to complete and may have too many mers flagged as repeats.

genomeLength Specifies the length of the genome or fragment being assembled. This is used to calculate expected coverage in determining repeat handling. Default 0.

matchRepeatPercent The percent frequency a mer occurs compared to its expected frequency. Mers exceeding this value are flagged as repeated and not used as mer tags in determining overlaps. Default 150.

maxUsableCount Any mers occurring more frequently than FixedCoverage multiplied by MaxUsableCount are disregarded as mer tags from the assembly. Default 25.

maxGap The maximum number of gaps allowed per 1000 bases in the alignment. Default 15.

22 • Commands SeqMan NGen User's Manual

Page 27: SeqMan NGen User's Manual

maxAssemblyCoverage The maximum depth of coverage allowed in your assembly. SeqMan NGen will not exceed the coverage specified by this threshold. 0 = unlimited coverage. Default 500.

Note: This parameter is only available for templated assemblies.

matchScore The score for a base match during an alignment. This score contributes to the pairwise score used to calculate match percentage. Increasing the matchScore value will allow for longer or more frequent gaps, thus forcing bases that match to be assembled together. Default 10.

gapPenalty The penalty for opening or extending a gap during an alignment. This penalty is deducted from the pairwise score used to calculate match percentage. A high gap penalty suppresses gapping, while a low value promotes gapping. Default 30.

mismatchPenalty The penalty for a base mismatch during an alignment. This penalty is deducted from the pairwise score used to calculate match percentage. Default 20.

min454SeqLen Any sequences within the range of this minimum length value and the max454SeqLen value will have homopolymeric regions sorted by quality score so that the highest value appears first in the sequence as it is assembled. This improves gapping when sequences are mixed in orientation. Default 50.

max454SeqLen Any sequences within the range of this maximum length value and the min454SeqLen value will have homopolymeric regions sorted by quality score so that the highest value appears first in the sequence as it is assembled. This improves gapping when sequences are mixed in orientation. Default 350.

defaultQuality The value used for the base quality of sequences without quality scores. Default 15.

SeqMan NGen User's Manual Commands • 23

Page 28: SeqMan NGen User's Manual

templateDefaultQuality The value used for the base quality of template sequences without quality scores. Default 500.

splitFalseJoins Value may be true or false. When true, the assembler identifies and splits false joins based on the set of false join parameters specified. Default True.

falseJoinMinColDepth The minimum depth of coverage necessary in a column of bases to consider the bases for possible false joins. Default 4.

falseJoinMinInconsistent The minimum number of inconsistent bases at a position to determine a possible false join. Default 4.

falseJoinMinFraction The minimum percentage of the alternate base at a position required to be considered in splitting false joins. Default 25.

falseJoinMinMatches The minimum number of matching bases used in splitting a false join. Default 2

falseJoinUniformQual Value may be true or false. When true, the assembler assumes a uniform quality for bases during false join splitting. Default True.

falseJoinQualThresh The minimum quality score for a base to be considered a different base during false join splitting. Default 15.

allowConstraintBased Value may be true or false. When true, the assembler uses constraints during assembly. Default True.

skipRealign Value may be true or false. When true, the assembler skips the realignment step of the assembly. The realignment step analyzes each sequence at the nucleotide level to determine the exact position of each sequence in the alignment. Default False.

24 • Commands SeqMan NGen User's Manual

Page 29: SeqMan NGen User's Manual

useSeqMan7Format Value may be true or false. When true, SeqMan NGen will create a SeqMan Pro project file (*.sqd) that is compatible with SeqMan Pro version 7.2. However, the project file will be much bigger than the same project file created for SeqMan Pro version 8. Default is False.

splitTemplateContigs Value may be true or false. When true, after a templated assembly has been completed, the template will be split into contigs at areas where there is zero coverage. Split contigs will be grouped into scaffolds with a defined position to allow for easy sorting when the project is viewed in SeqMan Pro. Annotations on the template sequence will also be split, and any /codon_start qualifiers will be adjusted to stay in frame. Default is False.

assembleBoneyard Value may be true or false. When true, after a templated assembly has been completed, the unassembled sequences remaining will be assembled into contigs. If the template has been split, SeqMan NGen will attempt to join the split contigs together in new arrangements. Default is False.

minContigSeqs The minimum number of sequences in a contig. After an assembly has been completed, any contigs without a template sequence will be disassembled if they contain fewer sequences than the number specified. The use of this parameter is recommended when performing de novo assemblies using data from Next Generation sequencing technologies, such as Illumina®, as these types of assemblies can produce tens of thousands of very small contigs. Default 0.

snpPasses The number of times SeqMan NGen will cycle through a templated assembly, attempting to fill in regions with zero coverage due to SNPs. Default 2.

snpMatchPercentage The minimum match percentage required during passes to fill in SNP regions. See the snpPasses parameter. Default 90.

Example

setParam useRepeatHandling:true

SeqMan NGen User's Manual Commands • 25

Page 30: SeqMan NGen User's Manual

setParam coverageType:fixed setParam fixedCoverage:6 setParam matchSize:15 setParam minMatchPercent:90 setParam matchSpacing:10 setParam matchRepeatPercent:150 setParam maxUsableCount:25 setParam maxGap:15 setParam matchWindowLength:50 setParam matchScore:10 setParam maxAssemblyCoverage:500 setParam gapPenalty:30 setParam mismatchPenalty:20 setParam min454SeqLen:50 setParam max454SeqLen:350 setParam defaultQuality:15 setParam templateDefaultQuality:500 setParam splitFalseJoins:True setParam falseJoinMinColDepth:4 setParam falseJoinMinInconsistent:4 setParam falseJoinMinFraction:25 setParam falsJoinMinMatches:2 setParam falsJoinUniformQual:true setParam falseJoinQualThresh:15 setParam allowConstraintBased:true setParam skipRealign:false setParam useSeqMan7Format:false setParam splitTemplateContigs:false setParam assembleBoneyard:false setParam minContigSeqs:0 setParam snpPasses:2 setParam snpMatchPercentage:90

setVectorParam This command allows you to adjust the parameters used for vector trimming. In order to be applied, this command must appear in your script before the loadVector command, and the vectScan parameter for the assemble command must be set to true. Parameters

merLength The minimum length of a mer required to be considered an exact match when searching for vector. Default 9. minMerMatch The minimum number of matching mers required to start an alignment. Default 3. minTrimLength The minimum length required for a mer to be considered as a match for vector trimming. Default 30.

Example

26 • Commands SeqMan NGen User's Manual

Page 31: SeqMan NGen User's Manual

setVectorParam merLength:9 setVectorParam minMerMatch:3 setVectorParam minTrimLength:30

setQualityParam This command allows you to adjust the parameters used for quality trimming. In order to be applied, the trimEnds parameter for the assemble command must be set to true. Parameters

winLength The length of the window used for averaging quality scores. Default 30. minAveLowQual The minimum averaged quality score of the evaluated window required to be considered low-quality. Default 14.

Example setQualityParam winLength:30

setQualityParam minAveLowQaul:14

setRepeatParam This command allows you to adjust the parameters used for scanning for repetitive sequences. In order to be applied, this command must appear in your script before the loadRepeat command, and the repeatScan parameter for the assemble command must be set to true. Parameters

merLength The minimum length of a mer required to be considered an exact match when scanning for repeats. Default 17. minMerMatch The minimum number of matching mers required to start an alignment. Default 2. minFlagLength The minimum length required for a mer to be flagged as a repeat. Default 50.

Example setRepeatParam merLength:17

setRepeatParam minMerMatch:2 setRepeatParam minFlagLength:50

setContaminantParam This command allows you to adjust the parameters used for scanning for contaminant sequences. In order to be applied, this command must

SeqMan NGen User's Manual Commands • 27

Page 32: SeqMan NGen User's Manual

appear in your script before the loadContaminant command, and the contamScan parameter for the assemble command must be set to true. Parameters

merLength The minimum length of a mer required to be considered an exact match when scanning for contaminants. Default 17. minMerMatch The minimum number of matching mers required to mark the sequence as a contaminant. Default 12.

Example setContaminantParam merLength:17

setContaminantParam minMerMatch:12

Preprocessing and Assembling

assemble This required command preprocesses and assembles the sequences that have been loaded. Preprocessing may include quality trimming, and scanning for vector, repetitive, and contaminant sequences. Parameters

trimEnds Value may be true or false. If true, the sequences will be trimmed based on quality scores before assembling. Default False. vectScan Value may be true or false. If true, the sequences will be scanned and trimmed for vector before assembling. Also see loadVector. Default False. repeatScan Value may be true or false. If true, sequences will be scanned for the specified known repetitive sequences before assembling. Also see loadRepeat. Default False. contamScan Value may be true or false. If true, sequences will be scanned for the specified contaminant sequences before assembling. Also see loadContaminant. Default False. doAssemble Value may be true or false. If false, only the preprocessing will be done, and the sequences will not be assembled. Default True.

Example assemble

28 • Commands SeqMan NGen User's Manual

Page 33: SeqMan NGen User's Manual

trimEnds:false vectScan:false repeatScan:false contamScan:false doAssemble:true

realignContigs This optional command does another pass through the assembly once the initial assembly is complete, and realigns contigs as needed. Using this command may significantly increase the time to assemble and typically should only be used for correcting occasional misalignments that may occur in gapped regions.

Note: The realignContigs command must appear in your script after the assemble command.

Parameters

None

splitTemplates This command splits template contigs into multiple contigs in areas where there is zero coverage. Split contigs will be grouped into scaffolds with a defined position to allow for easy sorting when the project is viewed in SeqMan Pro. Annotations on the template sequence will also be split, and any /codon_start qualifiers will be adjusted to stay in frame. Parameters

None

removeSmallContigs This command disassembles any contigs without template sequences that have fewer than the specified number of sequences.

Parameters minSeqs

This required parameter specifies the minimum number of sequences necessary in a contig to prevent it from being disassembled.

setPairSpecifier

This command defines the dual-end pair specifier for the paired Sanger and Illumina® sequences in your assembly.

SeqMan NGen User's Manual Commands • 29

Page 34: SeqMan NGen User's Manual

Note: For more information on assembling 454® dual-end data, see the command load454DualEnd.

This command must appear in the script before the assemble command, but after sequences have been loaded (loadSeq).

Pair specifiers define the naming convention for sequence pairs, as well as your requirement for a minimum and maximum distance between the opposite ends of your inserts.

Expressions for forward and reverse naming conventions should be created using the Dual-end specification language. Forward and reverse sequences must have identical names except for the unique portion that determines the direction of the clone.

Parameters

Pairs This parameter lists the dual-end pair constraints, specified by the following four values. Each value should be separated by a space and the list of values enclosed in double brackets. An additional set of brackets is required around all of the dual-end pair constraints, regardless of whether one or multiple pair constraints are specified. See example below.

Forward A naming pattern to match forward clones. The pattern should be enclosed in quotes. Reverse A naming pattern to match reverse clones. The pattern should be enclosed in quotes. Min The minimum distance for the dual end sequences to be separated. Max The maximum distance for the dual end sequences to be separated.

Example The following example defines 2 pair specifiers each with different size ranges: setPairSpecifier

pairs:{ {forward:"(.*)(2kb)(.*)-FP.*$" reverse:"(.*)(2kb)(.*)-

RP.*$" min: 1500 max: 2500} {forward:"(.*)(8kb)(.*)-FP.*$" reverse:"(.*)(8kb)(.*)-

RP.*$" min: 7000 max: 9000} }

30 • Commands SeqMan NGen User's Manual

Page 35: SeqMan NGen User's Manual

Note: In a real script, the text above would appear all on one line. It is written as shown above to make it easier to see the different parts of the parameter, including the { } brackets around each pair specifier, and the { } around the entire list of pair specifiers. The final { } brackets are required even if only one pair specifier is used.

Annotating Template Sequence Prior to Assembly Prior to assembling your sequences in SeqMan NGen, you may want to annotate your template sequence in SeqBuilder for known SNPs/variations and other features. SeqBuilder is the sequence editing and visualization application in the Lasergene suite. Assembling an annotated template sequence in SeqMan NGen will enable you to better analyze the identified putative SNPs when viewing your assembled project in SeqMan Pro. To annotate your template sequence in SeqBuilder:

1. Launch SeqBuilder, located by default in the following directory: Win: C:\Program Files (x86)\DNASTAR\Lasergene 8\SeqBuilder.exe. Mac: Applications/DNASTAR/Lasergene 8/SeqBuilder

2. Go to File>Open and select your template sequence. 3. Select the range of sequence where a feature will be added. (Use

Edit>Go to Position to navigate quickly up and down your sequence). 4. Go to Features>New Feature. A new “misc_feature” will be added to your

sequence and displayed in the Feature List. 5. Click on “misc_feature” from within the Feature List and select the

appropriate feature type from the list provided. For example:

− For SNPs, choose Variation>variation.

− For exons, choose Gene>exon.

− For CDS features, choose Transcript>CDS.

− For rep_origin, choose Structure>rep_origin.

Note: The next feature you create will automatically be of the same feature type you just selected, enabling you to create all the features of one type more quickly.

6. Repeat steps 3-5 until all of all of your features have been added. Then go to File>Save As and save your sequence as a *.sbd or *.seq file. Your annotated template sequence is now ready for assembly and subsequent analysis in SeqMan Pro.

SeqMan NGen User's ManualAnnotating Template Sequence Prior to Assembly • 31

Page 36: SeqMan NGen User's Manual

Viewing Assembly Results in SeqMan Pro Once your sequences have been assembled and saved using SeqMan NGen, you may view the results in SeqMan Pro. Use File>Open to open a SeqMan Pro project file (*.sqd) or File>Import to import a Phrap assembly (*.ace) Your assembly will be displayed in the Project window. And, unless you chose not to save them (see the saveUnassembled command), the unassembled sequences in your assembly will be displayed in the Unassembled Sequences window:

If your project file or original sequence files have been moved, SeqMan Pro will prompt you to locate the sequence files. The contigs in your project will be named as follows: − If you assembled your data using a template sequence, the resulting contig

will take the name of the template sequence name. − If you used the useRepeatHandling parameter, the contigs in your project

made up of sequences flagged as possible repeats will be named: Repeat-00001, Repeat-0002, Repeat-0003, etc.

32 • Viewing Assembly Results in SeqMan Pro SeqMan NGen User's Manual

Page 37: SeqMan NGen User's Manual

− If you scanned your assembly for known repeats (see the repeatScan parameter for the assemble command), the contigs containing the known repeated sequences will take the name of the repeated sequence.

− If none of the above applies, the contigs in your project will be named Contig

00001, Contig 00002, Contig 00003, etc. Detailed descriptions of all of SeqMan Pro’s features can be accessed within SeqMan Pro by going to Help> Contents (Win) or Help>SeqMan Help (Mac). The SeqMan Pro Help topics Discovering SNPs and Working with Features may be particularly useful.

Viewing Areas Exceeding the Maximum Depth of Coverage If you specified a value for maxAssemblyCoverage and wish to see the areas that likely exceeded this coverage:

1. Select Project>Parameters and choose Strategy Viewing from the list provided.

2. Enter a value in the Maximum Expected Coverage field that is one less than the maxAssemblyCoverage value you used in your assembly. (For example, enter 39 if your maxAssemblyCoverage value was 40).

3. Click OK to close the Parameters window. 4. Go to the Strategy View by choosing a contig or a scaffold from the

Project window and selecting Contig>Strategy View or Contig>Scaffold Strategy View. Areas exceeding this maximum value will be indicated by thick, red areas in the Coverage Threshold bar:

SeqMan NGen User's Manual Viewing Assembly Results in SeqMan Pro • 33

Page 38: SeqMan NGen User's Manual

Components of SeqMan NGen

Mer Tags

The SeqMan NGen layout algorithm relies on unique subsequences of bases, or mers, that occur in overlapping regions of fragment reads. Mers that are common to two or more fragment reads are aligned to determine the overall layout of reads. Overlapping reads have many mers in common, but only a few mers per overlapping region are needed to identify the overlap. These mers are called mer tags. The use of mers to tag fragments and identify overlaps is illustrated in the following figure.

Figure: Using Mer Tags to Identify Overlaps

34 • Components of SeqMan NGen SeqMan NGen User's Manual

Page 39: SeqMan NGen User's Manual

Note: As shown in the above figure, a 54bp original DNA sequence is covered by five overlapping fragment reads. The 6-mer tags for each fragment read are underlined. Matching mer tags are aligned to determine the layout of the reads.

The power of using mer tags relies on the ability of SeqMan NGen to choose mers that are most likely to occur only once in the original DNA sequence. It is important to avoid choosing mers that occur in repeated regions since the result may be fragment reads that are incorrectly aligned together.

Three parameters are involved in choosing mer tags: matchSize, useRepeatHandling, and matchSpacing.

The settings of the matchSize and useRepeatHandling parameters help to choose tags that are most likely to be unique in the original DNA sequence. The matchSize sets the length of the mers. The longer the mer, the more probable that the mer is unique. The useRepeatHandling parameters help to identify which mers are not likely to be unique. If a mer occurs more often than expected in the dataset, the mer may be part of a repeated region.

The matchSpacing parameter specifies the preferred distance between mer tags. The smaller the matchSpacing, the more memory and more time the assembly will take. If a fragment read is shorter than the matchSpacing, multiple mer tags are still chosen for the read.

Match Percentage SeqMan NGen uses a local Match Percentage, which requires that the Match Percentage threshold be met in each overlapping window of 50 bases, by default. The size of this window can be adjusted by specifying a different value for the matchWindowLength parameter. An example containing a repeated region follows. A fragment of a genome has a repeated region, labeled A and A’, and two unique regions, labeled B and C.

When the fragment is sequenced, one of the sequences contains parts of regions A and B, and another contains parts of regions A’ and C:

SeqMan NGen User's Manual Components of SeqMan NGen • 35

Page 40: SeqMan NGen User's Manual

In this example, a minMatchPercent threshold of 80% is used. When the two sequences are aligned, the 400 bases in the overlapping A and A’ regions match 100%. The 200 bases in the overlapping B and C regions match 42%. Over the entire alignment, 484 bases out of 600 match, yielding a global Match Percentage of 81%. However, SeqMan NGen checks the Match Percentage for every alignment of 50 bases. The alignment below shows the last 36 overlapping bases of A and A’ and the first 18 overlapping bases of B and C. Each mismatch in the overlap is marked by an X below the alignment. In the first 50 bases shown, there are 41 matches, and the Match Percentage is 82%. This is above the threshold of 80%, so the Match Percentage of the next 50 bases is checked and is also found to be 82%. Each fifty bases is checked along the overlap as long as the Match Percentage is at or above the threshold. In this case, the alignment fails once it gets far enough into the overlap of the unique regions, B and C, that the Match Percentage drops to 78%. The sequences will not be assembled together into a contig, which is correct for this data set.

Repeat Handling Repeat Handling is a set of parameters that compute a threshold for determining the number of identical subsequences of bases, or mers, used to indicate a putative repeat. The useRepeatHandling parameter is turned on by default. The SeqMan NGen layout algorithm relies on mers that occur in overlapping regions of fragment reads. Mers that are common to two or more fragment reads are aligned to determine the overall layout of reads. (Also see the “Mer Tags” section). The Repeat Handling parameters control which mers may be chosen as tags for overlapping reads. The threshold is computed as a percentage of the expected coverage in a project. Coverage can be determined using the length of the genome/fragment being sequenced or specified as a fixed number. If genomeLength is specified, the expected coverage is the genomeLength value divided by the total length of all sequences in the project. If fixedCoverage is specified, then genomeLength is ignored and the fixedCoverage value is used as

36 • Components of SeqMan NGen SeqMan NGen User's Manual

Page 41: SeqMan NGen User's Manual

the expected coverage. The threshold is computed by multiplying the matchRepeatPercent (default is 150) by the expected coverage. Any mer that occurs more frequently than the computed threshold is not considered for use as a mer tag in determining overlaps. The following assembly parameters are involved in Repeat Handling: useRepeatHandling coverageType fixedCoverage genomeLength matchRepeatPercent

Assembling Dual-End Data

SeqMan NGen assembles dual-end data for 454®, Illumina®, and Sanger sequences. A dual-end sequence pair is a pair of reads that are known to be related with respect to orientation and distance. SeqMan NGen assumes the pair will be from opposite ends of the same DNA fragment, and sequenced from the end of the fragment inwards.

For more information on assembling 454® dual-end data, see the command load454DualEnd.

In order to enable SeqMan NGen to identify Illumina® and Sanger pairs, the sequence naming convention must systematically distinguish between different pair reads while specifying which pair reads are associated. Forward and reverse sequences must have identical names except for the unique portion that determines the direction of the clone.

The parts of the sequence names that are the same in a pair of reads are specified inside parentheses, and the parts of the names that distinguish members of the same pair as forward and reverse are placed outside the parentheses.

Expressions for forward and reverse naming conventions are created using a subset of the popular “PERL regular expressions”, which utilizes elements of the Grep language. For more information, see Appendix III. Example The following forward and reverse pair are named as follows: 01f.abi 01r.abi

SeqMan NGen User's Manual Assembling Dual-End Data • 37

Page 42: SeqMan NGen User's Manual

“01” distinguishes that they are members of the same pair. The “f” and “r” at the end of each sequence name distinguishes the orientation. In Grep, the naming convention would be written as follows: Forward convention: (.*)f\..*$ Reverse convention: (.*)r\..*$ For this example, the setPairSpecifier command would be written as follows (with arbitrary distance values):

setPairSpecifier pairs: {{forward: "(.*)f\..*$" reverse: "(.*)r\..*$" min: 1500 max: 2000}}

38 • Assembling Dual-End Data SeqMan NGen User's Manual

Page 43: SeqMan NGen User's Manual

Appendix I

Alphabetical List of Commands

Command Parameters Accepted Parameter

Values Default

Parameter Value assemble trimEnds True/False False

vectScan True/False False repeatScan True/False False contamScan True/False False doAssemble True/False True

closeProject None N/A N/A

load454DualEnd file

A directory and file name enclosed in

quotes N/A

linker

A directory and file name enclosed in

quotes

See linker parameter in

load454DualEnd. min Numerical value 0 max Numerical value 10000 discardLinkerless True/False False

loadConstraint file

A directory and file name enclosed in

quotes N/A

loadContaminant file

A directory and file name enclosed in

quotes N/A

loadRepeat file

A directory and file name enclosed in

quotes N/A

loadSeq file

A directory and file name enclosed in

quotes N/A

loadTemplate file

A directory and file name enclosed in

quotes N/A

loadVector file

A directory and file name enclosed in

quotes N/A cloneSite Numerical value N/A

newProject None N/A N/A

openProject file

A directory and file name enclosed in

quotes N/A quit None N/A N/A

realignContigs None N/A N/A removeSmallContigs minSeqs Numerical value N/A

SeqMan NGen User's Manual Appendix I • 39

Page 44: SeqMan NGen User's Manual

Command Parameters Accepted Parameter

Values Default

Parameter Value

saveProject file

A directory and file name enclosed in

quotes N/A saveUnassembled True/False False format SeqMan/Phrap/Fasta SeqMan

saveReport file

A directory and file name enclosed in

quotes N/A

setContaminantParam merLength Numerical value 17

minMerMatch Numerical value 12

setDefaultDirectory defaultWinDirectory A directory enclosed in

quotes N/A

defaultMacDirectory A directory enclosed in

quotes N/A

setPairSpecifier pairs Forward, Reverse, Min,

Max values* N/A setParam matchSize Numerical value 15

minMatchPercent Numerical value 90 matchSpacing Numerical value 10 useRepeatHandling True/False True coverageType Genome/Fixed Fixed genomeLength Numerical value 0 fixedCoverage Numerical value 6 matchRepeatPercent Numerical value 150 maxUsableCount Numerical value 25 maxGap Numerical value 15 matchWindowLength Numerical value 50 matchScore Numerical value 10

maxAssemblyCoverage Numerical value (0 = unlimited) 500

gapPenalty Numerical value 30 misMatchPenalty Numerical value 20

min454SeqLen Numerical value 50

max454SeqLen Numerical value 350

defaultQuality Numerical value 15

templateDefaultQuality Numerical value 500

splitFalseJoins True/False True

falseJoinMinColDepth Numerical value 4

40 • Appendix I SeqMan NGen User's Manual

Page 45: SeqMan NGen User's Manual

Command Parameters Accepted Parameter

Values Default

Parameter Value

setParam (continued) falseJoinMinInconsistent Numerical value 4 falseJoinMinFraction Numerical value 25

falseJoinMinMatches Numerical value 2

falseJoinUniformQual True/False True falseJoinQualThresh Numerical value 15 allowConstraintBased True/False True skipRealign True/False False useSeqman7Format True/False False splitTemplateContigs True/False False assembleBoneyard True/False False minContigSeqs Numerical value 0 snpPasses Numerical value 2 snpMatchPercentage Numerical value 90

setQualityParam winLength Numerical value 30 minAveLowQual Numerical value 14

setRepeatParam merLength Numerical value 17 minMerMatch Numerical value 2 minFlagLength Numerical value 50

setVectorParam merLength Numerical value 9 minMerMatch Numerical value 3 minTrimLength Numerical value 30

splitTemplates None N/A N/A

writeUnassembledSeqs file

A directory and file name enclosed in

quotes N/A saveTrimmed True/False False

*See setPairSpecifier for further information.

SeqMan NGen User's Manual Appendix I • 41

Page 46: SeqMan NGen User's Manual

Appendix II The following table lists all of the file types supported for import by SeqMan NGen, organized by the commands that support each type.

File Type File

Extension(s) lo

adSe

q

load

Tem

plat

e

load

Vect

or

load

454D

ualE

nd

load

Rep

eat

load

Con

tam

inan

t

open

Proj

ect

Illumina® .txt yes yes yes no yes yes no 454®* .fas, .fna yes yes yes yes yes yes no 454®*

Standard Flow Files .sff yes yes no yes no no no

SeqMan .sqd yes yes no no no no yes Phrap .ace yes no no no no no yes

Lasergene .seq, .sbd yes yes yes no yes yes no

ABI .abi, .ab1, .abd yes yes yes

no yes yes no

FASTA* .fas, .nt, .txt, .fna yes yes yes

no yes yes no

FASTQ .fasq yes yes yes no yes yes no DNA

Multiseq .mseq yes yes yes no

yes yes no File of

Filenames .fof yes yes yes no

yes yes no SCF2 and

SCF3 .scf yes yes yes no

yes yes no Text .txt yes yes yes no yes yes no

GenBank .gbk, gb yes yes yes no yes yes no GCG .seq, .gcg yes yes yes no yes yes no

*Should have an associated .qual file with the same file name and a .qual extension. The .qual file must be in the same folder with the sequence file in order for the quality scores to be used.

42 • Appendix II SeqMan NGen User's Manual

Page 47: SeqMan NGen User's Manual

Appendix III

Example Regular Expressions in PERL

Examples of expressions you may find useful for dual-end naming specifications follow. Please note this is not a complete list of PERL regular expressions, and the definitions of the terms used are limited to their application to SeqMan NGen dual-end naming specifications.

Special characters

\ A switch that makes special characters literal and literal characters special

[ ] Character class--used to enclose a list of alternatives

( )

Grouping--used to delimit a string comprising a “phrase.” Phrases are necessary in dual-end specification so you can match a pair of forward and reverse reads while still distinguishing their orientation. In SeqMan NGen, phrases in parentheses must match for two reads to qualify as a pair; phrases outside the parentheses are used to distinguish members of the same pair.

\d Any digit (0-9) \D Any non-digit character \w Any alphanumeric “word” character (including “_”) . Any character | Alternate--either the term before “|” or after “|” ^ Match at the beginning of the line only $ Match at the end of the line only

Numerical modifiers

* 0 or more + 1 or more ? 1 or 0 {n} Exactly n {n,} At least n

{n,m} At least n but not more than m

SeqMan NGen User's Manual Appendix III • 43

Page 48: SeqMan NGen User's Manual

Example expressions and their meanings

d Literally the letter d \d Any digit (0-9) \d* Zero or more digits \d+ One or more digits

(\d+)

A phrase comprising one or more digits--same as “\d+”, but causes SeqMan NGen to match the names from the string inside the phrase when other characters in the name may not match.

\. Literally the period symbol (.) . Any character .+ One or more of any characters .* Zero or more of any characters a|b a OR b ab[i1] abi or ab1 abi$ Ends with abi [\.\d] A period OR a digit [abc] a OR b OR c

[abc]+ One or more characters from the set a, b, c

.*f Any number of any characters followed by the letter “f”

(.*)f

A phrase comprising any number of any characters, followed by the letter “f”--same as “*.*f”, but causes SeqMan NGen to match the phrase in parentheses without matching the “f” in a read name

(\D+)r(\d+)

One or more non-digit characters followed by “r” followed by one or more digits.

(\d{2,4})f(\.abi)Two, three or four digits followed by “f” followed by “*.abi”

44 • Appendix III SeqMan NGen User's Manual

Page 49: SeqMan NGen User's Manual

file 14, 15, 16, 17, 18 File Loading 16 fixedCoverage 22 format 14

Index G gapPenalty 23 genomeLength 22 H

A How to Use SeqMan NGen 5 allowConstraintBased 24 I Alphabetical List of Commands 39 Installing on Macintosh® 4 Annotating Template Sequence Prior

to Assembly 31 Installing on Windows® 4 Installing SeqMan NGen 4

Appendix I 39 Introduction 2 Appendix II 42 L Appendix III 43 load454DualEnd 18 assemble 28 loadConstraint 20 assembleBoneyard 25 loadContaminant 20 Assembling Dual-End Data 37 loadRepeat 19 C loadSeq 17 closeProject 15 loadTemplate 17 Commands 13 loadVector 18 Components of SeqMan NGen 34 M contamScan 28 Match Percentage 35 Conventions Used in this Manual 12 matchRepeatPercent 22 Copyright © 2007-2008 by

DNASTAR, Inc. 1 matchScore 23 matchSize 21

coverageType 22 matchSpacing 21 D matchWindowLength 21 defaultMacDirectory 16 max454SeqLen 23 defaultQuality 23 maxAssemblyCoverage 23 defaultWinDirectory 16 maxGap 22 Differences in scripts for Macintosh®

and Windows® 12 maxUsableCount 22 Mer Tags 34

doAssemble 28 merLength 26, 27, 28 E min454SeqLen 23 Example expressions and their

meanings 44 minAveLowQual 27 minContigSeqs 25

Example Regular Expressions in PERL 43

minFlagLength 27 minMatchPercent 21

F minMerMatch 26, 27, 28 falseJoinMinColDepth 24 minTrimLength 26 falseJoinMinFraction 24 mismatchPenalty 23 falseJoinMinInconsistent 24 N falseJoinMinMatches 24 newProject 13 falseJoinQualThresh 24 Numerical modifiers 43 falseJoinUniformQual 24

SeqMan NGen User's Manual Index • 45

Page 50: SeqMan NGen User's Manual

O V openProject 15 vectScan 28

Viewing Areas Exceeding the Maximum Depth of Coverage 33

P Parameter Settings 21 Practice Data and Script for

Macintosh® 7 Viewing Assembly Results in

SeqMan Pro 32 Practice Data and Script for

Windows® 6 W winLength 27

Preprocessing and Assembling 28 writeUnassembledSeqs 15 Project Management 13 Writing Scripts for SeqMan NGen 8 Q quit 16 R realignContigs 29 removeSmallContigs 29 Repeat Handling 36 repeatScan 28 S saveProject 13 saveReport 14 saveUnassembled 14 setContaminantParam 27 setDefaultDirectory 16 setPairSpecifier 29 setParam 21 setQualityParam 27 setRepeatParam 27 setVectorParam 26 skipRealign 24 snpMatchPercentage 25 snpPasses 25 Special characters 43 splitFalseJoins 24 splitTemplateContigs 25 splitTemplates 29 T Technical Requirements 3 Technical Requirements for

Macintosh® 3 Technical Requirements for

Windows® 3 templateDefaultQuality 24 trimEnds 28 U useRepeatHandling 21 useSeqMan7Format 25

46 • Index SeqMan NGen User's Manual