20110524zurichngs 1st pub

Next Generation Sequencing for Model and Non-Model Organism.

1st day

Jun Sese and Kentaro [email protected]

Ph.D course @ Univ. of Zurich25/05/2011

mailto:[email protected]

mailto:[email protected]

Today’s Menu

• Lecture

• Overview of next generation sequencer’s analysis

• Mapping: Sequence alignment

• Introduction to UNIX to handle NGS data

• Exercise

• UNIX commands

• Mapping real short reads against genomes

• Compute statistics of the mapped reads

2

Various Types of Sequencers• Roche 454, IonTorrent

• Roche: about 400bp, Ion Torrent: about 200bp

• Suitable for denovo sequencing

• Illumina HiSeq

• Widely-used new generation sequencer

• 100bpx2 up to 600 Gb/run (HiSeq 2000)

• MiSeq uses almost same technology except number of reads

• ABI SOLiD

• 75bp, 75bp+35bp or 60bpx2 up to 300 Gb/run (5500xl SOLiD)

• Color Space

• Pacific Biosciences PacBio RS

• Average > 500 bp

• Sequence quality is not high.3

Sequence cost becomes low dramatically

Lincoln Stein, Genome Biology, vol. 11(5), 2010

4

How large is it?

• Generated file size is more than 300GB/run

• We can read data from hard disks with 100 MB/sec

• 300GB / 100MB/sec

= 300,000MB / 100MB/sec

= 3000 sec

= 50min

• To just read the data from HDD, computer takes 50min!

• Require efficient calculation

5

Applications of DNA Sequencing

• NGS just read enormous short sequences, but has many biological applications.

• Genetic variation

• Gene regulations

• RNA-seq

• ChIP-seq

• Epigenetics

• Population genetics

Science 2007 6

Sequencerʼs Output

Mapping Program

Genome Sequence

Mapping Result

Visualization Further AnalysisSNPs, RNA-Seq,... 7

Major Pipelines of NGS• Most of the applications use the similar procedure.

Find originated

region

Filter

Analysis

Genetic variation

Map(Alignment)

SNP call

Find difference

ChIP-Seq

Map

Check regulatoryregions

Same as ChIP-Chip analysis

RNA-Seq

Map

Measure expressions

Same as microarray

Most of them require whole genome sequence to map reads.8

Mapping (Pairwise Alignment)• Find the place from which each read comes

• BLAST is one of the very famous alignment software.

• Few NGS analysis use BLAST/BLAT because of slow alignment speed.

• BWA and Bowtie have been used to map short reads.

ATATGCGA

GATGCTAAGCATATGCGAGGCATGCCATATGGATGReference

Reads

ATATGCGA

ATATGCGA

GATGCTAAGCAAATGCGAGGCATGCCATATGGCGAReference

Reads

ATATGCGA ATATG-CGA

We may find multiple mapped places. Score matrix (distance) defines which map is better.

x9

For non-model organism

Map

Filter

Analysis

Genetic Variation

Map new reads

SNP call

Find Difference

Chip-Seq

Map ChIP-Seqreads

Check regulatoryregions

Similar to ChIP-Chip

RNA-Seq

Map newRNA-Seq reads

Measure expressions

Same as microarray

Genome/Gene Sequence Genome

assemblyGenomeassembly

RNAAssembly

Read genome Read genomeRead normalized

library

Count assembled

reads

Map onto related species

genome

Most cases require genome assembly, which is experimentally and computationally high cost 11

Very Short History of Pairwise Alignment Programs

• More than 100 alignment programs are listed in Wikipedia!!!

• http://en.wikipedia.org/wiki/Sequence_alignment_software

• 1 sequence vs 1 sequence

• Ssearch, FASTA [Lipman and Pearson. 1985]

• 1 sequence vs Whole genes

• BLAST [Altschul et al. 1990]

• Thousands of sequences vs Whole genes or Whole genomes

• BLAT [Kent. 2002]

• Billions of short sequences vs Whole genome

• BWA, Bowtie, SHRiMP, etc...

• Most modern mappers use FM-index [Ferragina and Manzini. 2000] with Burrows-Wheeler transform [Burrows and Wheeler. 1994].

12

http://en.wikipedia.org/wiki/Sequence_alignment_software

http://en.wikipedia.org/wiki/Sequence_alignment_software

Why so many alignment programs have been developed?

• Computer scientist seems that alignment is easy task.

• Both indexing and dynamic programming used in sequence alignment are basic algorithm.

• Good problem for home work

• A little performance tuning can accelerates execution speed dramatically

• In reality, alignment problem is very hard to solve.

• Mutations, insertions, deletions...

• Each sequencer has unique bias.

• Sequence length. Homo-polymer in Roche 454...

• Many heuristics exist in biologist!

• GT-AG rule on splice site, but not always...

• That is, problem definition is ambiguous! 13

Alignment performance varies• Aligned 12million single end reads against human genome

sequences (hg18)

• Algorithm and implementation difference appear in total processed time

• In most program, used memory depends on genome size.

• Parameter settings reflect numbers of mapped reads.

• Authors did not mention about them.

• In real experiments, we have to change parameters to use alignment program.

14

Bao et al. J Hum Genet, 2011

Sequencerʼs Output

Mapping Program

Genome Sequence

Mapping Result

Visualization

Sequence Format

BWA, Bowtie, etc.

15

Sequence File Format (1)• FASTA + Quality File

• Used by Roche 454

>1ST_SEQ length=67 xy=1264_0441 region=1 run=R_2010_07_07_16_23_16_37 35 35 35 35 35 37 37 37 37 37 39 39 37 36 35 35 36 37 37 37 37 35 35 32 28 27 27 27 27 29 23 21 21 14 14 12 18 19 19 19 19 19 19 16 16 17 20 22 20 12 12 12 12 11 17 17 17 16 1922 23 24 21 21 21 18>2ND_SEQ length=88 xy=1264_0564 region=1 run=R_2010_07_07_16_23_16_29 30 19 19 19 20 19 24 28 27 27 27 27 27 30 19 19 20 20 20 24 33 33 33 33 33 33 33 35 35 37 37 30 30 30 30 32 32 32 32 35 32 32 32 32 33 33 33 33 20 20 20 23 27 30 30 31 31 27 2727 27 28 23 24 24 23 23 23 24 24 21 17 19 19 18 27 18 17 16 16 16 17 13 18 17 16 12

>1ST_SEQ length=67 xy=1264_0441 region=1 run=R_2010_07_07_16_23_16_GCGTTGTGTATGTCTCCTTTGGTATGTCAGGTTTCGTCAGAAGCTTCTATCAAACGGCGCACAGTGA>2ND_SEQ length=88 xy=1264_0564 region=1 run=R_2010_07_07_16_23_16_TCGGCCCTATCCGAGAAGGCGTGGTGTATCTCTCTTCTGGTATGCCACGTTACGCAGCAGCTTCTTCCCAAGACACAGAGCGAGTAAG

16

Sequence File Format (2)• FASTQ

• Used by Illumina sequencers

• Sequence database sites (SRA(Short read archive)/ENA(European Nucleotide Archive)/DRA(DDBJ Sequence Read Archive)) provide sequences with this format.

• De-facto standard

• CSFasta + Quality file

• Only used in SOLiD sequencers

• Similar to fasta file except sequences are described in color space.

>SRR038985.100 VAB_AT1deg1_51_269_F30 20 23 21 26 20 21 23 21 20 24 25 26 20 23 19 17 27 26 10 16 16 19 23 19 26 28 9 22 18 21 25 25 23 2 20>SRR038985.200 VAB_AT1deg1_78_430_F30 7 19 26 26 24 8 27 29 23 23 21 21 24 26 19 11 21 25 14 10 19 21 21 25 20 28 20 20 15 23 8 25 23 11 25

>SRR038985.100 VAB_AT1deg1_51_269_F3T10303011231130321000333001323122221>SRR038985.200 VAB_AT1deg1_78_430_F3T03102101012320213012132121333132011

17

Color Space

Color Space Analysis in the SOLiD™ System: the Theory, Advantages and Solutions Introduction 2nd Base

1st Base

Double Interrogation: Each base is defined twice

A T AAC

2nd Base

1st Base

Double Interrogation: Each base is defined twice

A T AACA T AAC

The SOLiD™ System is the only next generation sequencing system to employ ligation based chemistry with di-base labelled probes. This unique approach provides significant advantages in terms of system accuracy and downstream data analysis.

Unique built-in error checking capability distinguishes between measurement errors and true polymorphisms Detection of more complicated genetic variation

such as adjacent SNPs, insertions, deletions and structural variations

Properties for a 2 Base Color Code Scheme The color code scheme is based on the Klein four-group, which is the symmetry group of a rectangle.

Figure 1: SOLiD Color Space Code It was designed to have the following properties which enable the unique error checking capability.

For each di-base the reverse has the same color For each di-base the complement has the same color For each di-base the reversed complement has the

same color Two different di-bases that have the same 1st base,

have different colors Advantages of 2 base encoding and Color Space

Figure 2: Using the Klein four-group to Obtain the Code for Strings of Colors as Transformations of Bases

Two base encoding provides higher system accuracy and built-in error checking capability to discriminate between measurement error and true sequence variation. Since each base is interrogated twice the information about each base is included in two adjacent pieces of color space data. In the simplest case a single nucleotide polymorphism will always alter two adjacent colors, and detection of a single color

change compared to a reference sequence is therefore likely to be due to measurement error. Additionally some of the color space rules can be applied to detect error during de novo assembly. Analysis of Color Space There are an increasing number of tools available for the analysis of SOLiD color space data within a broad range of applications, including resequencing, transcriptomics, epigenomics and de novo assembly. Applied Biosystems provide a range of open source and proprietary tools to enable the researchers to perform the data analysis. All of these tools will use the color space rules to identify measurement error and provide corrected sequence in base space. In addition, we are working with both academic and commercial partners to provide additional tools so the scientists have a range of solutions that take advantage of the benefits of using color space. ISMB Conference 2009, Stockholm, Sweden Applied Biosystems Technology Track Session

ABI White Paper: Color Space Analysis in the SOLiD System: the Theory, Advantages and Solutions

T10303011

TGGCCGGTG

T10203011

TGGAATTGT18

• ABI SOLiD unique format.

• Each number represents two base pair

• Each nucleotide are read twice

• A spot detection miss may change downstream sequence.

• Some softwares did not support this format.

FASTQ Format

@SRR013343.216 :3:1:837:436GCGTGGTATAGGAGGCGGAACGGGCGGTTGGCGGTT+I6IIII*II*II+I:+&I)I'&%&%,+0>+'I''[email protected] :3:1:974:526GCGCATGAGTGGCTTGACTCGTATGCGGATTCCTTC+I@II6I<I/III;II+)I*II*DI*I?')+*+8/%[email protected] :3:1:755:341GTGGAGTAGGTTAGTTGCGGATCGTATGCCGTCTTC+IIIIIIIIIIAIIIIII<II6?II3/AD26=:-9I'

One read

SequenceName

Quality Score

19

PHRED quality encoding

• Q=20: 99% accuracy, Q=30: 99.9% accuracy

• Quality value scale is slightly different between PHRED and illumina/SOLiD results

• Encoded in FASTQ and SAM by quality string of “ASCII value - 33”

• For illumina 1.3+, ASCII character has been changed to ASCII-64 character.

Q = −10 log10 P ⇔ P = 10−Q10

! 33 ‘ 39 - 45 3 51 9 57 ? 63 ...

“ 34 ( 40 . 46 4 52 : 58 @ 64 ...

# 35 ) 41 / 47 5 53 ; 59 A 65 ...

$ 36 * 42 0 48 6 54 < 60 B 66 ...

% 37 + 43 1 49 7 55 = 61 C 67 ...

& 38 , 44 2 50 8 56 > 62 D 68 ...20

Sequencerʼs Output

Mapping Program

Genome Sequence

Mapping Result

Visualization

Sequence Format

Output Format

BWA, Bowtie, etc.

21

SAM Format

• Sequence Alignment / Map format

• Simple tab-delimited text file

• Standardized alignment output format

• Modern alignment tools support this format

• BAM format is binary version of SAM format.

@HD VN:1.0@SQ! SN:chr20 LN:62435964@RG! ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891@RG! ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891 read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 \

AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< \NM:i:1 RG:Z:L1

read_28701_28881_323b 147 chr20 28834 30 35M!= 28701 -168 \ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA <<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<< \ MF:i:18 RG:Z:L2

22

Overview<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> \<ISIZE> <SEQ> <QUAL> [<TAG>:<VTYPE>:<VALUE> [...]]

read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 \AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG <<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< \NM:i:1 RG:Z:L1

23

Flag

• Bitwise notation: computer friendly (human non-friendly format :)

• 16 = 0x0010: mapped reverse strand

• 4 = 0x0004: unmapped

• 0 = 0x0000: mapped forward strand

24

CIGAR

• Show alignment result simply

• 8M9I7M

• 8bp match, 9bp insertion, and then 7bp match

GATGCTAAGCATATGCGAGGCATGCCATATGGATG

CATATGCG---------ATATGGA|||||||| |||||||

4th line “POS” indicates this position.

8M 9I 7M

25

Summary

• No standard tools for analyzing NGS data

• QA sites are good resources

• SeqAnswers.com

• biostar.stackexchange.com

• Many algorithms and softwares have been developed.

• See. http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html

• Most of them work with UNIX command line

• Few analysis tools with GUI

• Galaxy (Free, require server setup)

• BioScope (Only available with SOLiD sequencer)

26

Sequencerʼs Output

Mapping Program

Genome Sequence

Mapping Result

Visualization

Sequence Format

Output Format

BWA, Bowtie, etc.

Unix Commands

Performed with UNIX commands

27

Preparation• NGS procedure generate many files.

• Even in this lecture, we will generate 50 files.

• We use directory generated by extracting “ngslec.zip.”

• Extract the zip file in your home directory.

• To move to the directory, we type the following command in Terminal

$ cd ngslec

$ pwd

/Users/YOUR_DIRECTORY/ngslec/

28

Use “Terminal”• Operating System (OS) handle movements on computer.

• Read files, mouse click, visualize characters, ...

• We can use the OS functions through application “Terminal” on UNIX OS

• Applications > Utilities > Terminal

• UNIX: Linux, IBM AIX, Sun OS, Mac OS X

• except Windows and Mac OS -9

• In the terminal, we can use shell commands.

• Applications consists of a procedure of the shell commands.

• A complicated program is made of a set of tiny programs.

• We start to learn usage of tiny programs, and then how to combine them.

Shell TerminalKernel29

Command and Arguments

(A) Command (Order): run a command called “rm”

(B),(C) and (D) Arguments: separated by space character between command and arguments and between arguments

(B) Arguments that change sub functions of the command are called “Option.” Options starts from “-” or “--”

(C) First argument. We count argument number except options.

(D) Second argument.

$ rm -r arg1 arg2

(A) (B) (C) (D)

30

Example: date command• Input “date” + [Return] to show current time

• With option “-u”, “date” command shows Coordinated Universal time.

• If you misspell command, terminal says “command not found.”

• Commands (and file names) are case sensitive on UNIX except Mac OS X.

31

File System• You may always use this system through “Finder.” In this lecture,

we will use this from “Terminal.”

• Tree structure rooted by “/”

• USB memories and DVDs are also managed through file system.

/

usr Volume

bin lib pics

USB zurich32

Directories and Files• Current directory

• Directory on which you are working

• You can check “pwd” command.

• Home directory

• Root (top) of your personal directory

• Denoted by “~” or “$HOME”

• When your current directory is “/Users/sesejun”

• pwd command shows /Users/sesejun

• /usr/lib indicates *

• usr/lib indicates **

• “.” is equal to “/Users/sesejun”

• .. is equal to /Users

• ../../usr/lib is equal to “/usr/lib”

/

usr Users

bin lib sesejun*

usr

lib**

33

cd: Change Directory• cd destination-dir

• move your current directory to destination-dir

• When you omit (unset) arguments, move to home dir.

jsmbp:~ sesejun$ pwd/Users/sesejunjsmbp:~ sesejun$ cd /usr/jsmbp:/usr sesejun$ pwd/usrjsmbp:/usr sesejun$ cd libjsmbp:/usr/lib sesejun$ pwd/usr/libjsmbp:/usr/lib sesejun$ cd /usr/bin/jsmbp:/usr/bin sesejun$ pwd/usr/binjsmbp:/usr/bin sesejun$ cdjsmbp:~ sesejun$ pwd/Users/sesejunjsmbp:~ sesejun$ 34

ls (LiSt): Show List of Files• Show current directory files when setting no arguments

• Important options

• -a: Show all files (Files starting from “.” do not appear when we do not set this option)

• -l: Show detail information of files

• -h: Show file size in human friendly format (usually used with option “-l”)

•$ lsDesktop Music largefile $ ls -ldrwx------+ 8 sesejun staff 272 5 16 00:09 Desktopdrwx------+ 3 sesejun staff 102 10 27 2010 Movies-rw-r--r-- 1 sesejun staff 4181139 5 16 08:20 largefile$ ls -lhdrwx------+ 8 sesejun staff 272B 5 16 00:09 Desktopdrwx------+ 3 sesejun staff 102B 10 27 2010 Movies-rw-r--r-- 1 sesejun staff 4.0M 5 16 08:20 largefile

35

cp: Copy Files• cp [options] source-file ... directory　• cp [options] source-file new-file　• Options:

• Copy text1.txt to text2.txt

• Copy text1.txt and text2.txt in “tmp” directory

$ cp text1.txt text2.txt

$ cp text1.txt text2.txt tmp/$ ls tmptext1.txt text2.txt

36

mv: Move files• Also used to change file names

• mv [options] source-file ... directory　• mv [options] old-path new-path　• Change filename text1.txt to text2.txt

• Move text1.txt and text2.txt into tmp directory

$ mv text1.txt text2.txt

$ mv text1.txt text2.txt tmp/$ lstmp$ ls tmp/text1.txt text2.txt

37

rm (ReMove): Delete files

• Options:

• -r: Remove all the files in directory

• -i: Confirm before removing each file.

• Delete text1.txt and text2.txt

• Delete all the files within tmp directory

• Note: These files are “really” removed. They never go to “Trash.” We cannot use undo.

jsmbp:~ sesejun$ rm text1.txt text2.txt

jsmbp:~/test sesejun$ lstmpjsmbp:~/test sesejun$ ls tmp/text1.txt text2.txtjsmbp:~/test sesejun$ rm -r tmp/jsmbp:~/test sesejun$ lsjsmbp:~/test sesejun$ 38

Exercise (1)• Run commands

• Run date and date -u, and check the results.

• Run command “cal” What is the result?

• Change directory

• Run examples in page “cd”

• Check make and remove directory

• Open your login name directory in Finder.

• Move your home directory in Terminal.

• Just open terminal.

• Run ls and compare the result with Finder result.

39

Note• Commands and messages in Terminal are describes with

“Courier Font”

• Lines starting from “#” is comment line. You do not need to put them in Terminal.

• Lines whose last character is “\” continue next line. You put the multiple lines as one line.

• You can run commands with “cut and paste.”

• To do that, double quotation (“) character make trouble because of difference of character types. Re-inputing double quotation will solve the problem.

• Bar (|) can be input by Alt + 7.

• In Terminal, you can show history of your commands by pushing up cursor.

• “Tab” key may complement your command or filename. 40

cat (conCATenate)

• cat [options] file ...　

• Original usage is file concatenation.

• Show detail later

• Some times this command is used to show inside of file.

• Options:

• -n: show line number

$ cat text1.txtHow are you ?$ cat text2.txtHello!Thank you!Good Bye!$ cat text1.txt text2.txtHow are you ?Hello!Thank you!Good Bye!$ cat -n text2.txt 1 Hello! 2 Thank you! 3 Good Bye!

41

head, tail (Show first or last part of file)

• head [-n num] file ...

• Show first 10 lines

• -n num: show first num lines

• tail [-n num] file ...　

• Show last 10 lines

• -n num: show last num lines

• by setting +num, you can see file from num-th line to last line.

• Because of large size of NGS file, these commands are frequently used.

• Most editors cannot open NGS files.

$ cat text2.txtHello!Thank you!Good Bye!$ head -n2 text2.txtHello!Thank you!$ tail -n2 text2.txtThank you!Good Bye!$ tail -n+3 text2.txtGood Bye!

42

less• less <filename>

• Show files interactively

• Space: Next page

• ‘b’: Previous page

• ‘q’: Quit

• ‘/’ + [word]: search [word] and go to first matched place. The word is highlighted.

• To move next place, press ‘n.’

• Frequently used to check contents of (large) file like FastA file

43

cut -Show columns-

• cut [options] file ...　

• Show selected columns

• Options:

• -f <list of nums>: Show <list of nums>-th columns. We can use -d option to set separator between columns. Default separator is “\t (Tab).”

• -c <list of nums>: Show <list of nums>-th characters.

• Examples of “list of nums”

• 1,3,5: 1st, 3rd and 5th columns

• 1-5: From 1st to 5th columns

• 1,3,5-: 1st, 3rd and from 5th to last columns.

• This command is also frequently used to handle NGS files. 44

Sort

• sort [options] file ...　

• Arrange file contents in alphabetical order

• Options:

• -r: reverse order

• -n: order in numerical value

• -k POS: order according to POS-th column. Default delimiter is “\t.” We can change it with “-t” option.

$ cat text2.txtHello!Thank you!Good bye!$ sort text2.txtGood bye!Hello!Thank you!$ sort -r text2.txtThank you!Hello!Good bye!

45

$ cat nums.tab11.2 13.210.9 7.715.2 7.09.4 10.98.8 9.1$ cut -f1 nums.tab11.210.915.29.48.8$ cut -f1 -d . nums.tab11101598$ cut -c1-3 nums.tab11.10.15.9.48.8

$ cat nums.tab11.2 13.210.9 7.715.2 7.09.4 10.98.8 9.1$ sort -n nums.tab8.8 9.19.4 10.910.9 7.711.2 13.215.2 7.0$ sort -n -k2 nums.tab15.2 7.010.9 7.78.8 9.19.4 10.911.2 13.2$ sort nums.tab10.9 7.711.2 13.215.2 7.08.8 9.19.4 10.9

46

Exercise (2)• Generate two files “test1.txt” and “test2.txt”

• Run cat, head and tail command according to examples.

• Generate file “nums.txt”

• Character between numbers (columns) is “tab.”

• Test cut and sort commands according to examples.

47

Redirect (>) • command > file

• Save command result into “file.”

• Overwrite contents of file.

• The following command save the result of “sort -n nums.tab” into “nums_sort.tab”

• command >> file

• Add command result to “file.”

$ sort -n nums.tab > nums_sort.tab$ sort -n nums.tab >> nums_sort.tab

48

Pipe (|)• command1 | command2

• Run command2 with command1’s result

$ sort -n nums.tab8.8 9.19.4 10.910.9 7.711.2 13.215.2 7.0$ sort -n nums.tab | cat -n 1 8.8 9.1 2 9.4 10.9 3 10.9 7.7 4 11.2 13.2 5 15.2 7.0$ sort -n nums.tab | cat -n | head -n2 1 8.8 9.1 2 9.4 10.9

$ sort -n nums.tab | cat -nproduces the same result as$ sort -n nums.tab > nums_sort.tab$ cat -n nums_sort.tab

49

Commands used with pipe• sort, cut

• less

• wc [options] file...

• Word Count

• Show number of lines, words and characters.

$ sort nums.tab | less$ wc nums.tab 5 10 45 nums.tab

$ wc -l nums.tab 5 nums.tab

#lines #words #chrs

Show only number of lines

50

gzip and bzip2• Source codes and sample datasets are provided with tar and

gzip/bzip2 file.

• Only gzip/bzip2 is used for single file.

• “tar” can generate single file containing files and folders.

• gzip/bzip2 can compress file

• gzip is the most frequently used. bzip2 file size is smaller than gzip.

$ ls -lh chr21.fa.gz -rw-r--r-- 1 sesejun sesejun 12M May 20 15:09 chr21.fa.gz$ gzip -d chr21.fa.gz

$ ls -lh chr21.fa-rw-r--r-- 1 sesejun sesejun 47M May 20 15:09 hs_ref_chr21.fa$ gzip chr21.fa

$ ls -lh chr21.fa.bz2-rw-r--r-- 1 sesejun sesejun 9.7M May 20 15:09 chr21.fa.bz2

Decompress hs_ref_chr21.fa.gz and generate hs_ref_chr21.fa.

Compress

51

tar (Tape ARchive)• Generate single file containing files and folders.

• Frequently used with gzip/bzip2

• Remember the following idioms!

• We will use this to install programs to analyze NGS data.

1. $ gzip -dc file.tar.gz | tar xvf -

2. $ tar zxvf file.tar.gz

1. $ bzip2 -dc file.tar.bz2 | tar xvf -

with gzip

with bzip2

Tar has no option to decompress bzip2.

52

grep (g/re/p)• grep [options] file ...　• Print lines matching pattern

• Options:

• -v: print non-matching lines

• -e <regular expression>: select line with regular expression

• Regular expression

• Specific pattern to express character sequence

• ^: The beginning of line

• $: The end of line

• Supported by most programming languages. Very useful to handle various formats including DNA/Protein sequence.

$ cat nums.tab11.2 13.210.9 7.715.2 7.09.4 10.98.8 9.1$ grep “7” nums.tab10.9 7.715.2 7.0$ grep -v “7” nums.tab11.2 13.29.4 10.98.8 9.1$ grep -e "^1" nums.tab11.2 13.210.9 7.715.2 7.0

53

Exercise (3)• Use “TAIR10_chr1.fas”

• A.thaliana chromosome 1 sequence

• Select annotation line from FASTA format.

• FASTA format

• Line starting from “>” is annotation of sequence.

• The following lines of the annotation contains nucleotide or amino acid sequence.

• To select an annotation, select lines starting from “>”

• Count number of nucleotides in (Multi) FASTA format

• Lines including nucleotides do not start from “>”

• Number of nucleotides = number of characters

• Use “wc” command

• Note that the end of line contains “Return” character

>gi|29028877|gb|BT005883|U23535ATGGAAAGCAAAGGAAGAATCCATCCATCTCATCATCATATGAGGCGTCCTCTTCCAGGTCCCGGTGGCTGTATAGCGCATCCGGAGACTTTCGGTAATCACGGTGCTATACCACCTTCTGCTGCTCAAGGTGTGTATCCTTCCTTCAACATGTTACCTCCACCTGAAGTTATGGAGCAAAAGTTTGTGGCACAACACGGGGAATTACAGAGACTTGCTATAGAGAATCAGAGACTTGGT

54

Let’s start NGS analysis!• Dataset

• TAIR 10 genome (A.thaliana)

• 1/100 scale SOLiD RNA-Seq reads sets

• Filenames: tha_reads.csfasta & tha_reads_QV.qual

• SRR038985: 41,117,124 reads, 1,439,099,340 bp

• http://trace.ddbj.nig.ac.jp/DRASearch/experiment?acc=SRX018529

• Filenames: lyr_reads.csfasta & lyr_reads_QV.qual

• SRR038987: 41,340,154 reads, 1,446,905,390 bp

• http://trace.ddbj.nig.ac.jp/DRASearch/experiment?acc=SRX018531

• 1/10 scale Roche 454 Read Set (SRR020799)

$ grep -e “^>” tha_reads.csfasta | wc -l411171 55

http://trace.ddbj.nig.ac.jp/DRASearch/experiment?acc=SRX018529








Installing BWA• In this lecture, because our computer do not have “gcc”

command to compile C language, we skip this procedure.

• Download BWA

• http://bio-bwa.sourceforge.net/

• bwa-0.5.8c.tar.bz2 exists in USB. Copy the file.

• Extract the file

• Move into BWA directory

• Compile source programs

• Make alias name “bwa” for bwa-0.5.8c directory

# $ curl -O \# http://switch.dl.sourceforge.net/project/bio-bwa/bwa-0.5.8c.tar.bz2# $ bzip2 -dc bwa-0.5.8c.tar.bz2 | tar xvf -# ...filenames...# $ ln -s bwa-0.5.8c bwa # Simplify the directory name# $ cd bwa# $ make# ...compile messages...# $ cd .. # back to working directory 56

http://bio-bwa.sourceforge.net/

http://bio-bwa.sourceforge.net/

Prepare A.thaliana Genome• Download chromosomes from TAIR site

• http://www.arabidopsis.org/

• Find URLs by selecting “Download” tab > Sequences > whole_chromosomes

• Each file includes one chromosome on current version.

• TAIR10_chr1.fas, TAIR10_chr2.fas, TAIR10_chr3.fas, TAIR10_chr4.fas, TAIR10_chr5.fas, TAIR10_chrC.fas, TAIR10_chrM.fas

• Because of limited server and network capacity, distributed these files with USB or web site for this lecture.

• Concatenate these chromosomes except chloroplast and mitochondria into single file

57

http://www.arabidopsis.org/

http://www.arabidopsis.org/

# We skip this process#$ curl -O “ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/TAIR10_chr[1-5].fas”## 1-5 means consecutive numbers from 1 to 5.## We do not use chroloplast and mitochondria genomes.# Instead of the download, we use the files in USB.# The files are in your working directory. # Check it by below command.$ ls TAIR10*TAIR10_chr1.fas TAIR10_chr3.fas TAIR10_chr5.fasTAIR10_chr2.fas TAIR10_chr4.fas# Concatinate all chromosomes into single file$ cat TAIR10_chr1.fas TAIR10_chr2.fas TAIR10_chr3.fas TAIR10_chr4.fas TAIR10_chr5.fas > TAIR10_chr_all.fas # Check the result $ grep -e “^>” TAIR10_chr_all.fas >Chr1 CHROMOSOME dumped from ADB: Jun/20/09 14:53; last updated: 2009-02-02>Chr2...# You can find 5 chromosomes’ annotations

58

Run BWA• Make index on genome sequence

• For SOLiD reads, “-c” option is required.

• This process needs just once as long as you use the same genome (do not depend on read sequences).

• Convert reads’ colorspace into BWA specific format

• You don’t need this process for illumina reads.

• Illumina sequencers produce FastQ format files, and most alignment software can handle that directly.

• Mapping reads against genome sequence

• If you use illumina, -I option may be required. Check your illumina version.

• Above two processes may take long time. This lecture’s toy data is 1/100 scale. For real data will require more than two hours.

$ ./bwa/bwa index -c TAIR10_chr_all.fas# running messages. Takes more than 3 mins.$ python csfasta2fastq.py --bwa tha_reads > tha_reads.bwa$ ./bwa/bwa aln -c TAIR10_chr_all.fas tha_reads.bwa > tha_reads.sai# messages...about 1min. Alignment phase. 59

Run BWA (continued)• Convert mapping result into SAM format.

• You have to use “sampe” instead of “samse” for paired end experiment to put mate pair information into SAM format.

• That’s all! Check the contents of sam file with less command.

• How many reads can be mapped against genome?

$ ./bwa/bwa samse TAIR10_chr_all.fas tha_reads.sai tha_reads.bwa > tha_reads.sam# messages. Generate summary of alignment.# If you have paired ended reads, you can use sampe instead of samse.

$ less tha_reads.sam# Press “q” to quit less command.# Next page is “space”

60

Inside of SAM file

@SQ SN:Chr1 LN:30427671@SQ SN:Chr2 LN:19698289@SQ SN:Chr3 LN:23459830@SQ SN:Chr4 LN:18585056@SQ SN:Chr5 LN:26975502@PG ID:bwa PN:bwa VN:0.5.9-r16SRR038985.100 0 Chr5 22828962 37 33M * 0 0 GCCGGTGATGTAATCAAAATATTTGCTACTCTT WZYTWWTW\]YVUOW]OEKNUUX]PJSRY][63 XT:A:U CM:i:0 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:33SRR038985.200 0 Chr3 14197678 0 33M * 0 0 ACCTGGTTGATCCTGCCAGTAGTCATATGCTTG X]]KN]]YWUX]XIKYRCHSUYX[[SNQJL[MO XT:A:R CM:i:0 X0:i:2 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:33 XA:Z:Chr2,+3707,33M,0;SRR038985.300 4 * 0 0 * * 0 0 AAACTGCGGGGTCTCACTTTTTTGGGTTTGGGGT 124,/08/5&6-&,(;/4+%7,+5.:1',*;8:&

Chromosome (Mapped database) information

Used program and its variables

Mapped read in forward direction on Chr5

Unmapped read61

Exercise (4)

• Run BWA

• Compare file size of csfasta + qual files with generated SAM file.

• Which is larger? How much disk space we need to analyze?

• Check the details of SAM file

• Format details are described in http://samtools.sourceforge.net/SAM1.pdf

• How many reads are mapped onto chromosomes.

• Select lines containing “Chr” # use grep

• Then, count the number of lines # use wc

• Calculate ratio of mapped reads to total reads.

62

http://samtools.sourceforge.net/SAM1.pdf




Problems• Mapped read ratio may be very lower than expected.

• Genome quality is (probably) high.

• Various problems

• Wet problems

• Protocols and reagents

• Mitochondria and chroloplast.

• Dry problems

• We used all sequences. We may need to remove low quality reads.

• Sequence quality of 3’-end is low. We might trim these sequence.

• We did not care about reads on splice junction.

• We did not change any parameters in BWA. The parameter might not be suitable for our reads.

• No one has versatile result.

• Note!!! mapped ratio of current RNA-Seq reads is (extremely) higher than this result.

63

20110524zurichngs 1st pub

Documents

Transcript of 20110524zurichngs 1st pub