Using BioHPC Lab Software - Cornell...

Using BioHPC Lab Software

Qi SunComputational Biology Service Unit

Cornell University

3CPG

CBSU/3CPG workshops

1. Linux

2. Using the BioHPC lab software

3. BioHPC web site

4. PERL scripting ?

1. Genetic variation Detection

2. RNAseq

Series 1

Series 2

Practice in the BioHPC lab

1. You can download the Powerpoint slides and a Word file with step‐by‐step instructions of two exercises.http://cbsu.tc.cornell.edu/lab/workshops.aspx

2. The data files used for exercises are available on /shared_data/cbsuworkshop_2.

1. SNP/INDEL detection

2. RNA‐seq and ChIP‐seq

3. De‐novo assembly

4. Genotyping‐by‐sequencing

5. Metagenomics

What software are available?

Reference genome dependent

Reference genome independent

Categories:

Not readyyet

Raw data files Excel spreadsheetsComputersoftware

Commercial software for nextgen data analysis are not mature yet

• Softgenetics NextGENe• DNAstar Ngen• GenomeQuest• CLC Genomics Workbench

A partial list of commercial software

BWA

Samtoolsview

Samtoolspileup

FASTQ file

SAM file

BAM file

VCF/BCFPipeline for SNP/INDEL calling

Step 1

Step 2

Step 3

You have more options with open source software

BWA

Samtoolsview

Samtoolspileup

FASTQ file

SAM file

BAM file

VCF/BCF

Bowtie

GATK

Picard

Step 1

Step 2

Step 3

Pipeline for SNP/INDEL calling

You have more options with open source software

BWA

Bowtie/Tophat/Cufflinks

Samtools

Picard

GATK

Annovar

Velvet

gsAssembler

BLAST

BLAT…

Software tools installed on the BioHPC labs

BWA

Bowtie/Tophat/Cufflinks

Samtools

Picard

GATK

Velvet

gsAssembler

BLAST

BLAT

…

Software tools installed on the BioHPC labs

Each project requires a combination of multiple tools

Category File Extension ReferenceSequence fasta http://en.wikipedia.org/wiki/FASTA_format

Sequence fastq http://en.wikipedia.org/wiki/FASTQ_format

Alignment SAM/BAM http://samtools.sourceforge.net/SAM‐1.3.pdf

Sequence variation VCF/BCF http://www.1000genomes.org/node/101

Genome Annotation gff/gff3 http://gmod.org/wiki/GFF3

Genome Annotation gtf http://genome.ucsc.edu/FAQ/FAQformat#format4

Commonly Used File Formats

Most files that you downloaded from a web site are compressed .gz files. Use the gunzip command to de‐compress the file. E.g.

gunzip s_1_sequence.txt.gz

SAM

BAM

BWA

VCF

BCF

Bowtie

Cufflinks

Tophat

GATK

PERL

Samtools

Bamtools

IGV

VCFtools

Velvet

FASTQ

GFF

BED

GTF

gz

GFF3

http://cbsu.tc.cornell.edu/lab/use.aspx

User guide and FAQ for BioHPC lab:

We welcome suggestions and contributions to our documentation library.

Designing a pipelineRNAseq: 1. Splicing junction identification2. Alignment to reference genome and junctions 3. Quantify reads aligned to each gene4. Identifying novel transcription active region

SNP/INDELs detection:1. Alignment to reference genome2. Filter out PCR duplication3. Realignment around INDEL4. Variation calling5. Filtering6. Annotation

Selection of alignment tools

• Gapped alignment or ungapped alignmentBWA: gapped; Bowtie: ungapped

• Dealing with ambiguous hitsBWA and Bowtie report the ambiguous hits in different ways.

• Alignment to splicing junctionsUse Tophat. (canonical splicing sites only).

CBSU Recommened alignment tools.

BWA (for genomic sequencing) Tophat (for RNA‐seq)

Get reference genome and annotation files

Using UCSC site to download genome fasta file. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/

use “cat chr* > allchr.fa” command to concatenate the individual chromsomesinto one file)

Using the UCSC Table Browser to create the GTF file.http://genome.ucsc.edu/cgi‐bin/hgTables?command=start

What to do after you receive the data?

What to do after you receive the data?

Fastq file: post QC filterQseq file: pre QC filter

Transfer the files to the workstation

• Use WinSCP (win) or FETCH (mac) to move files to the workstation. De‐compress the file using the gunzip command.gunzip s_3_sequence.txt.gz

Check the data quality (optional)

Run fastx_quality_stats tool (fastx toolkit) to get the quality report.

fastx_quality_stats –i s_3_sequence.txt ‐o stat_report.xls &

FASTX output

Decode the multiplexed data file

1. Use the fastx tool kit: (fastx_barcode_splitter.pl)

Note: This tool cannot be applied to paired‐end reads. If you have paired end reads, you can use the NovoBarCode tool. It is free for academic users. http://www.novocraft.com/userfiles/file/NovoBarcode.pdf

2. Buckler Lab GBS barcode system

cat s_3_sequence.txt | fastx_barcode_splitter.pl ‐‐bcfilemybarcodes.txt –bol ‐‐mismatches 0 –prefix mysample

GBS_barcode.pl yourBarCodeFile yourEnzymeFile

Note: Documentation for GBS_barcode.pl is on the workstation. ( /shared_data/doc/GBS_barcode.pdf)

Alignment with BWA

1. Reformat and index the genome fasta file.

2. Do alignment

bwa index ‐a bwtsw maize.fa &

bwa aln ‐n 2 ‐t 2 maize.fa s_1_sequence.txt > s_1.sai &

bwa samse ‐n 5 maize.fa s_1.sai s_1_sequence.txt > s_1.sam &

http://bio‐bwa.sourceforge.net/bwa.shtml#3Manual:

Alignment with Tophat

1. Reformat and index the genome fasta file.

2. Do alignment (with or without annotation)

bowtie‐build maize.fa maize &

tophat –p 3 ‐o s1_guided ‐G ZmB73_5a_WGS.gtf ‐‐no‐novel‐juncsmaize s_1_sequence.txt &

tophat –p 3 ‐o s1_unguided maize s_1_sequence.txt &

http://bio‐bwa.sourceforge.net/bwa.shtml#3Manual:

Convert SAM to BAM

samtools view ‐bS ‐o s_1.bam s_1.sam

samtools sort s_1.bam s_1.sorted

samtools index s_1.sorted.bam

• samtools view: sam2bam

• samtools sort: sort by genome coordinates or read names

• samtools index: index the BAM file

Visualize the sequence alignment

How to use IGV:

1. Copy the .bam .bai files, as well as the genome fasta and gtf files to you local computer.

2. Launch IGV program from http://www.broadinstitute.org/software/igv/download3. Load the files.

Use Cufflinks to assemble/quantify RNA‐seq

cufflinks ‐G ZmB73_5a_WGS.gtf s1_tophat.bam

Single sample:

cuffdiff ZmB73_5a_WGS.gtf s1_tophat.bam s5_tophap.bam

Multiple samples:

trans_idbundle_id chr left right FPKM FMI frac

FPKM_conf_lo

FPKM_conf_hi coverage length

effective_length status

GRMZM2G060082_T01 99289 1 2 3807 2.05938 0.507199 0.576667 0 5.04574 0.25115 2804 2769OKGRMZM2G060082_T02 99289 1 2606 3754 4.0603 1 0.423333 0 8.65942 0.495171 1066 1031OKGRMZM2G059865_T01 99290 1 4853 9652 15.6517 1 0.471931 7.73925 23.5641 1.90879 1966 1931OKGRMZM2G059865_T03 99290 1 4856 6355 4.18E‐09 2.67E‐10 7.70E‐11 0 0.000129 5.10E‐10 1214 1179OKGRMZM2G059865_T02 99290 1 4856 9652 14.2274 0.909003 0.528069 6.68358 21.7713 1.7351 2412 2377OKGRMZM2G059856_T01 99291 1 9855 10388 0 0 0 0 0 0 533 498OKGRMZM5G888250_T01 99291 1 9881 10387 0 0 0 0 0 0 506 471OKGRMZM2G059843_T01 99292 1 11454 14988 0 0 0 0 0 0 1788 1788OKGRMZM5G866996_T01 99293 1 46227 47746 0 0 0 0 0 0 472 437OKGRMZM2G059818_T02 99294 1 50452 54182 0 0 0 0 0 0 3099 3099OKGRMZM2G059818_T01 99294 1 50452 56348 0 0 0 0 0 0 4379 4379OKGRMZM2G059818_T03 99294 1 52003 52543 0 0 0 0 0 0 540 540OKGRMZM2G360269_T01 99295 1 57418 61452 0 0 0 0 0 0 2556 2556OKGRMZM2G518629_T01 99296 1 62320 62588 0 0 0 0 0 0 98 98OKGRMZM5G811273_T02 99296 1 62501 64014 3.65473 0.943998 0.41328 0 8.63598 0.44571 279 244OKGRMZM5G811273_T01 99296 1 62733 64014 3.87154 1 0.58672 0 8.47176 0.472151 362 327OKAC177838.2_FGT002 99297 1 70594 71919 0 0 0 0 0 0 633 633OKGRMZM2G518627_T01 99298 1 73839 74024 0 0 0 0 0 0 185 185OKGRMZM2G059778_T01 99299 1 76119 76752 0 0 0 0 0 0 411 376OKGRMZM2G518609_T01 99300 1 90684 90815 0 0 0 0 0 0 131 96OKGRMZM2G059745_T01 99301 1 92353 93541 0 0 0 0 0 0 425 390OKGRMZM2G093344_T01 99302 1 109518 111769 8.56066 1 1 2.70894 14.4124 1.04401 1012 977OKGRMZM2G394757_T01 99302 1 110764 111506 0 0 0 0 0 0 419 419OK

Cufflinks output

Some issues with Cufflinks

• Optimized for human RNAseq data.

• Problems in its model for weighting the ambiguously mapped reads, and weighting of reads contribution to different isoforms.

• In our test, simple average weighting gives better correlation to the RT PCR data.

CBSU RNAseq pipeline

Available on both workstaions and cbsu BioHPC web site.

In order to use this tool, we will need to create customized database for you. Premade database are available for 5 species including maize and rice.

Use Samtools to call variations

samtools mpileup ‐uf bwa/maize.fa s1.sorted.bam |bcftools view ‐bvcg ‐ > s1.raw.bcf

bcftools view s1.raw.bcf | vcfutils.pl varFilter ‐D100 > s1.vcf

1. Pileup2. Variation calling3. Filtering(example in next slide)

Pileup:

Chromsome 1 ‐ position 7080:Reference: “C”Reads: 10 reads covered this position.

8 “T” 2 “C”

Possible interpretation:1. PCR or sequencing error;2. Systematic machine error;3. Heterozygous;4. Homologous region;

Function annotation of the SNPs

In the coming months …

• Using Picard/GATK to replace Samtools

• De‐novo assembly

• Genotyping‐by‐sequencing

Using BioHPC Lab Software - Cornell...

Documents

Transcript of Using BioHPC Lab Software - Cornell...