Using BioHPC Lab Software - Cornell...
Transcript of Using BioHPC Lab Software - Cornell...
Using BioHPC Lab Software
Qi SunComputational Biology Service Unit
Cornell University
3CPG
CBSU/3CPG workshops
1. Linux
2. Using the BioHPC lab software
3. BioHPC web site
4. PERL scripting ?
1. Genetic variation Detection
2. RNAseq
Series 1
Series 2
Practice in the BioHPC lab
1. You can download the Powerpoint slides and a Word file with step‐by‐step instructions of two exercises.http://cbsu.tc.cornell.edu/lab/workshops.aspx
2. The data files used for exercises are available on /shared_data/cbsuworkshop_2.
1. SNP/INDEL detection
2. RNA‐seq and ChIP‐seq
3. De‐novo assembly
4. Genotyping‐by‐sequencing
5. Metagenomics
What software are available?
Reference genome dependent
Reference genome independent
Categories:
Not readyyet
Raw data files Excel spreadsheetsComputersoftware
Commercial software for nextgen data analysis are not mature yet
• Softgenetics NextGENe• DNAstar Ngen• GenomeQuest• CLC Genomics Workbench
A partial list of commercial software
BWA
Samtoolsview
Samtoolspileup
FASTQ file
SAM file
BAM file
VCF/BCFPipeline for SNP/INDEL calling
Step 1
Step 2
Step 3
You have more options with open source software
BWA
Samtoolsview
Samtoolspileup
FASTQ file
SAM file
BAM file
VCF/BCF
Bowtie
GATK
Picard
Step 1
Step 2
Step 3
Pipeline for SNP/INDEL calling
You have more options with open source software
BWA
Bowtie/Tophat/Cufflinks
Samtools
Picard
GATK
Annovar
Velvet
gsAssembler
BLAST
BLAT…
Software tools installed on the BioHPC labs
BWA
Bowtie/Tophat/Cufflinks
Samtools
Picard
GATK
Velvet
gsAssembler
BLAST
BLAT
…
Software tools installed on the BioHPC labs
Each project requires a combination of multiple tools
Category File Extension ReferenceSequence fasta http://en.wikipedia.org/wiki/FASTA_format
Sequence fastq http://en.wikipedia.org/wiki/FASTQ_format
Alignment SAM/BAM http://samtools.sourceforge.net/SAM‐1.3.pdf
Sequence variation VCF/BCF http://www.1000genomes.org/node/101
Genome Annotation gff/gff3 http://gmod.org/wiki/GFF3
Genome Annotation gtf http://genome.ucsc.edu/FAQ/FAQformat#format4
Commonly Used File Formats
Most files that you downloaded from a web site are compressed .gz files. Use the gunzip command to de‐compress the file. E.g.
gunzip s_1_sequence.txt.gz
SAM
BAM
BWA
VCF
BCF
Bowtie
Cufflinks
Tophat
GATK
PERL
Samtools
Bamtools
IGV
VCFtools
Velvet
FASTQ
GFF
BED
GTF
gz
GFF3
http://cbsu.tc.cornell.edu/lab/use.aspx
User guide and FAQ for BioHPC lab:
We welcome suggestions and contributions to our documentation library.
Designing a pipelineRNAseq: 1. Splicing junction identification2. Alignment to reference genome and junctions 3. Quantify reads aligned to each gene4. Identifying novel transcription active region
SNP/INDELs detection:1. Alignment to reference genome2. Filter out PCR duplication3. Realignment around INDEL4. Variation calling5. Filtering6. Annotation
Selection of alignment tools
• Gapped alignment or ungapped alignmentBWA: gapped; Bowtie: ungapped
• Dealing with ambiguous hitsBWA and Bowtie report the ambiguous hits in different ways.
• Alignment to splicing junctionsUse Tophat. (canonical splicing sites only).
CBSU Recommened alignment tools.
BWA (for genomic sequencing) Tophat (for RNA‐seq)
Get reference genome and annotation files
Using UCSC site to download genome fasta file. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
use “cat chr* > allchr.fa” command to concatenate the individual chromsomesinto one file)
Using the UCSC Table Browser to create the GTF file.http://genome.ucsc.edu/cgi‐bin/hgTables?command=start
What to do after you receive the data?
What to do after you receive the data?
Fastq file: post QC filterQseq file: pre QC filter
Transfer the files to the workstation
• Use WinSCP (win) or FETCH (mac) to move files to the workstation. De‐compress the file using the gunzip command.gunzip s_3_sequence.txt.gz
Check the data quality (optional)
Run fastx_quality_stats tool (fastx toolkit) to get the quality report.
fastx_quality_stats –i s_3_sequence.txt ‐o stat_report.xls &
FASTX output
Decode the multiplexed data file
1. Use the fastx tool kit: (fastx_barcode_splitter.pl)
Note: This tool cannot be applied to paired‐end reads. If you have paired end reads, you can use the NovoBarCode tool. It is free for academic users. http://www.novocraft.com/userfiles/file/NovoBarcode.pdf
2. Buckler Lab GBS barcode system
cat s_3_sequence.txt | fastx_barcode_splitter.pl ‐‐bcfilemybarcodes.txt –bol ‐‐mismatches 0 –prefix mysample
GBS_barcode.pl yourBarCodeFile yourEnzymeFile
Note: Documentation for GBS_barcode.pl is on the workstation. ( /shared_data/doc/GBS_barcode.pdf)
Alignment with BWA
1. Reformat and index the genome fasta file.
2. Do alignment
bwa index ‐a bwtsw maize.fa &
bwa aln ‐n 2 ‐t 2 maize.fa s_1_sequence.txt > s_1.sai &
bwa samse ‐n 5 maize.fa s_1.sai s_1_sequence.txt > s_1.sam &
http://bio‐bwa.sourceforge.net/bwa.shtml#3Manual:
Alignment with Tophat
1. Reformat and index the genome fasta file.
2. Do alignment (with or without annotation)
bowtie‐build maize.fa maize &
tophat –p 3 ‐o s1_guided ‐G ZmB73_5a_WGS.gtf ‐‐no‐novel‐juncsmaize s_1_sequence.txt &
tophat –p 3 ‐o s1_unguided maize s_1_sequence.txt &
http://bio‐bwa.sourceforge.net/bwa.shtml#3Manual:
Convert SAM to BAM
samtools view ‐bS ‐o s_1.bam s_1.sam
samtools sort s_1.bam s_1.sorted
samtools index s_1.sorted.bam
• samtools view: sam2bam
• samtools sort: sort by genome coordinates or read names
• samtools index: index the BAM file
Visualize the sequence alignment
How to use IGV:
1. Copy the .bam .bai files, as well as the genome fasta and gtf files to you local computer.
2. Launch IGV program from http://www.broadinstitute.org/software/igv/download3. Load the files.
Use Cufflinks to assemble/quantify RNA‐seq
cufflinks ‐G ZmB73_5a_WGS.gtf s1_tophat.bam
Single sample:
cuffdiff ZmB73_5a_WGS.gtf s1_tophat.bam s5_tophap.bam
Multiple samples:
trans_idbundle_id chr left right FPKM FMI frac
FPKM_conf_lo
FPKM_conf_hi coverage length
effective_length status
GRMZM2G060082_T01 99289 1 2 3807 2.05938 0.507199 0.576667 0 5.04574 0.25115 2804 2769OKGRMZM2G060082_T02 99289 1 2606 3754 4.0603 1 0.423333 0 8.65942 0.495171 1066 1031OKGRMZM2G059865_T01 99290 1 4853 9652 15.6517 1 0.471931 7.73925 23.5641 1.90879 1966 1931OKGRMZM2G059865_T03 99290 1 4856 6355 4.18E‐09 2.67E‐10 7.70E‐11 0 0.000129 5.10E‐10 1214 1179OKGRMZM2G059865_T02 99290 1 4856 9652 14.2274 0.909003 0.528069 6.68358 21.7713 1.7351 2412 2377OKGRMZM2G059856_T01 99291 1 9855 10388 0 0 0 0 0 0 533 498OKGRMZM5G888250_T01 99291 1 9881 10387 0 0 0 0 0 0 506 471OKGRMZM2G059843_T01 99292 1 11454 14988 0 0 0 0 0 0 1788 1788OKGRMZM5G866996_T01 99293 1 46227 47746 0 0 0 0 0 0 472 437OKGRMZM2G059818_T02 99294 1 50452 54182 0 0 0 0 0 0 3099 3099OKGRMZM2G059818_T01 99294 1 50452 56348 0 0 0 0 0 0 4379 4379OKGRMZM2G059818_T03 99294 1 52003 52543 0 0 0 0 0 0 540 540OKGRMZM2G360269_T01 99295 1 57418 61452 0 0 0 0 0 0 2556 2556OKGRMZM2G518629_T01 99296 1 62320 62588 0 0 0 0 0 0 98 98OKGRMZM5G811273_T02 99296 1 62501 64014 3.65473 0.943998 0.41328 0 8.63598 0.44571 279 244OKGRMZM5G811273_T01 99296 1 62733 64014 3.87154 1 0.58672 0 8.47176 0.472151 362 327OKAC177838.2_FGT002 99297 1 70594 71919 0 0 0 0 0 0 633 633OKGRMZM2G518627_T01 99298 1 73839 74024 0 0 0 0 0 0 185 185OKGRMZM2G059778_T01 99299 1 76119 76752 0 0 0 0 0 0 411 376OKGRMZM2G518609_T01 99300 1 90684 90815 0 0 0 0 0 0 131 96OKGRMZM2G059745_T01 99301 1 92353 93541 0 0 0 0 0 0 425 390OKGRMZM2G093344_T01 99302 1 109518 111769 8.56066 1 1 2.70894 14.4124 1.04401 1012 977OKGRMZM2G394757_T01 99302 1 110764 111506 0 0 0 0 0 0 419 419OK
Cufflinks output
Some issues with Cufflinks
• Optimized for human RNAseq data.
• Problems in its model for weighting the ambiguously mapped reads, and weighting of reads contribution to different isoforms.
• In our test, simple average weighting gives better correlation to the RT PCR data.
CBSU RNAseq pipeline
Available on both workstaions and cbsu BioHPC web site.
In order to use this tool, we will need to create customized database for you. Premade database are available for 5 species including maize and rice.
Use Samtools to call variations
samtools mpileup ‐uf bwa/maize.fa s1.sorted.bam |bcftools view ‐bvcg ‐ > s1.raw.bcf
bcftools view s1.raw.bcf | vcfutils.pl varFilter ‐D100 > s1.vcf
1. Pileup2. Variation calling3. Filtering(example in next slide)
Pileup:
Chromsome 1 ‐ position 7080:Reference: “C”Reads: 10 reads covered this position.
8 “T” 2 “C”
Possible interpretation:1. PCR or sequencing error;2. Systematic machine error;3. Heterozygous;4. Homologous region;
Function annotation of the SNPs
In the coming months …
• Using Picard/GATK to replace Samtools
• De‐novo assembly
• Genotyping‐by‐sequencing