Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File...
Transcript of Green Center Computational Core - UT Southwestern...Filename HF_K9_GATCAG_L005_R1_001.fastq.gz File...
Green Center Computational Core ChIP-‐Seq Pipeline, Just a Click Away
Venkat Malladi Computational Biologist Computational Core Cecil H. and Ida Green Center for Reproductive Biology Science
Green Center for Reproduc/ve Biology Science
Green Center for Reproduc/ve Biology Science
Introduc<on to the Green Center
● Basic research in female reproductive biology, with a focus on signaling, gene regulation, and genome function. ◦ pregnancy ◦ parturition ◦ stem cells ◦ oncology ◦ inflammation
● Key areas: ◦ Chromatin structure and gene regulation ◦ Epigenetics ◦ Nuclear endpoints of cellular signaling pathways ◦ Genome organization and evolution ◦ DNA replication and repair
Green Center for Reproduc/ve Biology Science
Who is in the Green Center?
W. Lee Kraus, Ph.D., Director of the Green Center.
● Associated with the Department of
Obstetrics and Gynecology
● Consists of 9 main faculty/labs
● 20 associated faculty/labs
● Computational Core
● Consists of 4 Computational Biologists ● Analysis of Genomic Sequencing Data ● Responsibilities
◦ Data Quality assurance
◦ Perform basic analyses
◦ Work with investigator to perform integrative analyses
Green Center Computation Team
Anusha Nagari Tulip Nandu Venkat Malladi Aishwarya Gogate
Role of the Computa<onal Core
Modified from PLoS Biol 9-‐e1001046,2011 (M. Pazin) Green Center for Reproduc/ve Biology Science
ATAC-seq RNA-seq GRO-seq
Challenge: Variety of Assays Supported?
Assay for transposase-accessible chromatin using Sequencing (ATAC-Seq): Genomic method that captures open chromatin sites.
What is ATAC-‐seq?
Buenrostro et al. ( 2013) Nature Methods
RNA Sequencing (RNA-Seq) : RNA-seq measures RNA abundance of mature RNA species in the cell. These experiments contribute to the understanding of how RNA-based mechanisms impact gene regulation.
● Types: ● Total RNA ● polyA mRNA (Long and short) ● shRNA ● small RNA ● microRNA ● polyA depleted RNA
What is RNA-‐Seq?
Green Center for Reproduc/ve Biology Science
Global Run On Sequencing (GRO-Seq) : This is a genomic method that maps the position and orientation of all actively transcribing RNA polymerases.
● Transcription from all three RNA Polymerases is captured providing transcriptional profiles including: ● protein coding mRNA ● long non-coding RNAs (lncRNAs) ● enhancer RNAs (eRNAs) ● divergent transcription ● antisense transcription ● intergenic transcription in both annotated and unannotated regions of the genome.
What is GRO-‐Seq?
Annotated
Divergent
Intergenic
Antisense
ERα Enhancer Annotated
Other Genic
Green Center for Reproduc/ve Biology Science Hah et al. ( 2011) Cell
Chromatin immunoprecipitation followed by Sequencing (ChIP-Seq): Identify the binding sites of chromatin-associated proteins.
● Categories: • Transcription factor ChIP-Seq: proteins
that associate with specific DNA sequences to influence the rate of transcription
• Histone ChIP-Seq: measure histone content of chromatin, specifically to the incorporation of particular post-translational histone modifications in chromatin
What is ChIP-‐Seq?
Green Center for Reproduc/ve Biology Science Park ( 2009) Nature Reviews
Considera<on of making a Pipeline
1. Who are the users
2. Define what the pipeline should deliver
3. Identify all input and output files
4. What QA/QC metrics should be available for users
5. Identify all software used in pipeline
6. Breakdown pipeline into discrete steps (based on deliverable files and metrics)
Green Center for Reproduc/ve Biology Science
Users and Goals
Green Center for Reproduc/ve Biology Science
● Users:
● Wet lab scientists (Grad Students/Post Docs)
● Computational Biologists in the Green Center
● Goals:
● Allow wet lab scientists to quickly assess the quality and explore
their data
● Allow for easily reproducible analysis within the Green Center
Schema: ChIP-‐seq Pipeline
FASTQ (SE/PE)
Map bowtie2
Quality fastqc
BAM
QA Metrics
Remove Duplicates
picard
QA Metrics
BAM Cross-correlation
tagAlign Fragment
size
Call Peaks macs2
bigWig
narrowPeak
QA Metrics
Green Center for Reproduc/ve Biology Science
FASTQ: Quality Metrics 3/13/13 10:44 AMHF_K9_GATCAG_L005_R1_001.fastq.gz FastQC Report
Page 1 of 15file:///Users/anushanagari/Desktop/TMP/HectorGROseq/HF_K9_GATCAG_L005_R1_001_fastqc/fastqc_report.html
FastQC Report Tue 19 Feb 2013HF_K9_GATCAG_L005_R1_001.fastq.gz
Summary
Basic Statistics
Per base sequence quality
Per sequence quality scores
Per base sequence content
Per base GC content
Per sequence GC content
Per base N content
Sequence Length Distribution
Sequence Duplication Levels
Overrepresented sequences
Kmer Content
Basic StatisticsMeasure Value
Filename HF_K9_GATCAG_L005_R1_001.fastq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 22571166
Filtered Sequences 0
Sequence length 50
%GC 42
3/13/13 10:44 AMHF_K9_GATCAG_L005_R1_001.fastq.gz FastQC Report
Page 1 of 15file:///Users/anushanagari/Desktop/TMP/HectorGROseq/HF_K9_GATCAG_L005_R1_001_fastqc/fastqc_report.html
FastQC Report Tue 19 Feb 2013HF_K9_GATCAG_L005_R1_001.fastq.gz
Summary
Basic Statistics
Per base sequence quality
Per sequence quality scores
Per base sequence content
Per base GC content
Per sequence GC content
Per base N content
Sequence Length Distribution
Sequence Duplication Levels
Overrepresented sequences
Kmer Content
Basic StatisticsMeasure Value
Filename HF_K9_GATCAG_L005_R1_001.fastq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 22571166
Filtered Sequences 0
Sequence length 50
%GC 42
Per Base Sequence Quality
Good quality calls
Reasonable quality calls
Poor quality calls
Green Center for Reproduc/ve Biology Science
Alignment: Quality Metrics
FASTQ File:
DNA sequence
Aligned File:
DNA sequence +
Genomic localization
Alignment % = No. of aligned reads Total no. of raw reads
* 100
Green Center for Reproduc/ve Biology Science
Uniquely Mapped Reads: Quality Metrics
● Depth ● Number of uniquely mapping reads
● Library Complexity ● Non-Redundant Fraction (NRF) - Number of distinct uniquely mapping reads
(i.e. after removing duplicates) / Total number of reads.
● PCR Bottlenecking Coefficient 1 (PBC1) ◦ PBC1=M1/M_DISTINCT where
M1: number of genomic locations where exactly one read maps uniquely M_DISTINCT: number of distinct genomic locations to which some read maps uniquely
● PCR Bottlenecking Coefficient 2 (PBC2) ◦ PBC2= M1/M2 where
M1: number of genomic locations where only one read maps uniquely M2: number of genomic locations where two reads map uniquely
Green Center for Reproduc/ve Biology Science ENCODE Standards hPps://www.encodeproject.org/data-‐standards/chip-‐seq/
Uniquely Mapped Reads: Quality Metrics (cont.)
NRF Guidelines PBC1 Guidelines
PBC2 Guidelines
ENCODE Standards hPps://www.encodeproject.org/data-‐standards/chip-‐seq/
Green Center for Reproduc/ve Biology Science
Alignment: Quality Metrics Report
Sample Information Raw reads Alignment %Control Replicate 1 28,259,069 96.30%Control Replicate 2 28,892,302 96.00%Sample 2 Replicate 1 23,239,486 96.10%Sample 2 Replicate 2 25,637,094 96.90%Sample 3 Replicate 1 22,713,054 96.60%Sample 3 Replicate 2 20,419,272 95.90%Sample 4 Replicate 1 22,617,154 96.60%Sample 4 Replicate 2 20,068,460 96.00%
Sample Information Raw reads Alignment % Control Replicate 1 28,259,069 96.30% Control Replicate 2 28,892,302 96.00% Sample 2 Replicate 1 23,239,486 96.10% Sample 2 Replicate 2 25,637,094 96.90% Sample 3 Replicate 1 22,713,054 96.60% Sample 3 Replicate 2 20,419,272 95.90% Sample 4 Replicate 1 22,617,154 96.60% Sample 4 Replicate 2 20,068,460 96.00%
Green Center for Reproduc/ve Biology Science
Cross-‐correla<on: Quality Metrics Report
Sample Information Raw reads Alignment % Control Replicate 1 28,259,069 96.30% Control Replicate 2 28,892,302 96.00% Sample 2 Replicate 1 23,239,486 96.10% Sample 2 Replicate 2 25,637,094 96.90% Sample 3 Replicate 1 22,713,054 96.60% Sample 3 Replicate 2 20,419,272 95.90% Sample 4 Replicate 1 22,617,154 96.60% Sample 4 Replicate 2 20,068,460 96.00%
Green Center for Reproduc/ve Biology Science
Sample 1 Sample 2
R=0.99 R=0.99
R: Pearson correlation coefficient
Call Peaks: Quality Metrics Report
Green Center for Reproduc/ve Biology Science
1. Peak calls for individual replicates
2. Overlapping peaks between the pooled pseudo replicates
3. Bigwig files (UCSC Genome Browser, IGV…)
Call Peaks: Quality Metrics Report
Green Center for Reproduc/ve Biology Science
Visualizing signal tracks (Bigwig files) in UCSC Genome Browser:
Franco et al (2015)
Working With BioHPC and Astrocyte
Green Center for Reproduc/ve Biology Science
Crea<ng a Project
Green Center for Reproduc/ve Biology Science
Create New Project to run analysis
Adding Data
Green Center for Reproduc/ve Biology Science
Select “Add Data to this Project” ...
ChIP-‐Seq Workflow
Green Center for Reproduc/ve Biology Science
ChIP-Input fastq files
ChIP TF or Histone fastq files
Sequence format
Assembly
Run Time of ChIP-‐Seq Pipeline
Thank you !
𐆋𐆌𐆍𐆎𐆜𐆠
Questions?