HTS data file

HTS data file

Sequence and quality information are recorded as multi-FASTQ files.

For efficient storage and transmission, they are transformed into SRA (Sequence Read Archives) format.

Practice: transform the SRA file to fastq.

“$ fastq-dump.2 path_to_sra_file”

HTS - Recording sequence and quality information

FASTQ format = FASTA + Quality

@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTT+HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQabdefghadfda

• Two identification lines (@, +) for each sequence.• Identification line format depends on specific

sequencing platform.• Quality line using characters representing integer

values.

HTS data – map to genome

“bwa” or “bowtie” are the two most popular software that implement a similar strategy (Burrows-Wheeler Transform).

Can benefit from multi-processor .

Practice: map the data to hg19.

“ bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]”

Representation of HTS data

The importance of a reference genome

• All coordinates are only meaningful for a given genome assembly.

• One assembly may have multiple releases (annotations).

You need to know which reference genome was used to generate the BED file.

Representation of (HTS) data – BED (Browser Extensible Data) file

chr2 10000192 10000217 U0 0 + chr2 10000227 10000252 U1 0 -chr2 10000310 10000335 U2 0 +chr3 10000496 10000521 U1 0 -chr2 10000556 10000581 U2 0 +

Chrom. Start End name Scor Strand

With the completion of the genome, there is no need to record the base pair identity (if it is the same as the reference genome).

Detailed description of genomic data formats: http://genome.ucsc.edu/FAQ/FAQformat.html

http://genome.ucsc.edu/FAQ/FAQformat.html

How to gain knowledge from HTS data

Visualization of HTS data.

Discovering genomic patterns.

Identifying novel mechanism – hypothesis generation.


Simple visualization - distribution of tags (or normalized values).

Barski et al. (2007) Cell

chr4 0 200 0chr4 200 400 2chr4 400 600 13chr4 600 800 35chr4 800 1000 27

Chr. ChrStart ChrEnd Value

BedGraph file (Wig)

Visualizing Deep Seq data with UCSC genome browser

Practice & Observe I:

1. Load the PolII.H99.Bed file as custom track to the browser by copy/past the URL link.

2. View ‘dense’ and then ‘full’ presentation of the track.

Visualizing Deep Seq data with UCSC genome browser

Practice & Observe II:

1. Save the landmark.bed file to your local computer. View the contents with Notepad.

2. Load the local file to the UCSC browser.

3. Edit the color value, save, resubmit, and observe the differences.


Shifting sequence tag position may be necessary to reflect nucleosome positions. In this example the mapping positions were shifted +73bp for forward strain and -73bp for reverse strain to reflect the midpoint of the nucleosome.

Jiang & Pugh, Nat. Rev. Genet., 2009


Advanced visualization – depending on purpose of comparison.

Berger et al. (2011) Nature

Example - Circos plot depicts genomic location, chromosomal copy number (red, copy gain; blue, copy loss). Inter-chromosomal translocations (purple) and intra-chromosomal (green) rearrangements observed in primary prostate cancers

Discovering genomic patterns

Usually requires some programming (scripting).

As a biologist, you need to clearly define your question, and the logic to obtain the data summary.

Barski et al. (2007) Cell

Discovering genomic patterns

Q: Is H3K4me3 associated with TSS? Is such an association related to gene expression status?

Logic: 1. Group genes based on expression levels obtained with a

microarray study (Su et al, 2004).2. For each gene, obtain the normalized H3K4me3 ChIP-Seq

counts within [-2k, +2k] of the TSS. 3. For each of the expression group, plot the average value along

the [-2k, +2k] interval.

Functional Analysis of HTS data

Gene Ontology – http://www.geneontology.org/

Regulatory pathways.

Modeling & Systems Biology.

http://www.geneontology.org/

http://www.geneontology.org/

Gene Ontology – hierarchical

framework of terms / concepts

Gene Ontology

Goal – “produce a dynamic controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing” – GO consortium

Ontology:

“ The branch of metaphysics that deals with the nature of being” – The American Heritage Dictionary

Implications of Gene Ontology (I)

Monitoring biological processes or molecular functions beyond individual gene.

Example:

1.) Which biological process (mol. Function) is activated/suppressed following a treatment?

Gene Expression Profile Differences between the two long cancer cell lines A549 and H23

extracellular (GO:0005576) 1.91E-08 169

Cell Communication 1.32E-07 690

plasma membrane (GO:0005886) 1.34E-07 511

Complement and coagulation cascades - Homo sapiens

1.73E-07 20

Metabolism 2.10E-06 174

carbohydrate metabolism (GO:0005975)

2.45E-06 207

cell adhesion molecule activity (GO:0005194)

0.000102 113

Structural Protein 0.000231 271

extracellular matrix (GO:0005578) 0.000235 53

Cell Growth and Maintenance 0.000569 590

Cell Adhesion 0.000917 100

development 1.40E-07 596

cell differentiation (GO:0030154) 6.60E-05 186

regulation of gene expression, epigenetic (GO:0040029)

7.71E-05 442

cell growth (GO:0016049) 8.37E-05 307

transcription regulator activity (GO:0030528)

0.000307 319

extracellular (GO:0005576) 0.000515 153

Implications of Gene Ontology (II)

Basis for cross genome comparison and integrating knowledge from different model systems.

Tools associated with GO

• A comprehensive list at GO web site.• Tools for browsing, AmiGO, QuickGO at

EBI, etc.• Tools for analyzing array data such as

FuncAssociate, etc.

http://www.geneontology.org/GO.tools.shtml

Using GO to gain comprehensive understanding of cellular differences

Practice: Load a probe set list to FuncAssociate to identify over-represented

GO

HTS data file

Documents

Transcript of HTS data file