ChIP-seq (NGS) Data Formats - GitHub Pages · 2019-05-23 · Common data types and ﬁle formats...

ChIP-seq (NGS) Data Formats

NGS analysis workflows

Biological samples

Sequence reads

Quality control

Mapping Assembly ...

DE Analysis Variant Detection

SRA/SRF, FASTQ

SAM/BAM/Pileup

VCF BED/narrowPeak/ broadPeak

?Peak Calling ...

Counts, RPKM

Common data types and file formats

• You will encounter 3 major types of data, with several associated file formats: ◇ Sequence data - FASTA, FASTQ

◇ Alignment data - SAM, BAM

◇ Genome feature data - BED, Wiggle, GTF, GFF

• Some file formats are not human-readable (binary).

• Many are human readable, but extremely large! (Never use Word or Excel to open these!)

Common data types and file formats

• You will encounter 3 major types of data, with several associated file formats: ◇ Sequence data - FASTA, FASTQ

◇ Alignment data - SAM, BAM

◇ Genome feature data - BED, Wiggle, GTF, GFF

• Some file formats are not human-readable (binary).

• Many are human readable, but extremely large! (Never use Word or Excel to open these!)

Formats for “Genome Features/Coordinates” data

• Tab-delimited (text file separated by tabs)

• Contain specific information about genomic coordinates of various genomic “features” (e.g. exon, UTRs, etc.)

• May or may not include sequence data • Some examples include:

◇ SAM/BAM

◇ UCSC formats (BED, WIG, etc.)

◇ GTF/GFF (GTF v2, and GFF v3)

Where is base 1 and where is base 8?

Genomic coordinates can be represented in 2 ways

0-based (half-open) preferred by programmers

1-based (closed) preferred by biologists

Where is ATG?

( 3, 6 ]

[ 4, 6 ]

Len = end - start

Len = end – start + 1

LengthCoords

https://www.biostars.org/p/84686/

Genomic coordinates can be represented in 2 ways

• Tab- or whitespace-delimited text file; consists of one line per feature • 0-based coordinates • The first three fields/columns in each feature line are required:

◇ chr: chromosome name/ID ◇ start: start position of the feature ◇ end: end position of the feature

• There are nine additional fields that are optional. • Sometimes the BED format is referenced based on the number of

additional fields • (e.g. BED 6+4 format = the first 6 columns of a BED file + 4 other columns)

Genome interval file: BED

chr1 213941196 213942363chr1 213942363 213943530chr1 213943530 213944697

chr7 127471196 127472363 Pos1 0 + chr7 127472363 127473530 Pos2 0 + chr7 127473530 127474697 Pos3 0 + Chromosome ID

NameStart locationEnd location Strand

Phase (reading frame)

• Allows the display of continuous-valued data in a track format, especially for data that is sparse or contains elements of varying size

• Based on the BED format, but with a few differences: ◇ The score is placed in column 4 not 5 ◇ Track lines must also be included (these are optional in BED files)

• 0-based coordinates • Preserve data in original format (no compression) • Often used for displaying density or coverage information

BedGraph format

track type=bedGraph name="BedGraph Format" description="BedGraph format" visibility=full

chr19 49302000 49302300 -1.0

chr19 49302300 49302600 -0.75

chr19 49302600 49302900 -0.50

chr19 49302900 49303200 -0.25

Wiggle format

• Similar to the bedGraph format but:

◇ it’s compressed, and exact data values cannot be recovered from the compression

◇ data elements need to be equally sized (i.e bins of specified size) • Associates a floating point number with positions in the genome, which is

plotted on the track’s vertical axis to create a wiggly line • 1-based coordinates

Wiggle format

fixedStep chrom=chr3 start=400601 step=100112233

Chromosome ID

Quantitative values

variableStep chrom=chr2300701 12.5300702 12.5300703 12.5300704 12.5300705 12.5

Start location

• An indexed binary format derived from the wiggle file ◇ Initially created for the wiggle file, but now bigWig can also be created from

bedGraph files

• Faster than the wiggle or bedGraph formats; good for large datasets

• 1-based coordinates

bigWig format

• FASTA • FASTQ – Fasta with quality • SAM – Sequence Alignment/Map format • BAM – Binary Sequence Alignment/Map format • Bed – Basic genome interval • BedGraph • Wiggle (wig, bigwig) – tab-limited format to represent continuous values

http://genome.ucsc.edu/FAQ/FAQformat.html

Commonly used file formats for ChIP-seq

ChIP-seq workflow

These materials have been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

ChIP-seq (NGS) Data Formats - GitHub Pages · 2019-05-23 · Common data types and ﬁle formats...

Documents

Transcript of ChIP-seq (NGS) Data Formats - GitHub Pages · 2019-05-23 · Common data types and ﬁle formats...

DATA FORMATS AT EOL

Oceangraphic data formats

NetFlow Field Types and Database Formats · Appendix A NetFlow Field Types and Database Formats NetFlow Field Types for NF_INI_VALUES Table: Table A-5 NetFlow Field Types for NF_INI_VALUES

1 Chapter 3 Graphics and Image Data Representations 3.1 Graphics/Image Data Types 3.2 Popular File Formats 3.3 Further Exploration.

XML Data Formats Full

Hadoop Programming. Overview MapReduce Types Input Formats Output Formats Serialization Job g/apache/hadoop/mapreduce/package-

Introduction to netCDF formats, data models, and utilities · dimensions to HDF5 data model! " Compatible with netCDF-3 classic data model! " Adds useful primitive types! " Provides

File Formats and Vector Graphics. File Types Images and data are stored in files. Each software application uses different native file types and file.

CHAPTER 3: Data Formats

REPORT WRITING: TYPES, FORMATS, STRUCTURE and RELEVANCE

IT Applications Theory Slideshows By Mark Kelly mark@vceit.com Vceit.com Last modified : 6 Dec 2013 Databases II: Structure, naming, data types, data formats.

Spatial Data Formats

DART Data Formats

Data Formats HDF5 and Parquet files - UHgabriel/courses/cosc6339_f18/BDA_16_DataFormats.pdfBig Data Analytics Data Formats – HDF5 and Parquet files Edgar Gabriel Fall 2018 File Formats

Collecting Data Types, coding, accuracy, file formats and the effect of data loss.

SD-WAN › data › whitepaper › ... · Robustness to Enterprise Networks SD-WAN Bringing Scale, Agility and Robustness ... aware of data types, formats, application compatibility,

Appendix B - Supported Formats (HL7 Data Types) · Appendix B - Supported Formats (HL7 Data Types) Professional and Software Compliance Standards for HL7 Messaging The table below

IDW Data Formats

Data Formats

Geotechnical Data for Transportation Infrastructurecommunity.dur.ac.uk/geo-engineering/jtc2/JTC2 Workshop... · 2006-09-19 · CIRIA geotechnical data formats review Data types Preliminary