The HEXCORDER Data Format. Data Format ASCII Comma Delimitated.
NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data...
-
Upload
hugh-gaines -
Category
Documents
-
view
221 -
download
0
Transcript of NGS data format and General Quality Control. Data format “Flowchart” Sequencer raw data...
NGS data format and General Quality Control
Data format “Flowchart”
Sequencer raw data Fastq SAM/BAM
Fastq file
• Used to record raw reads coming off the sequencers
• Each record contains four lines• Parameters were usually set by the sequencer,
such as read length
Fastq file
• Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
• Line 2 is the raw sequence letters. The read length is the length of the string.
• Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
• Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.
http://en.wikipedia.org/wiki/FASTQ_format
General quality control of raw reads
• Using FASTQC– A tool that implements some general rules– Basic Statistics– Per base sequence quality– Per sequence quality scores– Per base sequence content– Per base GC content– Per sequence GC content– Per base N content– Sequence Length Distribution– Sequence Duplication Levels– Overrepresented sequences– Kmer Content
Quality scores
Perbase “N” percentage
Sample FASTQC reports
Good quality : http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc/fastqc_report.html
Bad quality: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc/fastqc_report.html
Data format “Flowchart”
Sequencer Fastq SAM/BAM
SAM/BAM
• SAM stands for Sequence Alignment Map• BAM is the binary form of SAM• Used for mapped/aligned reads• Generated by NGS mapper/aligners
SAM
BAM