NGS, Cancer and Bioinformaticsrssf.i2bc.paris- NGS and Oncology 5/3/2015 Yannick Boursin NGS is now

download NGS, Cancer and Bioinformaticsrssf.i2bc.paris- NGS and Oncology 5/3/2015 Yannick Boursin NGS is now

of 80

  • date post

    28-Jul-2020
  • Category

    Documents

  • view

    2
  • download

    0

Embed Size (px)

Transcript of NGS, Cancer and Bioinformaticsrssf.i2bc.paris- NGS and Oncology 5/3/2015 Yannick Boursin NGS is now

  • 5/3/2015 Yannick Boursin

    NGS, Cancer and Bioinformatics

    1

  • NGS and Clinical Oncology

    • NGS in hereditary cancer genome testing • BRCA1/2 (breast/ovary cancer)

    • XPC (melanoma)

    • ERCC1 (colorectal cancer)

    • NGS for personalized cancer treatment • Clinical trials: MOSCATO (GR), SAFIR (GR), SHIVA (Curie), …

    • Ipilimumab (anti-CTLA4), Nivolumab (anti-PD1), Trastuzumab (anti-HER2), Cetuximab (anti-EGFR)

    • Detection of chimeric transcripts • Chronic Myeloid Leukemia: Philadelphia chromosome (BCR/ABL)

    • Non-Small-Cell Lung Cancer: EML4-ALK

    5/3/2015 Yannick Boursin 2

  • NGS and Oncology

    5/3/2015 Yannick Boursin

    NGS is now widely used as: • A research tool to screen a large amount of cancer samples

    NGS and Oncology

    18

    07-09th April 2014 NGS and Bioinformatics

    NGS is now widely used as:

    • A research tool to screen a large amount of cancer samples

    • A clinical/diagnosis tool in daily practice

    These projects require dedicated bioinformatics integration project to access and analyses this huge amount of data

    • A clinical/diagnosis tool in daily practice

    These projects require dedicated bioinformatics integration project to access and analyses this huge amount of data.

    NGS and Oncology

    18

    07-09th April 2014 NGS and Bioinformatics

    NGS is now widely used as:

    • A research tool to screen a large amount of cancer samples

    • A clinical/diagnosis tool in daily practice

    These projects require dedicated bioinformatics integration project to access and analyses this huge amount of data

    3

  • Why do we need computers for NGS

    Sequencing data size evolution Needs to address

    • Store PetaBytes of data (1 PB is 1000 TB).

    • Share data around the world through networks

    • Analyze huge amounts of data with complex algorithms

    5/3/2015 Yannick Boursin 4

  • Bioinformatics and Oncology

    • Problem: finding, extracting, and presenting relevant informations.

    • Partial solution: designing workflows in order to ease data analysis.

    5/3/2015 Yannick Boursin 5

  • Interdisciplinary collaboration

    5/3/2015 Yannick Boursin

    Bioinformatics acts as a hubs between the different fields. Trust between partners is needed, training is needed as well for efficient understanding.

    6

  • Standard Workflow for NGS Analysis

    5/3/2015 Yannick Boursin 7

    A typical NGS workflow

  • Step 1: Quality Check and improvements

    5/3/2015 Yannick Boursin 8

  • Standard Workflow for NGS Analysis

    5/3/2015 Yannick Boursin 9

    A typical NGS workflow

  • NGS Data: what do they look like ?

    5/3/2015 Yannick Boursin 10

    A raw data file (.fastq, .sff, .fa, .csfasta/.qual) with millions of short reads of the same size (SOLiD, HiSeq) or reads of different size (Ion PGM/Proton)

    Enhanced view of the reads in a fastq file

  • FASTQ format

    5/3/2015 Yannick Boursin

    • 1 sequence = 1 read = 4 lines in the file

    • First line = sequence identifier

    11

  • FASTQ format

    5/3/2015 Yannick Boursin

    • Fourth line = Quality

    • ASCII encoded (Reduce the file size)

    12

  • Sequence quality encoding

    5/3/2015 Yannick Boursin 13

  • Why looking at sequencing quality ?

    5/3/2015 Yannick Boursin

    • Quality of data is very important for various downstream analyses: • Sequence assembly or mapping • Variants detection • Gene expression studies •...

    • Quality of data = poor • Try to find a reason • Can we correct/improve the quality ? • May lead to erroneous conclusions

    14

  • Quality controls on raw reads: which metrics to check ?

    5/3/2015 Yannick Boursin

    Mainly: • Quality score per base and over the reads

    But also: • Read length distribution • Sequence content per base and % of GC • Kmers content • Overrepresented sequences • Duplicated reads

    15

  • Quality scores

    5/3/2015 Yannick Boursin

    • Per base (Box Whisker type plot) -> to see wether base calls falls into low quality (commonly towards the end of a read)

    • Per sequence (mean quality distribution) -> to see if a subset of your sequences have universally low quality values

    16

  • Quality scores

    5/3/2015 Yannick Boursin 17

  • Quality scores

    5/3/2015 Yannick Boursin 18

  • Standard Workflow for NGS Analysis

    5/3/2015 Yannick Boursin 19

    A typical NGS workflow

  • Reads cleaning: removing bad quality bases

    • After QC, we need to remove bad quality entities.

    • This is often done by scanning reads with a sliding window algorithm.

    5/3/2015 Yannick Boursin 20

    Read-ends trimming by a quality trimming algorithm. In red: bad quality bases. In blue: good quality bases.

  • Reads cleaning: adapters removal

    5/3/2015 Yannick Boursin

    • An adapter is a small piece of known DNA located at the end of the reads • Adapters roles:

    • Hang read to the sequencer flowcell • Allows a specific PCR enrichment of reads having adapter • Use in multiplex sequencing (samples in mix)

    • Available tools to trim adapters: • Cutadapt • Trimmomatic • RmAdapter

    21

    In blue: adapters. In orange: informative part of the read.

  • Standard Workflow for NGS Analysis

    5/3/2015 Yannick Boursin 22

    A typical NGS workflow

  • Step 2: Short Reads Alignment

    5/3/2015 Yannick Boursin 23

  • Standard Workflow for NGS Analysis

    5/3/2015 Yannick Boursin 24

    A typical NGS workflow

  • Reads alignment - Vocabulary

    5/3/2015 Yannick Boursin

    Reference Genome : The reference genome is a known sequence, supposed to be as close as possible to the input genome, and which is used as an anchor to organize the single reads information. Alignment : (mapping) The reads alignment aims at transforming the single reads information in an organized and reduced set of information. Giving each read a genomic position. Mismatch : Incoherence between two nucleotides Gap : Bridge within the read alignment (i.e. small Insertion/deletion) Indels : Insertion/Deletion into the reference genome Mappability : Uniqueness of a region (repeated region = low mappability, unique region = good mappability)

    25

  • Reads alignment – Two strategies

    5/3/2015 Yannick Boursin

    The reads alignment aims at transforming the single reads information in an organized and reduced set of information.

    Two strategies can be applied :

    - De novo Reads Assembly Used when no reference genome are available. It aims at reconstructing long scaffolds from single reads information.

    - Alignment on a Reference Genome The reads are directly compared to a known reference genome.

    26

  • Alignment on a reference genome

    5/3/2015 Yannick Boursin

    The reference genome is a known sequence, supposed to be as close as possible to the input genome, and which is used as an anchor to organize the single reads information.

    27

    Alignment of reads against reference genome

  • Alignment on a reference genome

    5/3/2015 Yannick Boursin

    The reference genome is a known sequence, supposed to be as close as possible to the input genome, and which is used as an anchor to organize the single reads information.

    28

    Alignment of reads against reference genome

  • Alignment on a reference genome - Challenges

    5/3/2015 Yannick Boursin

    New alignment algorithms must address the requirements and characterics of NGS reads

    – Millions of reads per run (30x of genome coverage) – Reads of different size (35bp - 200bp) – Different types of reads (single-end, paired-end, mate-pair, etc.) – Base-calling quality factors – Sequencing errors ( ~ 1%) – Repetitive regions – Sequencing organism vs. reference genome – Must adjust to evolving sequencing technologies and data formats

    29

  • Alignment on a reference genome – Bioinformatics tools

    5/3/2015 Yannick Boursin 30

  • Finding the best alignment - Rational

    5/3/2015 Yannick Boursin

    Given a reference and a set of reads, report at least one “good” local alignment for each read if one exists What is “good”? For now, we concentrate on: – Fewer mismatches is better

    – Failing to align a low-quality base is better than failing to align a high-quality base

    Based on a scoring system, i.e. score for a match (1), MM penalty (3), gap open penalty (5), gap extension penalty (2). The best alignment is the one with the highest score.

    31

  • Alignment key parameters - Repeats

    5/3/2015 Yannick Boursin

    Approximately 50% of the human genome is comprised of repeats

    Tr ea

    n ge

    n T.

    J. a

    n d

    S al

    zb er

    g S.

    L. 2

    0 1

    2 . N

    at u