140127 rtg vcfeval vcf comparison tool

18
Comparing Variant Calls Francisco M. De La Vega, D.Sc. Visiting Scholar, Department of Genetics Stanford University School of Medicine In collaboration with Real Time Genomics, Inc. GENOME-IN-A-BOTTLE WORKSHOP

Transcript of 140127 rtg vcfeval vcf comparison tool

Page 1: 140127 rtg vcfeval vcf comparison tool

Comparing Variant Calls

Francisco M. De La Vega, D.Sc.Visiting Scholar, Department of GeneticsStanford University School of Medicine

In collaboration with Real Time Genomics, Inc.

G E N O M E - I N - A - B O T T L E W O R K S H O P

Page 2: 140127 rtg vcfeval vcf comparison tool

rtgTools v1.0

A toolkit to compare and analyze VCFs

• vcfeval – comparison of VCFs for ROC curves • rocplot – draw ROC curves from vcfeval output• medelian – counts of Mendelian inheritance errors in pedigrees• vcfstats – basic statistics of VCF files• vcffilter – filtering of VCFs by scores, etc.• vcfannotate – annotation of VCF files• vcfmerge – merge VCF files

Java compiled code freely available at GiaB repository:

ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/

Page 3: 140127 rtg vcfeval vcf comparison tool

3

Issues in representation of complex calls

Indel in homopolymer

Reference CAAAAAAG

Baseline C..AAAAGCalled CAAAA..G

After replay:

Baseline CAAAAGCalled CAAAAG

MNPs

Reference CAACGTAAG  Baseline CAATGTCAG Called CAATGTCAG

Page 4: 140127 rtg vcfeval vcf comparison tool

Issues in representation of complex calls

Dinucleotide repeat

Reference ACGTACCAGATATCACAACATATATATABaseline ACGGACCAG..ATCACAACATATATATATA

Called ACGGACCAGAT..CACAACATATATATATA

After replay: Baseline ACGGACCAGATCACAACATATATATATA Called ACGGACCAGATCACAACATATATATATA

Page 5: 140127 rtg vcfeval vcf comparison tool

Best path Link mutations ROC

Comparison of variant call set with baseline set

Basic rules• Match the baseline and called sequences so as to maximize true positives

and minimize false positives and false negatives.• True positives + false negatives = total calls in the baseline• Heterozygous calls match: Both heterozygous and alleles must agree

Path creation• A path is a selection of subset of calls• Best path: paths that maximize true positives and minimize errors• In theory, exponential number of paths; in practice this can be solved by

dynamic programing

Page 6: 140127 rtg vcfeval vcf comparison tool

Baseline

Called

a b c d e f g h

Reference

Path creation - simple homozygous case

Page 7: 140127 rtg vcfeval vcf comparison tool

False positive (excluded)

Baseline

Called

Best Path

False negative (excluded)

a b c d e f g h

Baseline

Called

a b c d e f g h

Reference

Path creation - simple homozygous case

Page 8: 140127 rtg vcfeval vcf comparison tool

Baseline

Called

a b c d e f

Reference

Path creation - simple heterozygous case (non-phased)

Page 9: 140127 rtg vcfeval vcf comparison tool

False positive (excluded)

Baseline

Called

Best Path

False negative (excluded)

a b c d e f

Baseline

Called

a b c d e f

Reference

Path creation - simple heterozygous case (non-phased)

Page 10: 140127 rtg vcfeval vcf comparison tool

Why weighting is needed?

TP + FN = Totalbaseline

Reference CAACAACTATCCTC....ATCT....GC

Baseline CAACAACTATCCTCATCTATCTATCTGC

 

Called CAACAACTATCCTCATCTATCTATCTGC

Page 11: 140127 rtg vcfeval vcf comparison tool

Sync points

Reference ACAGTCACGGBaseline ACGGTCACTGCalled ACGGTTACGG

Reference AC AGT CAC GGBaseline AC GGT CAC TGCalled AC GGT TAC GG

Page 12: 140127 rtg vcfeval vcf comparison tool

Weighting

where B is the number of baseline variants between the current (Sn) and previous sync points (Sn-1) and C is the number of called variants between the current and previous sync points.

Page 13: 140127 rtg vcfeval vcf comparison tool

False positive (excluded)

False negative (excluded)

1 1 1 1 1 1

Baseline

Called

Weights

1

1

Type Weighted total

TP 6

FP 1

FN 1

Sync points

a b c d e f

Sync point

Simple homozygous weighting

Page 14: 140127 rtg vcfeval vcf comparison tool

False positive (excluded)

Baseline

Called

False negative (excluded)

1 1 1 1

1

2

Type Weighted total

TP 4

FP 1

FN 2

Sync point

a b c d e f

Simple heterozygous case (non-phased) weighting

Page 15: 140127 rtg vcfeval vcf comparison tool

a b c d e f

Called

1 1 1 1 0.5 0.5

Baseline

Type Weighted total

TP 5

FP 0

FN 0Sync point

Complex weighting

Page 16: 140127 rtg vcfeval vcf comparison tool

ROC Plot

Page 17: 140127 rtg vcfeval vcf comparison tool

http://biorxiv.org/content/early/2014/01/24/001958

Page 18: 140127 rtg vcfeval vcf comparison tool

Acknowledgements

RTG, Hamilton, New Zealand John Cleary Len Trigg Mehul Rathoud

Data and tools to compare with phased standard released publicly at NIST Genome-in-a-Bottle repository (s3://giab)

This work was done while the presenter was employed by Real Time Genomics Inc., San Bruno, CA.

© 2014 Real Time Genomics, Inc. All rights reserved.