Analysis of the RNAseq Genome Annotation Assessment Project

24
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De

description

Analysis of the RNAseq Genome Annotation Assessment Project. by Subhajyoti De. . . . The RNAseq Genome Annotation Assessment Project. The RGASP aims to assess the current progress of automatic gene building using RNAseq as its primary dataset. - PowerPoint PPT Presentation

Transcript of Analysis of the RNAseq Genome Annotation Assessment Project

Page 1: Analysis  of the RNAseq Genome Annotation Assessment Project

Analysis of

the RNAseq Genome Annotation Assessment Project

bySubhajyoti De

Page 2: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and a summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

•The RGASP aims to assess the current progress of automatic gene building using RNAseq as its primary dataset.

•More specifically we aim to evaluate the status of computational methods to

•map human RNAseq data, •assemble them into transcripts and •quantify the abundance of that transcript in particular datasets.

•Promising transcript predictions not covered by Gencode annotation will be validated by experimental methods

Page 3: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and a summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes 3 species: human, worm and fly.

Multiple RNA-seq daasets for each organism.15 submitters.304 submissions

Page 4: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Analysis methodology

1. we carried out independent evaluation for the coding portions of the mRNA transcripts (CDS focused) and the mRNA transcripts as a whole (mRNA focused).

2. Analysis was carried out at multiple levels:1. Nucleotide level2. Exon level3. Transcript level

3. For each of the levels, we calculated the sensitivity and specificity of the predictions (as discussed later). As a summary measure we also reported the average of the two statistic.

Page 5: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Annotation set

Prediction set

True positives

False positives

False negatives

Sensitivity =Number of annotated nucleotides correctly predicted Number of annotated nucleotides in the annotation set

Specificity =Number of predicted nucleotides correctly also annotated Number of predicted nucleotides in the annotation set

Nucleotide level analysis

Page 6: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Nucleotide level analysis

Points to note:1. Nucleotide predictions had to be on the same strand as the

annotations to be considered as correct.

2. Individual nucleotides present in multiple transcripts in either the annotation or the predictions are considered only once.

3. As a summary measure, we also calculated the arithmetic average of specificity and sensitivity.

Page 7: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Nucleotide level analysis (H. sapiens)

Team Sensitivity Specificity Average FocusJel_hum_qna_solexa_hummul 84.936 82.198 83.567 cdsMar_hum_qbo_solexa_ 80.161 85.017 82.589 cdsVic_hum_qna_solexa_hummul_ 84.367 68.269 76.318 cdsTyl_hum_qna_solexa_hummul 75.173 73.467 74.320 cdsSim_hum_qtr_solexa_hummul 54.280 92.832 73.556 cdsChr_hum_qbo_solexa_ 44.076 57.971 51.024 cds

Team Sensitivity Specificity Average FocusTyl_hum_qbo_solexa_K562single 69.587 99.308 84.447 exonSea_hum_qex_solexa_ 48.419 94.289 71.354 exonMar_hum_qbo_solexa_K562strand 47.483 82.247 64.865 exonSim_hum_qtr_solexa_ 32.947 83.172 58.059 exonGer_hum_qtr_solexa_hummul 31.904 84.012 57.958 exonVic_hum_qbo_solexa_ 30.969 84.488 57.729 exonLio_hum_qtr_solexa_ 34.330 78.622 56.476 exonTyl_hum_qna_solexa_hummul 31.668 80.285 55.977 exonTho_hum_qbo_solexa_ 35.203 69.262 52.233 exonChr_hum_qbo_solexa_ 44.660 14.236 29.448 exonCar_hum_qna_solexa_hummul 8.6357 2.4019 5.5188 exonJie_hum_qex_solexa_K562single 0.2245 75.762 37.993 exon

93.308

Page 8: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Nucleotide level analysis (D.melanogaster)

Team Sensitivity Specificity Average FocusMar_fly_qbo_solexa_MLDmBG3c2 93.129 95.367 94.248 cdsJel_fly_qna_solexa_flymul 90.746 95.316 93.031 cdsTyl_fly_qna_solexa_flymul 85.938 93.383 89.661 cdsVic_fly_qna_solexa_flymul 95.640 83.252 89.446 cdsGun_fly_qna_solexa_flymul 88.929 72.169 80.549 cds

Team Sensitivity Specificity Average FocusTyl_fly_qbo_solexa_S2DRSC 94.467 98.472 96.470 exonMar_fly_qna_solexa_flymul 85.508 86.379 85.944 exonVic_fly_qna_solexa_flymul 70.835 83.555 77.195 exonTho_fly_qbo_solexa_MLDmBG3c2 42.051 87.378 64.715 exonGun_fly_qtr_solexa_CMEW1CI 72.836 55.486 64.161 exon

Page 9: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Nucleotide level analysis (C.elegans)

Team Sensitivity Specificity Average FocusVic_wor_qna_solexa_wormmul 94.658 92.470 93.564 exonTyl_wor_qbo_solexa_SRX001872 92.464 90.199 91.331 exonMar_wor_qna_helicos_wormmul 90.863 76.515 83.689 exonGun_wor_qtr_solexa_SRX001872 90.343 76.350 83.346 exonTho_wor_qbo_solexa_wormmul 74.669 72.662 73.665 exonGer_wor_qbo_solexa_wormmul 68.993 77.187 73.090 exonLio_wor_qtr_solexa_SRX001874 57.334 81.411 69.372 exon

Team Sensitivity Specificity Average FocusVic_wor_qna_solexa_wormmul 96.455 89.931 93.193 cdsWol_wor_qex_solexa_SRX004867 92.883 91.719 92.301 cdsMar_wor_qbo_solexa_SRX004866 91.433 93.062 92.247 cdsJel_wor_qna_solexa_wormmul 90.805 92.663 91.734 cdsTyl_wor_qna_solexa_SRX004867 90.328 89.038 89.683 cdsGun_wor_qtr_solexa_SRX004865 93.610 83.862 88.736 cdsGer_wor_qbo_solexa_wormmul 75.200 97.186 86.193 cds

Page 10: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Annotation set

Prediction set

True positives

False positives

False negatives

Exon level analysis

Sensitivity =Number of annotated exons correctly predicted Number of annotated exons in the annotation set

Specificity =Number of predicted exons correctly also annotated Number of predicted exons in the annotation set

Page 11: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Exon level analysis

Points to note:1. An exon in the prediction must have identical start and end

coordinates and also the same strand as an exon in the annotation to be counted correct.

2. If an exon is present in multiple transcripts in either the annotation

or the predictions, it is counted only once.

3. As a summary measure, we also calculated the arithmetic average of specificity and sensitivity.

Page 12: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Exon level analysis (H.sapiens)

Team Sensitivity Specificity Average FocusVic_hum_qbo_solexa_ 31.368 65.870 48.619 exonMar_hum_qbo_solexa_ 32.228 64.186 48.207 exonTyl_hum_qbo_solexa_SRX004865 32.932 61.228 47.080 exonGer_hum_qtr_solexa_hummul 19.741 58.694 39.217 exonSim_hum_qtr_solexa_ 16.509 54.381 35.445 exonLio_hum_qtr_solexa_ 18.151 52.382 35.266 exonTho_hum_qbo_solexa_ 14.035 33.955 23.995 exonChr_hum_qbo_solexa_ 2.3731 1.5113 1.9422 exonSea_hum_qex_solid_GM12878solid 0.2463 0.8973 0.5718 exon

Team Sensitivity Specificity Average FocusMar_hum_qbo_solexa_ 59.947 78.377 69.162 cdsVic_hum_qbo_solexa_ 57.848 73.337 65.593 cdsJel_hum_qtr_solexa_ 50.251 77.910 64.081 cdsChr_hum_qbo_solexa_ 7.6757 4.8519 6.2638 cdsTyl_hum_qna_solexa_hummul 49.423 64.729 57.076 cdsSim_hum_qtr_solexa_ 30.725 68.615 49.670 cds

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Page 13: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Exon level analysis (D.melanogaster)

Team Sensitivity Specificity Average FocusMar_fly_qbo_solexa_Kc167 56.063 64.869 60.466 cdsVic_fly_qna_solexa_flymul 57.875 53.877 55.876 cdsJel_fly_qtr_solexa_Kc167 48.299 60.408 54.354 cdsTyl_fly_qna_solexa_flymul 42.425 57.206 49.815 cdsGun_fly_qtr_solexa_CMEW1CI 54.588 40.784 47.686 cds

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Team Sensitivity Specificity Average FocusTyl_fly_qbo_solexa_S2DRSC 46.490 56.912 51.701 exonMar_fly_qna_solexa_flymul 38.190 49.951 44.071 exonVic_fly_qna_solexa_flymul 38.608 44.740 41.674 exonGun_fly_qtr_solexa_CMEW1CI 20.878 56.591 38.734 exonTho_fly_qbo_solexa_MLDmBG3c2 8.2705 17.651 12.961 exon

Page 14: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Exon level analysis (C.elegans)

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Team Sensitivity Specificity Average FocusVic_wor_qna_solexa_wormmul 75.471 80.553 78.012 exonTyl_wor_qna_solexa_wormmul 60.400 72.415 66.408 exonGun_wor_qtr_solexa_SRX001872 42.802 63.928 53.365 exonLio_wor_qtr_solexa_SRX001874 23.309 43.959 33.634 exonTho_wor_qbo_solexa_SRX004867 9.4368 15.237 12.336 exon

Team Sensitivity Specificity Average FocusWol_wor_qex_solexa_SRX004867 80.738 78.661 79.699 cdsVic_wor_qna_solexa_wormmul 71.772 66.978 69.375 cdsMar_wor_qna_solexa_wormmul 67.100 68.633 67.866 cdsJel_wor_qna_solexa_wormmul 65.788 67.521 66.655 cdsTyl_wor_qbo_solexa_SRX001872 60.484 59.180 59.832 cdsGun_wor_qtr_solexa_SRX004865 61.633 63.122 62.377 cdsGer_wor_qbo_solexa_SRX004863 20.744 24.157 22.450 cds

Page 15: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Annotation set

Prediction set

True positives

False positives

False negatives

Transcript level analysis

Sensitivity =Number of annotated transcripts correctly predicted Number of annotated transcripts in the annotation set

Specificity =Number of predicted transcripts correctly also annotated Number of predicted transcripts in the annotation set

Page 16: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Transcript level analysis

Points to note:1. We consider a transcript accurately predicted if the number of

exons in a transcript and their boundaries match exactly between the annotation and the prediction.

2. for the CDS-focused evaluation if the beginning and end of translation are correctly annotated and each of the 5' and 3' splice sites for the coding exons are correct we consider the transcript to be correctly predicted.

3. for the mRNA evaluation, a transcript is counted correct if all of the exons from the start of transcription to the end of transcription

match perfectly between the annotation and prediction sets.

Page 17: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Transcript level analysis

Human, (CDS-focused)

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Page 18: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Annotation set

Prediction set

True positives

False positives

False negatives

Relaxed Transcript level analysis

Sensitivity =Number of annotated transcripts correctly predicted Number of annotated transcripts in the annotation set

Specificity =Number of predicted transcripts correctly also annotated Number of predicted transcripts in the annotation set

Page 19: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Relaxed Transcript level analysis

Points to note:1. We consider a transcript ‘accurately’ predicted if the number of

exons in a transcript match exactly between the annotation and the prediction, and their boundaries differ by no more than 5bp.

2. All other criteria remain same as that of Transcript-level analysis.

Page 20: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Annotation set

Prediction set

True positives

False positives

False negatives

Very relaxed Transcript level analysis

Sensitivity =Number of annotated transcripts correctly predicted Number of annotated transcripts in the annotation set

Specificity =Number of predicted transcripts correctly also annotated Number of predicted transcripts in the annotation set

Page 21: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Very relaxed Transcript level analysis

Worm, (exon-focused)

Points to note:1. We consider a transcript ‘accurately’ predicted if

1. the number of exons in a transcript differ by no more than two (terminal exons only) between the annotation and prediction, and

2. the boundaries of all equivalent exons differ by no more than 5bp between the annotation and the prediction.

2. All other criteria remain same as that of Transcript-level Analysis.

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Page 22: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

'missing exons' (MEs:): the annotated exons that have no overlap with predicted exons by at least 1 bp

'wrong exons' (WEs): the predicted exons not overlapping annotated exons by at least 1 bp.

Annotation set

Prediction set

Missed exons

Wrong exons

'wrong exons' (WEs) that are predicted independently by more than two predictors are recorded, and some of them will be tested experimentally.

Page 23: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Annotation set

Prediction set

Dubious wrong exons

’Dubious wrong exons' (WEs) that are predicted independently by more than two predictors are reported.

Screen shot of the list of dubious wrong exons.

15704 dubious wrong exons in the whole human genome.

17678 dubious wrong exons in the whole worm genome.

Page 24: Analysis  of the RNAseq Genome Annotation Assessment Project

The RNAseq Genome Annotation Assessment Project

Introduction and summary of submissions

Analysis methodology

Nucleotide level analysis

Exon level analysis

Transcript level analysis

Missing and wrong genes

Acknowledgement

Jen Harrow

Felix Kokocinski

Tim Hubbard

The RGASP community