Be an All-Star Manuscript Reviewer!
Charles E. Kahn, Jr., MD, MS
Editor, Radiology: Artificial Intelligence
Conflict of Interest
• Nothing to disclose
Learning Objectives
• Understand the manuscript review process
• Describe the most important criteria to evaluate a manuscript
• Define special considerations when reviewing work about AI in
radiology
Overview
• Process
▫ Timeliness
• Content
▫ Guidelines + standards
▫ Critical thinking
• Ethics
▫ The Golden Rule
Review Timeline
Journal Office
• Complete manuscript?
• IRB/ethics statement?
Deputy Editor
• Appropriate?• Novel?• Good enough to
review?• Assign reviewers
Reviewers
• Important?• Scientifically
valid?
Deputy Editor
• Integrate reviews
• Recommend
Editor
• Make decision• Accept• Revise• Reject
1-3 days 5 days 14 days 7 days 7 days
Time to First Decision: 31 – 37 days
Time is Key
• Respond promptly!
▫ Quick “No” better than Slow “Yes”
• Provide dates you’re unavailable
• Honor the deadline
Your Review
• Strengths
▫ Timeliness of topic, novelty, size of study
• Weaknesses
▫ Scientific concerns, lack of generalizability
• Specific comments
▫ Page + line numbers
Focus Your Review
• Give constructive comments
▫ Seek to improve the work
• Help the editors
▫ Weaknesses
▫ Scientific quality
▫ Priority
“Would I want to read this article in this journal?”
The Golden Rule of Reviewing
• “Do unto others as you would have them do unto you”
▫ Be constructive
Try to improve the work
▫ Maintain confidentiality
Your “sneak peek” at another’s work is a privilege!
Don’t quote or share the content
The Rewards
• Reviewer recognition
▫ Journal appreciation
▫ Publons
• Editorial Board appointment
• Honor + Glory
Before You Begin…
• Correct manuscript category?
• Any potential conflict or bias?
• Important problem?
• Previously published?
The Review
• Abstract
• Introduction
• Methods
• Results
• Discussion
• Figures
• Tables
• References
• Summary
The Abstract
• Summarizes the manuscript appropriately
▫ Discrepancies between Abstract and rest of manuscript
• Stands alone
▫ Understandable without reading the manuscript
The Introduction
• Concise
• Clearly defines the purpose of the study
• Explains why the study is important
• Defines terms
• Well-defined and testable hypothesis
The Methods Section
• Reproducible methods
• Justified study design
▫ Choice of model
▫ Cohort size
• Methods test the hypothesis
The Results Section
• Clearly explained
• Order of results parallels the methods
• Reasonable? Unexpected?
• Any results not tied to methods?
The Discussion Section
• Summarize results
• Place results in context
▫ Relate to prior literature
▫ Indicate impact on the field
• Describe limitations
• Envision future work
The Discussion Section
• Concise
• Hypothesis verified?
▫ Research question answered?
• Unexpected results
• Limitations
• Next steps
Figures
• Necessary
• Understandable
• Appropriate
Hypothesis
• Quantitative vs.
qualitative
• Retrospective vs.
prospective
• Feasibility vs.
performance
Data
• De-ID
• Data protection
• Ethics
• Preparation
• Augmentation
• Partitions
• Ground truth
Model
• Architecture
• Software
• Initialization
• Pretraining /
transfer learning
• Hyperparameters
• Training rules
Evaluation
• Metrics
• Sensitivity
analysis
• External
testing
• Statistical
analysis
Data
• Where did the data come from?
• How were variables defined?
▫ Common Data Elements (RadElement.org), where applicable
• Inclusion / exclusion criteria
• How was the quantity of data to be used determined?
• How well do the training data match the intended clinical use?
Trouble in Paradise
• Of 516 eligible published studies, only 6% (31 studies)
performed external validation
• None of the 31 studies adopted all three design features:
▫ Diagnostic cohort design
▫ Inclusion of multiple institutions
▫ Prospective data collection for external validation
Kim DW, et al. Korean J Radiol. 2019;20:405-10
doi.org/10.1016/j.jacr.2019.06.009
ACR “TOUCH-AI”
acrdsi.org/DSI-Services/Define-AI/Use-Cases/Acute-Appendicitis
radelement.org/element/RDE195
De-Identification
• Images
▫ DICOM header
Conventional + “private" fields
▫ Image data
Jewelry, implant IDs, burned-in labels,
facial views
• Reports
▫ Patient + provider names
“Hide in Plain Sight” and others
▫ DatesParks CL, Monson KL.
J Digit Imaging 2017; 30:204
“Ground Truth”
• Well defined
• Who (or what) annotated the data?
▫ Qualifications / training
▫ Instructions
▫ How was inter-rater variability
measured?
▫ Was intra-rater variability assessed?
• Single vs. multiple annotation
▫ Blinding
▫ Adjudication of discrepancies
Data Preparation
• “Data wrangling”
▫ Specific software, version number, specified options
• Normalization
• Resampling
▫ Image matrix size
E.g., 512 x 512 224 x 224
▫ Bit depth
E.g., 16-bit grayscale 8-bit RGB
Data Augmentation
• Was it used? If so, how?
▫ Horizontal flip
▫ Vertical flip
▫ Translation (sliding)
▫ Rotation
▫ Affine transformation
Data Partitions
• How?
▫ Any differences? If so, why?
Data source, annotation source, or
preparation
• Disjoint
▫ By image, study, patient, or
institution
▫ Should be at least by patient
Training70%
Tuning (Validation)
20%
Testing10%
Architectures
• LeNet
• ResNet34
• U-Net
• VGG
• LSTM
• Inception
Model Evaluation
• How many models were evaluated against the test set, and how
were these models selected?
▫ Ideally, one
▫ If greater than one, justify reasoning
Metrics
• Sørensen-Dice coefficient = 2 𝑋∩𝑌
𝑋 + 𝑌
▫ Dice similarity coefficient (DSC)
=2𝑇𝑃
2𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁
.
Metrics
• Jaccard index =
Intersection over Union = IoU = |𝑋∩𝑌|
|𝑋∪𝑌|
Metrics
• Hausdorff distance
▫ Measures how far apart two subsets are
The greatest of all the distances from a point
in one set to the closest point in the other set
Model Performance
Handelman GS, et al. AJR 2019; 212:38-43
Overfitting Underfitting
arxiv.org/abs/1807.00431
Irrelevant
Features
Leakage
• Unintended use of known information as unknown
• Outcome leakage
▫ Independent variables can be used to infer outcomes
For example, a risk factor that spans into the future may be used to predict the
future itself
• Validation leakage
▫ Ground truth from training set propagates to validation set
For example, same patient is used in both training and validation
K-fold Cross-validation
• Split the data into K equal parts
▫ Train the model on K−1 parts
▫ Validate on the remaining part
• Repeat the process K times
• Report average results for K-folds
• For small classes and rare categorical factors, stratified K-fold
splitting ensures equal presence of classes and factors in each fold
Evaluation
• ROC analysis
• Calibration curve
• Confusion matrix
• Review of misclassifications
http://arogozhnikov.github.io/2015/10/05/roc-curve.html
Allen B Jr, et al. J Am Coll Radiol 2019; 16:1179
AI Need Not Be Superhuman
Calibration
Actual prevalence of
malignancy versus
estimated risk of
malignancy for each decile
of the probability scale.
Ayer T et al. Cancer 2010; 116:3310-21
Misclassification
https://doi.org/10.1148/ryai.2019180001
False Positives False Negatives
Good Science
• Hypothesis
▫ Well-defined
▫ Testable
▫ Innovative
• Methods
▫ Appropriate to stated problem
▫ Described in detail
▫ Correct metrics
• Results
▫ Appropriate level of detail
• Discussion
▫ Summarize results
▫ Place work into context
▫ Describe limitations
▫ Envision future work
http://jasonya.com/wp/wp-content/uploads/2015/04/car_peer_review_comic_12.jpg
http://jasonya.com/wp/wp-content/uploads/2015/04/car_peer_review_comic_12.jpg
Top Related