This study showed that in case of small sample sizes, there is no difference in performance when using a holdout approach (approach 2) or a small external dataset (approach 4, n = 100) with similar patient characteristics and image qualities. A single small dataset suffers from large uncertainty suggesting that repeated cross-validation using the full dataset is preferred instead in this situation. Moreover, external validation has limited additional value when patient and PET characteristics are similar (approach 4). However, our simulations demonstrated that external datasets with different patient or PET characteristics have added value, and these differences may ask for adjustment of relevant variables in the final model (approaches 5–9), as shown by the calibration slopes of the models. Yet again, small datasets result in large uncertainty in model performance.
In line with other studies [9, 13], our simulation results show that no single internal validation method clearly outperformed other internal validation methods when looking at the CV-AUCs, standard deviation and calibration slopes (approaches 1–3). Bootstrapping (approach 3) resulted in a smaller standard deviation than cross-validation or holdout. Model performance using a bootstrap approach resulted in slightly lower but more stable model performance. Although mean calibration slopes and mean CV-AUC are comparable for all internal validation approaches, large differences in model performance are observed per fold, stressing the need for repeated validation.
Using a small training set or test set may not be representative of all the possible cases. A small training set results in poor generalization ability and a small test set leading to large confidence intervals. This is also shown in our simulation results, where the confidence interval became smaller as the sample size of the external test set increased (Approach 4). Similarly, the uncertainty of the predictions was lower in the cross-validated model where all simulated patients were used, compared to the uncertainty in the holdout approach, thereby reducing the sample size for the cross-validation training. Moreover, using a holdout set as test set is essentially the same as onefold of the cross-validation as the patient characteristics and metric distributions are identical for the training and test set for both the cross-validation and holdout approaches. Therefore, a holdout set is only effective if you have a very large dataset . As PET studies often have small sample sizes (< 100 patients) a CV-AUC or bootstrap approach is favored over a holdout set in small datasets. Moreover, larger external datasets with similar patient and PET characteristics only result in higher certainty of model predictions as shown by lower standard deviations in CV-AUC and calibration slope, but do not provide meaningful information about generalizability.
The focus of a validation study should not be on the statistical testing of differences in performance but on generalizability of the model in other settings [8, 10]. Our study showed that PET and patient characteristics, such as EARL reconstruction and Ann Arbor stage, influence the model validation (approaches 5–6), this effect is more prominent in the lower calibrations between the models. A model with high generalizability is more likely to be implemented in clinical practice. Often, an external dataset is not available and a training set that is not representative (e.g., due to aberrant patient or PET characteristics) might lead to overfitting of the model in the training set, reducing its performance in the test set and therefore reducing its clinical applicability. Therefore, it is important to check the influence of patient and PET characteristics within your sample using simulations, if possible. An external dataset allows to assess case-mix differences, whereas internal validity approaches only correct for sampling variation.
It is important to note that for this simulation study we assumed that the model to predict outcome was fixed and the test set is only used to validate the model that was developed in the training set. Therefore, our results only apply for the validation of a fixed model. If a new model was trained or feature selection was incorporated in the training set a holdout set or external set would not be comparable to a cross-validation approach. However, in this scenario a small validation set also results in large uncertainty of model performance. Moreover, a holdout set always results in a smaller training set, thereby leading to larger uncertainty for both the training and validation set. Therefore, most of our conclusions remain the same when selecting models and/or features in your training set. However, feature selection leads to overfitting of the training set.
Based on our results we can conclude that in case of small sample sizes there is no added value of a holdout approach (internal validation) or a very small external dataset with similar patient and PET characteristics. PET studies often have small sample sizes; therefore, a holdout approach is not favored as it leads to larger uncertainties for both the training set and validation set. Moreover, a single small external dataset also suffers from a large uncertainty. External validation provides important information regarding the generalizability of a model in different settings. Our simulations also demonstrated that it is important to consider the impact of differences in patient population between training, cross-validation and (external) testing data, which may ask for adjustment or stratification of relevant variables or recalibration of the models. Therefore, we suggest that for future studies with small sample sizes, a repeated CV or bootstrap approach is superior to holdout or only one small external test set with similar patient characteristics, and editors should stress the need for proper internal validation of models.