We investigated how accuracy, precision, and lesion detectability of analogue whole body [18F]FDHT PET-CT are affected by image count statistics and reconstruction protocol, to optimize imaging protocols for research and clinical use. Reducing counts by 50% introduced < 20% SUV intrascan variability for EARL1 images, which only increased test-retest variability to a small extent. Improving image spatial resolution by adhering to EARL2 guidelines might reduce the size-dependent bias in SUV, but it hampers repeatability and increases sensitivity to count statistics. Lesion detectability is only slightly affected by reduced counts and only marginally increased by resolution modelling.
SUVs of 50% count scans correlated highly with SUVs of 100% count scans, indicating accuracy is preserved at lower count statistics. However, when comparing split scans directly, a variability in SUV ranging 8.5% (SUVmean EARL1) to 22.2% (SUVmax EARL2) was observed. Hence, while SUV accuracy is maintained at low counts, its precision might be hampered. Still, test-retest variability only increased to a small and non-significant extent, which indicates that the statistical Poisson image noise is a minor determinant of SUV repeatability for [18F]FDHT.
SUV repeatability of oncological 18F-tracers (i.e. [18F]FDG, [18F]-fluorothymidine, [18F]-fluoromethylcholine, [18F]FDHT) ranges between 10 and 30%, yielding 30% as the preferred upper threshold for SUV variability for use in e.g. response monitoring studies [27,28,29,30]. As expected, repeatability of SUVmax was most affected by count reduction and EARL2 reconstruction, yielding RCs > 30%. In contrast, SUVpeak seemed to be robust to both count statistics and reconstruction protocol, yielding an RC of approximately 30% after count reduction, which was even lower (27.9%) when only lesions > 4.2 mL were considered. The improved repeatability of SUV when excluding small lesions seems a direct consequence of the size dependency of intrascan variability at reduced counts (Additional file 1: Figure S1). Note that test-retest variability of [18F]FDHT can be even lower when evaluating only selected target lesions, or analysing on a patient instead of lesion basis [12]. In the current study, all avid lesions were primarily included to avoid selection bias and also evaluate the effect of count reduction on smaller and less avid lesions.
Between SUV normalizations, differences in test-retest variability were observed, with larger variability in SUVauc-pp (> 30%) compared SUVbw. While SUV normalized to AUC-PP correlates better with reference pharmacokinetic parameters than SUV normalized to bodyweight [19], deriving it is more technically demanding and less precise compared to more simple factors such as dose per bodyweight, making it less suitable for multicentre studies. Hence, a trade-off between accuracy, precision, and ease of use has to be made when selecting the preferred SUV normalization. For example, while SUVpeak normalized to bodyweight had a RC of 30% at half of counts, it exceeded 30% when normalizing to AUC-PP rendering it unfit for response assessment.
Partial-volume effects generally result in volume-dependent underestimations of tumour SUV and possibly hamper lesion detectability [31]. Correcting for PVE in the reconstruction algorithms might be particularly important in [18F]FDHT due to the high frequency of small (e.g. < 4.2 mL) detected lesions. Novel reconstruction algorithms incorporating the PSF either within or after reconstruction have been proposed to improve image resolution [17]. The EARL2 standards have adopted these algorithms as a step forward in scanner calibration harmonization between centres [15]. However, PSF reconstructions are known to suffer from noise propagation and image artefacts (e.g. Gibbs phenomenon resulting in edge overshoot), which might lead to misinterpretation regarding treatment effects [17, 18, 32]. Indeed, we observed that repeatability was worse for the EARL2 reconstruction with higher sensitivity to count statistics, resulting in a higher minimal detectable change for response assessment.
Previous reports argued that PSF reconstructions should be used for qualitative purposes (i.e. lesion detection) and that non-PSF images (such as EARL1) should be used for tumour quantification [18, 33]. However, Quak et al. found that with additional image filtering the higher lesion detection and image resolution of PSF images do not need to be impaired in order to meet the EARL criteria [34]. In the present study, we observed a very small increase in lesion CNR when PSF was applied. This will not likely result in clinically relevant different conclusions regarding the extent of disease or intrapatient heterogeneity (Fig. 1) due to the vast amount of detected lesions (336 lesions in 12 patients). The small reduction in CNR by < 5% after count reduction is also not likely to have clinical consequences (Fig. 1). This corresponds to [18F]FDG PET-CT data in several cancer types, where reducing acquisition time from 3 to 1.5 min per bed position reduced image quality, but did not impair lesion detection rates [13].
Another factor affecting image count statistics is the injected tracer dosage. In the present cohort, patients received a relatively low dosage compared to other cohorts from the recent multicentre study [12]. However, while SUV test-retest variability varied between centres, the authors did not observe a direct relationship between injected dosage and repeatability [12]. This might be explained by differences in other factors determining repeatability, such as the observer variability in tumour delineation, PET system specifics, adherence to imaging protocols (i.e. uptake interval), and methods for acquiring the SUV normalization factors. Hence, count statistics did not appear to be the main determinant of [18F]FDHT repeatability, which we confirm in the current study where non-significant increases in test-retest RCs were observed after count reduction. Therefore, a potentially modifiable and important determinant of SUV variability in [18F]FDHT imaging seems to be the choice of normalization factors, which, again, need some trade-off between accuracy and precision to be made.
The present study contains several limitations. First, while splitting data on a count-wise basis enables evaluation of Poisson noise induced by count reduction, the 50% count scans do not fully represent a 50% shorter image acquisition. However, [18F]FDHT kinetics commonly reach a plateau after 20–30 min, yielding stable SUV during the whole body acquisition [8]. Second, the present study contains data acquired on a PET system of a single vendor. As between vendors the overlap between bed positions differs, count reduction might have a different impact on measurement variability for these PET systems. Also, for novel PET systems, which may have higher sensitivities and better time-of-flight performance, in particular for the new digital systems, the impact of reducing acquisition times on measurement variability will be even smaller. Hence, for these systems acquisition times may be reduced even further, but this remains to be investigated for each type of system. As investigated in the present study for analogue PET, a reduction up to 50% compared with the current standard practice seems to be feasible for diagnostic and response assessment purposes, given that the use of SUVmax is avoided.
The current approach for evaluating the sensitivity of whole body PET-CT acquisition to scan statistics can be extended to other tracers currently being investigated and/or implemented in clinical practice, such as PSMA-ligand PET-CT. For adequate evaluation of these tracers, however, test-retest data should be available.