Asphericity of tumor FDG uptake in non-small cell lung cancer: reproducibility and implications for harmonization in multicenter studies

Background Asphericity (ASP) of the primary tumor’s metabolic tumor volume (MTV) in FDG-PET/CT is independently predictive for survival in patients with non-small cell lung cancer (NSCLC). However, comparability between PET systems may be limited. Therefore, reproducibility of ASP was evaluated at varying image reconstruction and acquisition times to assess feasibility of ASP assessment in multicenter studies. Methods This is a retrospective study of 50 patients with NSCLC (female 20; median age 69 years) undergoing pretherapeutic FDG-PET/CT (median 3.7 MBq/kg; 180 s/bed position). Reconstruction used OSEM with TOF4/16 (iterations 4; subsets 16; in-plane filter 2.0, 6.4 or 9.5 mm), TOF4/8 (4 it; 8 ss; filter 2.0/6.0/9.5 mm), PSF + TOF2/17 (2 it; 17 ss; filter 2.0/7.0/10.0 mm) or Bayesian-penalized likelihood (Q.Clear; beta, 600/1750/4000). Resulting reconstructed spatial resolution (FWHM) was determined from hot sphere inserts of a NEMA IEC phantom. Data with approx. 5-mm FWHM were retrospectively smoothed to achieve 7-mm FWHM. List mode data were rebinned for acquisition times of 120/90/60 s. Threshold-based delineation of primary tumor MTV was followed by evaluation of relative ASP/SUVmax/MTV differences between datasets and resulting proportions of discordantly classified cases. Results Reconstructed resolution for narrow/medium/wide in-plane filter (or low/medium/high beta) was approx. 5/7/9 mm FWHM. Comparing different pairs of reconstructed resolution between TOF4/8, PSF + TOF2/17, Q.Clear and the reference algorithm TOF4/16, ASP differences was lowest at FWHM of 7 versus 7 mm. Proportions of discordant cases (ASP > 19.5% vs. ≤ 19.5%) were also lowest at 7 mm (TOF4/8, 2%; PSF + TOF2/17, 4%; Q.Clear, 10%). Smoothing of 5-mm data to 7-mm FWHM significantly reduced discordant cases (TOF4/8, 38% reduced to 2%; PSF + TOF2/17, 12% to 4%; Q.Clear, 10% to 6%), resulting in proportions comparable to original 7-mm data. Shorter acquisition time only increased proportions of discordant cases at < 90 s. Conclusions ASP differences were mainly determined by reconstructed spatial resolution, and multicenter studies should aim at comparable FWHM (e.g., 7 mm; determined by in-plane filter width). This reduces discordant cases (high vs. low ASP) to an acceptable proportion for TOF and PSF + TOF of < 5% (Q.Clear: 10%). Data with better resolution (i.e., lower FWHM) could be retrospectively smoothed to the desired FWHM, resulting in a comparable number of discordant cases.


Background
Patients with early-stage or locally advanced non-small cell lung cancer (NSCLC) are potential candidates for curatively intended therapy; however, management decisions are primarily based on the clinical tumor stage as a single factor only [1]. In the average of patients, adjuvant chemotherapy only showed modest survival benefits [2][3][4], and therefore, more effective methods of treatment selection are highly warranted.
Consequently, numerous additional prognostic or predictive factors [5][6][7], among image-derived parameters [8][9][10][11][12], have been investigated aiming at more differentiated outcome prediction and more differentiated management decisions. Among parameters from positron emission tomography/computed tomography with [ 18 F] fluorodeoxyglucose (FDG-PET/CT), asphericity (ASP) is a parameter that reflects shape irregularity of the primary tumor's metabolic tumor volume (MTV), combining metric and metabolic features of the primary tumor. Three retrospective studies confirmed its independent prognostic value for progression-free (PFS) and overall survival (OS) in patients with NSCLC [13][14][15]. The largest study (311 patients, UICC stage I-III) further showed that ASP, with a cutoff of > 19.5%, could identify patients with UICC stage II treated by surgery and adjuvant chemotherapy with high ASP and reduced PFS (median 11 months vs. not reached) and OS (22 months vs. not reached) [15]. ASP was superior for survival prediction compared to primary tumor's maximum standardized uptake value (SUVmax) and MTV, two other previously proposed and common PET parameters [8,9,16,17].
Studies on quantitative PET parameters have mostly been monocentric, but the main limitation of any PET parameter is its dependence on numerous technical factors including image reconstruction algorithms. Therefore, results may fail to reproduce in a multicenter approach unless harmonization between centers is ensured [18][19][20]. SUVmax and MTV may vary by > 30% if basic ordered subset expectation maximization (OSEM) reconstruction is combined with time-of-flight (TOF) information and/or scanner-specific compensation for the point spread function (PSF) [19][20][21][22].
Variability of ASP has not been investigated so far, but an impact of different reconstruction methods and resulting levels of image noise can be expected. The definition of ASP includes the MTV and its surface; therefore, a variability of MTV will cause variability of ASP. Since MTV also varies notably depending on the applied delineation algorithm [20,[23][24][25], there are two potential sources of variability of ASP: image generation and lesion delineation.
The goal of the current study was to investigate differences in ASP resulting from variability in image generation (common reconstruction methods and acquisition times). The focus was on the assessment if the resulting variation is acceptable for application in multicenter studies and on defining the range of acceptable variation of the influencing factors. Specifically, the goal was not to investigate the trueness of ASP itself, to identify a ground truth or to define a highly optimized reconstruction protocol for a specific PET scanner. To the contrary, this study investigated whether ASP could still be used in multicenter studies under imperfect clinical conditions with different scanners and a certain variation in acquisition protocols (uptake time, acquisition time). Such variability introduced by image generation should be separated from variations in image post-processing, the software for image feature extraction [26] or variation in lesion delineation. Therefore, data were not postprocessed (unless specified), and the same software and delineation method were used as in the preceding studies on ASP in NSCLC [13][14][15]. To facilitate interpretation, SUVmax and MTV were investigated analogously for comparison.

Phantom data
A NEMA IEC body phantom was examined using a GE Discovery MI PET scanner (GE Healthcare, General Electric, Boston, MA, USA) with a 3-ring detector with silicon photomultipliers (SiPM) and a reported sensitivity of 7.3 cps/kBq [27]. Total activity in field of view was approximately 35 MBq. The absolute activities were measured in a certified dose calibrator (ISOMED 2010, MED Dresden GmbH, Germany), which was also used for regular cross calibration of the PET scanner (every 6 months). Sphere inserts (inner diameter 10, 13, 17, 22, 28, and 37 mm) were filled with 24.4 kBq/ml F18-fluoride, while the background was filled with 3.1 kBq/ml (sphereto-background ratio, approx. 8:1). Acquisition time was 3 min per bed position (transaxial field of view, 70 cm; matrix size, 256 × 256; voxel size, 2.73 × 2.73 × 2.78 resolution (i.e., lower FWHM) could be retrospectively smoothed to the desired FWHM, resulting in a comparable number of discordant cases.
Keywords: FDG-PET, Image reconstruction, Spatial resolution, Asphericity, Non-small cell lung cancer, Reproducibility, Prognosis mm 3 ). CT data of the phantom were used for attenuation correction. Scatter correction, random correction and dead time correction were also performed.
Reconstructed spatial resolution was assessed as the full width at half maximum (FWHM) of the PSF in the reconstructed phantom images. PSF was modeled by a 3D Gaussian, and FWHM was determined by applying the method described in detail by Hofheinz et al. [28]. This method is based on fitting the analytic solution for the radial activity profile of a homogeneous sphere convolved with a 3D Gaussian to the reconstructed data. In this process, the full 3D vicinity of each sphere is evaluated by transforming the data to spherical coordinates relative to the respective sphere's center. A summary of the used reconstructions, resulting spatial resolution and image noise (patient data) is given in Table 1. Representative radial profiles are shown in Fig. 1.
To study effects of different acquisition time per bed position, PET list mode data were retrospectively rebinned to reconstruct further datasets representing an acquisition time of 120 s, 90 s or 60 s, respectively. Reconstruction was then performed with the algorithms that resulted in a reconstructed spatial resolution of 7 mm (i.e., TOF 4/8/6 , TOF 4/16/6.4 , PSF + TOF 2/17/7 and Q.Clear 1750 ).

Patients and scans
Fifty patients (female 20; median age 69 years; range 46 to 83 years) with histologically proven NSCLC underwent pretherapeutic FDG-PET/CT between July 2018 and February 2019 using the same scanner. Patients were required to fast for at least 6 h prior to tracer administration, and a blood glucose level of ≤ 150 mg/ dl was ensured. A median activity of 249 MBq (interquartile range [IQR], 238 to 257 MBq; range 209 to 274 MBq) or 3.7 MBq/kg (IQR 3.1 to 4.2 MBq/kg; range 2.0 to 5.7 MBq/kg) was administered intravenously. Static PET data were acquired after a median uptake time of 65 min (IQR 61 to 70 min; range 55 to 96 min) from the base of skull to the proximal femora in 3D acquisition mode (acquisition time, 180 s per bed position; bed overlap, approx. 25%). Attenuation correction was based on a non-enhanced low-dose CT (automated tube current modulation "Smart mA"; maximum tube current-time product 100 mAs; tube voltage 120 kV; gantry rotation time 0.5 s) or non-enhanced diagnostic CT (maximum tube current-time product, 200 mAs).
PET raw data were reconstructed as described above (patient example in Fig. 2). Furthermore, data with 5-mm FWHM resolution were smoothed with a Gaussian filter (5 mm FWHM). According to this results in a target spatial resolution of approximately 7 mm. Altogether, 25 image data per patient with different spatial resolution and noise (i.e., acquisition time) were generated. (1)

Data evaluation
Evaluation of the data was performed with a dedicated software (ROVER, version 3.0.34, ABX advanced biochemical compounds GmbH, Radeberg, Germany) by an experienced physician in nuclear medicine. MTV of the primary tumor was delineated in each dataset using the same threshold-based, background-adapted algorithm [29]. Delineation was visually inspected and manually corrected if deemed necessary. Tumoral FDG-avid tissue not related to the primary tumor and delineable from the latter (lymph nodes, metastases) was excluded. If the primary tumor was determined to be multifocal (i.e., separate ipsilateral tumor nodules) or the presence of lymphangitic carcinomatosis was diagnosed by interdisciplinary consensus, all tumor nodules and FDG-avid lymphangitic tissue were included in the MTV (see also [15]). SUVmax and ASP [30] of the MTV were derived. SUV was normalized using the body weight in kg.
ASP was calculated identical to its initial definition by the authors [30], which was unaltered in subsequent publications [13][14][15][31][32][33][34][35][36][37]: S and V are the surface area and the volume of the MTV, respectively. S was computed as the sum of all voxel surfaces that form the outer and inner surfaces of the MTV multiplied by the factor 2/3. Note that this corresponds to the approximation of the surface area of discrete 3D objects using six voxel classes as described by [38].
Please note that this definition of the MTV surface area is distinctly different from the definition by the Image Biomarker Standardization Initiative (IBSI), and compliance of both definitions cannot be assumed. The IBSI estimates the MTV surface area using a (2) Fig. 1 Sphere activity profiles. a Radial activity profiles of the 37-mm sphere for the reference algorithm with different in-plane filter widths to achieve different levels of reconstructed spatial resolution (FWHM). Acquisition time was 180 s. Substantial noise propagation can be observed at FWHM of approx. 5 mm. b Corresponding profiles for 6.4-mm in-plane filter width at shorter acquisition times. Noise especially increases between 90 and 60 s acquisition time, while reconstructed spatial resolution remains similar mesh-based representation after triangulation of the MTV's outer surface [26]. Additional file 1 provides the IBSI checklist for an overview of all methodological aspects of image generation and image processing in the present analysis. Distribution of ASP values in all current 50 patients is illustrated in Fig. 3.
In each dataset, a spherical volume of interest (VOI) of approx. 19 ml was placed in the unaffected right liver lobe to derive its SUVmean and SUV standard deviation and calculate image noise (SUV standard deviation/SUVmean).

Statistical analysis
Statistical analysis was performed using SPSS 22 (IBM Corporation, Armonk, NY, USA). Descriptive parameters were expressed as median and IQR. Relative differences between any dataset a and the reference dataset b were calculated as follows: The significance of these differences was assessed with Wilcoxon signed-rank test for paired data. Proportions (%) of discordantly classified cases (high vs. low ASP/ SUVmax/MTV) between algorithms were given with their 95% binomial proportion confidence intervals (95% CI), which included the continuity correction of ± 0.5/n (= ± 0.5/50 = ± 1%). Classification with ASP (> 19.5%) was based on a previously identified cutoff in NSCLC patients [15] while cutoffs for SUVmax (> 10.5) and MTV (> 9.5 ml) were the respective median among the current 50 patients. Proportions between different pairs of algorithms were compared with two-sided McNemar's test. Correlation between ASP and MTV was examined using the Pearson correlation coefficient r and interpretation criteria based on [39]. Statistical significance was generally assumed at p < 0.05.

Relative differences
To identify the level of reconstructed spatial resolution that provides minimal relative ASP difference to the reference algorithm (TOF 4/16 ), different combinations of spatial resolution for candidate algorithms (TOF 4/8 , PSF + TOF 2/17 , Q.Clear) and the reference algorithm were compared pairwise ( Table 2).
Relative SUVmax and MTV differences at 7 versus 7 mm were significantly lower than corresponding ASP differences (each p < 0.001; Table 2).

Relative differences and discordant cases (retrospectively smoothed data)
Comparing data that were retrospectively smoothed to achieve 7-mm reconstructed spatial resolution with the original 7 mm data, relative differences between TOF 4/8 and the reference algorithm TOF 4/16 were higher in retrospectively smoothed data for ASP but similar for SUVmax and MTV (details in Table 4). In contrast, relative differences with PSF + TOF 2/17 were comparable for ASP and significantly higher in the smoothed data for SUVmax and MTV. With Q.Clear, relative differences for ASP, SUVmax and MTV were each   significantly lower in the smoothed data compared to original 7-mm data. Proportions of discordantly classified cases at 7 versus 7 mm were comparable between retrospectively smoothed data and original 7 mm data for TOF 4/8 (smoothed vs. original, 2% vs. 2%; p = 1.0), for PSF + TOF 2/17 (4% vs. 4%; p = 1.0) and Q.Clear (6% vs. 10%; p = 0.5). The rate of discordant cases between retrospectively smoothed data and original 7-mm data for the reference algorithm TOF 4/16 itself was 2% (95% CI 0-6.9%).

Relative differences and discordant cases (reduced acquisition time)
Relative differences in ASP, SUVmax and MTV at reconstructed spatial resolution of 7 mm (TOF 4/8/6 , TOF 4/16/6.4 , PSF + TOF 2/17/7 and Q.Clear 1750 ) and shorter acquisition times are displayed in Additional file 2: Tables S2 to S4. Independent from the acquisition time for the candidate algorithms, relative differences were always calculated with regard to the reference algorithm TOF 4/16/6.4 at 180 s. Briefly, relative ASP, SUVmax and MTV differences with TOF 4/8/6 and TOF 4/16/6.4 were significantly higher at any shorter acquisition time (i.e., 120 s, 90 s and 60 s) than at 180 s. Relative differences with PSF 2/17/7 tended to remain similar between 180 and 90 s but increased significantly at 60 s. Q.Clear 1750 mostly showed similar ASP, SUVmax and MTV differences between all acquisition times.
Proportions of discordantly classified cases of ASP, SUVmax and MTV with TOF 4/8/6 , PSF + TOF 2/17/7 and Q.Clear 1750 did not increase significantly with shorter acquisition time (each compared to 180 s; Additional file 2: Tables S5 to S7). Discordant cases with TOF 4/16/6.4 remained similar at 120 s and 90 s but increased with 60 s acquisition time (McNemar's test not applicable).

Discussion
This study found that ASP differences between reconstruction algorithms were significantly higher than corresponding SUVmax and MTV differences ( Table 2). This may be explained by a combined effect of changes in SUVmax (suppression of local maxima and therefore a decreasing absolute threshold and increasing MTV size) and changes in MTV surface (smoothed, smaller MTV surface) on the ASP. Coarseness of the MTV surface is likely to differ with variation in reconstructed spatial resolution, which-in conventional iterative reconstruction algorithms-is mainly determined by the width of the inplane filter. Therefore, if threshold-based MTV delineation is applied, wider filters can be expected to result in lower ASP. In Bayesian-penalized likelihood reconstruction (e.g., GE's Q.Clear), post-processing is not applied, and smoother images are generated by increasing the penalization factor β.
However, since ASP is supposed to serve as part of prognostic/predictive models based on a predefined cutoff, even substantial inter-method differences may be clinically irrelevant if classification of individual patients into groups of high versus low ASP remains concordant. Applying a strict cutoff for ASP of > 19.5% [15], discordantly classified cases compared to the reference algorithm accounted for 2% (TOF 4/8 ) or 4% (PSF + TOF 2/17 ) at spatial resolution of approx. 7-mm FWHM. This could be acknowledged as acceptably low for application of ASP in a multicenter study. If a less strict cutoff with ± 5% tolerance (ASP between 18.53% and 20.48%) was applied, no discordant cases at 7-mm FWHM were observed for TOF 4/8 and PSF + TOF 2/17 . This underlines that inter-method ASP differences at comparable spatial resolution are clinically relevant only if ASP is close to the predefined cutoff. Furthermore, this range of tolerance is well covered by the range of possible ASP cutoffs (17% to 39%) within which ASP remained significantly prognostic for PFS in previously reported patients with UICC stage II NSCLC [15].
Relative differences and discordant proportions tended to be higher with Q.Clear. Notably, Q.Clear showed systematically lower image noise at any level of spatial resolution (Table 1 and Fig. 2). In contrast to conventional algorithms, relative ASP differences with Q.Clear compared to the reference algorithm were higher at 7 versus 7 mm than at 5 versus 7 mm (Table 2) or at 7 versus 9 mm (Additional file 2: Table S8). Simultaneously, noise levels at 5 versus 7 mm and 7 versus 9 mm were also more comparable to the reference algorithm than at 7 versus 7 mm. However, the same observation was not true for SUVmax and MTV or with the conventional algorithms. Consequently, similar reconstructed spatial resolution rather than the noise level should guide the choice of reconstruction algorithms for harmonization for multicenter purposes. Furthermore, Q.Clear, or Bayesian-penalized likelihood reconstruction in general, may not be optimal to achieve minimal ASP deviations if the reference is a conventional algorithm.
With the PET scanner used in the present study, variation of image noise between algorithms was especially prominent at spatial resolution of 5-mm FWHM (Table 1, Fig. 1). This partly explains high inter-method differences, which exceeded 100% for TOF 4/8 and TOF 4/16 ( Table 2), and frequent discordant cases even if pairs of algorithms with 5 versus 5 mm FWHM were compared. In addition to higher noise, Gibbs artifacts (edge elevations) caused by PSF + TOF and Q.Clear reconstruction increase with narrower in-plane filters or lower β [40]. Consequently, SUVmax differences will be more prominent than at 7 mm or 9 mm FWHM. In contrast, in substantially smoothed data with 9-mm FWHM, PET parameters that are reflective of heterogeneity or irregularity of tracer accumulation, such as ASP may lose discriminatory power to detect "real" and clinically relevant differences between tumors/patients. Therefore, under the conditions of the current analysis, 7-mm FWHM could be a feasible and reasonable target for harmonization in a multicenter approach. This is underlined by the observation that the MTV threshold for correlation between ASP and MTV was lowest for TOF 4/16/6.4 compared to TOF 4/16/9.5 and especially TOF 4/16/2 .
If reconstructed spatial resolution is better than the target resolution (e.g., 5 mm instead of 7-mm FWHM), retrospective smoothing of data using formula (1) can be performed to achieve the anticipated resolution. This enabled inter-method differences and discordant proportions far closer to those observed with the original 7-mm data, irrespective of TOF, PSF + TOF or Q.Clear. Consequently, in a multicenter analysis, retrospective smoothing of data with better spatial resolution would be a valid option to ensure comparability. It is important to note that here the effective reconstructed spatial resolution is relevant [28], which can differ notably from the resolution determined via point sources.
A similar approach by the EANM Research Ltd. (EARL) harmonization project was reported by Kaalep et al. who analyzed SUV and MTV in FDG-PET data of NSCLC and lymphoma patients. Only after applying an additional Gaussian post-reconstruction filter of 6-to 7-mm FWHM to PET data reconstructed with PSF + TOF (compliant with the current EARL 2 standard) could SUV and MTV differences be reduced from approx. 30% to < 10% compared to reconstruction compliant with the former EARL 1 standard [41]. In a different approach to harmonization, Tsutsui et al. examined OSEM + TOF data of a NEMA IEC phantom obtained with a Siemens Biograph mCT and showed that errors compared to a simulated reference phantom were lowest with an in-plane filter of approx. 7-to 8-mm FWHM [42]. In a different study, the group achieved harmonization between 12 different PET scanners using contrast recovery (CR) of NEMA IEC phantom spheres by applying a scanner-specific Gaussian filter of up to 8-mm FWHM [43]. The current results of low SUVmax differences < 5% and MTV differences ≤ 6% at 7 versus 7 mm FWHM imply that both CR and reconstructed spatial resolution may be suitable surrogates for harmonization.
Shorter acquisition times of 120 s, 90 s or 60 s increased inter-method differences compared to 180 s with TOF 4/8/6 and TOF 4/16/6.4 , while the increase was insignificant or less prominent with PSF + TOF 2/17/7 and Q.Clear 1750 . More importantly, proportions of discordantly classified cases by ASP, SUVmax or MTV remained similar or did not increase significantlyespecially between 180 and 90 s. Therefore, equal acquisition times between PET systems/centers may be of secondary importance to achieve comparability in the investigated parameters, and differences as high as 180 s versus 90 s might be tolerable.
Voxel sizes may also vary between PET systems in a multicenter study. However, due to technical restrictions voxel size could not be freely varied during image reconstruction in this study. Therefore, the influence on ASP, SUVmax and MTV and the correcting effect of retrospective reslicing to the original voxel size could not be assessed. A further limitation of the current analysis is that the variation in reconstruction algorithms and acquisition time may not fully reflect differences between PET scanners beyond these factors. This would require comparative examinations with different scanners in each patient under identical conditions [20,44]. For methodological consistency with the previous studies [13][14][15], the same threshold-based algorithm [29] was used to delineate all lesions. Consequently, the presented results are not necessarily valid when lesions are delineated differently. Furthermore, although the current study demonstrated that the reconstructed spatial resolution can be used as a surrogate for scanner harmonization and showed lowest inter-method ASP differences and the lowest MTV threshold for correlation between ASP and MTV for 7.0 FWHM, this is not sufficient for a general recommendation of this specific spatial resolution for future studies regarding the ASP. This decision should also consider the performance of all PET scanners used in a specific study (best achievable reconstructed spatial resolution) and-if available-comparative clinical results on the value of ASP at different reconstructed spatial resolution.

Conclusions
Differences in ASP, SUVmax and MTV resulting from TOF 4/8 , PSF + TOF 2/17 or Q.Clear compared to the reference algorithm TOF 4/16 were mainly determined by differences in reconstructed spatial resolution. Therefore, harmonization for ASP in multicenter studies should aim at comparable reconstructed spatial resolution between PET systems, which is determined by either in-plane filter width or the penalization factor β. With the PET scanner used in the present study, a resolution of 7-mm FWHM ensured that discordantly classified cases of high versus low ASP were at an acceptable proportion for TOF and PSF + TOF of < 5% (Q.Clear: 10%). Retrospectively smoothing data with better spatial resolution (i.e., lower FWHM) to the desired FWHM resulted in comparable results. These results require confirmation in a multicenter study.