Quantitation of Cancer Treatment Response by FDG PET/CT: Multi-center Assessment of Measurement Variability Using AUTO-PERCISTTM

Background: The aim of this study was to assess the reader variability in quantitatively assessing pre-and post-treatment F-18 uorodeoxyglucose positron emission tomography/computed tomography (FDG PET/CT) scans in a dened set of images of cancer patients using the same semi-automated analytical software (Auto-PERCIST™), which identies tumor peak standard uptake value corrected for lean body mass (SUL peak ) to determine 18F-FDG PET quantitative parameters. Methods: Paired pre- and post-treatment FDG PET/CT images from 30 oncologic patients and Auto-PERCIST™ semi-automated software were distributed to 13 readers across US and international sites. One reader was aware of the relevant medical history of the patients (read reference ), whereas the 12 other readers were blinded to history but had access to the correlative images. Auto-PERCIST™ was set up to rst automatically identify the liver and compute the threshold for tumor measurability (1.5 x liver mean) + (2 x liver standard deviation [SD]), and then detect all sites with SUL peak greater than the threshold. Next, the readers selected sites they believed to represent tumor lesions. The main performance metric assessed was the percent change in the SUL peak (%ΔSUL peak ) of the hottest tumor identied on the baseline and follow up images. Results: The intra-class correlation coecient (ICC) for the %ΔSUL peak of the hottest tumor was 0.87 (95%CI: [0.78, 0.92]) when all reads were included (n=297). Including only the measurements that selected the same target tumor as the read reference (n=224), the ICC for %ΔSUL peak was 1.00 (95%CI: [1.00, 1.00]). The Krippendorff alpha coecient for response (complete or partial metabolic response, versus stable or progressive metabolic disease on PET Response Criteria in Solid Tumors 1.0) was 0.91 for all reads (n=380), and 1.00 including for reads with the same target tumor selection (n=270). Conclusion: Quantitative tumor FDG SUL peak changes measured across multiple global sites and readers utilizing Auto-PERCIST™ show very high correlation. Harmonization of methods to single software, Auto-PERCIST™, resulted in virtually identical extraction of quantitative tumor response data from FDG PET images when the readers select the same target tumor. PET Response Criteria in Solid Tumors 1.0; SUVmax: maximum standardized uptake condence interval; SD: standard deviation; ICC: intraclass correlation coecient; CMR:

Introduction F-18 uorodeoxyglucose positron emission tomography/computed tomography (FDG PET/CT) is increasingly applied in monitoring treatment response in patients with cancer. While PET is intrinsically a quantitative imaging technique, many PET assessments of cancer response are qualitative, as for example in lymphoma where quantitative PET data are converted into a 5 point qualitative scale which is practical and highly useful (1,2). Quantitative PET assessments of response have been deployed in many research imaging studies, especially in examining early treatment response related changes in metabolism including breast cancer where these changes can predict much later pathological outcomes (3,4). The PET Response Criteria in Solid Tumors 1.0 (PERCIST 1.0) were proposed in 2009 as a method to standardize the assessment of tumor response on FDG PET and emphasized use of the peak standard uptake value corrected for lean body mass (SUL peak ) in contrast to the maximum standardized uptake value (SUV max ) (5,6). While the SUV max is reasonably easy to determine with many forms of software though the SUL peak is more challenging to measure (7).
Thus, despite its attractiveness, quantitative PET is still not routinely performed for assessing response to therapy in patients with cancer in many settings. One way to expand the use of quantitative FDG PET/CT in clinical trials and clinical practice is to reduce reader variability of SUV measurements and make the measurements rapid and automated. In a previous multi-center, multi-reader study we conducted, multiple sites assessed the same paired pre-and post-treatment FDG PET/CT images in cancer patients. The intra-class correlation coe cient (ICC) of percent change in SUV max was 0.89 (95% con dence interval (CI): [0.81, 0.94]) across multiple performance sites using a variety of analytical software tools. The ICC for the SUL peak was lower at 0.70 (95% CI: [0.54, 0.80]). SUL peak is, in principle, the more statistically sound of the PET parameters and it is the suggested metric in PERCIST (7). However, if there is considerable variability among sites in how SUL peak is generated and measured, then the PERCIST metric potentially may introduce variability into assessments of treatment response, as opposed to reducing variability (8).
The aim of the present study was to determine if the utilization of Auto-PERCIST™, a semi-automated software system for the quantitative assessment FDG PET images, could lower the reader variability in quantitatively assessing pre-and post-treatment FDG PET/CT studies for response in a multi-center, multi-reader, multi-national study assessing identical images.

Materials And Methods
Pre-and post-treatment FDG PET/CT images of 30 oncologic patients selected from a group of tumor types having representative patterns of FDG-avidity contained a .mix of single and multiple tumors on the pretreatment scan (1 tumor, n=6; >1 but < 10 tumors, n=19; ≥ 10 tumors, n=6), and a mix of the four major response categories using PERCIST (complete metabolic response, n=4; partial metabolic response, n=11; stable metabolic disease, n=4; and progressive metabolic disease, n=12).
Sites both with National Cancer Institute Quantitative Imaging Network a liation and without which did not participate in the previous study with the same data set were recruited by email and conference calls.
The dataset was the same one used in previous study of reader variability (9).
Thirty anonymized cases of pre-and post-treatment FDG PET/CT studies (total 60 studies) were distributed along with directions for installing and utilizing the Auto-PERCIST™ software. Approval from the institutional review board was obtained, and patient consent was waived for this study of anonymized image data.

Measurement
Individual measurements from coupled pre-and post-treatment FDG PET/CT images from one patient were counted as a read. The coupled pre-and post-treatment measurements for all 30 cases from a single reader were counted as a set of reads. One reader from the central site (reader 1) had full knowledge of the primary tumors, treatment histories and subsequent follow-up results, but all other readers had no knowledge of the patients' medical histories as the reader is often intentionally blinded in the setting of multicenter trials. For statistical purpose, the measurements by reader 1 were considered as the reference standard for comparison (read reference ).
Each reader determined which tumor to measure. The Auto-PERCIST™ loads the PET images and automatically obtains liver measurements from a 3 cm diameter sphere to compute the threshold for lesion detection. The default setting is 1.5 x liver mean + 2 standard deviations (SD) at baseline. For follow up images, the default setting is 1.0 x liver mean + 2SD. If a lesion was perceptible visually but not detected using the default threshold settings, the reader had the choice to manually lower the threshold for detection. The Auto-PERCIST™ would detect all sites with SUL peak higher than the threshold ( Figure 1).
It was up to the readers to determine whether the detected sites were true tumor lesions or not. The reader could also separate a detected focus of FDG uptake into separate smaller lesions when needed -to exclude adjacent physiologic FDG uptake or break down a large conglomeration of tumors into smaller separate lesions. The reader could also add smaller FDG uptake lesions to make them a single lesion if the reader decided the separate FDG uptakes were parts of a larger single lesion. The readers were instructed to select up to 5 of the hottest tumors, or more. The readers could view the PET/CT images on any reading software they preferred, but the measurements came only from the Auto-PERCIST™. The measurements from Auto-PERCIST™ included SUL peak , maximum and mean SUL, number of counts, geometric mean, exposure, kurtosis, skewness, and metabolic volume. After the readers selected and quanti ed the lesions, the measurements were saved as text les and sent for central compilation and analysis to the Image Response Assessment Core at Johns Hopkins University.

Statistical analysis
The primary study metric was the percentage change in SUL peak (%ΔSUL peak ) from baseline to follow-up.
Percentage change was de ned as [(follow-up measurement -baseline measurement) / (baseline measurement)] x 100. Treating both case and site as random-effects, a linear random-effects model was t via the restricted maximum likelihood estimation method, which estimated variance components of the random-effects in the model. As a measure of inter-rater agreement, the intra-class correlation coe cient (ICC) was computed using the variance components of the random-effects. The ICC was computed as [inter-subject variance / (inter-subject variance + intra-subject variance + residual variance)]. The bias-corrected and accelerated bootstrap method was implemented with 1,000 bootstrap replicates to construct the 95% con dence interval of the computed ICC. The sampling unit was a read.
To assess agreement between the reference reader (read reference ) and another reader, the ICC was computed for each pair of the reference reader and 12 other readers. The mean of these ICCs and its range (minimum, maximum) were reported.
Krippendorff alpha reliability coe cient was computed as a measure of agreement between multiple readers for response outcome, which was classi ed into four ordered major response categories using PERCIST 1.0 as: complete metabolic response (CMR), partial metabolic response (PMR), stable metabolic disease (SMD), and progressive metabolic disease (PMD). The measurements were classi ed: PMD for SUL peak increase ³ 30% (and 0.8 units) or new lesions; SMD for SUL peak increase or decrease < 30% (or 0.8 units); PMR for SUL peak decrease ³ 30% (and 0.8 units); and CMR for no perceptible tumor lesion.
Additionally, Krippendorff coe cient was computed with the response categories being dichotomized into two levels: clinical bene t (CMR/PMR/SMD) and no bene t (PMD); or response (CMR/PMR) and noresponse (SMD/PMD). Krippendorff suggests 0.8 as a threshold for satisfactory reliability, but if tentative conclusions are acceptable, 0.667 is the lowest conceivable threshold (10).

All reads
Reads were received from 13 different sites from January to September of 2018. A single reader (nuclear medicine physician/radiologist/radiological scientist) at each site measured all 30 cases. Measurements were treated as missing when a reader did not submit data. Among a total of 390 possible reads by 13 readers, 347 baseline reads and 329 follow-up reads were reported, of which 297 reads were complete baseline and follow-up pairs. Such reads were used to compute the ICC with all readers and agreement with read reference for the baseline, follow-up and percentage change in SUL peak , respectively. The ICC for %ΔSUL peak was 0.87 (95% CI: [0.78, 0.92]), and agreement with read reference was 0.88 (Range: [0.61, 1.00]). The ICC and agreement with read reference of other metrics are in Table 1. The overall within-subject coe cient of variance (COV; overall SD / average of the case means) for %ΔSULpeak change was computed as 2.293. The Bland-Altman plot of the %ΔSUL peak is shown in Figure 2.  Figure 3.

Sum of up to 5 SUL peak
In addition to the SUL peak measurement of a single lesion, the sum of SUL peak measurements of up to 5 of the selected lesions was used to compute the ICC and agreement with read reference for all reads and reads with the same target lesion (Table 1). Even when the same lesions were selected, the ICCs and agreement with read reference were not a perfect 1.00 due to (a) differences in the manual thresholds used for lesion detection and (b) utilization of the 'erosion option' for breaking up FDG uptake volumes by the individual readers. *Reads with missing response were excluded (10 for all reads, and 1 for reads with same target)

Inter-rater reliability of readers on responses
Among the 390 reads for all reads, 380 reads reported response categories. Among the 271 reads agreeing on target selection with the read reference , 270 reported response categories. The Krippendorff alpha coe cient of 13 readers for binary response measure (response (CMR/PMR) versus no-response (SMD/PMD)) was 0.91 for all reads, and 1.00 for only the reads with the same target lesion selection. When assessing clinical bene t (SMD/PMR/CMR representing clinical bene t versus PMD representing no bene t), the Krippendorff alpha coe cient was 0.81 for all reads and 1.00 for only the reads with the same target selection. With the four response categories treated in an ordinal scale, the Krippendorff alpha coe cient was 0.86 for all reads and 1.00 for only the reads with the same target selection ( Table  2).

Discussion
Variability in measurements across readers and sites is an often cited hurdle to broader utilization of quantitative FDG PET/CT for response assessment of cancer treatment (11). Test-retest studies have demonstrated high repeatability of FDG and other radiopharmaceutical PET parameters (12)(13)(14)(15). The variance of SUVs could be greater in clinical practice compared to under ideal study setting (16). In the clinical setting, measurement of SUV max was demonstrated to have high agreement in our previous paper, while the statistically more robust SUL peak showed suboptimal agreement (9). We wanted to know if using uniform software could eliminate the variability associated with the computation differences for SUL peak across multiple vendors and software.
The localization of the liver, SUL measurements from the liver, computation of a threshold for lesion detection, and identi cation of candidate lesions were all performed automatically on Auto-PERCIST™. Following detection of all sites with SUL peak higher than the set threshold, various FDG uptake intensity or pattern measurements and textural features for each of the detected sites were also performed automatically. When the readers chose the same single target tumor, the measurements were identical, as could be expected. For up to 5 hottest lesions measurements, the agreement was near perfect. However, agreement was not a perfect 1.00 even when the readers chose the same tumors because the readers had the option to break down a single volume of FDG uptake to separate parts, or add up two or more FDG uptake sites to a single volume as they determined appropriate. Some readers chose to break down a lesion detected on Auto-PERCIST™ to avoid including physiologic FDG uptake, or to separate a conglomeration of multiple tumors lesions. And some readers intentionally chose a threshold of detection lower than the default software setting to include lesions with relatively low FDG uptake for assessment on the follow-up PET images.
A previous paper that showed excellent correlation between two different vendor software for SUL peak had the tumor sites prede ned by the readers to exclude interpretive error (13). Determining which FDG uptake site is true tumor remained a challenge even for experienced readers. In the outlier case in Figure 2 showing an average difference greater than 100%, some readers considered an intense FDG uptake on the follow-up image to be new tumor lesion, while the read reference considered it physiologic in nature. Of the 360 non-reference baseline reads (including missing measurements) in this study, only 241 reads (67%) chose the same lesion and went on to make the same measurements as the read reference at both baseline and follow-up. Among the 30 cases, the target lesion (hottest tumor) on the post-therapy scan was different from the target lesion noted on the pre-therapy scan in 11 cases. For example in one case, the target lesion was in a mediastinal node on pre-therapy scan, and then a lung lesion became the hottest tumor in the post-therapy scan. In 3 cases, nodes in different stations were the target lesions at different time points. Among patients with multiple bone or lung metastases, different lesions in the same organ could be observed becoming the target tumors at different time points. As seen in interobserver agreement studies of FDG PET/CT performed in patients with lymphoma after therapy, even experienced readers do not always agree on what is tumor FDG uptake and what is physiologic FDG uptake (17,18). Rather than relying solely on the reading experience of the local site, discussions and consensus meetings and better training methods are necessary to implement FDG PET/CT to its full potential. It almost certainly is the case that the availability of more relevant patient history would result in better accuracy and consistency in tumor detection.
While PERCIST 1.0 is quantitative, the category of CMR is dependent on the reader's judgement, and software quanti cation alone could not determine the response to be CMR. There were 6 cases considered to have reached CMR by the read reference . The twelve other readers categorized the case correctly as CMR in 44 reads out of 72 (12 readers x 6 cases), PMR was designated in 21 reads, SMD in 5 reads and PMD in 1 read, with 1 missing read. Thus, in addition to selection of different target tumor from the read reference , the reader's decision between CMR and PMR leaves room for variability in response categorization, even if quantitation produces identical results. Detailed de nition or consensus on ndings compatible with the CMR category, or addition of quantitative threshold to clarify the CMR category is necessary for use in trials and in the clinical setting. The threshold computed from liver measurements (liver SUL mean + 2SD) was viewed by the readers as too high a cutoff for CMR in this study as assessed visually.
Revealing a potential limitation in the software, and of the PERCIST criteria, there was a small tumor with clearly perceptible FDG uptake visually, which was not detectable by Auto-PERCIST™ due to the volume below the PERCIST de nition of SUL peak sphere of 1 cubic centimeter (Figure 4).
Auto-PERCIST TM computed additional PET parameters representing tumor features, such as metabolic volume, geometric mean, exposure, kurtosis and skewness, which have been reported as prognostic markers and diagnostic tools (19)(20)(21)(22). Even with the addition of several PET parameters, the measurement took seconds to at the longest and a few minutes for cases with many lesions. In addition to reducing variability in measurement, the software reduced the measurement time radically. Auto-PERCIST TM may become adjunct reading software the way myocardial perfusion and metabolism studies utilize cardiac image analysis software.

Conclusion
Harmonization of methods to single software Auto-PERCIST TM resulted in virtually identical extraction of quantitative data including the SUL peak when the readers selected the same target tumor, and should promote greater use of FDG PET/CT for response assessment in cancer treatment. Nonetheless, the ndings show caution remains in order as lesion selection still results on qualitative assessments of whether a lesion is tumor or physiological uptake. This retrospective study was approved by our Institutional Review Board. Informed consent was waived.

Consent for publication
Not applicable

Availability of data and material
The datasets used in this study are available from the corresponding author on reasonable request.

Competing interests
No relevant con icts of interest were identi ed except two of the authors, JL and RW, are co-inventors on a patent underlying the Auto-PERCISTTM software.  Bland-Altman plot of the percentage change of tumor FDG uptake from baseline to follow-up. The plot is for the percentage changes of SULpeak for all reads. Each dot represents a case (30 cases in total). The x-axis represents the average mean percentage change measurement by all readers. The y-axis represents the average difference between the 12 readers and the reference reader (readreference). The solid line represents the average bias, and the dashed lines represent the corresponding bias +/-2 standard deviations (SD).

Figure 3
Bland-Altman plot of the percentage change of tumor FDG uptake from baseline to follow-up. The plot is for the percentage changes of SULpeak (%ΔSULpeak) for only the reads with same lesion selected as the readreference. Each dot represents a case (30 cases in total). The x-axis represents the average mean %ΔSULpeak measurement by all readers. The y-axis represents the average difference between the 12 readers and the reference reader (readreference) and the y-axis unit is one tenth of one percent. The solid line represents the average bias, and the dashed lines represent the corresponding bias +/-2 standard deviations (SD).