Variability in measurements across readers and sites is an often cited hurdle to broader utilization of quantitative [18F]FDG PET/CT for response assessment of cancer treatment [11]. Test–retest studies have demonstrated high repeatability of [18F]FDG and other radiopharmaceutical PET parameters [12,13,14,15]. The variance of SUVs could be greater in clinical practice compared to ideal study setting [16]. In the clinical setting, measurement of SUVmax was demonstrated to have high agreement in our previous paper, while the statistically more robust SULpeak showed suboptimal agreement [9]. We wanted to know whether using uniform software could eliminate the variability associated with the computation differences for SULpeak across multiple vendors and software.
The localization of the liver, SUL measurements from the liver, computation of a threshold for lesion detection and identification of candidate lesions were all performed automatically on Auto-PERCIST™. Following detection of all sites with SULpeak higher than the set threshold, various [18F]FDG uptake intensity or pattern measurements and textural features for each of the detected sites were also performed automatically. When the readers chose the same single target tumor, the measurements were identical, as could be expected. For up to five hottest lesions measurements, the agreement was near perfect. However, agreement was not a perfect 1.00 even when the readers chose the same tumors because the readers had the option to break down a single volume of [18F]FDG uptake to separate parts, or add up two or more [18F]FDG uptake sites to a single volume as they determined appropriate. Some readers chose to break down a lesion detected on Auto-PERCIST™ to avoid including physiologic [18F]FDG uptake, or to separate a conglomeration of multiple tumors lesions. And some readers intentionally chose a detection threshold lower than the default software setting to include lesions with relatively low [18F]FDG uptake for assessment on the follow-up PET images. The agreement was lower for follow-up images for the all-reads assessment. The readers disagreed more often on what was tumor and what was physiologic or inflammatory response on the follow-up images.
A previous paper that showed excellent correlation between two different vendor software tools for SULpeak had the tumor sites predefined by the readers to exclude interpretive error [13]. Determining which [18F]FDG uptake site is true tumor remained a challenge even for experienced readers. In the outlier case in Fig. 2 showing an average difference greater than 100%, some readers considered an intense [18F]FDG uptake in the colon on the follow-up image to be new tumor lesion, while the readreference considered it physiologic in nature. Of the 360 non-reference baseline reads (including missing measurements) in this study, only 241 reads (67%) chose the same lesion and went on to make the same measurements as the readreference at both baseline and follow-up. Among the 30 cases, the target lesion (hottest tumor) on the post-therapy scan was different from the target lesion noted on the pre-therapy scan in 11 cases. For example, in one case, the target lesion was in a mediastinal node on pre-therapy scan, and then, a lung lesion became the hottest tumor in the post-therapy scan. In three cases, nodes in different stations were the target lesions at different time points. Among patients with multiple bone or lung metastases, different lesions in the same organ could be observed becoming the target tumors at different time points. As seen in inter-observer agreement studies of [18F]FDG PET/CT performed in patients with lymphoma after therapy, even experienced readers do not always agree on what is tumor [18F]FDG uptake and what is physiologic [18F]FDG uptake [17, 18]. Rather than relying solely on the reading experience of the local site, discussions and consensus meetings and better training methods are necessary to implement [18F]FDG PET/CT to its full potential. It almost certainly is the case that the availability of more relevant patient history would result in better accuracy and consistency in tumor detection.
While PERCIST 1.0 is quantitative, the category of CMR is dependent on the reader’s judgment, and software quantification alone could not determine the response to be CMR. There were six cases considered to have reached CMR by the readreference. The 12 other readers categorized the case correctly as CMR in 44 reads out of 72 (12 readers × 6 cases), PMR was designated in 21 reads, SMD in 5 reads and PMD in 1 read, with 1 missing read. Thus, in addition to selection of different target tumors from the readreference, the reader’s decision between CMR and PMR leaves room for variability in response categorization, even if quantitation produces identical results. Detailed definition or consensus on findings compatible with the CMR category, or addition of quantitative threshold to clarify the CMR category, is necessary for use in trials and in the clinical setting. A lesion could be considered present and thus not CMR even with very low SULpeak, for example, in the lungs, or a lesion could be considered resolved and thus CMR even with relatively high SULpeak, for example, in tonsils. The threshold computed from liver measurements (liver SULmean + 2SD) was viewed by the readers as too high a cutoff for CMR in this study as could be inferred by how the readers manually lowered the threshold on the follow-up images.
Revealing a potential limitation in the software, and of the PERCIST criteria, there was a small tumor with clearly perceptible [18F]FDG uptake visually, which was not detectable by Auto-PERCIST™ due to the volume below the PERCIST definition of SULpeak sphere of 1 cubic centimeter (Fig. 4). More mundane limitation of applying PERCIST includes the need to measure the patient’s height. That many of the referring physicians and radiologist are not familiar with the SULpeak parameters is another limitation to overcome. When there are multiple lesions showing intense [18F]FDG uptake, the lesion with the worst response may not be the target lesion, and PERCIST needs to specify how to address such poorly behaving lesions for categorizing the overall response.
Auto-PERCIST™ has the ability to automatically detect potentially new lesions for co-registered studies based on the location of the classified lesions. Auto-PERCIST™ also computed additional PET parameters representing tumor features, such as metabolic tumor volume, geometric mean, exposure, kurtosis and skewness, which have been reported as prognostic markers and diagnostic tools [19,20,21,22]. Discordance among readers was minimal for the additional PET parameters, and the cause for any variance arose when the reader manually changed the tumor boundary. Even with the addition of several PET parameters, the measurement took seconds to at the longest and a few minutes for cases with many lesions. In addition to reducing variability in measurement, the software reduced the measurement time radically. Auto-PERCIST™ may become adjunct reading software the way myocardial perfusion and metabolism studies utilize cardiac image analysis software. Auto-PERCIST is available to academic researchers who register their interest with the Johns Hopkins Technology Transfer office.