Repeatability of Two Articial Intelligence Approaches for Tumor Segmentation in PET

Background: Positron Emission Tomography (PET) is routinely used for cancer staging and treatment follow up. Metabolic active tumor volume (MATV) as well as total MATV (TMATV - including primary tumor, lymph nodes and metastasis) and/or total lesion glycolysis (TLG) derived from PET images have been identied as prognostic factor or for the evaluation of treatment ecacy in cancer patients. To this end a segmentation approach with high precision and repeatability is important. However, the implementation of a repeatable and accurate segmentation algorithm remains an ongoing challenge. Methods: In this study, we compare two articial intelligence (AI) based segmentation methods with conventional segmentation approaches in terms of repeatability. One segmentation approach is based on a textural feature (TF) segmentation approach designed for accurate and repeatable segmentation of primary tumors and metastasis. Moreover, a Convolutional Neural Network (CNN) is trained. The algorithms are trained, validated and tested using a lung cancer PET dataset. The segmentation accuracy of both segmentation approaches is compared using the jaccard coecient (JC). Additionally, the approaches are applied on a fully independent test-retest dataset. The repeatability of the methods is compared with the repeatability of two majority vote (MV2, MV3) approaches, 41%SUV MAX , and a SUV>4 segmentation (SUV4). Repeatability is assessed with test-retest coecients (TRT%) and intraclass correlation coecient (ICC). A TRT% of 0 indicates perfect repeatability and an ICC>0.9 was regarded as representing excellent repeatability. Results: The accuracy of the the reference was good (JC median TF: 0.7, CNN: 0.73) Both segmentation approaches outperformed together with the MV2 approach the other conventional segmentation methods in terms of test-retest coecient (TRT% mean: TF: 13.0%, CNN: 13.9%, MV2: 14.1%, MV3: 28.1%, 41%SUV MAX : 28.1%, SUV4: 18.1% ) and ICC (TF: 0.98, MV2: 0.97, CNN: 0.99, MV3: 0.73, SUV4: 0.81, and 41%SUV MAX : 0.68). Conclusion: The AI based segmentation approaches used in this study provided better repeatability than conventional segmentation approaches. Moreover, both algorithms lead to accurate segmentations for both primary tumors as well as metastasis and are therefore good candidates for PET tumor segmentation.


Introduction
Positron Emission Tomography in combination with Computed Tomography (PET/CT) using the tracer uorodeoxyglucose (FDG) is an important imaging modality for cancer diagnosis, tumor staging, prognosis or treatment follow-up [1,2]. The volume of the segmented tumor in the PET image, also known as metabolic active tumor volume (MATV) as well as the total MATV (TMATV -including metastasis and lymph nodes), is one important metric for the evaluation of therapy response [3].
Observed differences in MATV/TMATV have to be due to biological changes in the tumor tissue and not to segmentation errors. Therefore, a repeatable segmentation is of outermost importance. Hereby, a repeatable segmentation refers to a segmentation algorithm leading to comparable results when applied on two consecutive PET/CT images of the same patient under the same physiological conditions. The implementation of a repeatable segmentation algorithm is not trivial due to the challenges coming with PET images. Among them are factors regarding the image quality, e.g. the low signal-to-noise ratio, low spatial resolution, and partial volume effects. Especially for smaller lesions, the partial volume effect can reduce the apparent tumor uptake making the lesion therefore di cult to detect and segment.
Up to now, a manual segmentation by an expert or (if available) the consensus segmentation of several experts are considered as gold standard. However, manual segmentations have several drawbacks, e.g. they are time-consuming, non-reproducible and come with a high inter-observer variability [4][5][6]. A recent study also demonstrated that even the consensus of several observers results in a low repeatability [7].
To overcome the limitations of manual segmentations and to increase repeatability, a large number of (semi-) automatic segmentation methods have been proposed. The most basic and frequently used ones are simple xed thresholding algorithm de ning voxels with an intensity value above a certain threshold as part of the tumor [8]. Also adaptive and iterative thresholding algorithm are available which are adapting the threshold according to the actual image characteristics [9]. However, all thresholding approaches are dependent on the scanner type, reconstruction algorithm, as well as image noise and have therefore limitations [10].
Therefore, more robust segmentation algorithm have been developed aiming to improve segmentation accuracy and repeatability. Developed methods include methods using the statistical properties of the image as well as learning-based methods [11,12].Nevertheless, most of these approaches have only been tested on limited datasets and are not publically available. Therefore, the only (semi-) automated segmentation methods used in the clinic are thresholding approaches.
Due to the mentioned limitations of available segmentation algorithm, there is the need for new, more robust segmentation approaches. Arti cial intelligence (AI) based segmentations such as Convolutional Neural Networks (CNN) have shown very promising results for various segmentation tasks [13] and yield great promise also for the segmentation of tumors in PET images. However, only a few studies use AI based segmentation approaches for metabolic active tumor segmentation in PET images. Even more, most studies combine the information of PET and CT images in order to get reliable segmentation results [14] or use some post-processing for an improvement of CNN segmentations [15]. Classi ers classifying each voxel as tumor or non-tumor using textural features of voxel neighborhoods have been used for the segmentation of e.g. lung carcinoma or head-and-neck cancer [16][17][18]. All of these studies combine the information of PET and CT images. In many cases the PET/CT is performed with a low-dose CT which is not of optimal image quality to be used for segmentation purposes. Therefore, it is of interest to develop AI based PET segmentation approaches that rely on PET information only. Additionally, in previous papers segmentation approaches were only applied on primary tumors, while for the calculation of TMATV, also an accurate and repeatable segmentation of metastasis and lymph nodes is important. This task is especially challenging due to the small size of metastasis, different tumor-to background ratios and different locations of the metastasis in the body.
While several studies already reported on the segmentation accuracy of AI based segmentation algorithm, to the best of our knowledge, no study reported yet on the repeatability of those algorithms. In this study, we investigate the repeatability of two AI approaches especially built to segment primary tumors and metastasis accurately and repeatable. We focus hereby on the segmentation task and do not consider lesion detection. This study includes a textural feature based segmentation approach, as well as a 3D CNN. All algorithm are trained, validated, and tested on a dataset of Non-Small-Cell-Lung-Cancer (NSCLC) patients. As second step, the algorithm are applied to a fully independent test-retest dataset of ten NSCLC patients scanned on two consecutive days. The repeatability of the AI segmentation approaches are compared with conventional segmentation algorithms used in the clinic.

Datasets
The study was registered at clinical trials.gov (NCT02024113) and was approved by the Medical Ethics Review Committee of the Amsterdam UMC and registered in the Dutch trial register (trialregister.nl, NTR3508). All patients gave informed consent for study participation and use of their data for (retrospective) scienti c research. Two datasets acquired at two institutions were included in this study with both datasets following the recommendations of the EARL accreditation program [19,20]. All images were converted to Standardized Uptake Value (SUV) units before the segmentation process started in order to normalize the images for differences in injected tracer dose and patient weight. The focus of this paper lies on the segmentation process and not on lesion detection. Therefore, before the start of the segmentation process, a large bounding box was drawn around every lesion including also a large number of non-tumor voxels as illustrated in Fig. 1. The bounding box was drawn randomly such that the tumor was not always appearing in the middle but on different locations in the box. This step was performed in order to avoid that the CNN remembers the location of the object instead of other, more important characteristics.

Training and testing dataset
For training, validating, and testing the segmentation approaches, 96 images of patients with NSCLC Stage III -IV were included. Patients fasted at least six hours before scan start and were scanned 60 minutes after tracer injection. All images were acquired on a Gemini TF Big Bore (Philips Healthcare, Cleveland, OH, USA). For attenuation correction, a low dose CT was performed. All images were reconstructed to a voxel size of 4 × 4 × 4 mm using the vendor provided BLOB-OS-TOF algorithm. More details about the patient cohort can be found in previous studies [21]. The images were split randomly in training, validating, and testing sets, where 56 images (286 lesion) were used for training, 14 images (98 lesions) for validation, and 26 images (171 lesions) for independent testing.

Test-Retest dataset
For a fully independent test-retest evaluation, ten PET/CT scans of patients with Stage III and IV NSCLC were analyzed. These ten patients underwent two whole-body PET/CT scans on two consecutive days. Images were acquired on a Gemini TF PET/CT scanner (Philips Healthcare, Cleveland, OH, USA) at a different institution (Amsterdam University Medical Center). Patient fasting time, time between tracer injection and scan start, as well as reconstruction algorithm and voxel size were the same as in the previous described dataset. A total of 28 lesions were included in the analysis.

Reference segmentations
The reference segmentations used for training, validating, and testing the algorithm, were obtained by applying an automatic segmentation which identi ed all voxels with a SUV above 2.5 as tumor (here after SUV2.5). The segmentations were manually adjusted by an expert medical physicist (RB) with more than twenty years of experience in PET tumor segmentation. This approach was chosen as it has been demonstrated that the manual adaption of a (semi-) automatic algorithm is more robust than a pure manual segmentation [22].

Segmentation Algorithm
All segmentation algorithm were implemented in Python 3.6 using the libraries keras and scikit-learn.

Convolutional Neural Network (CNN)
A 3D CNN following the U-Net architecture proposed by Ronneberger et al. [23] was implemented with the keras library. U-net is one of the most famous and most frequently used CNN architectures for biomedical image segmentation as it was especially designed for scenarios where only a small number of training examples are available. More details about the architecture and the used con guration can be found in the supplemental material.
In order to increase the amount of training data and to avoid over-tting, data augmentation was performed. This included rotations within − 20 to 20 degrees, shifting in width and height direction within 20% of the side length, a rescaling of the images within 25%, intensity stretching, as well as adding Gaussian noise to the image.
For training, testing, and applying the CNN, the dataset was divided into smaller ( < = 12.8 ml) and bigger tumors. The threshold was chosen by experiments, as this threshold let to the best performance. For each tumor size, one separate CNN was trained. The split of the dataset by lesion size was performed as this led to more accurate and repeatable segmentations (illustrated in supplemental material Sect. 2.1). For training, the tumor size was determined by calculating the volume of the ground truth mask. For testing and applying the CNN, an initial guess of the tumor size was performed using the majority vote (MV) segmentation of four established threshold approaches (see supplemental material, Sect. 3). The MV segmentation was chosen for this task as it resulted in previous work in the most accurate segmentation when compared with manual segmentations [7] and is easy to implement.

Textural feature segmentation (TF)
In this segmentation approach, textural features of voxel neighborhoods were used for the voxel-wise segmentation of the tumor. For every view (axial, sagittal, coronal) a separate segmentation was performed and the majority vote of the three views was regarded as nal segmentation. The work ow of the TF segmentation for one view is illustrated in Fig. 2. As illustrated, every voxel was regarded as center of a scanning window. For each scanning window, statistical and textural features were calculated using the open-source software pyradiomics [24]. The feature space was then reduced by selecting the most important features for the segmentation task which were identi ed by a random forest.
Next, a random forest classi er was trained to classify each voxel as tumor or non-tumor. The trained random forest was then applied to the testing dataset. The probability images of the three orientations are combined in order to obtain the nal classi cation. A probability image contains information about how certain the classi er is that it made the right decision. Hereby, all voxels with a summed probability of more than 1.8 were included in the tumor mask. A more detailed description of the algorithm can be found in the supplemental and in Pfaehler et al. [25].
In order to evaluate how well the AI based segmentations are matching the reference segmentation which was used for training, the segmentation results and the reference segmentation were compared in terms of accuracy.

Conventional segmentation algorithm
The repeatability of the AI based segmentations were compared with two established segmentation algorithm: 41%SUV MAX : all voxels with intensity values higher than 41% of the maximal SUV value (SUV MAX ) are regarded as tumor SUV4: all voxels with a SUV higher than 4 are included in the segmentation Moreover, two majority vote (MV) approaches combining four frequently used thresholding approaches were included in the comparison. Both MV approaches have been demonstrated in previous work to be more repeatable than conventional approaches. The underlying segmentation algorithm are explained in the supplemental Sect. 3 and are also described in previous work [7]. The two MV segmentation methods include: MV2: the consensus of at least two of the approaches MV3: the consensus of at least three of the approaches Page 7/21

Evaluation Of Segmentation Algorithm
For the evaluation of the segmentation algorithm, several metrics were combined. The data analysis was performed in Python 3.6.2 using the packages numpy and scipy.

Accordance of AI segmentation and reference segmentation
In order to determine the accordance of AI and reference segmentation, the Jaccard Coe cient (JC) was calculated. The JC is de ned as the ratio between the intersection and the union of two labels and gives an indication about the overlap of the two labels: A JC of 1 indicates perfect overlap, while a JC of 0 indicates that there is no overlap at all. Furthermore, as the JC does not contain information about volume differences, the percentage MATV differences of performed and reference segmentation were calculated: . A percentage volume difference above 1 indicates an over-and a percentage volume difference below 1 an under-estimation. A percentage difference of 1 represents a perfect alignment. Finally, the distance of mass (barycenter distance) of the segmentations was calculated. Hereby, a barycenter distance close to 0 indicates perfect agreement.

Repeatability evaluation
The repeatability of the segmentation approaches was evaluated by comparing the differences of segmented volume across days. For this purpose, the percentage Test-Retest difference (%TRT) was calculated: The %TRT gives a measure for the proportional differences in segmented volume between the two consecutive scans. Moreover, the repeatability coe cient (RC) which is de ned as 1.96 × standard deviation(TRT%) was calculated. Additionally, intraclass correlation coe cients (ICC) were calculated using a two-way mixed model with single measures checking for agreement. An ICC between 0.9 and 1 indicates excellent and an ICC between 0.75 and 0.9 indicates good repeatability [26]. If a lesion was completely missed by one segmentation approach, it was discarded from the analysis in order to analyze the same dataset for all segmentation approaches.
The accuracy metrics of the AI based segmentations as well as the TRT% of all approaches were compared using the Friedman test. The Friedman test is a non-parametric test which does not assume a normal distribution of the data or independency of observations. It compares the rank of each data point instead of only comparing mean or median values. This means that if a segmentation algorithm results consistently in more accurate results, it will be ranked higher even though its mean or median might be lower. As the Friedman test only contains information if there was a signi cant difference in the data, a Nemenyi test was performed in order to assess which methods resulted in signi cant differences. Pvalues below 0.01 were considered as statistically signi cant. A Benjamini-Hochberg correction was applied in order to correct for multiple comparisons.

Results
Accordance reference -AI based segmentation  Table 1. The CNN yields less underestimations and more overestimations of tumor volume (higher percentage volume differences (25th /75th percentile: 0.83/1.34)). While the TF approach resulted in more underestimations of tumor volume (25th /75th percentile: 0.59/0.83). The barycentric distances of the TF approach were lower than the barycentric distances of the CNN. The corresponding values for the test-retest dataset can be found in Supplemental  Table S1. In general, the accuracy of the segmentations depended on the lesion size as illustrated in Fig. 4. Segmentations of bigger tumors resulted in better accuracy than segmentations of smaller lesions. For larger lesions, the CNN resulted in a median JC value of 0.79, while the TF approach yielded a median JC of 0.86. For both approaches, the percentage volume differences were close to 1. Here, the TF approach resulted in lower percentage volume differences than the CNN and therefore more underestimations.
For smaller lesions, the CNN resulted in a median value of 0.69 which was higher than the median of the TF approach (0.66). The median percentage volume differences of the CNN was 1.02 (25th /75th percentile: 0.81/1.40) indicating that the CNN resulted more often in overestimations for smaller than for larger lesions. While the TF approach yielded in the majority of the cases percentage volume differences below 0.7 and therefore also for smaller lesions more underestimations. Quartile values as well as corresponding percentage volume differences and barycentric distances for smaller and bigger lesions are listed in Table 2. As displayed in Fig. 4, TF and CNN resulted in three cases in JC values around/below 0.4 for bigger lesions. In both cases, the tumors were located close to the heart which was incorrectly included in the segmentation. Therefore, the tumor was highly overestimated. A similar effect was observed for smaller lesions: The CNN missed some of the smaller lesions completely while this was not the case for the TF based approach. All lesions that were completely missed were located close to the kidney which was wrongly identi ed as tumor. The TF approach also identi ed the kidney regions as tumors but also detected the tumors. Figure 5 displays the TRT-coe cients for all segmentation algorithm. Two lesions were completely missed by the CNN and therefore discarded from the analysis.

Repeatability
CNN-based segmentations outperformed the other approaches regarding TRT% with an absolute mean value of 13.9% and a standard deviation of 16%. TF and MV2 segmentation yielded with absolute mean values of 13.0% and 14.1% and standard deviations of 17% and 21% similar values to the CNN. MV3, 41%SUVMAX, and SUV4 segmentations yielded mean values of 28.1%, 28.1%, and 18.1%, and standard deviations of 50%, 51%, and 26%. The corresponding repeatability coe cients can be found in supplemental table S2. After applying the Benjamini-Hochberg correction, the differences in TRT% were not signi cantly different.
The CNN resulted in 3 out of 28 cases in a TRT% of more than 10%, while the conventional methods resulted in 12 (MV2, SUV4, 41%SUV MAX ), or 13 cases in a TRT% higher than 10% (MV3

Summary of the results
In summary, CNN and TF segmentation resulted in a better repeatability when compared with conventional approaches. Furthermore, both approaches resulted in a good accuracy when compared with the reference segmentations. The observed differences between the AI based methods were neither for accuracy nor for repeatability signi cant. Therefore, our results suggest that both methods are equally good candidates for the segmentation of tumors in PET images and are more powerful than conventional approaches in terms of repeatability.

Discussion
In this paper, we evaluated two AI based segmentation approaches in terms of repeatability and analyzed their accordance with the reference segmentation. Both approaches resulted in a good accuracy when compared with the reference segmentation. The differences in performance between both AI approaches were small and statistically non-signi cant.
The segmentation of smaller lesions remains also for these two approaches a challenging task. One reason for this effect might be that with decreasing tumor size, small misclassi cations have a higher impact on accuracy metrics as illustrated in supplemental table S3. Smaller lesions also come with a lower tumor-to-background ratio and are therefore more di cult to detect what might be the reason that the CNN missed some smaller lesions completely. Moreover, some of the metastasis are also located close to other high-uptake regions (such as the kidney) what opposes a special challenge to a segmentation algorithm. Especially for the CNN, the different locations of the metastasis and therefore the differences in background tissue yield a more challenging learning task than the segmentation of one type of primary tumor.
In terms of accuracy and precision, the CNN trained and tested in this study was comparable with previous CNNs designed for the segmentation of primary tumors in PET images. An important difference between our methods and other published algorithm is that our approaches rely on the PET image information only and can therefore also be used when only a low-dose CT is acquired aside of the PET image [14,16]. Previous studies reported on low segmentation performance when only using the PET image for segmentation.
When the tumor was located close to another high uptake region such as the heart or the kidney, both segmentation approaches regarded also the high-uptake region as tumor. The automatic segmentation methods included in this study are mainly intensity driven and are therefore also not capable of distinguishing between one and another high-uptake region when they are close to each other. For these cases it is likely that a human interaction will always remain necessary as was mentioned in previous studies [27]. However, in future studies we will investigate if these segmentation approaches might also be used for lesion detection. A disadvantage of AI based segmentation approaches is the need of reliable training data. The lack of reasonable training data is one drawback making the clinical implementation of AI based segmentation algorithm challenging. However, the MV2 approach used in this study was found to result in accurate and robust segmentations in a previous study [7]. Moreover, in our study it also outperformed the conventional segmentation approaches in terms of repeatability without depending on training data. Especially for tasks where segmentation accuracy is important such as radiotherapy planning, the MV2 is a good candidate for clinical use. Yet, regardless method used, the nal segmentation should always be supervised. In terms of repeatability, especially the CNN segmentation outperformed the MV2 approach and is the method of choice when segmentation repeatability is important such as for the evaluation of treatment response, i.e. precision may be more important that accuracy for those clinical applications.
One limitation of this study is that the ground truth segmentations were delineated by one, yet experienced, observer while the consensus of three expert segmentations is considered as gold standard.
To account for this, the segmentation was initiated with a semi-automated delineation method, an approach known to reduce observer variability. Of note, for the test-retest study the same lesions were delineated by 5 observers in a previous study (7) and it was shown that even the consensus contour of these observers was less repeatable than those seen with any of the automated approaches. Finally, in our repeatability study we included the AI based approaches as well as several conventional methods and this repeatability study showed that our trained AI approaches provided very good results, even if the ground truth segmentations used during training of the AI methods would have been suboptimal.
Another limitation is the small dataset used for repeatability analysis. However, the collection of testretest scans is unfortunately limited due to the patient burden coming with consecutive scans of the same patients. Future studies, especially studies using data from different centers should con rm our ndings.

Conclusion
In this paper, we compared the repeatability of AI based segmentation algorithm with conventional segmentation approaches. Our results illustrate the advantage of AI based segmentation approaches: Both approaches resulted in a good accuracy when compared with the reference segmentation and a high repeatability. Together with a majority vote approach (combining the results of four conventional segmentation approaches) the proposed segmentation methods were superior to the other segmentation algorithms included in this study in terms of repeatability. This study demonstrates that AI based segmentations have not only the potential to accurately segment lesions but also result in more repeatable segmentations.
Ethics approval and consent to participate: All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Consent for publication:
Not applicable Availability of data and materials: The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Declarations
Competing interests: The authors declare that they have no competing interests Financial support