Patients and imaging protocol
The study population consists of 1334 patients who underwent bone scintigraphy between 2012–2021 in four Finnish nuclear medicine units. The data were collected using standard clinical single-photon emission computed tomography (SPECT) scanners: Philips/ADAC Forte (Philips Healthcare, Eindhoven, The Netherlands; N = 100), Philips Brightview (N = 200), Siemens e.cam (Siemens Healthcare, Erlangen, Germany; N = 547), Siemens Symbia (N = 372), GE Infinia Hawkeye (GE Healthcare, Waukesha, Wisconsin, USA; N = 96), and GE Discovery 670 (N = 19). Low-energy high-resolution (LEHR) collimators were used in all scanners.
Of all participants, 1319 were scanned using a whole-body bone scintigraphy protocol, and 15 patients with thoracic planar scintigraphy included in the clinical cardiac amyloidosis imaging protocol. The emphasis of patient selection was in the inclusion of patients with positive cardiac uptake as their prevalence is low in overall population. All studies were performed using [99mTc]Tc-HMDP imaged at three-hours post-injection. The administered activity was 500–700 MBq. Both visual and CNN analysis of bone scintigraphy data were done for research purposes only. The study was approved by the ethics committee of Helsinki University Hospital and was conducted according to the Declaration of Helsinki.
Visual analysis of cardiac uptake
Three physicians participated in grading bone scintigraphy images for cardiac uptake. All patients with a positive scan (≥ grade 2) were reviewed by one nuclear medicine physician with most clinical experience in amyloid imaging. Different grades of cardiac uptake are demonstrated in Fig. 1. The figure shows both the original whole-body images and the corresponding preprocessed and cropped images used in our further analyses.
The grading of cardiac uptake was visually defined from images using the standard Perugini grade for cardiac uptake [1, 13]:
-
Grade 0: no cardiac uptake
-
Grade 1: cardiac uptake less than bone uptake
-
Grade 2: cardiac uptake with intensity similar to bone uptake
-
Grade 3: cardiac uptake greater than bone uptake
Cases borderline for positive (grade 1–2) were graded as positive to optimize the sensitivity of our automated image analysis. We analyzed the reproducibility of the visual grading using 40 anonymized patients. Two nuclear medicine physicians (VU and SM) graded the patients twice. The patients were presented in randomized order, and the physicians were unaware of the prevalence of each Perugini grade in the dataset. Both intra- and interobserver reliability were evaluated using Cohen’s kappa coefficient [14].
Data preprocessing
Only anterior (AP) images were retained for further analyses since the inclusion of posterior (PA) images did not improve classification accuracy in preliminary tests. The preprocessing workflow is illustrated in Fig. 2. Images were cropped into a 128 × 128 matrix centered at the thoracic region using an automated Python workflow. The location of the cropping region in whole-body images was determined as follows: first, we measured a line profile in the y-direction and found all nonzero pixels, corresponding to the position of the patient in the image. Next, the upper edge of the cropped image was positioned at a height corresponding to 0.85 × patient height. The lower edge was set 128 pixels lower than the upper edge. The left and right edges were set 64 pixels to the left and right from the image center, respectively.
In case of planar images, the image matrix was down sampled to 256 × 256 resolution if necessary, and a 128 × 128 region was cropped at the center of the 256 × 256 matrix.
Finally, all nonzero pixel intensities in both cropped whole-body and planar images were log-transformed in order to reduce the intensity of possible hot spots (e.g., injection site and uptake in bones due to injury, small fracture or metastatic lesion), which could negatively affect the classification results. The data were anonymized during preprocessing, so that the data used in further analyses included only the cropped image matrix, patient’s age and Perugini grade.
CNN models
We developed two CNN models, referred to as Linear and Residual. The models were implemented using Python 3.8.3 and functions in Tensorflow 2.2.0 [15] and Keras 2.3.0 [16]. The Linear model included one convolutional layer, followed by four convolution blocks with six consecutive convolutional layers (3 × 3 kernel, stride 1, ReLu activation function) and one average pooling layer (2 × 2 kernel, stride 1) in each. After the convolution blocks, there was a flattening layer, a dropout layer (dropout 0.2) and the final fully connected layer (softmax activation function, dimension equal to the number of classes).
The Residual model was otherwise similar to the Linear model but included skip connections between every other convolutional layer, i.e., the weights of the layer n–2 were added to layer n and the activation function was applied to this summed layer (see details in [17]). The architectures of these models are shown in Fig. 3.
For comparison, we classified data with four state-of-the-art models implemented in Keras library: VGG16 [18], ResNet50 [17], InceptionV3 [19] and MobileNet [20]. For all these models, we used the pre-trained versions with ImageNet weights as initial weights. The output layer of all state-of-the-art models was omitted and replaced with similar flattening layer, dropout layer and fully connected layer as implemented in the Linear and Residual models.
CNN training and validation
We first classified the images using the visually determined Perugini grades (0, 1, 2 and 3) as the ground truth labels (four-class classification). We also studied the accuracy of CNN for detection of positive (grade > 2) and negative (grade < 2) cardiac uptake for ATTR and differentiation of high-grade (grade 3) cardiac uptake from other patients.
Similar training and validation procedures were used for all CNN models. We quantified the classification accuracy with fivefold cross-validation, in which 80% of the data were used for training and 20% for testing the CNNs. We used stratified cross-validation, i.e., the proportion of different classes in each cross-validation fold was equal.
As the size of the dataset was limited, we increased the number of training images by data augmentation. On each cross-validation fold, 5330 augmented images were generated by shifting (range ± 10% in both x- and y-directions), rotating (range ± 20°), and scaling (range ± 20%) the original training images in a randomized way, using the ImageDataGenerator function implemented in Keras. Finally, both the training and testing images were z-score normalized pixel-wise with respect to the training data, i.e., the average of the training images was subtracted from each image and subsequently, the image was divided with the standard deviation of the training images.
The model training process was carried out with 50 epochs and a batch size of 128. As the dataset was imbalanced, i.e., the number of patients with grades 0 and 1 was significantly higher than that of patients with grades 2 and 3, we decided to use class weights in the classification. The penalty for misclassification of the minority classes (grades 2 and 3) was set higher than that of the majority classes (grades 0 and 1) in order to overcome the uneven distribution in the dataset. For model optimization, we used sparse categorical cross-entropy loss function and an Adam optimizer with an initial learning rate 1e–4. Ten percent of the training data were used for validation during the training process. The validation data were used for guiding the training, so that the learning rate was reduced whenever the validation loss did not decrease in two consecutive epochs; the minimum learning rate was set to 1e–7. Classification performance was evaluated by receiver operating characteristic (ROC) analysis. We calculated the area under the curve (AUC), total accuracy and class-specific precision (number of true positives over the number of true positives plus the number of false positives) and recall (number of true positives over the number of true positives plus the number of false negatives) for each CNN model using the roc_auc_score function implemented in Scikit-Learn [21]. The above training and testing process took about 1 h on an NVIDIA Quadro P5000 graphical processing unit.
After classifying all patients, we investigated whether bone metastases have an effect on the classification results of the CNN. We selected only the Residual CNN for this analysis for simplicity. As we had the final CNN classification for each patient, we divided them to those with and without bone metastases. Thereafter, AUC and total accuracy were calculated separately for these two groups.
CNN layer visualization
Besides detection and classification of cardiac uptake on bone scintigraphy, we studied which parts of the image contribute most to the CNN output, i.e., “what the CNN is looking for”. We visualized the maximum activation maps of layers 2, 10, 17, and 24 of the Linear model and layers 2, 12, 22, and 32 of Residual model, corresponding to the first convolutional layer of each convolution block.