A three-stage, deep learning, ensemble approach for prognosis in patients with Parkinson’s disease

Background Diagnosis of Parkinson’s disease (PD) is informed by the presence of progressive motor and non-motor symptoms and by imaging dopamine transporter with [123I]ioflupane (DaTscan). Deep learning and ensemble methods have recently shown promise in medical image analysis. Therefore, this study aimed to develop a three-stage, deep learning, ensemble approach for prognosis in patients with PD. Methods Retrospective data of 198 patients with PD were retrieved from the Parkinson’s Progression Markers Initiative database and randomly partitioned into the training, validation, and test sets with 118, 40, and 40 patients, respectively. The first and second stages of the approach extracted features from DaTscan and clinical measures of motor symptoms, respectively. The third stage trained an ensemble of deep neural networks on different subsets of the extracted features to predict patient outcome 4 years after initial baseline screening. The approach was evaluated by assessing mean absolute percentage error (MAPE), mean absolute error (MAE), Pearson’s correlation coefficient, and bias between the predicted and observed motor outcome scores. The approach was compared to individual networks given different data subsets as inputs. Results The ensemble approach yielded a MAPE of 18.36%, MAE of 4.70, a Pearson’s correlation coefficient of 0.84, and had no significant bias indicating accurate outcome prediction. The approach outperformed individual networks not given DaTscan imaging or clinical measures of motor symptoms as inputs, respectively. Conclusion The approach showed promise for longitudinal prognostication in PD and demonstrated the synergy of imaging and non-imaging information for the prediction task. Supplementary Information The online version contains supplementary material available at 10.1186/s13550-021-00795-6.


Data processing
The DaTscan images were preprocessed by selecting a continuous segment of 22 axial image slices of each image volume where the central slice had the highest relative mean uptake intensity. This was done to capture the structure of the striatum and to remove image slices of relatively lower intensity and higher noise. The resulting DaTscan images had a cubic voxel size of 2 mm, were zero-padded to yield an image size of 128 × 128 × 22, and normalized to values from 0 to 1.
A time series of measured MDS-UPDRS-III subscores relating to motor signs of Parkinson's disease (PD) were extracted at the time points of screening, baseline, 3,6,9,12,42,48, and 54 months. Those subscores reflected the motor signs of PD, including speech, facial expression, rigidity, finger tapping, hand movements, pronation-supination movements of hands, toe-tapping, leg agility, arising from a chair, gait, freezing of gait, postural stability, posture, body bradykinesia, postural and kinetic tremor of the hands, rest tremor amplitude, and constancy of rest tremor. Information about whether the patient was receiving medication for treating symptoms of PD and the clinical state of patients receiving medication (good or poor response) at each timepoint were also extracted [1]. MDS-UPDRS-III scores missing at any time point were set to be equal to the value of the previous time step following the last observation carried forward imputation procedure. The observed MDS-UPDRS-III subscores, overall MDS-UPDRS-III score, and treatment information at Years 0 to 1 (screening, baseline, 3, 6, 9, and 12 months) were used as inputs to the approach. This resulted in an input time sequence consisting of six time-points (from screening to 12 months) with thirty-six features that are referred to as the input MDS-UPDRS-III information.
The MDS-UPDRS-III subscores at 42, 48, and 54 months were summed and averaged to yield the overall MDS-UPDRS-III scores at Year 4 which were used as outcome. The outcome 3 prediction task was formulated as a regression task since the overall MDS-UPDRS-III score at Year 4 is a continuous value.

Image feature extraction with a convolutional LSTM-based network architecture
The DaTscan images at Years 0 and 1 were input as a time sequence into a convolutional LSTM-based network architecture for feature extraction (Fig 1a). The convolutional LSTM network is a type of recurrent neural network architecture that is similar to an LSTM-based architecture where the input and recurrent transformations are both convolutional. The convolutional LSTMbased networks can better capture spatiotemporal correlations in the input data where the input data are spatiotemporal sequences [2].
The DaTscan image volumes at Years 0 and 1 consisted of 22 axial slices that contained the complete structure of the striatum at two time points. The output of the convolutional LSTM layer was then placed into a batch normalization layer followed by a three-dimensional (3D) convolutional layer and 3D global average pooling layers. Batch normalization has been shown to stabilize learning and accelerate training by normalizing each batch of inputs into subsequent layers of the network [3]. The output of the global average pooling layer was an N-dimensional extracted feature vector containing information about the original input DaTscan images from Years 0 and 1. Here, the dimensionality of the extracted feature vector was N=64.

Image feature extraction with pre-trained CNNs
Deep learning methods typically require very large training data sizes, on the order of thousands, to adequately train deep neural networks on various image analysis tasks [3]. Due to our limited dataset consisting of only 198 patients, we extracted features from DaTscan images at Years 0 and 1 with four commonly used CNN architectures that were pre-trained on the ImageNet dataset [4], including VGG16 [5], ResNet50 [6], DenseNet121 [7], and InceptionV3 [8].
The ImageNet dataset consists of millions of natural images across 1,000 different class label categories [4]. We hypothesized that these CNNs that were pre-trained on the natural image classification task with the ImageNet dataset should be able to extract generalized spatial features from DaTscan images.
The maximum intensity projection (MIP) was first performed in the longitudinal direction of the DaTscan image slices (Fig 1). The MIPs obtained from the DaTscan images from Years 0 and 1 were used as input to the pre-trained VGG16, ResNet50, DenseNet121, and InceptionV3 CNN-based architectures. Since these pre-trained CNNs can only take 2D images as inputs, MIPs were also combined into one feature vector with a dimensionality of N=5632 (Fig 1), which was referred to as the "All ImageNet" feature vector. The All ImageNet feature vector from Years 0 and 1 was also treated as a time sequence and was used as input to the LSTM-based network (Fig 1).
In summary, the relevant spatial features present in the DaTscan images were first extracted using the pre-trained CNN-based architectures. Those spatial features extracted from DaTscan imaging were then used as input to an LSTM network, which extracted the relevant temporal features [2]. This differs from the previous method where the relevant spatiotemporal features were extracted directly from the original DaTscan images using a convolutional LSTMbased architecture in one step.

Image feature extraction using semi-quantitative imaging measures
The semi-quantitative imaging measures of the striatal binding ratio of the left caudate, right caudate, left putamen and right putamen were also used as predictors for the prediction task.
The striatal binding ratios were extracted from the PPMI database. The striatal binding ratio is defined as the ratio of specific uptake in the striatum to non-specific uptake in the background.
Semi-quantitative imaging measures were input as a time sequence that consisted of two timepoints at Years 0 and 1 to an LSTM network which extracted N-dimensional feature vectors corresponding to the relevant temporal features for the prediction task (Fig 1). Here, the dimensionality of the extracted feature vector was N=64.

Training and hyperparameter optimization
The approach was trained by optimizing a mean absolute error loss function that quantified the error between the measured and predicted MDS-UPDRS-III scores in Year 4. The network was optimized via a first-order gradient-based optimization algorithm, Adam [9].

Evaluation metrics
The approach was evaluated on the test set of 40 patients. The accuracy of the approach was quantified by evaluating several standard evaluation metrics, including mean absolute percentage error (MAPE), mean absolute error (MAE), mean squared error (MSE), and Pearson's correlation coefficient (r) [10][11][12].
The evaluation metrics of MAPE, MAE, and MSE quantify the error between the predicted and observed MDS-UPDRS-III scores in Year 4 for the regression task and are defined as in equations 1, 2, and 3, respectively.
The term ̂ is defined as the predicted MDS-UPDRS-III score, the term is defined as the observed MDS-UPDRS-III score for the ℎ sample, and is defined as the sample size. The  [12]. The Pearson's correlation coefficient is defined in equation 4 where is defined as covariance and is defined as the standard deviation [12].
To further evaluate the performance of the proposed approach, an ordinary least squares linear regression [13] was performed between the predicted and observed MDS-UPDRS-III scores in Year 4. The ordinary least squares regression fit a linear model solving for the intercept ( 1 ) and slope ( 2 ) in equation 5 that best fits the relationship between the predicted and observed MDS-UPDRS-III scores.
The coefficient of determination or R 2 value which indicates the goodness-of-fit of the regression [14] was reported as an evaluation metric for the approach. The coefficient of determination indicates the amount of the total variance in the data that is explained by the fitted linear model. Values for R 2 range from 0 to 1 where higher values of R 2 indicate a more accurate prediction of the MDS-UPDRS-III score in Year 4. An R 2 value greater than 0.7 generally indicates a strong relationship between the observed data and the fitted values. 8 The approach was compared to cases where a single network was given different subsets of the clinical data as inputs. The difference of squared errors given by equation 6 was used to compare the performance of the approach to that of those networks. .