Statistical evaluation of test-retest studies in PET brain imaging
© The Author(s). 2018
Received: 6 November 2017
Accepted: 30 January 2018
Published: 12 February 2018
Positron emission tomography (PET) is a molecular imaging technology that enables in vivo quantification of metabolic activity or receptor density, among other applications. Examples of applications of PET imaging in neuroscience include studies of neuroreceptor/neurotransmitter levels in neuropsychiatric diseases (e.g., measuring receptor expression in schizophrenia) and of misfolded protein levels in neurodegenerative diseases (e.g., beta amyloid and tau deposits in Alzheimer’s disease). Assessment of a PET tracer’s test-retest properties is an important component of tracer validation, and it is usually carried out using data from a small number of subjects.
Here, we investigate advantages and limitations of test-retest metrics that are commonly used for PET brain imaging, including percent test-retest difference and intraclass correlation coefficient (ICC). In addition, we show how random effects analysis of variance, which forms the basis for ICC, can be used to derive additional test-retest metrics, which are generally not reported in the PET brain imaging test-retest literature, such as within-subject coefficient of variation and repeatability coefficient. We reevaluate data from five published clinical PET imaging test-retest studies to illustrate the relative merits and utility of the various test-retest metrics. We provide recommendations on evaluation of test-retest in brain PET imaging and show how the random effects ANOVA based metrics can be used to supplement the commonly used metrics such as percent test-retest.
Random effects ANOVA is a useful model for PET brain imaging test-retest studies. The metrics that ensue from this model are recommended to be reported along with the percent test-retest metric as they capture various sources of variability in the PET test-retest experiments in a succinct way.
Positron emission tomography (PET) is a molecular imaging technology used for in vivo measurement of metabolism and neurochemistry, including measurement of cerebral blood flow, glucose metabolism, oxygen utilization, and density of neuroreceptors or other molecular targets [1, 2]. As an integral component of the validation of novel PET tracers, a test-retest experiment is usually first conducted to measure repeatability of the measurements.
The main purpose of a test-retest experiment is to inform about within-subject variability, i.e., how close the measurements are when they are obtained repeatedly on the same subject under identical conditions. It is common then to compare these measures of repeatability—certainly, when considering multiple methods of processing and/or modeling PET data. Often, standardized measures of repeatability are used as general metrics to help judge the general utility of a tracer, although it is not obvious that it is appropriate to compare these measures across tracers or across molecular targets.
The test-retest experiment is most naturally relevant for evaluating a tracers’ utility for use in a study involving multiple measurements on the same subject, e.g., an occupancy study or a study measuring the effect of some intervention. As we will summarize here, most of the indices used to summarize the results of test-retest experiments measure quantities that are important for such experiments. Note, however, that these indices by themselves do not provide all the useful information when considering other types of PET studies, i.e., a cross-sectional study of two groups of subjects.
Still, the test-retest repeatability of a tracer is an important criterion to help select a tracer for a particular target among multiple available tracers , although of course several other criteria (e.g., robust radiochemistry, large specific-to-nonspecific signal, and absence of off-target binding) are also important factors. Going beyond tracer evaluation, test-retest studies also provide useful data for determining the optimal approach among various quantification techniques (e.g., modeling strategies or outcome measures) for a given tracer. Test-retest studies are also useful for understanding the relative variability among multiple region of interests (ROIs).
In general, test-retest repeatability usually refers to measuring the variability when repeated measurements are acquired on the same experimental unit under identical (or nearly identical) conditions . Various metrics have been proposed in the statistical and PET literature to evaluate test-retest experiments such as percent test retest (PTRT), intraclass correlation coefficient (ICC), within-subject coefficient of variation (WSCV) or repeatability coefficient (RC), and we will describe these in some detail in the next section. Briefly, these metrics can be classified as either scaled or unscaled indices of agreement . Unscaled indices of agreement summarize the test-retest repeatability based on differences of original measurements and therefore are obtained on the original unit of measurement, example of which would be RC. In contrast, scaled indices of agreement are normalized with respect to some given quantity and are therefore (unitless) relative measures. Common examples of scaled measures are “percent test retest” which is commonly reported in PET studies.
A very recent article by Lodge , assesses repeatability of very common PET-based measurements in oncology applications focusing on only one tracer (18F–FDG) and one summary measure (standardized uptake value (SUV)). In that paper, Lodge reviews multiple relevant test-retest studies that report results in inconsistent ways depending on several repeatability measures, and so syntheses of these studies is quite challenging. This illustrates the need to critically evaluate the various measures that are reported in the PET imaging literature. Our objective here is to provide a comprehensive assessment of test-retest evaluations in PET brain imaging, in particular with respect to the assumptions of the random effects ANOVA model that underlies the ICC statistic. Similar critical reviews of repeatability experiments have recently been conducted for other modalities (e.g., electrocardiogram data ). To illustrate the utility of the different test-retest metrics, we reevaluated data from five published brain PET test-retest studies in humans. Finally, we provide a discussion of the merits and applicability of the test-retest metrics for future PET brain imaging studies.
Description of the data sets
Summary table of the considered clinical brain PET test-retest data sets
Data set ID
Milak et al., J Nucl Med. 2010; 51(12): 1892–900
Serotonin 1A receptor
Ogden et al., J Cereb Blood Flow Metab. 2007; 27(1): 205–17
Delorenzo et al., J Cereb Blood Flow Metab. 2009; 29(7): 1332–45
Parsey et al., J Cereb Blood Flow Metab. 2000; 20(7): 1111–33
Serotonin 1A receptor
Delorenzo et al., J Cereb Blood Flow Metab. 2011; 31(11): 2169–80
Glutamate receptor subtype 5
Statistical model for test-retest
where σ s and σ e are the between- and within-subject standard deviations, respectively.
Estimation of the parameters μ, σe, and σs in model (1) is described in Appendix for completeness. The computation was implemented using the R package “agRee” . There are two scaled indices and one unscaled index of agreement that naturally ensue from model (1) that were proposed for characterization of a test-retest experiment:
2) the ICC, defined as
where z1−α/2 is the 1−α/2 quantile of standard normal distribution. The RC can also be interpreted as the smallest detectable difference (SDD) between a test and retest measurement for a given subject. It is defined as a 100(1−α/2)% quantile of the distribution of test-retest differences. Thus, this quantile represents limits of a typical range containing large proportion (e.g., 95%) of the distribution of test-retest differences (with α = 0.05, z1−α/2 = 1.96 ).
As described in the “Introduction” section, percent test-retest (PTRT) is a ubiquitous measure in PET brain imaging although it is not often used in other related fields. In early PET test-retest papers, signed (or raw) mean normalized test-retest differences were considered [16, 17], but later authors generally used the absolute values of the normalized differences instead . Following this latter definition, PTRT is calculated as follows:
Where n is the number of subjects in the test-retest study and y i1 and y i2 are the estimated PET outcome measures obtained for the i-th subject in a given region in the test and in the retest scan, respectively.
Bland-Altman plots show mean vs. difference of test-retest observations for each subject involved in the study and therefore provide a comprehensive visual assessment of the data .
PET test-retest data
The total volume of distribution (V T )  was considered as the PET outcome measure that was calculated using three different quantification strategies, one- (1TC) and two-tissue compartment (2TC) models , and a graphical approach, the likelihood estimation in graphical analysis (LEGA) .It should be noted that the purpose here of considering three different quantification approaches is not to revisit the question of determining the “best” modeling approach for each tracer. This question has been adequately addressed in the original manuscripts for the respective tracers. Rather, multiple quantification approaches provide additional datasets to illustrate how the different test-retest metrics can be applied and what attributes of the data and quantification method can be measured. Ten ROIs were considered in common across all five data sets: anterior cingulate, amygdala, dorsal caudate, dorsolateral prefrontal cortex, gray matter cerebellum, hippocampus, insula, midbrain, parietal lobe, and ventral striatum. In the case of [11C]WAY-100635, an additional ROI, the white matter cerebellum, was considered , but not included in this analysis to maintain the same ROIs across all tracers. The test-retest variability is a result of noise in the ROI and in the arterial input function and is impacted by the size of the ROI. Analysis in this paper does not consider the ROI size as a factor, since ROI-size is the same for different tracers binding to the same target.
A key utility of the test-retest metrics is selecting a tracer among many for a particular target. For example, [11C]WAY-100635 and [11C]CUMI-101 are both tracers for the serotonin 1A receptor. The ICC, PTRT, and WSCV show lower test-retest variability for [11C]CUMI-101 compared to [11C]WAY-100635 (Figs. 1 and 2), indicating that [11C]CUMI-101 considering only the test-retest repeatability aspect would be preferred of the tracer, for the serotonin 1A receptor.
Agreement indices for amygdala in the [11C]CUMI-101 dataset for the three considered quantification approaches
Our main goal was to investigate current approaches to the evaluation of test-retest experiments in PET brain imaging from a statistical point of view and to provide insights and guidance for using indices of agreement in addition to the typically reported PTRT metric. In this evaluation, the random effects ANOVA model underpins the rationale for most metrics and we found it to be a useful model for brain PET imaging, as it describes and quantifies the test-retest PET experiments in a succinct way, while at the same time capturing various random variations present in the data. With respect to random effects ANOVA, three metrics obtained from the model (ICC, RC, and WSCV) reveal several aspects of the data. The ICC provides information about distinguishability of the subjects . As ICC is a ratio of between-subject variance to total variance, it quantifies the agreement of the test-retest readings (given by the within-subject standard deviation (WSSD)) relative to the spread of the subjects (characterized by between-subject standard deviation). The higher the between-subject variability is, the better the distinguishability. As ICC depends on the between-subject variability expressed by the between-subject deviation, it has been pointed out that care needs to be paid to comparisons of the ICC across groups for which the between-subject variability may be different . WSCV provides information about the agreement between test-retest readings with respect to the overall signal (estimated as population mean from the random effects ANOVA model). RC is an unscaled index of agreement, reflecting agreement between the test-retest readings proportional to the WSSD (which is estimated as a square root of the within-subject mean sum of squares or WSMSS).
In PET imaging literature, several test-retest outcome metrics are commonly reported, but there has been no general consensus as to which outcome metrics should be used. We found it useful to classify the metrics based on the underlying statistical model, such as random effects ANOVA vs. other metrics. The most popular metrics based on random effect ANOVA are ICC and WSMSS [8–12, 24, 25]. WSMSS is directly related to the RC, as square root of WSMSS and is an estimate of the WSSD. WSCV, which also ensues for random effect ANOVA model, is only rarely reported in test-retest studies in PET brain imaging . In PET test/retest studies, ICC is usually calculated assuming a one-way ANOVA (4). However, in some cases, a two-way mixed effect model has also been applied . Since typical test/retest studies consist of two images per subject, we generally recommend calculating ICC according to the one-way model.
The most commonly used test-retest metric in PET imaging is PTRT (reported virtually in all PET imaging studies with test-retest experiment). PTRT is obtained from mean normalized differences of test-retest samples. With respect to the random effects ANOVA model, PTRT does not estimate any parameter or function of parameters of the model. Using a first order Taylor expansion (see also Appendix), it can be shown that the mean normalized differences are akin to taking log transform of the data. Therefore, it is expected that the PTRT will not be as sensitive to outliers, as these will be scaled “locally” by the corresponding test-retest mean. Also, due to local scaling, the spread of PTRT is small compared to ICC where the scaling is global. This may significantly underestimate the test-retest repeatability measured with PTRT as seen in the analysis of [11C]CUMI dataset (Table 2). Both PTRT and WSCV provide an intuition to the tracer’s limit on detecting differences (e.g., a difference smaller than PTRT and WSCV is unlikely to be detected). The overall rank ordering of regions in terms of test-retest reliability is similar between PTRT and WSCV. Due to inherent small sample size in PET reliability experiments, confidence intervals for the test-retest metrics will generally be fairly wide. Thus, small differences in these measures may not be meaningful. As a general recommendation, the random effect ANOVA model is a useful model for the PET test-retest studies and therefore measures ensuing from it should be reported together with the PTRT, in the case of two repeated measures (one test and one retest). Although more than two repetitions for the PET imaging are not typical, it is worth to note that PTRT is not straightforwardly generalizable for more than two test/retest periods, whereas the ANOVA indices can be applied naturally regardless of the number of repeated observations.
Test-retest metrics that are directly derived from the random ANOVA model (WSCV, ICC, and RC) can be also used for sample size calculation when planning a study that involves multiple PET scans per subject. A method for sample size calculation for ICC was suggested in , which is based on determination of necessary sample size to achieve pre-specified precision of ICC given by a corresponding confidence interval width. This approach can be used in a straightforward way also for the WSCV and RC indices, but not for the PTRT. We emphasize that while these summaries are quite valuable for planning studies that involve multiple PET scans per subject, they are not directly relevant for planning cross-sectional studies. For example, for a pre-post study design, within-subject standard deviation obtained from a test-retest experiment may be used for sample size calculation given an assumed effect size (mean difference between pre- and post- periods) as shown in .
Bland-Altman plots represent a mainstay in the graphical display of test-retest data. However, they are rarely used in PET brain imaging . Bland-Altman plots should be used as a first step in the analysis as they may be helpful in better understanding the dependence of variability on the signal strength as well as potential bias between test and retest measurements.
When characterizing test-retest properties of a particular tracer, one may aim at an overall measure across several ROIs or at a region-specific measure of reliability in a priori regions with hypothesized or confirmed biological relevance to the population and/or application at hand. In our investigation, we found that some ROIs may exhibit better performance than others, so ROI-wise comparisons are worth considering. In addition, various ROIs may show different uptake characteristics that influence their noise properties (e.g., high-binding vs. low-binding ROIs), and in that case, test-retest properties could be investigated region-by-region; however, pooling all ROIs into an aggregate test-retest metric may also be carried out if there is an application specific requirement. The difference in ROI-size influences the noise in the region which is the cause of test-retest repeatability metric. Thus, the ROI size will not have an impact on the conclusions drawn from test-retest repeatability metrics if the image processing is performed in a uniform fashion across studies, which was the case in the datasets chosen for this paper.
All the scaled metrics will be useful to compare repeatability of the same ROIs from different tracers as well as different ROIs of the same tracer. As seen in case of [11C]CUMI-101 and [11C]WAY-100635 for the serotonin 1A receptor; all things being equal, these repeatability metrics can help choose the tracer for a given target.
Random effects ANOVA is a useful model for PET brain imaging test-retest studies. The metrics that ensue from this model such as ICC, RC and WSVC are recommended to be reported along with the percent test-retest metric as they capture various sources of variability in the PET test-retest experiments in a succinct way.
This work was not supported by any grants or other funding sources.
RB, AJ, and TO conceived the study. RB, DF, TO, FZ, and AJ developed statistical analysis plan and drafted the manuscript. RB and DF performed the statistical analysis. RB and FZ coordinated the effort. All authors read and approved the final manuscript.
Ethics approval and consent to participate
This article does not contain any studies with human participants or animals performed by any of the authors.
Richard Baumgartner and Dai Feng are employees of Merck and Co., Inc. and own stock of Merck and Co., Inc. Aniket Joshi is employee of Novartis. Francesca Zanderigo and Todd Ogden declare that they have no conflict of interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Dierckx RAJO, de Vries EFJ, van Waarde A, den Boer JA. PET and SPECT in psychiatry. Berlin Heidelberg: Springer-Verlag; 2014.View ArticleGoogle Scholar
- Jones T, Rabiner EA. The development, past achievements, and future directions of brain PET. J Cereb Blood Flow Metab. 2012;32:1426–54.View ArticlePubMedPubMed CentralGoogle Scholar
- Kuwabara H, Chamroonrat W, Mathews W, Waterhouse R, Brasic JR, Guevara MR, Kumar A, Hamill T, Mozley PD, Wong DF. Evaluation of 11C-ABP688 and 18FFPEB for imaging mGluR5 receptors in the human brain. J Nucl Med. 2011;52:390.Google Scholar
- Raunig DL, McShane L, Pennello G, et al. Quantitative imaging biomarkers: a review of statistical methods for technical performance assessment. Stat Methods Med Res. 2014;24(1):27–67.View ArticlePubMedGoogle Scholar
- Barnhart HX, Haber MJ, Lin LI. An overview on assessing agreement with continuous measurements. J Biopharm Stat. 2007;17:529–69.View ArticlePubMedGoogle Scholar
- Lodge MA. Repeatability of SUV in oncologic 18F-FDG PET. J Nucl Med. 2017;58:523–32.View ArticlePubMedPubMed CentralGoogle Scholar
- Crowley AL, Yow E, Barnhart HX, Daubert MA, Bigelow R, Sullivan DC, Pencina M, Douglas PS. Critical review of current approaches for echocardiographic reproducibility and reliability assessment in clinical research. J Am Soc Echocardiogr. 2016;29:1144–54. e1147View ArticlePubMedGoogle Scholar
- Milak MS, DeLorenzo C, Zanderigo F, Prabhakaran J, Kumar JS, Majo VJ, Mann JJ, Parsey RV. In vivo quantification of human serotonin 1A receptor using 11C-CUMI-101, an agonist PET radiotracer. J Nucl Med. 2010;51:1892–900.View ArticlePubMedGoogle Scholar
- Ogden RT, Ojha A, Erlandsson K, Oquendo MA, Mann JJ, Parsey RV. In vivo quantification of serotonin transporters using [(11)C]DASB and positron emission tomography in humans: modeling considerations. J Cereb Blood Flow Metab. 2007;27:205–17.View ArticlePubMedGoogle Scholar
- DeLorenzo C, Kumar JS, Zanderigo F, Mann JJ, Parsey RV. Modeling considerations for in vivo quantification of the dopamine transporter using [(11)C]PE2I and positron emission tomography. J Cereb Blood Flow Metab. 2009;29:1332–45.View ArticlePubMedPubMed CentralGoogle Scholar
- Parsey RV, Slifstein M, Hwang DR, Abi-Dargham A, Simpson N, Mawlawi O, Guo NN, Van Heertum R, Mann JJ, Laruelle M. Validation and reproducibility of measurement of 5-HT1A receptor parameters with [carbonyl-11C]WAY-100635 in humans: comparison of arterial and reference tisssue input functions. J Cerebral Blood Flow Metab. 2000;20:1111–33.View ArticleGoogle Scholar
- DeLorenzo C, Kumar JS, Mann JJ, Parsey RV. In vivo variation in metabotropic glutamate receptor subtype 5 binding using positron emission tomography and [11C]ABP688. J Cereb Blood Flow Metab. 2011;31:2169–80.View ArticlePubMedPubMed CentralGoogle Scholar
- Feng D: agRee: Various methods for measuring agreement. Available at http://cran.r-project.org/web/packages/agRee.
- Quan H, Shih WJ. Assessing reproducibility by the within-subject coefficient of variation with random effects models. Biometrics. 1996;52(4):1194–203.View ArticleGoogle Scholar
- Barnhart HX, Barboriak DP. Applications of the repeatability of quantitative imaging biomarkers: a review of statistical analysis of repeat data sets. Transl Oncol. 2009;2:231–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Holcomb HH, Cascella NG, Medoff DR, Gastineau EA, Loats H, Thaker GK, Conley RR, Dannals RF, Wagner HN Jr, Tamminga CA. PET-FDG test-retest reliability during a visual discrimination task in schizophrenia. J Comput Assist Tomogr. 1993;17:704–9.View ArticlePubMedGoogle Scholar
- Seibyl JP, Laruelle M, van Dyck CH, Wallace E, Baldwin RM, Zoghbi S, Zea-Ponce Y, Neumeyer JL, Charney DS, Hoffer PB, Innis RB. Reproducibility of iodine-123-beta-CIT SPECT brain measurement of dopamine transporters. J Nucl Med. 1996;37:222–8.PubMedGoogle Scholar
- Lopresti BJ, Klunk WE, Mathis CA, Hoge JA, Ziolko SK, Lu X, Meltzer CC, Schimmel K, Tsopelas ND, DeKosky ST, Price JC. Simplified quantification of Pittsburgh compound B amyloid imaging PET studies: a comparative analysis. J Nucl Med. 2005;46:1959–72.PubMedGoogle Scholar
- Bland JM, Altman DG. Measuring agreement in method comparison studies. Stat Methods Med Res. 1999;8:135–60.View ArticlePubMedGoogle Scholar
- Innis RB, Cunningham VJ, Delforge J, Fujita M, Gjedde A, Gunn RN, Holden J, Houle S, Huang SC, Ichise M, Iida H, Ito H, Kimura Y, Koeppe RA, Knudsen GM, Knuuti J, Lammertsma AA, Laruelle M, Logan J, Maguire RP, Mintun MA, Morris ED, Parsey R, Price JC, Slifstein M, Sossi V, Suhara T, Votaw JR, Wong DF, Carson RE. Consensus nomenclature for in vivo imaging of reversibly binding radioligands. J Cereb Blood Flow. 2007;27:1533–9.View ArticleGoogle Scholar
- Gunn RN, Gunn SR, Cunningham VJ. Positron emission tomography compartmental models. J Cereb Blood Flow Metab. 2001;21:635–52.View ArticlePubMedGoogle Scholar
- Ogden RT. Estimation of kinetic parameters in graphical analysis of PET imaging data. Stat Med. 2003;22:3557–68.View ArticlePubMedGoogle Scholar
- Carrasco JL, Caceres A, Escaramis G, Jover L. Distinguishability and agreement with continuous data. Stat Med. 2014;33:117–28.View ArticlePubMedGoogle Scholar
- Kodaka F, Ito H, Kimura Y, Fujie S, Takano H, Fujiwara H, Sasaki T, Nakayama K, Halldin C, Farde L, Suhara T. Test-retest reproducibility of dopamine D2/3 receptor binding in human brain measured by PET with [11C]MNPA and [11C]raclopride. Eur J Nucl Med Mol Imaging. 2013;40:574–9.View ArticlePubMedGoogle Scholar
- Collste K, Forsberg A, Varrone A, Amini N, Aeinehband S, Yakushev I, Halldin C, Farde L, Cervenka S. Test-retest reproducibility of [(11)C]PBR28 binding to TSPO in healthy control subjects. Eur J Nucl Med Mol Imaging. 2016;43:173–83.View ArticlePubMedGoogle Scholar
- Ettrup A, Svarer C, McMahon B, da Cunha-Bang S, Lehel S, Moller K, Dyssegaard A, Ganz M, Beliveau V, Jorgensen LM, Gillings N, Knudsen GM. Serotonin 2A receptor agonist binding in the human brain with [(11)C]Cimbi-36: test-retest reproducibility and head-to-head comparison with the antagonist [(18)F]altanserin. NeuroImage. 2016;130:167–74.View ArticlePubMedGoogle Scholar
- Zou GY. Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Stat Med. 2012;31:3972–81.View ArticlePubMedGoogle Scholar
- Julious S. Tutorial in biostatistics. Sample sizes for clinical trials with normal data. Stat Med. 2004;23:1921–86.View ArticlePubMedGoogle Scholar
- Normandin MD, Zheng MQ, Lin KS, Mason NS, Lin SF, Ropchan J, Labaree D, Henry S, Williams WA, Carson RE, Neumeister A, Huang Y. Imaging the cannabinoid CB1 receptor in humans with [11C]OMAR: assessment of kinetic analysis methods, test-retest reproducibility, and gender differences. J Cereb Blood Flow Metab. 2015;35:1313–22.View ArticlePubMedPubMed CentralGoogle Scholar
- Shoukri MM, Elkum N, Walter SD. Interval estimation and optimal design for the within subject coefficient of variation for continuous and binary variables. BMC Med Res Methodol. 2006;6:24.View ArticlePubMedPubMed CentralGoogle Scholar