For most psychologists, the concept of reliability is the main way they think about measurement error. At least it used to be that way for me and a recent conference I attended didn’t disabuse me of this impression.
At the conference, many researchers presented work in which they aimed to explain variation in a parameter of a multilevel model, or used such a parameter to predict something. Basically, in personality, people are moving on from individual differences in means. Some example parameters include random slopes (e.g., the varying effects of some daily event on daily affect), residual variability (e.g., varying within-person negative affect variability), but there are many others (e.g., within-subject autocorrelation, “inertia”).
Now, these higher-level parameters have to be estimated of course. Most researchers are aware that these quantities are not error-free.1 Still, in most talks I saw, people either didn’t address this, or they computed a reliability in some roundabout way (e.g., split-half).2
When we were recently interested in the reliability of within-subject variability in a censored model, we did not find an existing approach fit for our needs. Our solution happens to be easy, fully general and works for random intercepts, slopes, residual variances, etc.
In Bayesian models, we already have an estimate of the error of our multilevel parameters, namely the variance across MCMC draws. Let’s simulate a N=300 within-person diary study. People take part for, on average, 14 days.
set.seed(20191005)
N <- 300
n_days = 14
n_days_per_person = rpois(N, n_days)
people <- tibble(
id = 1:N,
x_mean = rnorm(N),
y_mean = rnorm(N),
x_slope = rnorm(N, 0.5, 0.2),
y_var = rnorm(N, 0.3, 0.2),
days = n_days_per_person
)
days <- people %>%
full_join(tibble(
id = rep(1:N, times = n_days_per_person)
), by = "id", multiple = "all") %>%
mutate(
x = x_mean + rnorm(n()),
y = rnorm(n(),
mean = y_mean + x_slope*x,
sd = exp(y_var))
)
rmarkdown::paged_table(people)
rmarkdown::paged_table(days)
Let’s not make it too simple. Our outcome y is a function of x
, which has stable between-person variance, but also varies within-subject. But the effect of x
varies between persons. Also, the residual variance of y
varies between persons. Let me specify a brms model to recover our data-generating model.
m1 <- brm(bf(y ~ 1 + x + (1 + x|id),
sigma ~ (1|id)), data = days,
backend = "cmdstanr",
cores = 4,
file = "example_model",
file_refit = "on_change")
Now, for each person’s random intercept, slope or residual variance, we can look at the distribution of the MCMC draws for their estimate.3
We get our estimate of the variance of the true scores by taking the sd
hyperparameters for the random intercept, slope and residual variance. We square the SDs to get the variance and then average across MCMC draws.4
SEE2 <- m1 %>%
# grab all estimates of varying/random effects
gather_draws( `r_.+__.+`[id,parameter], `r_[^_]+_?[^_]+`[id,parameter],
regex = TRUE) %>%
ungroup() %>%
tidyr::unite(.variable, .variable, parameter, sep = "__") %>%
mutate(.variable = stringr::str_sub(.variable, 3)) %>%
mutate(.variable = stringr::str_replace(.variable,
"_sigma__", "_sigma_")) %>%
group_by(.variable, id) %>%
# compute variance across draws
summarise(SEE2 = var(.value)) %>%
left_join(people %>% select(id, days)) %>%
# give more weight to error var estimated from people
# who contributed more data
summarise(
# weighted SEE^2 reflects the amount of error you get if you take
# along the uncertainty in your model
weighted_SEE2 = sum(SEE2*days)/sum(days),
# unweighted SEE^2 reflects the amount of error you get if you
# extract the manifest random intercepts/slopes/variances
SEE2 = mean(SEE2)
)
Now, in addition to the hyperparameter sd, the model also estimated a random intercept, slope, residual variance for each person. The sd
hyperparameter affected how much the observed intercepts/slopes/variances were shrunk towards the mean. Some random intercepts are estimated with more, some with less error, depending (in our example) on the number of days the person participated in the diary. The uncertainty of random intercept is reflected by the variance across MCMC draws, which I call \({SEE}^2\) here. So, we compute the variance across draws for each person and then average across people.5
Finally, we can get an estimate of each parameter’s reliability by dividing the average \({SEE}^2\) by the average between-person variance and subtracting this fraction from 1.
true_score_var %>% left_join(SEE2) %>%
mutate(reliability = 1 - SEE2/latent_var,
weighted_reliability = 1 - weighted_SEE2/latent_var,
) %>%
knitr::kable()
.variable | latent_var | weighted_SEE2 | SEE2 | reliability | weighted_reliability |
---|---|---|---|---|---|
id__Intercept | 1.00 | 0.15 | 0.16 | 0.84 | 0.85 |
id__sigma_Intercept | 0.05 | 0.02 | 0.02 | 0.52 | 0.55 |
id__x | 0.03 | 0.02 | 0.02 | 0.19 | 0.20 |
To get here, it helped us to work our way back from how a confidence interval is calculated under the regression hypothesis in classical test theory using Kelley’s formula (1923, p. 2146). For me, the CI under the regression hypothesis was my first encounter with shrinkage. In my teaching, the connection to the Bayesian concepts was not made, though apparently Novick (1971) made the connection to Kelley’s formula7
Namely, the standard error of the shrunk estimate is computed as follows, where \(SEE\) stands for “standard error of the estimate” and \(s_X\) is the observed SD:
\[SEE = s_X \cdot \sqrt{Rel \cdot (1-Rel)}\] We can square this to get the error variance of the shrunk estimates. Now \(s_X^2\) is our observed variance and \(V_E\) is \(se_{shrunk}^2\).
\[{SEE}^2=s_X^2⋅(1−Rel) * Rel\] Rearranging the classical test theory definition of reliability, we can express the observed variance (\(s_X^2\)) as the latent true score variance (\(V_T\)) divided by the reliability (\(Rel\)).
\[ Rel=\frac{V_T}{s_X^2} \] \[ s_X^2=\frac{V_T}{Rel} \]
Substitute
\[{SEE}^2=\frac{V_T}{Rel}⋅(1−Rel) * Rel\] Simplify
\[{SEE}^2=V_T⋅(1−Rel)\] Solve for \(Rel\)
\[Rel = 1 - \frac{{SEE}^2}{V_T}\] The standard errors (SEs) of our random slopes are the SEs of the shrunk estimates. As such, they can be analogised (in our vanilla, uninformative prior model) to the SE computed under the regression hypothesis (\(SEE\)), where we first shrink the observed value to the population mean and then compute a standard error around that estimate.
I thought this was neat. I’d appreciate to hear if it is wrong in some way and/or is discussed more elsewhere.8 I am also still looking for a way to compute the right uncertainty around the reliability estimates.
Thanks especially to Stefan Schmukle (who has a related paper draft he really needs to publish), but also Niclas Kuper, Julia Rohrer, and Taym Alsalti for discovering basic mistakes in my thinking and fun exchanges on this topic.
I saw uses of split half, as well as people getting parameters for multiple items, then computing reliability of the item indices using Cronbach’s alpha. Methodologists have published various solutions for specific cases, like random slopes, though I haven’t seen them used in the primary literature (e.g., Neubauer, Voelkle, Voss, & Mertens, 2020, Schneider & Junghaenel, 2022). All of these approaches also require estimating additional models. I.e., you can’t just use the model that is being used to study e.g. the moderator that could explain random slope variation. The latter two approaches also require us to know the residual variance. Most importantly for us though, they weren’t fully general (e.g., didn’t work to estimate the reliability of the residual variability).
I guess I won’t be needing this formula/snippet very often, as I prefer full luxury Bayes. I usually won’t extract model parameters to use in two-step modelling, but would rather predict the latent variable in the same model. I am also moving away from interpreting metrics that are relativised to the in-sample standard deviation for various reasons. But in a recent preprint by Julia Rohrer et al., we argued that the low reliability of the random slopes we were studying made it fairly pointless to study them further with the amount of data we had. That seems like a suitable use case and will hopefully motivate someone to collect more data. I think the reliability coefficient can also help break down the results of complex models for other researchers and ease the transition as personality psychologists study these complex parameters more.
This blog post is subject to my bug bounty policy, which means that I’d appreciate hearing about errors in this post and will pay you (or a charity) a small bounty.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/rubenarslan/rubenarslan.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Arslan (2024, Feb. 26). One lives only to make blunders: The reliability of multilevel parameters in Bayesian regressions. Retrieved from https://rubenarslan.github.io/posts/2024-02-15-the-reliability-of-multilevel-parameters-in-bayesian-regressions/
BibTeX citation
@misc{arslan2024the, author = {Arslan, Ruben C.}, title = {One lives only to make blunders: The reliability of multilevel parameters in Bayesian regressions}, url = {https://rubenarslan.github.io/posts/2024-02-15-the-reliability-of-multilevel-parameters-in-bayesian-regressions/}, year = {2024} }