One lives only to make blunders: The reliability of multilevel parameters in Bayesian regressions

Ruben C. Arslan

For most psychologists, the concept of reliability is the main way they think about measurement error. At least it used to be that way for me and a recent conference I attended didn’t disabuse me of this impression.

At the conference, many researchers presented work in which they aimed to explain variation in a parameter of a multilevel model, or used such a parameter to predict something. Basically, in personality, people are moving on from individual differences in means. Some example parameters include random slopes (e.g., the varying effects of some daily event on daily affect), residual variability (e.g., varying within-person negative affect variability), but there are many others (e.g., within-subject autocorrelation, “inertia”).

Now, these higher-level parameters have to be estimated of course. Most researchers are aware that these quantities are not error-free.¹ Still, in most talks I saw, people either didn’t address this, or they computed a reliability in some roundabout way (e.g., split-half).²

When we were recently interested in the reliability of within-subject variability in a censored model, we did not find an existing approach fit for our needs. Our solution happens to be easy, fully general and works for random intercepts, slopes, residual variances, etc.

Simulation, of course

In Bayesian models, we already have an estimate of the error of our multilevel parameters, namely the variance across MCMC draws. Let’s simulate a N=300 within-person diary study. People take part for, on average, 14 days.

Show code

set.seed(20191005)
N <- 300
n_days = 14
n_days_per_person = rpois(N, n_days)
people <- tibble(
  id = 1:N,
  x_mean = rnorm(N),
  y_mean = rnorm(N),
  x_slope = rnorm(N, 0.5, 0.2),
  y_var = rnorm(N, 0.3, 0.2),
  days = n_days_per_person
  )

days <- people %>% 
  full_join(tibble(
              id = rep(1:N, times = n_days_per_person)
            ), by = "id", multiple = "all") %>% 
            mutate(
              x = x_mean + rnorm(n()),
              y = rnorm(n(), 
                        mean = y_mean + x_slope*x,
                        sd = exp(y_var))
            )

Show code

rmarkdown::paged_table(people)

Show code

rmarkdown::paged_table(days)

Let’s not make it too simple. Our outcome y is a function of x, which has stable between-person variance, but also varies within-subject. But the effect of x varies between persons. Also, the residual variance of y varies between persons. Let me specify a brms model to recover our data-generating model.

Show code

m1 <- brm(bf(y ~ 1 + x + (1 + x|id),
             sigma ~ (1|id)), data = days,
          backend = "cmdstanr",
          cores = 4,
          file = "example_model",
          file_refit = "on_change")

Now, for each person’s random intercept, slope or residual variance, we can look at the distribution of the MCMC draws for their estimate.³

Show code

true_score_var <- m1 %>%
    # grab all estimates of varying/random effects
    gather_draws( `sd_.+`, regex = TRUE) %>% 
    mutate(.variable = stringr::str_sub(.variable, 4)) %>% 
    mutate(latent_var = .value^2) %>% 
    group_by(.variable) %>%
    summarise(latent_var = mean(latent_var))

We get our estimate of the variance of the true scores by taking the sd hyperparameters for the random intercept, slope and residual variance. We square the SDs to get the variance and then average across MCMC draws.⁴

Show code

SEE2 <- m1 %>%
      # grab all estimates of varying/random effects
      gather_draws( `r_.+__.+`[id,parameter], `r_[^_]+_?[^_]+`[id,parameter], 
                    regex = TRUE) %>%
      ungroup() %>% 
      tidyr::unite(.variable, .variable, parameter, sep = "__") %>% 
      mutate(.variable = stringr::str_sub(.variable, 3)) %>% 
      mutate(.variable = stringr::str_replace(.variable, 
                                              "_sigma__", "_sigma_")) %>% 
      group_by(.variable, id) %>%
      # compute variance across draws
      summarise(SEE2 = var(.value)) %>%
      left_join(people %>% select(id, days)) %>% 
      # give more weight to error var estimated from people 
      # who contributed more data
      summarise(
        # weighted SEE^2 reflects the amount of error you get if you take 
        # along the uncertainty in your model
        weighted_SEE2 = sum(SEE2*days)/sum(days),
        # unweighted SEE^2 reflects the amount of error you get if you 
        # extract the manifest random intercepts/slopes/variances
        SEE2 = mean(SEE2)
        )

Now, in addition to the hyperparameter sd, the model also estimated a random intercept, slope, residual variance for each person. The sd hyperparameter affected how much the observed intercepts/slopes/variances were shrunk towards the mean. Some random intercepts are estimated with more, some with less error, depending (in our example) on the number of days the person participated in the diary. The uncertainty of random intercept is reflected by the variance across MCMC draws, which I call \({SEE}^2\) here. So, we compute the variance across draws for each person and then average across people.⁵

Finally, we can get an estimate of each parameter’s reliability by dividing the average \({SEE}^2\) by the average between-person variance and subtracting this fraction from 1.

Show code

true_score_var %>% left_join(SEE2) %>% 
    mutate(reliability = 1 - SEE2/latent_var,
           weighted_reliability = 1 - weighted_SEE2/latent_var,
           ) %>% 
  knitr::kable()

.variable	latent_var	weighted_SEE2	SEE2	reliability	weighted_reliability
id__Intercept	1.00	0.15	0.16	0.84	0.85
id__sigma_Intercept	0.05	0.02	0.02	0.52	0.55
id__x	0.03	0.02	0.02	0.19	0.20

Why does this work?

To get here, it helped us to work our way back from how a confidence interval is calculated under the regression hypothesis in classical test theory using Kelley’s formula (1923, p. 214⁶). For me, the CI under the regression hypothesis was my first encounter with shrinkage. In my teaching, the connection to the Bayesian concepts was not made, though apparently Novick (1971) made the connection to Kelley’s formula⁷

Namely, the standard error of the shrunk estimate is computed as follows, where \(SEE\) stands for “standard error of the estimate” and \(s_X\) is the observed SD:

\[SEE = s_X \cdot \sqrt{Rel \cdot (1-Rel)}\] We can square this to get the error variance of the shrunk estimates. Now \(s_X^2\) is our observed variance and \(V_E\) is \(se_{shrunk}^2\).

\[{SEE}^2=s_X^2⋅(1−Rel) * Rel\] Rearranging the classical test theory definition of reliability, we can express the observed variance (\(s_X^2\)) as the latent true score variance (\(V_T\)) divided by the reliability (\(Rel\)).

\[ Rel=\frac{V_T}{s_X^2} \] \[ s_X^2=\frac{V_T}{Rel} \]

Substitute

\[{SEE}^2=\frac{V_T}{Rel}⋅(1−Rel) * Rel\] Simplify

\[{SEE}^2=V_T⋅(1−Rel)\] Solve for \(Rel\)

\[Rel = 1 - \frac{{SEE}^2}{V_T}\] The standard errors (SEs) of our random slopes are the SEs of the shrunk estimates. As such, they can be analogised (in our vanilla, uninformative prior model) to the SE computed under the regression hypothesis (\(SEE\)), where we first shrink the observed value to the population mean and then compute a standard error around that estimate.

I thought this was neat. I’d appreciate to hear if it is wrong in some way and/or is discussed more elsewhere.⁸ I am also still looking for a way to compute the right uncertainty around the reliability estimates.

Thanks especially to Stefan Schmukle (who has a related paper draft he really needs to publish), but also Niclas Kuper, Julia Rohrer, and Taym Alsalti for discovering basic mistakes in my thinking and fun exchanges on this topic.

Footnote on other approaches.

I saw uses of split half, as well as people getting parameters for multiple items, then computing reliability of the item indices using Cronbach’s alpha. Methodologists have published various solutions for specific cases, like random slopes, though I haven’t seen them used in the primary literature (e.g., Neubauer, Voelkle, Voss, & Mertens, 2020, Schneider & Junghaenel, 2022). All of these approaches also require estimating additional models. I.e., you can’t just use the model that is being used to study e.g. the moderator that could explain random slope variation. The latter two approaches also require us to know the residual variance. Most importantly for us though, they weren’t fully general (e.g., didn’t work to estimate the reliability of the residual variability).

Boring footnote to ingratiate myself with fellow full luxury Bayesians

I guess I won’t be needing this formula/snippet very often, as I prefer full luxury Bayes. I usually won’t extract model parameters to use in two-step modelling, but would rather predict the latent variable in the same model. I am also moving away from interpreting metrics that are relativised to the in-sample standard deviation for various reasons. But in a recent preprint by Julia Rohrer et al., we argued that the low reliability of the random slopes we were studying made it fairly pointless to study them further with the amount of data we had. That seems like a suitable use case and will hopefully motivate someone to collect more data. I think the reliability coefficient can also help break down the results of complex models for other researchers and ease the transition as personality psychologists study these complex parameters more.

Enough with the footnotes

This blog post is subject to my bug bounty policy, which means that I’d appreciate hearing about errors in this post and will pay you (or a charity) a small bounty.

Like everything else in psychology↩︎
See below on other approaches↩︎
Presumably standard errors calculated with bootstrapping in empirical Bayes models would work too?↩︎
Caution, don’t just square the sd parameter reported in the brms model summary or in VarCorr, mean(sd)^2 != mean(sd^2)↩︎
There’s a wrinkle here. For the main calculation, I take the unweighted average across people. That’s the relevant case for people who extract random slopes from a model and use them as a predictor. But if you estimate a moderator in a one-step model or if you take along the uncertainty by working with the draws, I think the weighted SEE^2 is more appropriate↩︎
Kelley, T. L. (1923). Statistical method. Macmillan.↩︎
I only learned about this after the fact. Thanks to Stefan Schmukle for pointing me to Levy, R. & Mislevy, R. J. (2016). Bayesian psychometric modeling. Taylor & Francis↩︎
I fully expect to be embarrassed by someone having pointed this out in 1930 already, but I’ll be a little less embarrassed because I asked around and it’s not obvious to everyone in my field.↩︎

Comment on this article Share:

The reliability of multilevel parameters in Bayesian regressions

Simulation, of course

Why does this work?

Footnote on other approaches.

Boring footnote to ingratiate myself with fellow full luxury Bayesians

Enough with the footnotes

Corrections

Reuse

Citation