Multilevel censored location-scale models

Simulating several models and their ability to recover within-subject and between-subject effects on scale

Ruben C. Arslan https://rubenarslan.github.io
2023-03-13

Recently, I got mad at a ceiling.

Show code
library(tidyverse)
library(brms)
library(tidybayes)
options(mc.cores = parallel::detectCores(),
        brms.backend = "cmdstanr",
        brms.file_refit = "on_change")

theme_set(theme_bw())
Show code
library(tweetrmd)
tweet_embed("https://twitter.com/rubenarslan/status/1632170569448685569")

Let me explain. Kahneman & Deaton (2010) reported that happiness plateaued at an income of $75,000 or higher. Their measure of happiness was a binary question that 450,000 people answered in daily, random phone calls to ~1,000 people. Killingsworth 2021 used his experience sampling data from more than 35,000 individuals responding ~50 times to refute this. The Killingsworth study had a better happiness measure, a sliding scale from very unhappy to very happy. He found a loglinear relationship across the entire income range1. He attributed Kahneman & Deaton’s (2010) plateau finding to a ceiling effect in their measure.

Now, in a recent paper, Killingsworth, Kahneman, and Meller (2023) re-analyzed Killingsworth’s (2021) experience sampling data to determine the relationship between income and experienced well-being, happiness. In the reanalysis, KKM wanted to find out why the results differed in an “adversarial collaboration”.

To do so, they aggregated daily happiness in the Killingsworth data to derive the subject’s observed mean. Then they used a form of distributional regression, quantile regression, to predict not only the mean of happiness but also various quantiles by log(income). Specifically, they investigated whether the relationship to the mean plateaued above a yearly household income of $100,000. This was not the case, the relationship between income and mean happiness is loglinear above and below this threshold with near-identical slopes. It’s also a quite small relationship, log(income) explains less than 1% of the variance in happiness.

However, they find that the relationship between log(income) and the 15% quantile does indeed flatten after $100,000. In my view, they inappropriately interpret this as evidence that there is an unhappy group of people for which money cannot alleviate suffering (above 100k).

A better way to put it would have been: at higher log(income) the variability in average happiness is higher. That is, some people have a lot of money and spend it on holidays, household help, and time with their loved ones. Others buy a social media platform and make themselves miserable. But even if this statistical relationship is real2 and reflects a causal relationship3, it does not imply that there is one group of people whose unhappiness cannot be alleviated with money.

Now, I’m not particularly well-acquainted with quantile regression. However, I’ve worked with multilevel location scale models a good deal in the past, a different form of distributional regression. In so doing, I’ve learned not to trust my intuition, or at least to calibrate with fake data simulation. So, I simulated to check a hunch.

Two questionable assumptions in the quantile regression

My conclusion is that KKM make at least two4 assumptions that I think are unlikely to hold and which could bias their estimate:

  1. that the variance of happiness within-subjects is homogeneous
  2. that their aggregated means of happiness are free of sampling error

I was fairly sure that #1 is false, because, you know, I’ve seen happiness data, and because I am a human being who knows other human beings. I did not see their happiness data, mind you, which they only shared in aggregated and rounded form upon publication of KKM 2023, so we cannot check within-subject variance homogeneity.5

I was less sure whether #2 has much impact with 51 days per subject on average.

Simulation results

Show code
knitr::include_graphics("multilevel-censored-location-scale-models_files/figure-html5/BS_0_8_WS_0__means-22-1.png")
Compare the scatter of y around the regression line at the top left and the bottom left to see how sampling error creates a slight bias when estimating the effect on sd(y) on the between-subject level. From the second simulation below.

Figure 1: Compare the scatter of y around the regression line at the top left and the bottom left to see how sampling error creates a slight bias when estimating the effect on sd(y) on the between-subject level. From the second simulation below.

Show code
knitr::include_graphics("multilevel-censored-location-scale-models_files/figure-html5/BS_0_8_WS_0__bias-24-1.png")
A between-subject effect on the log(sd) of .8 reduces to .65 because of sampling variance in aggregated means.

Figure 2: A between-subject effect on the log(sd) of .8 reduces to .65 because of sampling variance in aggregated means.

Show code
knitr::include_graphics("multilevel-censored-location-scale-models_files/figure-html5/BS_0_WS_0_8__bias-1.png")
A within-subject effect on the log(sd) of .8 shows up as a .2 effect on the between-subject level.

Figure 3: A within-subject effect on the log(sd) of .8 shows up as a .2 effect on the between-subject level.

Show code
knitr::include_graphics("multilevel-censored-location-scale-models_files/figure-html5/BS_0_8_WS_0_2_bias-24-1.png")
The true between-subject effect on the log(sd) is underestimated more (~.55 with censoring vs. .65 without censoring vis a vis a true value of .8), even though just 10% of values are right-censored.

Figure 4: The true between-subject effect on the log(sd) is underestimated more (~.55 with censoring vs. .65 without censoring vis a vis a true value of .8), even though just 10% of values are right-censored.

Show code
knitr::include_graphics("multilevel-censored-location-scale-models_files/figure-html5/BS_0_WS_0_8_2_bias-1.png")
The between-subject effect on the log(sd) is overestimated  more (~.3 censored vs. .16 non-censored vis a vis a true value of 0) in the censored within-subject model, even though just 12% of values are censored.

Figure 5: The between-subject effect on the log(sd) is overestimated more (~.3 censored vs. .16 non-censored vis a vis a true value of 0) in the censored within-subject model, even though just 12% of values are censored.

Technical note: There are Mplus (McNeish, 2020) and R-Stan (Martin, 2022) models to estimate two-part mixed effects location-scale models that allow simultaneous estimation of scale effects at both the within- and between-subject level. However, they do not implement censored distributions. My favourite package brms does, but does not natively permit scale effects on the between-subject level to be estimated. However, it allows us to estimate different sd(id) for subgroups. I formed septiles on x and used these as subgroups. Doing so allowed me to get a sort of poor man’s version of a censored two-part mixed effects location-scale model. A better man would have coded it up in raw Stan. As far as I can tell, this model both gets me closer to the between-subject effect than the aggregated model and gets me an unbiased estimate of the within-subject effect.

Show code
knitr::include_graphics("multilevel-censored-location-scale-models_files/figure-html5/BS_0_8_WS_0_2_betweensds-25-1.png")
Regressing x on the estimated sd(y) for the quantiles in my two-part model recovers the true between-subject effect even in the censored model.

Figure 6: Regressing x on the estimated sd(y) for the quantiles in my two-part model recovers the true between-subject effect even in the censored model.

You can find my simulations (well, one round) below.

Lessons learned

Killingsworth 2021 was a wonderful paper. I think this quantile analysis and adversarial collaboration actually muddled the matter. In German, we of course have a word for this: Verschlimmbesserung.6 How come?

I have an idea. Adversarial collaborations don’t just sound like an oxymoron. Especially with large status differences between adversaries, the interesting stuff happens behind the scenes. Without transparency, power, resources, and stubbornness can win over superior arguments.7 I think such disputes should be carried out in the open. And if you want to share data with someone you consider an adversary, may I suggest the following as insurance:

Show code
library(tweetrmd)
tweet_embed("https://twitter.com/rubenarslan/status/1631789472407859200")

Simulations

Location-scale within-subject

Here, mu(y) is a function of x, as is sigma(y) (only at the within-subject level).

Show code
fit_models <- function(b_mean,  b_sd_bs, b_sd_ws, y_ceiling) {
  name = paste("BS", b_sd_bs, "WS", b_sd_ws, if_else(y_ceiling<Inf,as.character(y_ceiling),""))
  rmdpartials::partial("fit_models.Rmd", 
                       name = name,
                       b_mean = b_mean,  
                       b_sd_bs = b_sd_bs, 
                       b_sd_ws = b_sd_ws, 
                       y_ceiling = y_ceiling)
}
Show code
fit_models(
  b_mean = 1,  b_sd_bs = 0, b_sd_ws = 0.8, y_ceiling = Inf)
Show code
N <- 250
n_days = 51
set.seed(20191005)
people <- tibble(
  id = 1:N,
  x = rnorm(N))
n_days_per_person = rpois(N, n_days)
Show code
people <- people %>% 
  mutate(
    mean_log_sd_y = -1 + b_sd_bs * x,
    log_sd_y = 0 + b_sd_ws * x,
    mean_y = rnorm(N, sd = exp(mean_log_sd_y)) + b_mean * x,
    xQ = ntile(x, 6)
    )
days <- people %>% 
  full_join(tibble(
              id = rep(1:N, times = n_days_per_person)
            ), by = "id", multiple = "all") %>% 
            mutate(
              latent_y = rnorm(n(), 
                        mean = mean_y,
                        sd = exp(log_sd_y)),
              y = case_when(
                latent_y >= y_ceiling ~ y_ceiling,
                # latent_y <= -1.5 ~ -1.5,
                TRUE ~ latent_y
              ),
              ycens = case_when(
                latent_y >= y_ceiling ~ "right",
                # latent_y <= -1.5 ~ "left",
                TRUE ~ "none"
              )
            )

Percentage censored: 0.00

Show code
ggplot(days, aes(x, y)) +
  geom_point(alpha = 0.3)
Raw data

Figure 7: Raw data

Show code
sel_ids <- c(43, 36, 8, 40, 88, 29, 11, 49, 84, 41, 98)

ggplot(days, aes(x, y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_pointrange(stat = "summary", 
                  fun = mean, 
                  fun.min = function(x) { mean(x)-sd(x) },
                  fun.max = function(x) { mean(x)+sd(x) }, data = days %>% filter(id %in% sel_ids) ) +
  geom_point(alpha = 0.3, data = days %>% filter(id %in% sel_ids) )
Selected individuals with their means, standard deviations, and the regression line

Figure 8: Selected individuals with their means, standard deviations, and the regression line

Show code
library(cowplot)
p1 <- days %>% 
  group_by(id) %>% 
  summarise(x = mean(x, na.rm = T),
            mean_y = mean(y, na.rm = T)) %>% 
  ggplot(., aes(x, mean_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4)

p2 <- days %>% 
  group_by(id) %>% 
  summarise(x = mean(x, na.rm = T),
            sd_y = sd(y, na.rm = T)) %>% 
  ggplot(., aes(x, sd_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4) + 
  scale_y_continuous(trans = "log", breaks = c(0.1, 0.25, 0.5, 1, 2, 4))

p3 <- people %>% 
  ggplot(., aes(x, mean_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4) + 
  ylab("latent mean y")
plot_grid(p1,p2,p3,ncol = 2)
The mean and the intraindividual SD as a function of X

Figure 9: The mean and the intraindividual SD as a function of X

Bias?

Show code
m_mixed <- brm(bf(y | cens(ycens) ~ x + (1|id),
             sigma ~ x), data = days, silent = 2, refresh = 0)
m_between <- brm(bf(y | cens(ycens) ~ x,
             sigma ~ x), data = days %>% group_by(x, id) %>% 
               summarise(y = mean(y)) %>% 
               mutate(ycens = case_when(
                y >= y_ceiling ~ "right",
                TRUE ~ "none"
              )), silent = 2, refresh = 0)
m_mixed_2part <- brm(bf(y | cens(ycens) ~ x + (1|gr(id, by = xQ)),
             sigma ~ x), data = days, silent = 2, refresh = 0)
draws <- m_mixed_2part %>% gather_draws(`sd_.*x.*`, regex = T)
betweenhdis <- draws %>% mean_hdci(.width = .95) %>% 
  mutate(xQ = as.numeric(str_match(.variable, "xQ(\\d)")[,2])) %>% 
  left_join(people %>% group_by(xQ) %>% summarise(x = mean(x)))
m_2part_level2 <- brm(log(.value) | se(se, sigma = FALSE) ~ x, data = betweenhdis %>% mutate(se = (log(.value)-log(.lower))/2))
Show code
draws <- bind_rows(
  m_mixed = m_mixed %>% gather_draws(`b_(sigma_)?x`, regex = T),
  m_mixed_2part = m_mixed_2part %>% gather_draws(`b_(sigma_)?x`, regex = T),
  m_2part_between = m_2part_level2 %>% gather_draws(`b_(sigma_)?x`, regex = T) %>% mutate(.variable = "b_sigma_x"),
  m_between = m_between %>% gather_draws(`b_(sigma_)?x`, regex = T),
 .id = "model") %>% 
  mutate(model = fct_inorder(factor(model)))
draws <- draws %>% group_by(model, .variable) %>% 
  mean_hdci(.width = c(.95, .99)) %>% 
  ungroup()

ggplot(draws, aes(y = .variable, x = .value, xmin = .lower, xmax = .upper)) +
  geom_pointinterval(position = position_dodge(width = .4)) +
  ggrepel::geom_text_repel(aes(label = if_else(.width == .95, sprintf("%.2f", .value), NA_character_)), nudge_y = .1) +
  geom_vline(aes(xintercept = true_value), linetype = 'dashed', 
             data = tibble(model = c("m_between", "m_mixed", "m_mixed_2part", "m_between", "m_mixed", "m_mixed_2part", "m_2part_between"), .variable = c("b_x", "b_x", "b_x", "b_sigma_x", "b_sigma_x", "b_sigma_x", "b_sigma_x"), true_value = c(b_mean, b_mean, b_mean, b_sd_bs, b_sd_ws, b_sd_ws, b_sd_bs))) +  scale_color_discrete(breaks = rev(levels(draws$model))) +
  facet_grid(model ~ .variable) +
  theme_bw() +
  theme(legend.position = c(0.99,0.99),
        legend.justification = c(1,1))
Estimated coefficients and the true values (dashed line)

Figure 10: Estimated coefficients and the true values (dashed line)

Show code
ggplot(betweenhdis, aes(x, .value, 
                 ymin = .lower, ymax = .upper)) +
  geom_pointrange() + 
  geom_line() + 
  ylab("sd(id)") +
  scale_y_continuous(trans = "log", breaks = c(0.1, 0.25, 0.5, 1, 2, 4)) +
  theme_bw()
Relationship between x and sd(id) in the two-part model

Figure 11: Relationship between x and sd(id) in the two-part model

Location-scale between-subject

Here, mu(y) is a function of x, as is sigma(y) at the between-subject level.

Show code
fit_models(
  b_mean = 1,  b_sd_bs = 0.8, b_sd_ws = 0, y_ceiling = Inf)
Show code
N <- 250
n_days = 51
set.seed(20191005)
people <- tibble(
  id = 1:N,
  x = rnorm(N))
n_days_per_person = rpois(N, n_days)
Show code
people <- people %>% 
  mutate(
    mean_log_sd_y = -1 + b_sd_bs * x,
    log_sd_y = 0 + b_sd_ws * x,
    mean_y = rnorm(N, sd = exp(mean_log_sd_y)) + b_mean * x,
    xQ = ntile(x, 6)
    )
days <- people %>% 
  full_join(tibble(
              id = rep(1:N, times = n_days_per_person)
            ), by = "id", multiple = "all") %>% 
            mutate(
              latent_y = rnorm(n(), 
                        mean = mean_y,
                        sd = exp(log_sd_y)),
              y = case_when(
                latent_y >= y_ceiling ~ y_ceiling,
                # latent_y <= -1.5 ~ -1.5,
                TRUE ~ latent_y
              ),
              ycens = case_when(
                latent_y >= y_ceiling ~ "right",
                # latent_y <= -1.5 ~ "left",
                TRUE ~ "none"
              )
            )

Percentage censored: 0.00

Show code
ggplot(days, aes(x, y)) +
  geom_point(alpha = 0.3)
Raw data

Figure 12: Raw data

Show code
sel_ids <- c(43, 36, 8, 40, 88, 29, 11, 49, 84, 41, 98)

ggplot(days, aes(x, y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_pointrange(stat = "summary", 
                  fun = mean, 
                  fun.min = function(x) { mean(x)-sd(x) },
                  fun.max = function(x) { mean(x)+sd(x) }, data = days %>% filter(id %in% sel_ids) ) +
  geom_point(alpha = 0.3, data = days %>% filter(id %in% sel_ids) )
Selected individuals with their means, standard deviations, and the regression line

Figure 13: Selected individuals with their means, standard deviations, and the regression line

Show code
library(cowplot)
p1 <- days %>% 
  group_by(id) %>% 
  summarise(x = mean(x, na.rm = T),
            mean_y = mean(y, na.rm = T)) %>% 
  ggplot(., aes(x, mean_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4)

p2 <- days %>% 
  group_by(id) %>% 
  summarise(x = mean(x, na.rm = T),
            sd_y = sd(y, na.rm = T)) %>% 
  ggplot(., aes(x, sd_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4) + 
  scale_y_continuous(trans = "log", breaks = c(0.1, 0.25, 0.5, 1, 2, 4))

p3 <- people %>% 
  ggplot(., aes(x, mean_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4) + 
  ylab("latent mean y")
plot_grid(p1,p2,p3,ncol = 2)
The mean and the intraindividual SD as a function of X

Figure 14: The mean and the intraindividual SD as a function of X

Bias?

Show code
m_mixed <- brm(bf(y | cens(ycens) ~ x + (1|id),
             sigma ~ x), data = days, silent = 2, refresh = 0)
m_between <- brm(bf(y | cens(ycens) ~ x,
             sigma ~ x), data = days %>% group_by(x, id) %>% 
               summarise(y = mean(y)) %>% 
               mutate(ycens = case_when(
                y >= y_ceiling ~ "right",
                TRUE ~ "none"
              )), silent = 2, refresh = 0)
m_mixed_2part <- brm(bf(y | cens(ycens) ~ x + (1|gr(id, by = xQ)),
             sigma ~ x), data = days, silent = 2, refresh = 0)
draws <- m_mixed_2part %>% gather_draws(`sd_.*x.*`, regex = T)
betweenhdis <- draws %>% mean_hdci(.width = .95) %>% 
  mutate(xQ = as.numeric(str_match(.variable, "xQ(\\d)")[,2])) %>% 
  left_join(people %>% group_by(xQ) %>% summarise(x = mean(x)))
m_2part_level2 <- brm(log(.value) | se(se, sigma = FALSE) ~ x, data = betweenhdis %>% mutate(se = (log(.value)-log(.lower))/2))
Show code
draws <- bind_rows(
  m_mixed = m_mixed %>% gather_draws(`b_(sigma_)?x`, regex = T),
  m_mixed_2part = m_mixed_2part %>% gather_draws(`b_(sigma_)?x`, regex = T),
  m_2part_between = m_2part_level2 %>% gather_draws(`b_(sigma_)?x`, regex = T) %>% mutate(.variable = "b_sigma_x"),
  m_between = m_between %>% gather_draws(`b_(sigma_)?x`, regex = T),
 .id = "model") %>% 
  mutate(model = fct_inorder(factor(model)))
draws <- draws %>% group_by(model, .variable) %>% 
  mean_hdci(.width = c(.95, .99)) %>% 
  ungroup()

ggplot(draws, aes(y = .variable, x = .value, xmin = .lower, xmax = .upper)) +
  geom_pointinterval(position = position_dodge(width = .4)) +
  ggrepel::geom_text_repel(aes(label = if_else(.width == .95, sprintf("%.2f", .value), NA_character_)), nudge_y = .1) +
  geom_vline(aes(xintercept = true_value), linetype = 'dashed', 
             data = tibble(model = c("m_between", "m_mixed", "m_mixed_2part", "m_between", "m_mixed", "m_mixed_2part", "m_2part_between"), .variable = c("b_x", "b_x", "b_x", "b_sigma_x", "b_sigma_x", "b_sigma_x", "b_sigma_x"), true_value = c(b_mean, b_mean, b_mean, b_sd_bs, b_sd_ws, b_sd_ws, b_sd_bs))) +  scale_color_discrete(breaks = rev(levels(draws$model))) +
  facet_grid(model ~ .variable) +
  theme_bw() +
  theme(legend.position = c(0.99,0.99),
        legend.justification = c(1,1))
Estimated coefficients and the true values (dashed line)

Figure 15: Estimated coefficients and the true values (dashed line)

Show code
ggplot(betweenhdis, aes(x, .value, 
                 ymin = .lower, ymax = .upper)) +
  geom_pointrange() + 
  geom_line() + 
  ylab("sd(id)") +
  scale_y_continuous(trans = "log", breaks = c(0.1, 0.25, 0.5, 1, 2, 4)) +
  theme_bw()
Relationship between x and sd(id) in the two-part model

Figure 16: Relationship between x and sd(id) in the two-part model

Show code
# remotes::install_github("stephensrmmartin/LMMELSM")
# library(LMMELSM)
# m_melsm_2part <- lmmelsm(list(observed ~ y,
#                     location ~ x,
#                     scale ~ x,
#                     between ~ x),
#                id, days)
# summary(m_melsm_2part)

Censored location-scale within-subject

Here, mu(y) is a function of x, as is sigma(y) (only at the within-subject level).

Show code
fit_models(b_mean = 1,  b_sd_bs = 0, b_sd_ws = 0.8, y_ceiling = 2)
Show code
N <- 250
n_days = 51
set.seed(20191005)
people <- tibble(
  id = 1:N,
  x = rnorm(N))
n_days_per_person = rpois(N, n_days)
Show code
people <- people %>% 
  mutate(
    mean_log_sd_y = -1 + b_sd_bs * x,
    log_sd_y = 0 + b_sd_ws * x,
    mean_y = rnorm(N, sd = exp(mean_log_sd_y)) + b_mean * x,
    xQ = ntile(x, 6)
    )
days <- people %>% 
  full_join(tibble(
              id = rep(1:N, times = n_days_per_person)
            ), by = "id", multiple = "all") %>% 
            mutate(
              latent_y = rnorm(n(), 
                        mean = mean_y,
                        sd = exp(log_sd_y)),
              y = case_when(
                latent_y >= y_ceiling ~ y_ceiling,
                # latent_y <= -1.5 ~ -1.5,
                TRUE ~ latent_y
              ),
              ycens = case_when(
                latent_y >= y_ceiling ~ "right",
                # latent_y <= -1.5 ~ "left",
                TRUE ~ "none"
              )
            )

Percentage censored: 0.12

Show code
ggplot(days, aes(x, y)) +
  geom_point(alpha = 0.3)
Raw data

Figure 7: Raw data

Show code
sel_ids <- c(43, 36, 8, 40, 88, 29, 11, 49, 84, 41, 98)

ggplot(days, aes(x, y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_pointrange(stat = "summary", 
                  fun = mean, 
                  fun.min = function(x) { mean(x)-sd(x) },
                  fun.max = function(x) { mean(x)+sd(x) }, data = days %>% filter(id %in% sel_ids) ) +
  geom_point(alpha = 0.3, data = days %>% filter(id %in% sel_ids) )
Selected individuals with their means, standard deviations, and the regression line

Figure 8: Selected individuals with their means, standard deviations, and the regression line

Show code
library(cowplot)
p1 <- days %>% 
  group_by(id) %>% 
  summarise(x = mean(x, na.rm = T),
            mean_y = mean(y, na.rm = T)) %>% 
  ggplot(., aes(x, mean_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4)

p2 <- days %>% 
  group_by(id) %>% 
  summarise(x = mean(x, na.rm = T),
            sd_y = sd(y, na.rm = T)) %>% 
  ggplot(., aes(x, sd_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4) + 
  scale_y_continuous(trans = "log", breaks = c(0.1, 0.25, 0.5, 1, 2, 4))

p3 <- people %>% 
  ggplot(., aes(x, mean_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4) + 
  ylab("latent mean y")
plot_grid(p1,p2,p3,ncol = 2)
The mean and the intraindividual SD as a function of X

Figure 9: The mean and the intraindividual SD as a function of X

Bias?

Show code
m_mixed <- brm(bf(y | cens(ycens) ~ x + (1|id),
             sigma ~ x), data = days, silent = 2, refresh = 0)
m_between <- brm(bf(y | cens(ycens) ~ x,
             sigma ~ x), data = days %>% group_by(x, id) %>% 
               summarise(y = mean(y)) %>% 
               mutate(ycens = case_when(
                y >= y_ceiling ~ "right",
                TRUE ~ "none"
              )), silent = 2, refresh = 0)
m_mixed_2part <- brm(bf(y | cens(ycens) ~ x + (1|gr(id, by = xQ)),
             sigma ~ x), data = days, silent = 2, refresh = 0)
draws <- m_mixed_2part %>% gather_draws(`sd_.*x.*`, regex = T)
betweenhdis <- draws %>% mean_hdci(.width = .95) %>% 
  mutate(xQ = as.numeric(str_match(.variable, "xQ(\\d)")[,2])) %>% 
  left_join(people %>% group_by(xQ) %>% summarise(x = mean(x)))
m_2part_level2 <- brm(log(.value) | se(se, sigma = FALSE) ~ x, data = betweenhdis %>% mutate(se = (log(.value)-log(.lower))/2))
Show code
draws <- bind_rows(
  m_mixed = m_mixed %>% gather_draws(`b_(sigma_)?x`, regex = T),
  m_mixed_2part = m_mixed_2part %>% gather_draws(`b_(sigma_)?x`, regex = T),
  m_2part_between = m_2part_level2 %>% gather_draws(`b_(sigma_)?x`, regex = T) %>% mutate(.variable = "b_sigma_x"),
  m_between = m_between %>% gather_draws(`b_(sigma_)?x`, regex = T),
 .id = "model") %>% 
  mutate(model = fct_inorder(factor(model)))
draws <- draws %>% group_by(model, .variable) %>% 
  mean_hdci(.width = c(.95, .99)) %>% 
  ungroup()

ggplot(draws, aes(y = .variable, x = .value, xmin = .lower, xmax = .upper)) +
  geom_pointinterval(position = position_dodge(width = .4)) +
  ggrepel::geom_text_repel(aes(label = if_else(.width == .95, sprintf("%.2f", .value), NA_character_)), nudge_y = .1) +
  geom_vline(aes(xintercept = true_value), linetype = 'dashed', 
             data = tibble(model = c("m_between", "m_mixed", "m_mixed_2part", "m_between", "m_mixed", "m_mixed_2part", "m_2part_between"), .variable = c("b_x", "b_x", "b_x", "b_sigma_x", "b_sigma_x", "b_sigma_x", "b_sigma_x"), true_value = c(b_mean, b_mean, b_mean, b_sd_bs, b_sd_ws, b_sd_ws, b_sd_bs))) +  scale_color_discrete(breaks = rev(levels(draws$model))) +
  facet_grid(model ~ .variable) +
  theme_bw() +
  theme(legend.position = c(0.99,0.99),
        legend.justification = c(1,1))
Estimated coefficients and the true values (dashed line)

Figure 10: Estimated coefficients and the true values (dashed line)

Show code
ggplot(betweenhdis, aes(x, .value, 
                 ymin = .lower, ymax = .upper)) +
  geom_pointrange() + 
  geom_line() + 
  ylab("sd(id)") +
  scale_y_continuous(trans = "log", breaks = c(0.1, 0.25, 0.5, 1, 2, 4)) +
  theme_bw()
Relationship between x and sd(id) in the two-part model

Figure 11: Relationship between x and sd(id) in the two-part model

Censored location-scale between-subject

Here, mu(y) is a function of x, as is sigma(y) only at the between-subject level.

Show code
fit_models(b_mean = 1,  b_sd_bs = 0.8, b_sd_ws = 0, y_ceiling = 2)
Show code
N <- 250
n_days = 51
set.seed(20191005)
people <- tibble(
  id = 1:N,
  x = rnorm(N))
n_days_per_person = rpois(N, n_days)
Show code
people <- people %>% 
  mutate(
    mean_log_sd_y = -1 + b_sd_bs * x,
    log_sd_y = 0 + b_sd_ws * x,
    mean_y = rnorm(N, sd = exp(mean_log_sd_y)) + b_mean * x,
    xQ = ntile(x, 6)
    )
days <- people %>% 
  full_join(tibble(
              id = rep(1:N, times = n_days_per_person)
            ), by = "id", multiple = "all") %>% 
            mutate(
              latent_y = rnorm(n(), 
                        mean = mean_y,
                        sd = exp(log_sd_y)),
              y = case_when(
                latent_y >= y_ceiling ~ y_ceiling,
                # latent_y <= -1.5 ~ -1.5,
                TRUE ~ latent_y
              ),
              ycens = case_when(
                latent_y >= y_ceiling ~ "right",
                # latent_y <= -1.5 ~ "left",
                TRUE ~ "none"
              )
            )

Percentage censored: 0.10

Show code
ggplot(days, aes(x, y)) +
  geom_point(alpha = 0.3)
Raw data

Figure 12: Raw data

Show code
sel_ids <- c(43, 36, 8, 40, 88, 29, 11, 49, 84, 41, 98)

ggplot(days, aes(x, y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_pointrange(stat = "summary", 
                  fun = mean, 
                  fun.min = function(x) { mean(x)-sd(x) },
                  fun.max = function(x) { mean(x)+sd(x) }, data = days %>% filter(id %in% sel_ids) ) +
  geom_point(alpha = 0.3, data = days %>% filter(id %in% sel_ids) )
Selected individuals with their means, standard deviations, and the regression line

Figure 13: Selected individuals with their means, standard deviations, and the regression line

Show code
library(cowplot)
p1 <- days %>% 
  group_by(id) %>% 
  summarise(x = mean(x, na.rm = T),
            mean_y = mean(y, na.rm = T)) %>% 
  ggplot(., aes(x, mean_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4)

p2 <- days %>% 
  group_by(id) %>% 
  summarise(x = mean(x, na.rm = T),
            sd_y = sd(y, na.rm = T)) %>% 
  ggplot(., aes(x, sd_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4) + 
  scale_y_continuous(trans = "log", breaks = c(0.1, 0.25, 0.5, 1, 2, 4))

p3 <- people %>% 
  ggplot(., aes(x, mean_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4) + 
  ylab("latent mean y")
plot_grid(p1,p2,p3,ncol = 2)
The mean and the intraindividual SD as a function of X

Figure 14: The mean and the intraindividual SD as a function of X

Bias?

Show code
m_mixed <- brm(bf(y | cens(ycens) ~ x + (1|id),
             sigma ~ x), data = days, silent = 2, refresh = 0)
m_between <- brm(bf(y | cens(ycens) ~ x,
             sigma ~ x), data = days %>% group_by(x, id) %>% 
               summarise(y = mean(y)) %>% 
               mutate(ycens = case_when(
                y >= y_ceiling ~ "right",
                TRUE ~ "none"
              )), silent = 2, refresh = 0)
m_mixed_2part <- brm(bf(y | cens(ycens) ~ x + (1|gr(id, by = xQ)),
             sigma ~ x), data = days, silent = 2, refresh = 0)
draws <- m_mixed_2part %>% gather_draws(`sd_.*x.*`, regex = T)
betweenhdis <- draws %>% mean_hdci(.width = .95) %>% 
  mutate(xQ = as.numeric(str_match(.variable, "xQ(\\d)")[,2])) %>% 
  left_join(people %>% group_by(xQ) %>% summarise(x = mean(x)))
m_2part_level2 <- brm(log(.value) | se(se, sigma = FALSE) ~ x, data = betweenhdis %>% mutate(se = (log(.value)-log(.lower))/2))
Show code
draws <- bind_rows(
  m_mixed = m_mixed %>% gather_draws(`b_(sigma_)?x`, regex = T),
  m_mixed_2part = m_mixed_2part %>% gather_draws(`b_(sigma_)?x`, regex = T),
  m_2part_between = m_2part_level2 %>% gather_draws(`b_(sigma_)?x`, regex = T) %>% mutate(.variable = "b_sigma_x"),
  m_between = m_between %>% gather_draws(`b_(sigma_)?x`, regex = T),
 .id = "model") %>% 
  mutate(model = fct_inorder(factor(model)))
draws <- draws %>% group_by(model, .variable) %>% 
  mean_hdci(.width = c(.95, .99)) %>% 
  ungroup()

ggplot(draws, aes(y = .variable, x = .value, xmin = .lower, xmax = .upper)) +
  geom_pointinterval(position = position_dodge(width = .4)) +
  ggrepel::geom_text_repel(aes(label = if_else(.width == .95, sprintf("%.2f", .value), NA_character_)), nudge_y = .1) +
  geom_vline(aes(xintercept = true_value), linetype = 'dashed', 
             data = tibble(model = c("m_between", "m_mixed", "m_mixed_2part", "m_between", "m_mixed", "m_mixed_2part", "m_2part_between"), .variable = c("b_x", "b_x", "b_x", "b_sigma_x", "b_sigma_x", "b_sigma_x", "b_sigma_x"), true_value = c(b_mean, b_mean, b_mean, b_sd_bs, b_sd_ws, b_sd_ws, b_sd_bs))) +  scale_color_discrete(breaks = rev(levels(draws$model))) +
  facet_grid(model ~ .variable) +
  theme_bw() +
  theme(legend.position = c(0.99,0.99),
        legend.justification = c(1,1))
Estimated coefficients and the true values (dashed line)

Figure 15: Estimated coefficients and the true values (dashed line)

Show code
ggplot(betweenhdis, aes(x, .value, 
                 ymin = .lower, ymax = .upper)) +
  geom_pointrange() + 
  geom_line() + 
  ylab("sd(id)") +
  scale_y_continuous(trans = "log", breaks = c(0.1, 0.25, 0.5, 1, 2, 4)) +
  theme_bw()
Relationship between x and sd(id) in the two-part model

Figure 16: Relationship between x and sd(id) in the two-part model


  1. so, a doubling of income is associated with same happiness increment throughout the income range↩︎

  2. Big if!↩︎

  3. Even bigger if!↩︎

  4. two assumptions that I’ll touch on here. They do not make their assumptions regarding causal inference explicit, but still very much imply causality. But, like, what else is new?↩︎

  5. I requested the raw data on March 4 and received an answer that ping-level data could not be shared owing to promises made to participants on March 15.↩︎

  6. My dictionary says “disimprovement”. Is that actually in use?↩︎

  7. Am I projecting based on my own attempt at adversarial collaboration? Absolutely! Thanks for asking, I’m okay!↩︎

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/rubenarslan/rubenarslan.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Arslan (2023, March 13). One lives only to make blunders: Multilevel censored location-scale models. Retrieved from https://rubenarslan.github.io/posts/2023-03-05-multilevel-censored-location-scale-models/

BibTeX citation

@misc{arslan2023multilevel,
  author = {Arslan, Ruben C.},
  title = {One lives only to make blunders: Multilevel censored location-scale models},
  url = {https://rubenarslan.github.io/posts/2023-03-05-multilevel-censored-location-scale-models/},
  year = {2023}
}