One lives only to make blunders: Multilevel censored location-scale models

Ruben C. Arslan

Recently, I got mad at a ceiling.

Show code

library(tidyverse)
library(brms)
library(tidybayes)
options(mc.cores = parallel::detectCores(),
        brms.backend = "cmdstanr",
        brms.file_refit = "on_change")

theme_set(theme_bw())

Show code

library(tweetrmd)
tweet_embed("https://twitter.com/rubenarslan/status/1632170569448685569")

My new personality? I'm going to lean into being mad about low ceilings in happiness measures and elsewhere. pic.twitter.com/6Bu19pn2rm
— Ruben C. Arslan (@rubenarslan) March 5, 2023

Let me explain. Kahneman & Deaton (2010) reported that happiness plateaued at an income of $75,000 or higher. Their measure of happiness was a binary question that 450,000 people answered in daily, random phone calls to ~1,000 people. Killingsworth 2021 used his experience sampling data from more than 35,000 individuals responding ~50 times to refute this. The Killingsworth study had a better happiness measure, a sliding scale from very unhappy to very happy. He found a loglinear relationship across the entire income range¹. He attributed Kahneman & Deaton’s (2010) plateau finding to a ceiling effect in their measure.

Now, in a recent paper, Killingsworth, Kahneman, and Meller (2023) re-analyzed Killingsworth’s (2021) experience sampling data to determine the relationship between income and experienced well-being, happiness. In the reanalysis, KKM wanted to find out why the results differed in an “adversarial collaboration”.

To do so, they aggregated daily happiness in the Killingsworth data to derive the subject’s observed mean. Then they used a form of distributional regression, quantile regression, to predict not only the mean of happiness but also various quantiles by log(income). Specifically, they investigated whether the relationship to the mean plateaued above a yearly household income of $100,000. This was not the case, the relationship between income and mean happiness is loglinear above and below this threshold with near-identical slopes. It’s also a quite small relationship, log(income) explains less than 1% of the variance in happiness.

However, they find that the relationship between log(income) and the 15% quantile does indeed flatten after $100,000. In my view, they inappropriately interpret this as evidence that there is an unhappy group of people for which money cannot alleviate suffering (above 100k).

A better way to put it would have been: at higher log(income) the variability in average happiness is higher. That is, some people have a lot of money and spend it on holidays, household help, and time with their loved ones. Others buy a social media platform and make themselves miserable. But even if this statistical relationship is real² and reflects a causal relationship³, it does not imply that there is one group of people whose unhappiness cannot be alleviated with money.

Now, I’m not particularly well-acquainted with quantile regression. However, I’ve worked with multilevel location scale models a good deal in the past, a different form of distributional regression. In so doing, I’ve learned not to trust my intuition, or at least to calibrate with fake data simulation. So, I simulated to check a hunch.

Two questionable assumptions in the quantile regression

My conclusion is that KKM make at least two⁴ assumptions that I think are unlikely to hold and which could bias their estimate:

that the variance of happiness within-subjects is homogeneous
that their aggregated means of happiness are free of sampling error

I was fairly sure that #1 is false, because, you know, I’ve seen happiness data, and because I am a human being who knows other human beings. I did not see their happiness data, mind you, which they only shared in aggregated and rounded form upon publication of KKM 2023, so we cannot check within-subject variance homogeneity.⁵

I was less sure whether #2 has much impact with 51 days per subject on average.

Simulation results

Even with on average 51 days per subject and homogeneous within-subject variance, sampling error in the mean attenuates the relationship between x and sd(y) on the between-subject level. Especially at low levels of x, sampling error can still dominate the true sd(id) in y.

Show code

knitr::include_graphics("multilevel-censored-location-scale-models_files/figure-html5/BS_0_8_WS_0__means-22-1.png")

Compare the scatter of y around the regression line at the top left and the bottom left to see how sampling error creates a slight bias when estimating the effect on sd(y) on the between-subject level. From the second simulation below.

Figure 1: Compare the scatter of y around the regression line at the top left and the bottom left to see how sampling error creates a slight bias when estimating the effect on sd(y) on the between-subject level. From the second simulation below.

Show code

knitr::include_graphics("multilevel-censored-location-scale-models_files/figure-html5/BS_0_8_WS_0__bias-24-1.png")

Figure 2: A between-subject effect on the log(sd) of .8 reduces to .65 because of sampling variance in aggregated means.

Heterogeneous within-subject variance can bias the estimate of the between-subject variance (i.e., what is really a strong tendency for an increased within-subject variance at higher x is mistaken for a weak tendency for an increase in between-subject variance at higher x).

Show code

knitr::include_graphics("multilevel-censored-location-scale-models_files/figure-html5/BS_0_WS_0_8__bias-1.png")

Figure 3: A within-subject effect on the log(sd) of .8 shows up as a .2 effect on the between-subject level.

Censoring increases bias. When values above or below a certain cutoff (e.g., the maximum value of the scale is 100) get clipped, the bias in the aggregated between-subject model is exacerbated. Why? The aggregation is done without regard to censoring. So, a person with c(94, 95, 96, 95) gets 95, just like a person with c(90, 100, 100, 90). But if we model censoring at the within-subject level, we would estimate a higher true value for the second person (my 2-part model correctly recovers the true value). Killingsworth 2021 reports only 5.5% of responses at ceiling and the relationship with the mean is quite weak, so this probably did not hit hard. Still, aggregating loses relevant information (after aggregation “less than 0.5% of people in any income group had experienced well-being equal to the response ceiling, on average”).

Show code

knitr::include_graphics("multilevel-censored-location-scale-models_files/figure-html5/BS_0_8_WS_0_2_bias-24-1.png")

The true between-subject effect on the log(sd) is underestimated more (~.55 with censoring vs. .65 without censoring vis a vis a true value of .8), even though just 10% of values are right-censored.

Figure 4: The true between-subject effect on the log(sd) is underestimated more (~.55 with censoring vs. .65 without censoring vis a vis a true value of .8), even though just 10% of values are right-censored.

Show code

knitr::include_graphics("multilevel-censored-location-scale-models_files/figure-html5/BS_0_WS_0_8_2_bias-1.png")

The between-subject effect on the log(sd) is overestimated more (~.3 censored vs. .16 non-censored vis a vis a true value of 0) in the censored within-subject model, even though just 12% of values are censored.

Figure 5: The between-subject effect on the log(sd) is overestimated more (~.3 censored vs. .16 non-censored vis a vis a true value of 0) in the censored within-subject model, even though just 12% of values are censored.

Technical note: There are Mplus (McNeish, 2020) and R-Stan (Martin, 2022) models to estimate two-part mixed effects location-scale models that allow simultaneous estimation of scale effects at both the within- and between-subject level. However, they do not implement censored distributions. My favourite package brms does, but does not natively permit scale effects on the between-subject level to be estimated. However, it allows us to estimate different sd(id) for subgroups. I formed septiles on x and used these as subgroups. Doing so allowed me to get a sort of poor man’s version of a censored two-part mixed effects location-scale model. A better man would have coded it up in raw Stan. As far as I can tell, this model both gets me closer to the between-subject effect than the aggregated model and gets me an unbiased estimate of the within-subject effect.

Show code

knitr::include_graphics("multilevel-censored-location-scale-models_files/figure-html5/BS_0_8_WS_0_2_betweensds-25-1.png")

Regressing x on the estimated sd(y) for the quantiles in my two-part model recovers the true between-subject effect even in the censored model.

Figure 6: Regressing x on the estimated sd(y) for the quantiles in my two-part model recovers the true between-subject effect even in the censored model.

You can find my simulations (well, one round) below.

Lessons learned

Killingsworth 2021 was a wonderful paper. I think this quantile analysis and adversarial collaboration actually muddled the matter. In German, we of course have a word for this: Verschlimmbesserung.⁶ How come?

I have an idea. Adversarial collaborations don’t just sound like an oxymoron. Especially with large status differences between adversaries, the interesting stuff happens behind the scenes. Without transparency, power, resources, and stubbornness can win over superior arguments.⁷ I think such disputes should be carried out in the open. And if you want to share data with someone you consider an adversary, may I suggest the following as insurance:

Show code

library(tweetrmd)
tweet_embed("https://twitter.com/rubenarslan/status/1631789472407859200")

If you truly treat the other researcher as an adversary, perhaps you should try medieval hostage giving. Each sends the other a random half of their data.
The holdout data and mutuality may help ensure good faith.
— Ruben C. Arslan (@rubenarslan) March 3, 2023

Simulations

Location-scale within-subject

Here, mu(y) is a function of x, as is sigma(y) (only at the within-subject level).

Show code

fit_models <- function(b_mean,  b_sd_bs, b_sd_ws, y_ceiling) {
  name = paste("BS", b_sd_bs, "WS", b_sd_ws, if_else(y_ceiling<Inf,as.character(y_ceiling),""))
  rmdpartials::partial("fit_models.Rmd", 
                       name = name,
                       b_mean = b_mean,  
                       b_sd_bs = b_sd_bs, 
                       b_sd_ws = b_sd_ws, 
                       y_ceiling = y_ceiling)
}

Show code

fit_models(
  b_mean = 1,  b_sd_bs = 0, b_sd_ws = 0.8, y_ceiling = Inf)

Show code

N <- 250
n_days = 51
set.seed(20191005)
people <- tibble(
  id = 1:N,
  x = rnorm(N))
n_days_per_person = rpois(N, n_days)

Show code

people <- people %>% 
  mutate(
    mean_log_sd_y = -1 + b_sd_bs * x,
    log_sd_y = 0 + b_sd_ws * x,
    mean_y = rnorm(N, sd = exp(mean_log_sd_y)) + b_mean * x,
    xQ = ntile(x, 6)
    )
days <- people %>% 
  full_join(tibble(
              id = rep(1:N, times = n_days_per_person)
            ), by = "id", multiple = "all") %>% 
            mutate(
              latent_y = rnorm(n(), 
                        mean = mean_y,
                        sd = exp(log_sd_y)),
              y = case_when(
                latent_y >= y_ceiling ~ y_ceiling,
                # latent_y <= -1.5 ~ -1.5,
                TRUE ~ latent_y
              ),
              ycens = case_when(
                latent_y >= y_ceiling ~ "right",
                # latent_y <= -1.5 ~ "left",
                TRUE ~ "none"
              )
            )

Percentage censored: 0.00

Show code

ggplot(days, aes(x, y)) +
  geom_point(alpha = 0.3)

Figure 7: Raw data

Show code

sel_ids <- c(43, 36, 8, 40, 88, 29, 11, 49, 84, 41, 98)

ggplot(days, aes(x, y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_pointrange(stat = "summary", 
                  fun = mean, 
                  fun.min = function(x) { mean(x)-sd(x) },
                  fun.max = function(x) { mean(x)+sd(x) }, data = days %>% filter(id %in% sel_ids) ) +
  geom_point(alpha = 0.3, data = days %>% filter(id %in% sel_ids) )

Figure 8: Selected individuals with their means, standard deviations, and the regression line

Show code

library(cowplot)
p1 <- days %>% 
  group_by(id) %>% 
  summarise(x = mean(x, na.rm = T),
            mean_y = mean(y, na.rm = T)) %>% 
  ggplot(., aes(x, mean_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4)

p2 <- days %>% 
  group_by(id) %>% 
  summarise(x = mean(x, na.rm = T),
            sd_y = sd(y, na.rm = T)) %>% 
  ggplot(., aes(x, sd_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4) + 
  scale_y_continuous(trans = "log", breaks = c(0.1, 0.25, 0.5, 1, 2, 4))

p3 <- people %>% 
  ggplot(., aes(x, mean_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4) + 
  ylab("latent mean y")
plot_grid(p1,p2,p3,ncol = 2)

Figure 9: The mean and the intraindividual SD as a function of X

Bias?

Show code

m_mixed <- brm(bf(y | cens(ycens) ~ x + (1|id),
             sigma ~ x), data = days, silent = 2, refresh = 0)
m_between <- brm(bf(y | cens(ycens) ~ x,
             sigma ~ x), data = days %>% group_by(x, id) %>% 
               summarise(y = mean(y)) %>% 
               mutate(ycens = case_when(
                y >= y_ceiling ~ "right",
                TRUE ~ "none"
              )), silent = 2, refresh = 0)
m_mixed_2part <- brm(bf(y | cens(ycens) ~ x + (1|gr(id, by = xQ)),
             sigma ~ x), data = days, silent = 2, refresh = 0)
draws <- m_mixed_2part %>% gather_draws(`sd_.*x.*`, regex = T)
betweenhdis <- draws %>% mean_hdci(.width = .95) %>% 
  mutate(xQ = as.numeric(str_match(.variable, "xQ(\\d)")[,2])) %>% 
  left_join(people %>% group_by(xQ) %>% summarise(x = mean(x)))
m_2part_level2 <- brm(log(.value) | se(se, sigma = FALSE) ~ x, data = betweenhdis %>% mutate(se = (log(.value)-log(.lower))/2))

Show code

draws <- bind_rows(
  m_mixed = m_mixed %>% gather_draws(`b_(sigma_)?x`, regex = T),
  m_mixed_2part = m_mixed_2part %>% gather_draws(`b_(sigma_)?x`, regex = T),
  m_2part_between = m_2part_level2 %>% gather_draws(`b_(sigma_)?x`, regex = T) %>% mutate(.variable = "b_sigma_x"),
  m_between = m_between %>% gather_draws(`b_(sigma_)?x`, regex = T),
 .id = "model") %>% 
  mutate(model = fct_inorder(factor(model)))
draws <- draws %>% group_by(model, .variable) %>% 
  mean_hdci(.width = c(.95, .99)) %>% 
  ungroup()

ggplot(draws, aes(y = .variable, x = .value, xmin = .lower, xmax = .upper)) +
  geom_pointinterval(position = position_dodge(width = .4)) +
  ggrepel::geom_text_repel(aes(label = if_else(.width == .95, sprintf("%.2f", .value), NA_character_)), nudge_y = .1) +
  geom_vline(aes(xintercept = true_value), linetype = 'dashed', 
             data = tibble(model = c("m_between", "m_mixed", "m_mixed_2part", "m_between", "m_mixed", "m_mixed_2part", "m_2part_between"), .variable = c("b_x", "b_x", "b_x", "b_sigma_x", "b_sigma_x", "b_sigma_x", "b_sigma_x"), true_value = c(b_mean, b_mean, b_mean, b_sd_bs, b_sd_ws, b_sd_ws, b_sd_bs))) +  scale_color_discrete(breaks = rev(levels(draws$model))) +
  facet_grid(model ~ .variable) +
  theme_bw() +
  theme(legend.position = c(0.99,0.99),
        legend.justification = c(1,1))

Figure 10: Estimated coefficients and the true values (dashed line)

Show code

ggplot(betweenhdis, aes(x, .value, 
                 ymin = .lower, ymax = .upper)) +
  geom_pointrange() + 
  geom_line() + 
  ylab("sd(id)") +
  scale_y_continuous(trans = "log", breaks = c(0.1, 0.25, 0.5, 1, 2, 4)) +
  theme_bw()

Figure 11: Relationship between x and sd(id) in the two-part model

Location-scale between-subject

Here, mu(y) is a function of x, as is sigma(y) at the between-subject level.

Show code

fit_models(
  b_mean = 1,  b_sd_bs = 0.8, b_sd_ws = 0, y_ceiling = Inf)

Show code

N <- 250
n_days = 51
set.seed(20191005)
people <- tibble(
  id = 1:N,
  x = rnorm(N))
n_days_per_person = rpois(N, n_days)

Show code

people <- people %>% 
  mutate(
    mean_log_sd_y = -1 + b_sd_bs * x,
    log_sd_y = 0 + b_sd_ws * x,
    mean_y = rnorm(N, sd = exp(mean_log_sd_y)) + b_mean * x,
    xQ = ntile(x, 6)
    )
days <- people %>% 
  full_join(tibble(
              id = rep(1:N, times = n_days_per_person)
            ), by = "id", multiple = "all") %>% 
            mutate(
              latent_y = rnorm(n(), 
                        mean = mean_y,
                        sd = exp(log_sd_y)),
              y = case_when(
                latent_y >= y_ceiling ~ y_ceiling,
                # latent_y <= -1.5 ~ -1.5,
                TRUE ~ latent_y
              ),
              ycens = case_when(
                latent_y >= y_ceiling ~ "right",
                # latent_y <= -1.5 ~ "left",
                TRUE ~ "none"
              )
            )

Percentage censored: 0.00

Show code

ggplot(days, aes(x, y)) +
  geom_point(alpha = 0.3)

Figure 12: Raw data

Show code

sel_ids <- c(43, 36, 8, 40, 88, 29, 11, 49, 84, 41, 98)

ggplot(days, aes(x, y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_pointrange(stat = "summary", 
                  fun = mean, 
                  fun.min = function(x) { mean(x)-sd(x) },
                  fun.max = function(x) { mean(x)+sd(x) }, data = days %>% filter(id %in% sel_ids) ) +
  geom_point(alpha = 0.3, data = days %>% filter(id %in% sel_ids) )

Figure 13: Selected individuals with their means, standard deviations, and the regression line

Show code

library(cowplot)
p1 <- days %>% 
  group_by(id) %>% 
  summarise(x = mean(x, na.rm = T),
            mean_y = mean(y, na.rm = T)) %>% 
  ggplot(., aes(x, mean_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4)

p2 <- days %>% 
  group_by(id) %>% 
  summarise(x = mean(x, na.rm = T),
            sd_y = sd(y, na.rm = T)) %>% 
  ggplot(., aes(x, sd_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4) + 
  scale_y_continuous(trans = "log", breaks = c(0.1, 0.25, 0.5, 1, 2, 4))

p3 <- people %>% 
  ggplot(., aes(x, mean_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4) + 
  ylab("latent mean y")
plot_grid(p1,p2,p3,ncol = 2)

Figure 14: The mean and the intraindividual SD as a function of X

Bias?

Show code

m_mixed <- brm(bf(y | cens(ycens) ~ x + (1|id),
             sigma ~ x), data = days, silent = 2, refresh = 0)
m_between <- brm(bf(y | cens(ycens) ~ x,
             sigma ~ x), data = days %>% group_by(x, id) %>% 
               summarise(y = mean(y)) %>% 
               mutate(ycens = case_when(
                y >= y_ceiling ~ "right",
                TRUE ~ "none"
              )), silent = 2, refresh = 0)
m_mixed_2part <- brm(bf(y | cens(ycens) ~ x + (1|gr(id, by = xQ)),
             sigma ~ x), data = days, silent = 2, refresh = 0)
draws <- m_mixed_2part %>% gather_draws(`sd_.*x.*`, regex = T)
betweenhdis <- draws %>% mean_hdci(.width = .95) %>% 
  mutate(xQ = as.numeric(str_match(.variable, "xQ(\\d)")[,2])) %>% 
  left_join(people %>% group_by(xQ) %>% summarise(x = mean(x)))
m_2part_level2 <- brm(log(.value) | se(se, sigma = FALSE) ~ x, data = betweenhdis %>% mutate(se = (log(.value)-log(.lower))/2))

Show code

draws <- bind_rows(
  m_mixed = m_mixed %>% gather_draws(`b_(sigma_)?x`, regex = T),
  m_mixed_2part = m_mixed_2part %>% gather_draws(`b_(sigma_)?x`, regex = T),
  m_2part_between = m_2part_level2 %>% gather_draws(`b_(sigma_)?x`, regex = T) %>% mutate(.variable = "b_sigma_x"),
  m_between = m_between %>% gather_draws(`b_(sigma_)?x`, regex = T),
 .id = "model") %>% 
  mutate(model = fct_inorder(factor(model)))
draws <- draws %>% group_by(model, .variable) %>% 
  mean_hdci(.width = c(.95, .99)) %>% 
  ungroup()

ggplot(draws, aes(y = .variable, x = .value, xmin = .lower, xmax = .upper)) +
  geom_pointinterval(position = position_dodge(width = .4)) +
  ggrepel::geom_text_repel(aes(label = if_else(.width == .95, sprintf("%.2f", .value), NA_character_)), nudge_y = .1) +
  geom_vline(aes(xintercept = true_value), linetype = 'dashed', 
             data = tibble(model = c("m_between", "m_mixed", "m_mixed_2part", "m_between", "m_mixed", "m_mixed_2part", "m_2part_between"), .variable = c("b_x", "b_x", "b_x", "b_sigma_x", "b_sigma_x", "b_sigma_x", "b_sigma_x"), true_value = c(b_mean, b_mean, b_mean, b_sd_bs, b_sd_ws, b_sd_ws, b_sd_bs))) +  scale_color_discrete(breaks = rev(levels(draws$model))) +
  facet_grid(model ~ .variable) +
  theme_bw() +
  theme(legend.position = c(0.99,0.99),
        legend.justification = c(1,1))

Figure 15: Estimated coefficients and the true values (dashed line)

Show code

ggplot(betweenhdis, aes(x, .value, 
                 ymin = .lower, ymax = .upper)) +
  geom_pointrange() + 
  geom_line() + 
  ylab("sd(id)") +
  scale_y_continuous(trans = "log", breaks = c(0.1, 0.25, 0.5, 1, 2, 4)) +
  theme_bw()

Figure 16: Relationship between x and sd(id) in the two-part model

Show code

# remotes::install_github("stephensrmmartin/LMMELSM")
# library(LMMELSM)
# m_melsm_2part <- lmmelsm(list(observed ~ y,
#                     location ~ x,
#                     scale ~ x,
#                     between ~ x),
#                id, days)
# summary(m_melsm_2part)

Censored location-scale within-subject

Here, mu(y) is a function of x, as is sigma(y) (only at the within-subject level).

Show code

fit_models(b_mean = 1,  b_sd_bs = 0, b_sd_ws = 0.8, y_ceiling = 2)

Show code

N <- 250
n_days = 51
set.seed(20191005)
people <- tibble(
  id = 1:N,
  x = rnorm(N))
n_days_per_person = rpois(N, n_days)

Show code

people <- people %>% 
  mutate(
    mean_log_sd_y = -1 + b_sd_bs * x,
    log_sd_y = 0 + b_sd_ws * x,
    mean_y = rnorm(N, sd = exp(mean_log_sd_y)) + b_mean * x,
    xQ = ntile(x, 6)
    )
days <- people %>% 
  full_join(tibble(
              id = rep(1:N, times = n_days_per_person)
            ), by = "id", multiple = "all") %>% 
            mutate(
              latent_y = rnorm(n(), 
                        mean = mean_y,
                        sd = exp(log_sd_y)),
              y = case_when(
                latent_y >= y_ceiling ~ y_ceiling,
                # latent_y <= -1.5 ~ -1.5,
                TRUE ~ latent_y
              ),
              ycens = case_when(
                latent_y >= y_ceiling ~ "right",
                # latent_y <= -1.5 ~ "left",
                TRUE ~ "none"
              )
            )

Percentage censored: 0.12

Show code

ggplot(days, aes(x, y)) +
  geom_point(alpha = 0.3)

Figure 7: Raw data

Show code

sel_ids <- c(43, 36, 8, 40, 88, 29, 11, 49, 84, 41, 98)

ggplot(days, aes(x, y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_pointrange(stat = "summary", 
                  fun = mean, 
                  fun.min = function(x) { mean(x)-sd(x) },
                  fun.max = function(x) { mean(x)+sd(x) }, data = days %>% filter(id %in% sel_ids) ) +
  geom_point(alpha = 0.3, data = days %>% filter(id %in% sel_ids) )

Figure 8: Selected individuals with their means, standard deviations, and the regression line

Show code

library(cowplot)
p1 <- days %>% 
  group_by(id) %>% 
  summarise(x = mean(x, na.rm = T),
            mean_y = mean(y, na.rm = T)) %>% 
  ggplot(., aes(x, mean_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4)

p2 <- days %>% 
  group_by(id) %>% 
  summarise(x = mean(x, na.rm = T),
            sd_y = sd(y, na.rm = T)) %>% 
  ggplot(., aes(x, sd_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4) + 
  scale_y_continuous(trans = "log", breaks = c(0.1, 0.25, 0.5, 1, 2, 4))

p3 <- people %>% 
  ggplot(., aes(x, mean_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4) + 
  ylab("latent mean y")
plot_grid(p1,p2,p3,ncol = 2)

Figure 9: The mean and the intraindividual SD as a function of X

Bias?

Show code

m_mixed <- brm(bf(y | cens(ycens) ~ x + (1|id),
             sigma ~ x), data = days, silent = 2, refresh = 0)
m_between <- brm(bf(y | cens(ycens) ~ x,
             sigma ~ x), data = days %>% group_by(x, id) %>% 
               summarise(y = mean(y)) %>% 
               mutate(ycens = case_when(
                y >= y_ceiling ~ "right",
                TRUE ~ "none"
              )), silent = 2, refresh = 0)
m_mixed_2part <- brm(bf(y | cens(ycens) ~ x + (1|gr(id, by = xQ)),
             sigma ~ x), data = days, silent = 2, refresh = 0)
draws <- m_mixed_2part %>% gather_draws(`sd_.*x.*`, regex = T)
betweenhdis <- draws %>% mean_hdci(.width = .95) %>% 
  mutate(xQ = as.numeric(str_match(.variable, "xQ(\\d)")[,2])) %>% 
  left_join(people %>% group_by(xQ) %>% summarise(x = mean(x)))
m_2part_level2 <- brm(log(.value) | se(se, sigma = FALSE) ~ x, data = betweenhdis %>% mutate(se = (log(.value)-log(.lower))/2))

Show code

draws <- bind_rows(
  m_mixed = m_mixed %>% gather_draws(`b_(sigma_)?x`, regex = T),
  m_mixed_2part = m_mixed_2part %>% gather_draws(`b_(sigma_)?x`, regex = T),
  m_2part_between = m_2part_level2 %>% gather_draws(`b_(sigma_)?x`, regex = T) %>% mutate(.variable = "b_sigma_x"),
  m_between = m_between %>% gather_draws(`b_(sigma_)?x`, regex = T),
 .id = "model") %>% 
  mutate(model = fct_inorder(factor(model)))
draws <- draws %>% group_by(model, .variable) %>% 
  mean_hdci(.width = c(.95, .99)) %>% 
  ungroup()

ggplot(draws, aes(y = .variable, x = .value, xmin = .lower, xmax = .upper)) +
  geom_pointinterval(position = position_dodge(width = .4)) +
  ggrepel::geom_text_repel(aes(label = if_else(.width == .95, sprintf("%.2f", .value), NA_character_)), nudge_y = .1) +
  geom_vline(aes(xintercept = true_value), linetype = 'dashed', 
             data = tibble(model = c("m_between", "m_mixed", "m_mixed_2part", "m_between", "m_mixed", "m_mixed_2part", "m_2part_between"), .variable = c("b_x", "b_x", "b_x", "b_sigma_x", "b_sigma_x", "b_sigma_x", "b_sigma_x"), true_value = c(b_mean, b_mean, b_mean, b_sd_bs, b_sd_ws, b_sd_ws, b_sd_bs))) +  scale_color_discrete(breaks = rev(levels(draws$model))) +
  facet_grid(model ~ .variable) +
  theme_bw() +
  theme(legend.position = c(0.99,0.99),
        legend.justification = c(1,1))

Figure 10: Estimated coefficients and the true values (dashed line)

Show code

ggplot(betweenhdis, aes(x, .value, 
                 ymin = .lower, ymax = .upper)) +
  geom_pointrange() + 
  geom_line() + 
  ylab("sd(id)") +
  scale_y_continuous(trans = "log", breaks = c(0.1, 0.25, 0.5, 1, 2, 4)) +
  theme_bw()

Figure 11: Relationship between x and sd(id) in the two-part model

Censored location-scale between-subject

Here, mu(y) is a function of x, as is sigma(y) only at the between-subject level.

Show code

fit_models(b_mean = 1,  b_sd_bs = 0.8, b_sd_ws = 0, y_ceiling = 2)

Show code

N <- 250
n_days = 51
set.seed(20191005)
people <- tibble(
  id = 1:N,
  x = rnorm(N))
n_days_per_person = rpois(N, n_days)

Show code

people <- people %>% 
  mutate(
    mean_log_sd_y = -1 + b_sd_bs * x,
    log_sd_y = 0 + b_sd_ws * x,
    mean_y = rnorm(N, sd = exp(mean_log_sd_y)) + b_mean * x,
    xQ = ntile(x, 6)
    )
days <- people %>% 
  full_join(tibble(
              id = rep(1:N, times = n_days_per_person)
            ), by = "id", multiple = "all") %>% 
            mutate(
              latent_y = rnorm(n(), 
                        mean = mean_y,
                        sd = exp(log_sd_y)),
              y = case_when(
                latent_y >= y_ceiling ~ y_ceiling,
                # latent_y <= -1.5 ~ -1.5,
                TRUE ~ latent_y
              ),
              ycens = case_when(
                latent_y >= y_ceiling ~ "right",
                # latent_y <= -1.5 ~ "left",
                TRUE ~ "none"
              )
            )

Percentage censored: 0.10

Show code

ggplot(days, aes(x, y)) +
  geom_point(alpha = 0.3)

Figure 12: Raw data

Show code

sel_ids <- c(43, 36, 8, 40, 88, 29, 11, 49, 84, 41, 98)

ggplot(days, aes(x, y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_pointrange(stat = "summary", 
                  fun = mean, 
                  fun.min = function(x) { mean(x)-sd(x) },
                  fun.max = function(x) { mean(x)+sd(x) }, data = days %>% filter(id %in% sel_ids) ) +
  geom_point(alpha = 0.3, data = days %>% filter(id %in% sel_ids) )

Figure 13: Selected individuals with their means, standard deviations, and the regression line

Show code

library(cowplot)
p1 <- days %>% 
  group_by(id) %>% 
  summarise(x = mean(x, na.rm = T),
            mean_y = mean(y, na.rm = T)) %>% 
  ggplot(., aes(x, mean_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4)

p2 <- days %>% 
  group_by(id) %>% 
  summarise(x = mean(x, na.rm = T),
            sd_y = sd(y, na.rm = T)) %>% 
  ggplot(., aes(x, sd_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4) + 
  scale_y_continuous(trans = "log", breaks = c(0.1, 0.25, 0.5, 1, 2, 4))

p3 <- people %>% 
  ggplot(., aes(x, mean_y)) +
  geom_smooth(method = 'lm', color = "gray", se = F) + 
  geom_point(alpha = 0.4) + 
  ylab("latent mean y")
plot_grid(p1,p2,p3,ncol = 2)

Figure 14: The mean and the intraindividual SD as a function of X

Bias?

Show code

m_mixed <- brm(bf(y | cens(ycens) ~ x + (1|id),
             sigma ~ x), data = days, silent = 2, refresh = 0)
m_between <- brm(bf(y | cens(ycens) ~ x,
             sigma ~ x), data = days %>% group_by(x, id) %>% 
               summarise(y = mean(y)) %>% 
               mutate(ycens = case_when(
                y >= y_ceiling ~ "right",
                TRUE ~ "none"
              )), silent = 2, refresh = 0)
m_mixed_2part <- brm(bf(y | cens(ycens) ~ x + (1|gr(id, by = xQ)),
             sigma ~ x), data = days, silent = 2, refresh = 0)
draws <- m_mixed_2part %>% gather_draws(`sd_.*x.*`, regex = T)
betweenhdis <- draws %>% mean_hdci(.width = .95) %>% 
  mutate(xQ = as.numeric(str_match(.variable, "xQ(\\d)")[,2])) %>% 
  left_join(people %>% group_by(xQ) %>% summarise(x = mean(x)))
m_2part_level2 <- brm(log(.value) | se(se, sigma = FALSE) ~ x, data = betweenhdis %>% mutate(se = (log(.value)-log(.lower))/2))

Show code

draws <- bind_rows(
  m_mixed = m_mixed %>% gather_draws(`b_(sigma_)?x`, regex = T),
  m_mixed_2part = m_mixed_2part %>% gather_draws(`b_(sigma_)?x`, regex = T),
  m_2part_between = m_2part_level2 %>% gather_draws(`b_(sigma_)?x`, regex = T) %>% mutate(.variable = "b_sigma_x"),
  m_between = m_between %>% gather_draws(`b_(sigma_)?x`, regex = T),
 .id = "model") %>% 
  mutate(model = fct_inorder(factor(model)))
draws <- draws %>% group_by(model, .variable) %>% 
  mean_hdci(.width = c(.95, .99)) %>% 
  ungroup()

ggplot(draws, aes(y = .variable, x = .value, xmin = .lower, xmax = .upper)) +
  geom_pointinterval(position = position_dodge(width = .4)) +
  ggrepel::geom_text_repel(aes(label = if_else(.width == .95, sprintf("%.2f", .value), NA_character_)), nudge_y = .1) +
  geom_vline(aes(xintercept = true_value), linetype = 'dashed', 
             data = tibble(model = c("m_between", "m_mixed", "m_mixed_2part", "m_between", "m_mixed", "m_mixed_2part", "m_2part_between"), .variable = c("b_x", "b_x", "b_x", "b_sigma_x", "b_sigma_x", "b_sigma_x", "b_sigma_x"), true_value = c(b_mean, b_mean, b_mean, b_sd_bs, b_sd_ws, b_sd_ws, b_sd_bs))) +  scale_color_discrete(breaks = rev(levels(draws$model))) +
  facet_grid(model ~ .variable) +
  theme_bw() +
  theme(legend.position = c(0.99,0.99),
        legend.justification = c(1,1))

Figure 15: Estimated coefficients and the true values (dashed line)

Show code

ggplot(betweenhdis, aes(x, .value, 
                 ymin = .lower, ymax = .upper)) +
  geom_pointrange() + 
  geom_line() + 
  ylab("sd(id)") +
  scale_y_continuous(trans = "log", breaks = c(0.1, 0.25, 0.5, 1, 2, 4)) +
  theme_bw()

Figure 16: Relationship between x and sd(id) in the two-part model

so, a doubling of income is associated with same happiness increment throughout the income range↩︎
Big if!↩︎
Even bigger if!↩︎
two assumptions that I’ll touch on here. They do not make their assumptions regarding causal inference explicit, but still very much imply causality. But, like, what else is new?↩︎
I requested the raw data on March 4 and received an answer that ping-level data could not be shared owing to promises made to participants on March 15.↩︎
My dictionary says “disimprovement”. Is that actually in use?↩︎
Am I projecting based on my own attempt at adversarial collaboration? Absolutely! Thanks for asking, I’m okay!↩︎

Comment on this article Share:

Multilevel censored location-scale models

Two questionable assumptions in the quantile regression

Simulation results

Lessons learned

Simulations

Location-scale within-subject

Bias?

Location-scale between-subject

Bias?

Censored location-scale within-subject

Bias?

Censored location-scale between-subject

Bias?

Corrections

Reuse

Citation