Correction to Arslan et al. (2019).

We (me, Katharina Schilling, Tanja M. Gerlach, & Lars Penke) recently published a diary study on ovulatory changes in the Journal of Personality and Social Psychology.

Unfortunately, we made a few mistakes in reporting the study. The correction appeared today in JPSP. Because corrections have to be quite short, we will use this blog post to give a little more detail.

According to our assessment, the mistakes, although annoying and preventable, changed nothing substantive. I have taken to adding automated testing to my data cleaning code and instituted a bug bounty policy to reduce the odds of such errors in my future work.¹

After expanding on the correction, we will also respond to some criticisms that we do not think are errors in our work, but differences in interpretation.

Several bugs in our code. From the [Internet Archive Book Images](https://www.flickr.com/photos/internetarchivebookimages/14760162636)

Figure 1: Several bugs in our code. From the Internet Archive Book Images

Correction

Figure 1 and case numbers

We regret the following errors and inconsistencies in our published paper. Between our initial submission and our revision, we had made a small adjustment to the code for our exclusion criteria and neglected to update Figure 1 and Table 3 (because we did not notice that we had a few more participants and days). This led us to report an incorrect, lower number of total participants (1043 instead of 1054) for the robustness checks. The number of days were also off by a few hundred, as well as various sample means. The substantive results (model coefficients etc.) were reported correctly and with correct case numbers (in the online supplement).

The preregistered work is unaffected by this error. A corrected Figure 1 also shows two exclusion criteria (hypothesis guessing and long interruptions of the diary) that were mentioned on the supplementary website, but missing from Figure 1. A corrected figure can be found here and in the updated article.

`effsize` package bug

We reported inflated effect sizes for the Hedges’ g differences between hormonal contraceptive users and non-users in Table 1. After re-analysing data for the correction, we suddenly got different effect sizes. It turned out there was a bug in the effsize package for Hedges’ g computation that had been fixed in a newer version.

In all, we reported larger effect size differences between our naturally cycling group and our hormonal contraception quasi-control group; they were more comparable than Table 1 made them seem.

Programming error for a moderator variable

I made a programming error when aggregating the variable “partner’s attractiveness relative to self”. Specifically, I accidentally sorted values because of a typo in the data.table syntax. This led to nonsense values (women’s values were jumbled). A reader who re-analysed our data found it, for which are grateful. Fixing this error led to the following changes:

In the preregistered analyses, the moderation of fertile window effects on extra-pair desire and behaviour was no longer non-significant in the opposite direction of the prediction, but non-significant in the predicted direction (p = 0.23).
In the robustness analyses, the predicted interaction was significant for extra-pair desire and behaviour (p = 0.00565) and partner mate retention (p = 0.0014).

Our preregistered tests, following the literature at the time, had not permitted slopes for menstruation and the fertile window to vary by woman, even though fitting a cross-level moderation essentially stipulates that varying slopes must exist (an internal conceptual inconsistency).

Models with varying slopes indeed fit better for all outcomes. We reported robustness checks with varying slopes for all main effects, but we had not done so for our moderators tests, because we found no evidence of moderation and the check would have only made the test more conservative. Given that correcting the error led to a nominally significant result, we also tested a model, allowing for slopes to vary. In this model, the predicted interaction was non-significant for extra-pair desire (p = 0.085). The predicted interaction for partner mate retention in the robustness check would have been significant (p = 0.0072) according to our threshold of .01 for the preregistered tests, but still potentially consistent with sampling error given that 24 moderator effects had been tested (four moderators, three outcomes, two subsamples) were tested for essentially one hypothesis.

This programming error, though severe, did not affect the preregistered results. In our robustness checks, the error led to some changes in nominal significance, but the overall pattern still cannot be seen as evidence for the predicted moderation pattern.

Other post-publication feedback (not part of the correction)

Figure 5

Dan Engber helpfully pointed out that the caption for Figure 5 could have been clearer. The figure was intended to show differences in patterns across the cycle. To this end, we standardised differences within variables and hormonal contraceptive status (“within-subject change” in the figure caption). This focuses the eye on the differences in changes for HC users and non-users. In Figure 3, we also showed the mean differences. An alternative version of Figure 5, including mean differences between HC users and non-users, can be found online.

Following the preregistration

We were criticised for not following our preregistration to the letter. It was our intention to be faithful to the preregistration as much as possible and transparent about the deviations that we considered reasonable and necessary. We think we succeeded in doing so and that problems raised by the critic are mainly problems of explicitness and interpretation.

It was our first preregistration (in 2014), we had no models for how to preregister correlational work with many simultaenous (but not all related) hypothesis tests. It was also our first menstrual cycle study and my first repeated measures study. For this reason, we relied on expert opinion to design, for example, our exclusion criteria.

This process led to a few suboptimalities (still an incomplete list, I am sure):

Our exclusion criteria were overly strict and would have led to excluding most of the women for no good reason (according to our effect size estimates, excluded women were not more likely to be anovulatory).
We preregistered the use of windowed fertility predictors, which throw away most of the informative variation in fertility and reduce the number of usable days.
We preregistered no strategy to deal with multiple testing, although we had multiple outcomes (some of which were highly correlated).
We preregistered several moderators that were all designed to test the same hypothesis, instead of the strongest possible specification.
We did not preregister how we would aggregate some of the more complex items in the data.
We preregistered a scale optimisation algorithm based on Cronbach’s alpha, which is not the best way to estimate reliability for multilevel data like ours

We think we transparently reported how we chose to deal with these problems. We did not make any decisions to arrive at foregone conclusions; instead, we think we had good reasons for non-arbitrary decisions.

Operationalisation of hypothesis 2.2

The reader alerted us that our hypothesis _H.2.2. Moderation or shift hypotheses: The ovulatory increase in women’s extra-pair desires and reported male mate retention behavior is strongest (and the in-pair desire increase is weakest) for women who perceive their partners as low in sexual attractiveness relative to long-term partner attractiveness. could also be interpreted to mean a different statistical model than the one we fitted.

We interpreted it as meaning that women who have a partner who is high in long-term attractiveness but low in short-term attractiveness would show ovulatory increases in extra-pair desire, whereas all other women would not. Basically, women who have a partner who is a “provider” but does not have “good genes” would be interested in extra-pair men; other women would not be.

We saw this in contrast to the simpler model, which we also fit, with only short-term attractiveness as the moderator. The reader interpreted it as meaning that we should adjust for long-term attractiveness to remove a “positivity bias” and test only the interaction between the fertile window and short-term attractiveness. Previous work had sometimes tested such a model and sometimes a difference score.

Although we reported them, we recommend not interpreting difference scores such as this (or the relative attractiveness variable above) in isolation, because they assume that women with partners who are attractive for both long- and short-term relationships behave the same way as women with partners who are not attractive for either long- or short-term relationships. We think this is not what the verbally specified theory predicts, but of course verbal specifications can be debated because they often leave some room for ambiguity.

In our preregistered analyses, none of these alternative specifications would have yielded a significant effect, except one significant result in the opposite direction for in-pair desire. However, in our robustness checks, the interaction for this alternative specification would have been significant (p = 0.006). Again, allowing for slopes to vary rendered this interaction nonsignificant at .01 (p = 0.045).

Overall, as we had already stressed in our discussion, it would be premature to conclude an absence of moderation: confidence intervals were too wide to rule out potentially relevant effect sizes and patterns were often in the predicted form for extra-pair desire (but not for in-pair desire). But neither should these models, which were suggested after seeing the results for other models, be seen as evidence for moderation, given the number of tests performed. If a prediction from the literature is supported in preregistered tests, checks like ours can show robustness to relaxing or tightening assumptions. The evidence for the predicted moderators is clearly not robust in our data. More data is needed to reach adequate power for more informative tests of moderation patterns, and is indeed forthcoming. Maybe more importantly, theories need to be clearer, so that they can specify severe tests. We found this difficult to do at the time of planning the study.

Operationalisation of preregistration regarding hormonal contraceptive users

Lastly, we did not preregister that we would use hormonal contraception (HC) users as a quasi-control group for the naturally cycling group. Consistent with this, our preregistered tests compared fertile window changes with zero, not with the baseline change for HC users. However, we reported the latter comparison as well, in the preregistered analysis section. We mainly did this to show that despite the fact that we only had an ad-hoc strategy to deal with multiple testing, we never found an ovulatory change among hormonal contraceptive users (for whom ovulation is suppressed). We thought reporting the quasi-control group was one way to show that our ad-hoc strategy was effective.

We hope these additional tests, which were in fact always consistent with our preregistered tests, did not lead to confusion regarding our preregistration. The choice of additionally presenting these analyses did not affect our conclusions and was not made conditional on the results.

Conclusion

We are glad the paper led to animated post-publication discussion and are grateful to all who pointed out errors or ways the paper could be improved. We will implement the suggestions and lessons when publishing the results from the second, larger cycle study we conducted after this one.

In all, the conclusions of our paper remain the same, although quite a few numbers changed a little. The replicability of the ovulatory change literature still seems decidedly mixed. Our work was not (and was never meant to be) the last word on moderators of cycle changes.

I also stopped using data.table in favour of dplyr which has a more explicit syntax.↩