Swedish contemporary data Paternal age -> Fitness

Description of data

The Swedish Multi-Generation Register

The Swedish Multi-Generation Register was established in 1960 and includes index persons born since 1932 that were alive in 1960. The register also includes the parents of the index persons. The parents are included in the register even if they were not born after 1932 or alive in 1960. The coverage of parents of individuals born in Sweden is approximately 98 % for fathers and 100 % for mothers around 1960 (see graph below).

Left truncation

Left truncation refers to data that is retrospectively limited resulting in exclusion of study subjects. The truncation is dependent on birth strata. Still, matching cases and controls on birth year does not automatically take care of problems due to left truncation. In these studies, the data might be truncated regarding parental ages. Individuals born before the initiation of the Multi-Generation Register in 1932 are excluded due to left truncation of parental age data. However, the Multi-Generation Register includes parental information of individuals born since 1932 and ages of their parents are not extensively subjected to left truncation. In order to collate a sample consisting of 3 generations, parents have to be born after 1932. As a result, older parents are more likely to be excluded than younger parents. These problems may be addressed with several approaches in sensitivity analyses

Left censoring

When an event occurs before the individual comes under observation, this is called left censoring. Studies on paternal age and number of offspring might be subjected to left censoring regarding number of children because children born before 1932 are censored. Different sensitivity analyses can be used to examine potential problems with left censoring.

Difference between left truncation and left censoring

If a man has all his children before 1932 he will be excluded due to left truncation.
If a man has some children before 1932 and others after 1932 this man will be included as well as children born after 1932 who were alive in 1960. However, the children born before 1932 will be excluded due to left censoring.

Right censoring

When an event occurs after the observation time end-point, this is called right censoring. Studies on paternal age in association to fertility might be subjected to left censoring regarding number of children because all children born after December 31 st 2009 are censored. However, problems with censoring can be adjusted for by using Cox regression or conditional logistic regression.

Other issues

Lopnrs are unique for each individual. However, an individual can occur several times, either as fathers, grandfathers or children.
Some observations consist of two generations. Others have complete data for three generations.
Educational level classification: 1. Elementary school,< 9 y 2. Compulsory school, 9 y 3. Upper secondary, < 3 y 4. Upper secondary, 3 y 5. Postsecondary, < 3 y 6. University graduate, ≥ 3 y 7. University postgraduate (PhD)
Lopnr=id number (mor=mother, far=father, mormor=maternal grandmother, morfar=maternal grandfather, farmor= paternal grandmother, farfar= paternal grandfather )
Kon = sex

Wrangling

source("0__helpers.R")
library(zoo)
# bsub -q fat -W 48:00 -n 1 Rscript -e "setwd('/usr/users/rarslan/updated_data/'); filebase = '0_swed_massage'; knitr::knit(input = paste0(filebase,'.Rmd'), output = paste0(filebase,'.md'));cat(readLines(paste0(filebase,'.md')), sep = '\n')"
opts_chunk$set(cache=F,cache.lazy=F,tidy=FALSE,autodep=TRUE,dev=c('png','pdf'),fig.width=12,fig.height=7.5,out.width='1440px',out.height='900px')

Calculating mating success

Info about marriage is called CIVIL. You can be listed as G (married), OG (never married), S (divorced) or Ä (widow/widower). There is also a variable called CIVILANTAR which describes how many years you have had the same CIVIL. This info is updated annually and we have information from 1990-2009.

I have created a file where I just extracted all data about individual marriage status (including all women and men). We actually have info dating back to 1960 for another data source called FOB (folkbokföringsregistret). This data was collected in 1960, 1970, 1975, 1980, 1985 and 1990. FOB does not include CIVILANTAR (number of years with the same CIVIL status).

# system.time({
# swed_civil60_85 = read_sas("~/Downloads/civil_1960_1985.sas7bdat")
# swed_civil60_85 = data.table(swed_civil60_85)
# })
# system.time({
# swed_civil90_09 = read_sas("~/Downloads/civil_1990_2009.sas7bdat")
# swed_civil90_09 = data.table(swed_civil90_09)
# })
# setkey(swed_civil60_85, LOPNR, year)
# setkey(swed_civil90_09, LOPNR, year)
# 
# # swed_civil60_85 = head(swed_civil60_85, 1e5)
# # swed_civil90_09 = head(swed_civil90_09, 1e5)
# 
# swed_civil = rbind(swed_civil60_85,swed_civil90_09)
# setkey(swed_civil, LOPNR, year) # sort by person, then year
# save(swed_civil,file="data/swed_civil.rdata")
# rm(swed_civil60_85, swed_civil90_09) # free memory

load("data/swed_civil.rdata")

# tail(swed_civil, 100)
# swed_civil[LOPNR == 9862, ]
# swed_civil[LOPNR == 7, ]
# swed_civil[LOPNR==8602, ]

crosstabs(swed_civil$kod)

## swed_civil$kod
##       0G        Ä       EP        G       OG       RP        S       SP 
##        1 13243746     1177 81992711 69182406    38685 19884023     7090

# buggy patterns: some people go back to being "never married". This is logically impossible and and seems to be due to some data error where unmarried people are catalogued as divorced in the early years
swed_civil[, kod_fixed := ifelse(kod=="OG","OG",NA_character_)] # we set up a variable with only the OG codes, everything else is missing
swed_civil[, kod_fixed := ave(kod_fixed, LOPNR, FUN = function(kod_fixed) { na.locf(kod_fixed, na.rm=F, fromLast = T) })] # we carry the OGs backward, so that everybody is always unmarried until the most recent time they were counted as unmarried (implicitly assuming that recent data is more reliable)
swed_civil[is.na(kod_fixed), kod_fixed := kod] # for all years that weren't followed by a "never married status", we take the old codes


swed_civil[, status_end := NA] # find the points in the time series where the civil status changes
swed_civil[, status_end := lead(kod_fixed) != kod_fixed] # whenever the code changes, obviously (divorces, spousal death, marriages)
swed_civil[, status_end := ifelse(c(!is.na(diff(CIVILANTAR)) & diff(CIVILANTAR) < diff(year), F), TRUE, status_end)] # whenever the number of years with the same status changes (could be divorced and remarried within less than the year difference). Should theoretically encompass all the above cases, but we don't trust this.
swed_civil[, status_end := ifelse(lead(LOPNR) == LOPNR, status_end, T)] # whenever the person changes, that is a status end too, it's basically their "last seen as"
swed_civil[nrow(swed_civil), status_end := TRUE] # last row


# now we try to count marriages with all available information
swed_civil[, marriage_count := ifelse(kod_fixed == "G", 1, 0)] # if you marry, you gain one, OG = constant (for the sake of comparability and because legislation related to this is rather recent, we also treat registered partnerships as 0s)
#####################
### special cases ###
#####################
# first row for each person: if someone starts out as widow(er)/divorcee, then count as 1
swed_civil[!duplicated(LOPNR) & kod_fixed %in% c("S","Ä"), marriage_count := 1]
# if someone goes from being OG straight to being divorced or widowed (esp in the early data, when assessments is five-yearly)
swed_civil[, marriage_count := ifelse(
    ave(kod_fixed, LOPNR, FUN = function(kod_fixed) { !is.na(lag(kod_fixed)) & lag(kod_fixed) =="OG" & kod_fixed %in% c("S","Ä") }),
    1, # is S/Ä right after OG? count as 1, not -1
    marriage_count)] # other wise keep value

# now we count "cumulative" spouses (number of spouses had in whole life in given year)
swed_civil[, cumul_spouses := NA_real_]
swed_civil[status_end == TRUE, cumul_spouses := ave(marriage_count,LOPNR, FUN = cumsum)] ## we only use the times when the status ends
crosstabs(swed_civil[lead(LOPNR) != LOPNR, cumul_spouses]) # personal maxima

## swed_civil[lead(LOPNR) != LOPNR, cumul_spouses]
##       0       1       2       3       4       5       6       7       8 
## 4462254 6944642 1107224  108534    7255     486      58       6       4 
##       9 
##       1

swed_civil_s = swed_civil %>%
rename(idIndividu = LOPNR) %>% # now make one row per person
    group_by(idIndividu) %>% 
    summarise(
        marriage_codes = paste(sort(unique(kod_fixed)), collapse=","), 
        ever_married_narrow = ifelse(any(kod_fixed == 'G'), 1, 0), # this is the simplest (fewest errors?) way to determine if someone was ever married
        ever_married = ifelse(any(kod_fixed %in% c('G','S','Ä')), 1, 0), # this is the inclusive way to determine if someone was ever married, includes people for whom we only know they were divorced. only slight discrepancy with the above after error correction
        years_married = sum(CIVILANTAR[ kod_fixed == "G" & status_end == T]), # for years married we take a more conservative approach and don't put so much effort into reconstructing unobserved marriages
        spouses = max(cumul_spouses,na.rm=T), # for each person we take lifetime number of spouses
        times_divorced = sum(kod_fixed == "S" & status_end == T), # count S
        year_of_first_marriage = ifelse(any(marriage_count == 1), min(year[marriage_count == 1], na.rm=T), NA_real_), # won't work for everyone, we're not using years_married to count back
        times_widowed = sum(kod_fixed == "Ä" & status_end == T), # count Ä
        years_unmarried = sum(CIVILANTAR[ kod_fixed == "OG" & status_end == T])
) %>% data.table()

swed_civil_s$ever_divorced = ifelse(swed_civil_s$times_divorced > 1, 1,0)
# qplot(years_married, data=swed_civil_s)
# qplot(years_married, data=swed_civil_s) + xlim(1,NA)
# qplot(ever_married, data=swed_civil_s)
# qplot(years_married,years_unmarried, geom = 'jitter', data=swed_civil_s)
rm(swed_civil) # free memory

crosstabs(~ years_married + ever_married, data=swed_civil_s)

##              ever_married
## years_married       0       1
##          0    4044284  931033
##          1          0  124615
##          2          0  126998
##          3          0  124607
##          4          0  119152
##          5          0  113206
##          6          0  103153
##          7          0   98574
##          8          0   92754
##          9          0   95272
##          10         0   85548
##          11         0   78329
##          12         0   76730
##          13         0   76345
##          14         0   74645
##          15         0   74236
##          16         0   73153
##          17         0   75402
##          18         0   72494
##          19         0   75613
##          20         0  146811
##          21         0   71191
##          22         0   66565
##          23         0   62447
##          24         0   62338
##          25         0   60659
##          26         0   58970
##          27         0   58562
##          28         0   58223
##          29         0   59036
##          30         0   58535
##          31         0   57761
##          32         0   59905
##          33         0   64158
##          34         0   66150
##          35         0   67932
##          36         0   62780
##          37         0   63975
##          38         0   66159
##          39         0   71019
##          40         0   76885
##          41         0   81136
##          42         0   86306
##          43         0   89134
##          44         0   89034
##          45         0   87608
##          46         0   83593
##          47         0   84392
##          48         0   83845
##          49         0   82298
##          50         0   80882
##          51         0   80379
##          52         0   78960
##          53         0   75088
##          54         0   73158
##          55         0   68787
##          56         0   63761
##          57         0   57827
##          58         0   52369
##          59         0   45741
##          60         0   39196
##          61         0   33060
##          62         0   27362
##          63         0   21284
##          64         0   16323
##          65         0   11383
##          66         0    7973
##          67         0    5362
##          68         0    3403
##          69         0    2165
##          70         0    1262
##          71         0     772
##          72         0     425
##          73         0     242
##          74         0     156
##          75         0      76
##          76         0      72
##          77         0      51
##          78         0      25
##          79         0      32
##          80         0      15
##          81         0       8
##          82         0      18
##          83         0       7
##          84         0      11
##          85         0      11
##          86         0       8
##          87         0       7
##          88         0       5
##          89         0       2
##          90         0       6
##          91         0       2
##          92         0       7
##          93         0       3
##          94         0       3
##          95         0       5
##          96         0       4
##          97         0       6
##          98         0       3
##          99         0       4
##          100        0       3
##          101        0       1
##          102        0       1
##          103        0       2
##          104        0       4
##          105        0       4
##          106        0       3
##          107        0       8
##          108        0       1
##          109        0       2
##          112        0       3
##          113        0       1
##          114        0       2
##          115        0       1
##          116        0       2
##          117        0       1
##          119        0       3
##          120        0       2
##          121        0       2
##          122        0       2
##          123        0       2
##          128        0       1
##          <NA>       0 2859153

crosstabs(~ years_married + spouses, data=swed_civil_s)

##              spouses
## years_married       0       1       2       3       4       5       6
##          0    4462254  511697    1328      37       1       0       0
##          1          0  121773    2737      99       6       0       0
##          2          0  122576    4238     169      14       1       0
##          3          0  118532    5775     279      18       2       0
##          4          0  111640    7059     406      44       1       2
##          5          0  104733    7934     493      42       4       0
##          6          0   94095    8400     604      50       2       2
##          7          0   89120    8698     674      71      10       1
##          8          0   83234    8734     702      73       9       2
##          9          0   85814    8618     760      71       8       0
##          10         0   75543    9146     774      75       8       2
##          11         0   69042    8466     731      79      10       1
##          12         0   67362    8523     775      60       9       1
##          13         0   67446    8163     669      56       8       2
##          14         0   65780    8162     641      56       4       2
##          15         0   65717    7860     604      47       8       0
##          16         0   64911    7597     584      51       6       4
##          17         0   67471    7367     530      31       2       1
##          18         0   64749    7193     514      36       2       0
##          19         0   67908    7243     422      38       1       1
##          20         0  138316    8044     416      34       0       1
##          21         0   63546    7230     395      17       3       0
##          22         0   58721    7450     371      22       0       1
##          23         0   54541    7483     406      16       1       0
##          24         0   53448    8462     402      24       2       0
##          25         0   51262    8936     435      22       3       0
##          26         0   49134    9369     450      17       0       0
##          27         0   48666    9417     451      26       2       0
##          28         0   47830    9837     530      26       0       0
##          29         0   47737   10717     559      22       0       1
##          30         0   46902   11047     565      21       0       0
##          31         0   46771   10352     612      26       0       0
##          32         0   49413    9811     654      27       0       0
##          33         0   54023    9413     696      25       1       0
##          34         0   55026   10375     722      26       1       0
##          35         0   56866   10291     749      23       3       0
##          36         0   52225    9658     877      20       0       0
##          37         0   53569    9527     860      19       0       0
##          38         0   56069    9205     860      23       2       0
##          39         0   28793   41485     726      15       0       0
##          40         0   26328   49888     661       8       0       0
##          41         0   28443   52180     505       8       0       0
##          42         0   31047   54821     427      11       0       0
##          43         0   33427   55365     339       3       0       0
##          44         0   36188   52603     243       0       0       0
##          45         0   38940   48504     163       1       0       0
##          46         0   40889   42613      88       2       1       0
##          47         0   44364   39980      48       0       0       0
##          48         0   47262   36576       7       0       0       0
##          49         0   76196    6096       6       0       0       0
##          50         0   80656     221       5       0       0       0
##          51         0   80207     168       4       0       0       0
##          52         0   78799     160       0       1       0       0
##          53         0   74956     128       3       1       0       0
##          54         0   73038     118       2       0       0       0
##          55         0   68679     105       3       0       0       0
##          56         0   63694      63       4       0       0       0
##          57         0   57757      68       2       0       0       0
##          58         0   52296      70       3       0       0       0
##          59         0   45689      50       2       0       0       0
##          60         0   39150      44       2       0       0       0
##          61         0   33022      38       0       0       0       0
##          62         0   27336      26       0       0       0       0
##          63         0   21250      31       3       0       0       0
##          64         0   16292      29       2       0       0       0
##          65         0   11368      14       1       0       0       0
##          66         0    7958      10       5       0       0       0
##          67         0    5346      15       1       0       0       0
##          68         0    3389      12       2       0       0       0
##          69         0    2149      15       1       0       0       0
##          70         0    1249      10       3       0       0       0
##          71         0     762       9       1       0       0       0
##          72         0     415      10       0       0       0       0
##          73         0     236       3       3       0       0       0
##          74         0     144      10       2       0       0       0
##          75         0      68       8       0       0       0       0
##          76         0      65       7       0       0       0       0
##          77         0      42       7       2       0       0       0
##          78         0      20       4       1       0       0       0
##          79         0      26       4       2       0       0       0
##          80         0      12       2       1       0       0       0
##          81         0       5       2       1       0       0       0
##          82         0       6      11       1       0       0       0
##          83         0       6       1       0       0       0       0
##          84         0       3       8       0       0       0       0
##          85         0       2       7       2       0       0       0
##          86         0       4       4       0       0       0       0
##          87         0       1       6       0       0       0       0
##          88         0       0       2       3       0       0       0
##          89         0       2       0       0       0       0       0
##          90         0       3       3       0       0       0       0
##          91         0       0       2       0       0       0       0
##          92         0       2       5       0       0       0       0
##          93         0       0       3       0       0       0       0
##          94         0       0       3       0       0       0       0
##          95         0       0       5       0       0       0       0
##          96         0       0       4       0       0       0       0
##          97         0       0       6       0       0       0       0
##          98         0       0       3       0       0       0       0
##          99         0       0       4       0       0       0       0
##          100        0       0       3       0       0       0       0
##          101        0       0       1       0       0       0       0
##          102        0       0       1       0       0       0       0
##          103        0       0       2       0       0       0       0
##          104        0       0       4       0       0       0       0
##          105        0       0       4       0       0       0       0
##          106        0       0       3       0       0       0       0
##          107        0       0       8       0       0       0       0
##          108        0       0       1       0       0       0       0
##          109        0       0       2       0       0       0       0
##          112        0       0       3       0       0       0       0
##          113        0       0       1       0       0       0       0
##          114        0       0       2       0       0       0       0
##          115        0       0       1       0       0       0       0
##          116        0       0       2       0       0       0       0
##          117        0       0       1       0       0       0       0
##          119        0       0       2       0       0       1       0
##          120        0       0       2       0       0       0       0
##          121        0       0       2       0       0       0       0
##          122        0       0       2       0       0       0       0
##          123        0       0       2       0       0       0       0
##          128        0       0       1       0       0       0       0
##          <NA>       0 2463454  305660   83777    5850     371      34
##              spouses
## years_married       7       8       9
##          0          0       0       0
##          1          0       0       0
##          2          0       0       0
##          3          1       0       0
##          4          0       0       0
##          5          0       0       0
##          6          0       0       0
##          7          0       0       0
##          8          0       0       0
##          9          0       1       0
##          10         0       0       0
##          11         0       0       0
##          12         0       0       0
##          13         0       1       0
##          14         0       0       0
##          15         0       0       0
##          16         0       0       0
##          17         0       0       0
##          18         0       0       0
##          19         0       0       0
##          20         0       0       0
##          21         0       0       0
##          22         0       0       0
##          23         0       0       0
##          24         0       0       0
##          25         1       0       0
##          26         0       0       0
##          27         0       0       0
##          28         0       0       0
##          29         0       0       0
##          30         0       0       0
##          31         0       0       0
##          32         0       0       0
##          33         0       0       0
##          34         0       0       0
##          35         0       0       0
##          36         0       0       0
##          37         0       0       0
##          38         0       0       0
##          39         0       0       0
##          40         0       0       0
##          41         0       0       0
##          42         0       0       0
##          43         0       0       0
##          44         0       0       0
##          45         0       0       0
##          46         0       0       0
##          47         0       0       0
##          48         0       0       0
##          49         0       0       0
##          50         0       0       0
##          51         0       0       0
##          52         0       0       0
##          53         0       0       0
##          54         0       0       0
##          55         0       0       0
##          56         0       0       0
##          57         0       0       0
##          58         0       0       0
##          59         0       0       0
##          60         0       0       0
##          61         0       0       0
##          62         0       0       0
##          63         0       0       0
##          64         0       0       0
##          65         0       0       0
##          66         0       0       0
##          67         0       0       0
##          68         0       0       0
##          69         0       0       0
##          70         0       0       0
##          71         0       0       0
##          72         0       0       0
##          73         0       0       0
##          74         0       0       0
##          75         0       0       0
##          76         0       0       0
##          77         0       0       0
##          78         0       0       0
##          79         0       0       0
##          80         0       0       0
##          81         0       0       0
##          82         0       0       0
##          83         0       0       0
##          84         0       0       0
##          85         0       0       0
##          86         0       0       0
##          87         0       0       0
##          88         0       0       0
##          89         0       0       0
##          90         0       0       0
##          91         0       0       0
##          92         0       0       0
##          93         0       0       0
##          94         0       0       0
##          95         0       0       0
##          96         0       0       0
##          97         0       0       0
##          98         0       0       0
##          99         0       0       0
##          100        0       0       0
##          101        0       0       0
##          102        0       0       0
##          103        0       0       0
##          104        0       0       0
##          105        0       0       0
##          106        0       0       0
##          107        0       0       0
##          108        0       0       0
##          109        0       0       0
##          112        0       0       0
##          113        0       0       0
##          114        0       0       0
##          115        0       0       0
##          116        0       0       0
##          117        0       0       0
##          119        0       0       0
##          120        0       0       0
##          121        0       0       0
##          122        0       0       0
##          123        0       0       0
##          128        0       0       0
##          <NA>       4       2       1

crosstabs(~ spouses + ever_married, data=swed_civil_s)

##        ever_married
## spouses       0       1
##       0 4044284  417970
##       1       0 6944643
##       2       0 1107224
##       3       0  108534
##       4       0    7255
##       5       0     486
##       6       0      58
##       7       0       6
##       8       0       4
##       9       0       1

props(~ ever_married, swed_civil_s)

## ever_married
##      0      1 
## 0.3202 0.6798

Transforming data

# swed = read.csv("/Volumes/Elements/swed.csv")
# save(swed, file = "data/swed2.rdata")
load(file="data/swed2.rdata")

# load(file="data/swed.rdata")
# char_vars = sapply(swed,is.character) | sapply(swed, is.factor)
# swed[,char_vars] = plyr::colwise(function(x) { 
#                   type.convert(as.character(x),as.is=TRUE)
#               })(swed[,char_vars])
names(swed) = c("idIndividu", "idPere", "idMere", "idPaternalGrandfather", "idPaternalGrandmother", "idMaternalGrandfather", "idMaternalGrandmother", "sex", "education", "education_Father", "education_Mother", "education.MaternalGrandmother", "education.MaternalGrandfather", "education.PaternalGrandfather", "education.PaternalGrandmother", "byear", "byear.Father", "byear.Mother", "byear.PaternalGrandfather", "byear.MaternalGrandfather", "byear.PaternalGrandmother", "byear.MaternalGrandmother", "dyear", "dyear.Father", "dyear.Mother", "dyear.PaternalGrandfather", "dyear.MaternalGrandfather", "dyear.PaternalGrandmother", "dyear.MaternalGrandmother")
swed = as.data.table(swed)
# swed = head(swed, 1e6)
dupes = swed$idIndividu[duplicated(swed$idIndividu)]
# swed[LOPNR %in% dupes, ]
swed = swed[!duplicated(swed$idIndividu),]

swed = merge(swed, swed_civil_s, by = "idIndividu", all.x = T)
rm(swed_civil_s)

swed[, age_at_first_marriage := year_of_first_marriage - byear]
# qplot( year_of_first_marriage-byear, years_unmarried,data=swed, geom = 'jitter', alpha = I(0.3)) + facet_wrap(~ byear)
# qplot(byear, times_divorced, data = swed, stat = 'summary', fun.data = 'mean_sdl')
# qplot(byear, times_widowed, data = swed, stat = 'summary', fun.data = 'mean_sdl')
# qplot(byear, times_divorced, data = swed, stat = 'summary', fun.data = 'mean_sdl')

sort(props(~ swed$marriage_codes))

## swed$marriage_codes
## Ä,EP,G,OG,RP,S    Ä,EP,G,RP,S      Ä,EP,OG,S       Ä,G,RP,S        Ä,OG,RP 
##      1.219e-07      1.219e-07      1.219e-07      1.219e-07      1.219e-07 
##        EP,G,OG           EP,S        OG,S,SP   Ä,G,OG,RP,SP   EP,G,OG,RP,S 
##      1.219e-07      1.219e-07      1.219e-07      2.438e-07      2.438e-07 
##  EP,G,OG,RP,SP      EP,G,OG,S    EP,OG,RP,SP             RP     Ä,EP,OG,RP 
##      2.438e-07      2.438e-07      2.438e-07      2.438e-07      3.658e-07 
##      Ä,G,OG,RP   Ä,EP,G,OG,RP      EP,G,RP,S           RP,S        Ä,EP,OG 
##      4.877e-07      7.315e-07      8.535e-07      8.535e-07      9.754e-07 
##     OG,RP,S,SP      G,OG,S,SP        G,OG,SP          EP,OG      G,RP,S,SP 
##      1.097e-06      1.219e-06      1.707e-06      1.829e-06      2.317e-06 
##          OG,SP        OG,RP,S     EP,G,OG,RP       EP,OG,RP         G,RP,S 
##      2.560e-06      2.682e-06      5.121e-06      5.243e-06      9.998e-06 
##         Ä,OG,S   G,OG,RP,S,SP              Ä     G,OG,RP,SP      G,OG,RP,S 
##      1.146e-05      1.402e-05      2.499e-05      3.572e-05      4.218e-05 
##       OG,RP,SP        G,OG,RP            Ä,S          OG,RP           Ä,OG 
##      7.425e-05      1.196e-04      1.264e-04      3.138e-04      4.022e-04 
##       Ä,G,OG,S              S         Ä,G,OG           OG,S            Ä,G 
##      5.116e-04      4.196e-03      4.804e-03      5.114e-03      5.699e-03 
##          Ä,G,S              G         G,OG,S            G,S           G,OG 
##      8.315e-03      1.695e-02      7.141e-02      8.610e-02      2.035e-01 
##           <NA>             OG 
##      2.176e-01      3.746e-01

sort(props(~ swed$ever_married))

## swed$ever_married
##   <NA>      0      1 
## 0.2176 0.3750 0.4074

sort(props(~ swed[swed$byear < 1960,]$ever_married))

## swed[swed$byear < 1960, ]$ever_married
##     <NA>        0        1 
## 0.009808 0.180113 0.810079

swed[is.na(ever_married), ever_married := 0] # have verified that those missing from the civil file did not get far


miss_frac(swed)

## [1] 0

# make male var.
crosstabs(swed$sex)

## swed$sex
##       1       2 
## 4206572 3995396

swed[, male := Recode(sex,"'1'=1;'2'=0;else=NA")]
swed$sex = NULL

## Warning in alloc.col(x): Attempt to reduce allocation from 105 to 104
## ignored. Can only increase allocation via shallow copy.

# crosstabs(swed$dyear)
swed[, age := dyear-byear ]
# qplot(swed$age)
# library(mgcv)
# qplot(data=swed,dyear,ifelse(age < 1, 0, 1)) + geom_smooth()
# qplot(data=swed,dyear,age) + geom_smooth()

# we believe that those who don't have a death date by 2009 and who were born after 1962, were alive by 2009. this probably incorrectly treats some out-migrants.
swed[byear>=1962 & is.na(dyear) & byear < 2009, age_at_least := as.integer(2009)-byear]
swed[!is.na(age), age_at_least := age]
swed[age < 0, age := NA]

swed[, survive1y := ifelse(age_at_least <= 1, 0, 1) ] # we are conservative in our survival estimation and would estimate 1;11 years as dead in the first year in the worst case (because we only have birth years, not months) 
swed[, survive5y := ifelse(age_at_least <= 5, 0, 1) ]
swed[,surviveR := ifelse(age_at_least <= 15, 0, 1)]

crosstabs(~ survive1y + byear, data = swed)

##          byear
## survive1y   1932   1933   1934   1935   1936   1937   1938   1939   1940
##      0         0      0      0      0      0      0      0      0      0
##      1      7741  13121  14622  14447  14020  13313  12895  12809  11585
##      <NA>  19311  36165  44551  49456  54761  57923  62556  66790  68187
##          byear
## survive1y   1941   1942   1943   1944   1945   1946   1947   1948   1949
##      0         0      0      0      0      0      1      4      4      3
##      1     11739  12449  13055  12974  11955  11066  10013   9076   8061
##      <NA>  73866  87560  97828 106179 108817 110600 109889 109842 106781
##          byear
## survive1y   1950   1951   1952   1953   1954   1955   1956   1957   1958
##      0         1      2      1      6      3      2      3      5      2
##      1      7394   6539   6050   5656   5185   5067   4755   4360   4011
##      <NA> 102753  99036 100723 101876  98548 101314 102850 102763 101746
##          byear
## survive1y   1959   1960   1961   1962   1963   1964   1965   1966   1967
##      0         4    158   1399   1372   1385   1431   1324   1263   1268
##      1      3662   3546   3515 110197 116267 126024 126374 126463 125090
##      <NA> 102396 100194 103991      0      0      0      0      0      0
##          byear
## survive1y   1968   1969   1970   1971   1972   1973   1974   1975   1976
##      0      1219    980    984   1004    929    826    856    700    666
##      1    117406 111587 113380 117419 116096 114157 114944 108898 103430
##      <NA>      0      0      0      0      0      0      0      0      0
##          byear
## survive1y   1977   1978   1979   1980   1981   1982   1983   1984   1985
##      0       610    578    594    569    538    523    575    568    577
##      1    101411  98826 102194 103602 100940 100500  99680 102345 106941
##      <NA>      0      0      0      0      0      0      0      0      0
##          byear
## survive1y   1986   1987   1988   1989   1990   1991   1992   1993   1994
##      0       525    572    565    585    649    627    560    483    434
##      1    110470 113027 120897 124457 132234 131360 129139 123121 118274
##      <NA>      0      0      0      0      0      0      0      0      0
##          byear
## survive1y   1995   1996   1997   1998   1999   2000   2001   2002   2003
##      0       377    337    296    271    265    283    306    275    296
##      1    109307 101234  96507  95279  94383  96533  96688 100907 103633
##      <NA>      0      0      0      0      0      0      0      0      0
##          byear
## survive1y   2004   2005   2006   2007   2008   2009
##      0       284    253    272    243 109526    256
##      1    105393 105367 109071 109358     23      0
##      <NA>      0      0      0      0      0 105725

# bugs probably due to re-assigned personnummer
nrow(swed[!is.na(education) & survive1y == 0,])

## [1] 37

swed[!is.na(education) & survive1y == 0, survive1y := NA]
swed[!is.na(education) & survive1y == 0, age_at_least := NA]
swed[!is.na(education) & survive5y == 0, survive5y := NA]
swed[!is.na(education) & survive5y == 0, age_at_least := NA]
nrow(swed[education > 2 & surviveR == 0,])

## [1] 35

swed[education > 2 & surviveR == 0, surviveR := NA]
swed[education > 2 & surviveR == 0, age_at_least := NA]

swed[, paternalage := byear - byear.Father]

swed[, idParents := str_c(idMere,"_",idPere)]

swed[, maternalage := byear - byear.Mother]

# qplot(maternalage,paternalage,data=swed,geom="jitter",alpha=I(0.1))
cor.test(swed$maternalage,swed$paternalage)

## 
##  Pearson's product-moment correlation
## 
## data:  swed$maternalage and swed$paternalage
## t = 3100, df = 8100000, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7365 0.7372
## sample estimates:
##    cor 
## 0.7368

swed$born = 1

Calculating reproductive success

swed = swed[order(idParents,byear), ]
swed[ , first_child_with_this_partner := ifelse( !duplicated(idParents), 1, 0)  ]
swed = count_and_merge(swed, 'fertile_unions', wt_var = "first_child_with_this_partner")
crosstabs(~ fertile_unions + spouses, data = swed)

##               spouses
## fertile_unions       0       1       2       3       4       5       6
##             0  2209839  273143   51614    5576     355      20       3
##             1   805108 1921584  654070   53067    2582     108      11
##             2   102010  169292   94192   25748    2136     161       8
##             3    10435   15535   10292    4039     583      62      12
##             4     1141    1643    1305     606     125      18       4
##             5      151     241     177     104      28       4       2
##             6       26      38      54      20       4       1       1
##             7       10       4      13       5       2       1       1
##             8        1       2       3       1       0       0       0
##             9        0       1       0       0       0       0       0
##             10       0       0       1       0       0       0       0
##             11       0       1       0       0       0       0       0
##               spouses
## fertile_unions       7       8       9    <NA>
##             0        0       0       0 1784037
##             1        2       1       0     584
##             2        1       0       0      12
##             3        1       2       0       0
##             4        1       0       0       1
##             5        1       0       1       0
##             6        0       0       0       0
##             7        0       0       0       0
##             8        0       0       0       0
##             9        0       0       0       0
##             10       0       0       0       0
##             11       0       0       0       0

swed = count_and_merge(swed, 'children', wt_var = "born")
swed$children.per.spouse = swed$children/swed$spouses
swed$children.per.spouse[which(swed$spouses==0)] = NA
# qplot(swed$children.per.spouse)
# qplot(swed$children)
# qplot(swed$spouses)

changeNAto1 = function(x) { colwise(function(x) { ifelse(is.na(x), 1, x)})(x) }

# bugs probably owing to re-assigment of personnummer
nrow(swed[children>0 & surviveR == 0,])

## [1] 25

swed[children>0 & survive1y == 0, survive1y := NA]
swed[children>0 & surviveR == 0, surviveR := NA]
swed[children>0 & surviveR == 0, age_at_least := NA]
nrow(swed[spouses > 0 & surviveR == 0,])

## [1] 3690

swed[spouses > 0 & survive1y == 0, survive1y := NA]
swed[spouses > 0 & surviveR == 0, survive5y := NA]
swed[spouses > 0 & surviveR == 0, surviveR := NA]
swed[spouses > 0 & surviveR == 0, age_at_least := NA]

swed = count_and_merge(swed, 'children.surviving1y', wt_var = 'survive1y')
swed = count_and_merge(swed, 'children.surviving5y', wt_var = 'survive5y')
swed = count_and_merge(swed, 'children.survivingR', wt_var = 'surviveR')

# swed = count_and_merge(swed, 'children.spouses', wt_var = 'spouses')
# swed = count_and_merge(swed, 'grandchildren.per.spouse', wt_var = 'children.per.spouse')

swed = count_and_merge(swed, 'grandchildren',wt_var='children')
swed = count_and_merge(swed, 'grandchildren.surviving1y', wt_var = 'children.surviving1y')
swed = count_and_merge(swed, 'grandchildren.survivingR', wt_var = 'children.survivingR')

swed[, dead1y := ifelse(age_at_least >= 1, 0, 1)]
swed = count_and_merge(swed, 'children.dead1y', wt_var = 'dead1y')

swed[, children.wddate := children.dead1y + children.surviving1y] 

xtabs(~ (grandchildren>0) +(children>0) + surviveR,data=swed,exclude=NULL, na.action= na.pass)

## , , surviveR = 0
## 
##                  children > 0
## grandchildren > 0   FALSE    TRUE
##             FALSE 1594483       0
##             TRUE        0       0
## 
## , , surviveR = 1
## 
##                  children > 0
## grandchildren > 0   FALSE    TRUE
##             FALSE 2160586 1563637
##             TRUE        0  184535
## 
## , , surviveR = NA
## 
##                  children > 0
## grandchildren > 0   FALSE    TRUE
##             FALSE  568287  828315
##             TRUE        0 1302125

# qplot(swed$children)
# qplot(swed[which(swed$age > 15),]$children )
# qplot(swed$grandchildren)
# qplot(swed[which(swed$age > 15),]$grandchildren )
# qplot(swed$spouses)

pre-calculate some predictors

swed = swed[order(idParents,byear), ]
swed <- transform(swed, birthorder = ave(rep(NA, nrow(swed)), swed$idParents, FUN = seq_along)) # old trick to get birth order, don't know what this does to those with missings for father though
# qplot(swed$birthorder,binwidth=1)

swed <- transform(swed, min15.birthorder = ave(surviveR, idPere, FUN =function(x) { x[is.na(x)] = 0
cumsum(x)
} ))
xtabs(data=swed, ~is.na(birthorder) + is.na(min15.birthorder))

##                  is.na(min15.birthorder)
## is.na(birthorder)   FALSE
##             FALSE 8141033
##             TRUE    60935

table(swed$min15.birthorder,exclude=NULL)

## 
##       0       1       2       3       4       5       6       7       8 
## 3709277 2463803 1370459  474328  128488   36357   11935    4291    1748 
##       9      10      11      12      13      14      15      16      17 
##     692     319     145      59      33      18      10       4       1 
##      18    <NA> 
##       1       0

swed[, nr.siblings := ave(born,idParents,FUN= function(x) { sum(x,na.rm=T) } ) - 1] 
# qplot(swed$nr.siblings,binwidth=1)

count dependent sibs

swed = swed %>%
    group_by(idParents) %>%
    mutate(
        dependent_sibs_f5y = dependent_sibs_f5y(survive1y=survive1y, byear=byear, dyear=dyear)
    ) %>% data.table()

swed[, birth.cohort := year_bins(byear)]
crosstabs(swed$birth.cohort)

## swed$birth.cohort
## 1930-1935 1935-1940 1940-1945 1945-1950 1950-1955 1955-1960 1960-1965 
##    135511    358970    495422    596112    533773    532940    569479 
## 1965-1970 1970-1975 1975-1980 1980-1985 1985-1990 1990-1995 1995-2000 
##    612974    580595    517907    509840    578616    636881    498256 
## 2000-2005 2005-2010 
##    504598    540094

swed %>% 
    tbl_df()  %>% 
mutate(
                maternal_loss_age = dyear.Mother - byear
                 ,maternal_loss_age = as.numeric(ifelse(maternal_loss_age >= -1 & maternal_loss_age < 0, 0, maternal_loss_age))
                 ,maternal_loss  = as.character(cut(maternal_loss_age, breaks = c(0,1,5,10,15,20,25,30,35,40,45), include.lowest = T ))
                 ,maternal_loss = ifelse( maternal_loss_age >= 45, "later", maternal_loss)
                 ,maternal_loss = ifelse(is.na(maternal_loss_age) | is.na(maternal_loss), "unclear", maternal_loss)
                 ,maternal_loss = factor(maternal_loss, levels = c("later","[0,1]", "(1,5]", "(5,10]", "(10,15]", "(15,20]", "(20,25]", "(25,30]", "(30,35]",  "(35,40]", "(40,45]", "unclear"))
                 
                 ,paternal_loss_age = dyear.Father - byear
                 ,paternal_loss_age = as.numeric(ifelse(paternal_loss_age >= -1 & paternal_loss_age < 0, 0, paternal_loss_age))
                 ,paternal_loss  = as.character(cut(paternal_loss_age, breaks = c(0,1,5,10,15,20,25,30,35,40,45), include.lowest = T ))
                 ,paternal_loss = ifelse( paternal_loss_age >= 45, "later", paternal_loss)
                 ,paternal_loss = ifelse(is.na(paternal_loss_age) | is.na(paternal_loss), "unclear", paternal_loss)
                 ,paternal_loss = factor(paternal_loss, levels = c("later","[0,1]", "(1,5]",  "(5,10]", "(10,15]", "(15,20]", "(20,25]", "(25,30]", "(30,35]",  "(35,40]", "(40,45]", "unclear"))
                 ) %>%
    data.table() ->
    swed
crosstabs(swed$maternal_loss)

## swed$maternal_loss
##   later   [0,1]   (1,5]  (5,10] (10,15] (15,20] (20,25] (25,30] (30,35] 
## 1274984    2151    7869   16189   27191   45547   71480  106952  149433 
## (35,40] (40,45] unclear 
##  204279  208744 6087149

crosstabs(swed$paternal_loss)

## swed$paternal_loss
##   later   [0,1]   (1,5]  (5,10] (10,15] (15,20] (20,25] (25,30] (30,35] 
## 1096763    7248   19961   38395   63816  108260  170261  251155  328328 
## (35,40] (40,45] unclear 
##  394178  339366 5384237

swed$older_siblings = factor(ifelse((swed$birthorder - 1) > 4,"5+", swed$birthorder - 1))
swed$last_born = ifelse(swed$birthorder == swed$nr.siblings, 1, 0)

swed = swed[order(swed$idParents,swed$byear), ]
swed <- transform(swed, siblings = ave(rep(NA, nrow(swed)), swed$idParents, FUN = length)-1) # sibling count
swed <- transform(swed, birthorder = ave(rep(NA, nrow(swed)), swed$idParents, FUN = seq_along)) # old trick to get birth order, don't know what this does to those with missings for father though
qplot(swed$birthorder)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 60935 rows containing non-finite values (stat_bin).

plot of chunk unnamed-chunk-4

swed$younger_siblings = swed$siblings + 1 - swed$birthorder

swed = swed %>%
    group_by(idParents) %>%
    mutate(
        younger_sibs_ad_5y = younger_sibs_alive_and_dependent(survive5y=survive5y, byear=byear, dyear=dyear) ,
        older_sibs_ad_5y = older_sibs_alive_and_dependent(survive5y=survive5y, byear=byear, dyear=dyear),
        dependent_sibs_f5y = dependent_sibs_f5y(survive1y=survive1y, byear=byear, dyear=dyear)
    ) %>% data.table()

swed[, paternalage := paternalage/10]
swed[, maternalage := maternalage/10]
swed = recenter.pat(recenter.pat(swed), what = "maternalage")

min_na = function(x) { ifelse(all(is.na(x)), NA, min(x,na.rm=T) ) }
max_na = function(x) { ifelse(all(is.na(x)), NA, max(x,na.rm=T) ) }
swed = swed[order(idPere),]
swed[, paternalage_at_1st_sib := ave(paternalage, idPere, FUN = min_na)]
swed[, paternalage_at_last_sib := ave(paternalage, idPere, FUN = max_na)]
swed = swed[order(idMere),]
swed[, maternalage_at_1st_sib := ave(maternalage, idMere, FUN = min_na)]
swed[, maternalage_at_last_sib := ave(maternalage, idMere, FUN = max_na)]
fathers = swed[!duplicated(idPere), list(idPere, paternalage_at_1st_sib, paternalage_at_last_sib)]
names(fathers) = c("idIndividu","age_at_1st_child", "age_at_last_child")
mothers = swed[!duplicated(idMere), list(idMere, maternalage_at_1st_sib, maternalage_at_last_sib)]
names(mothers) = c("idIndividu","age_at_1st_child", "age_at_last_child")
parents = rbind(fathers, mothers)
swed = merge(swed, parents, by = "idIndividu", all.x = T)
rm(parents,fathers,mothers)

swed[, nr.siblings := ave(born,idParents,FUN= function(x) { sum(x,na.rm=T) } ) -1] # dont count self

swed[, maternalage.factor := cut((maternalage*10), breaks = c(14, 20, 35, 61))]
swed[, maternalage.factor := relevel(maternalage.factor, ref = "(20,35]")]
swed[, ever_married := ifelse(spouses>0, 1,0)]
swed[, any_surviving_children := ifelse(children.survivingR > 0, 1, 0)]

swed$birth_cohort = factor(swed$birth.cohort)
swed$male = factor(swed$male)
swed$last_born = factor(swed$last_born)

swed[ , first_sib_byear := ave( byear, idParents, FUN = function(x)min(x,na.rm=T))  ]
swed[ , last_sib_byear := ave( byear, idParents, FUN = function(x)max(x,na.rm=T))  ]
swed[, birth_span := last_sib_byear - first_sib_byear]
hist(swed$birth_span,breaks = 0:35)

plot of chunk unnamed-chunk-6

swed.mp = swed[!is.na(paternalage), ]
swed.1 = swed.mp[byear >= 1962-15  & byear <= 2009-50, ]

swed.2 = swed.mp[byear >= 1969  & byear < 2000, ]

set.seed(seed = 346834964)
swed.2 %>% filter(complete.cases(survive1y, birth_cohort, male, maternalage.factor, paternalage.mean, paternalage, paternal_loss, maternal_loss, older_siblings, nr.siblings, last_born, idParents)) %>% select(idParents) %>% distinct() -> complete_fams
selected_fams = sample(complete_fams$idParents, size = 200000)
swed_subset_survival.1 = swed.2 %>% filter(idParents %in% selected_fams)

swed.1 %>% filter(complete.cases(children, birth_cohort, male, maternalage.factor, paternalage.mean, paternalage, paternal_loss, maternal_loss, older_siblings, nr.siblings, last_born, idParents)) %>% select(idParents) %>% distinct() -> complete_fams
selected_fams = sample(complete_fams$idParents, size = 80000)
swed_subset_children.1 = swed.1 %>% filter(idParents %in% selected_fams)

save(swed,file="swed.rdata")
save(swed.1, file="swed1.rdata")
save(swed.2, file="swed2.rdata")
save(swed_subset_children.1, file="swed_subset_children.rdata")
save(swed_subset_survival.1, file="swed_subset_survival.rdata")