The Swedish Multi-Generation Register was established in 1960 and includes index persons born since 1932 that were alive in 1960. The register also includes the parents of the index persons. The parents are included in the register even if they were not born after 1932 or alive in 1960. The coverage of parents of individuals born in Sweden is approximately 98 % for fathers and 100 % for mothers around 1960 (see graph below).
Left truncation refers to data that is retrospectively limited resulting in exclusion of study subjects. The truncation is dependent on birth strata. Still, matching cases and controls on birth year does not automatically take care of problems due to left truncation. In these studies, the data might be truncated regarding parental ages. Individuals born before the initiation of the Multi-Generation Register in 1932 are excluded due to left truncation of parental age data. However, the Multi-Generation Register includes parental information of individuals born since 1932 and ages of their parents are not extensively subjected to left truncation. In order to collate a sample consisting of 3 generations, parents have to be born after 1932. As a result, older parents are more likely to be excluded than younger parents. These problems may be addressed with several approaches in sensitivity analyses
When an event occurs before the individual comes under observation, this is called left censoring. Studies on paternal age and number of offspring might be subjected to left censoring regarding number of children because children born before 1932 are censored. Different sensitivity analyses can be used to examine potential problems with left censoring.
When an event occurs after the observation time end-point, this is called right censoring. Studies on paternal age in association to fertility might be subjected to left censoring regarding number of children because all children born after December 31 st 2009 are censored. However, problems with censoring can be adjusted for by using Cox regression or conditional logistic regression.
source("0__helpers.R")
library(zoo)
# bsub -q fat -W 48:00 -n 1 Rscript -e "setwd('/usr/users/rarslan/updated_data/'); filebase = '0_swed_massage'; knitr::knit(input = paste0(filebase,'.Rmd'), output = paste0(filebase,'.md'));cat(readLines(paste0(filebase,'.md')), sep = '\n')"
opts_chunk$set(cache=F,cache.lazy=F,tidy=FALSE,autodep=TRUE,dev=c('png','pdf'),fig.width=12,fig.height=7.5,out.width='1440px',out.height='900px')
Info about marriage is called CIVIL. You can be listed as G (married), OG (never married), S (divorced) or Ä (widow/widower). There is also a variable called CIVILANTAR which describes how many years you have had the same CIVIL. This info is updated annually and we have information from 1990-2009.
I have created a file where I just extracted all data about individual marriage status (including all women and men). We actually have info dating back to 1960 for another data source called FOB (folkbokföringsregistret). This data was collected in 1960, 1970, 1975, 1980, 1985 and 1990. FOB does not include CIVILANTAR (number of years with the same CIVIL status).
# system.time({
# swed_civil60_85 = read_sas("~/Downloads/civil_1960_1985.sas7bdat")
# swed_civil60_85 = data.table(swed_civil60_85)
# })
# system.time({
# swed_civil90_09 = read_sas("~/Downloads/civil_1990_2009.sas7bdat")
# swed_civil90_09 = data.table(swed_civil90_09)
# })
# setkey(swed_civil60_85, LOPNR, year)
# setkey(swed_civil90_09, LOPNR, year)
#
# # swed_civil60_85 = head(swed_civil60_85, 1e5)
# # swed_civil90_09 = head(swed_civil90_09, 1e5)
#
# swed_civil = rbind(swed_civil60_85,swed_civil90_09)
# setkey(swed_civil, LOPNR, year) # sort by person, then year
# save(swed_civil,file="data/swed_civil.rdata")
# rm(swed_civil60_85, swed_civil90_09) # free memory
load("data/swed_civil.rdata")
# tail(swed_civil, 100)
# swed_civil[LOPNR == 9862, ]
# swed_civil[LOPNR == 7, ]
# swed_civil[LOPNR==8602, ]
crosstabs(swed_civil$kod)
## swed_civil$kod
## 0G Ä EP G OG RP S SP
## 1 13243746 1177 81992711 69182406 38685 19884023 7090
# buggy patterns: some people go back to being "never married". This is logically impossible and and seems to be due to some data error where unmarried people are catalogued as divorced in the early years
swed_civil[, kod_fixed := ifelse(kod=="OG","OG",NA_character_)] # we set up a variable with only the OG codes, everything else is missing
swed_civil[, kod_fixed := ave(kod_fixed, LOPNR, FUN = function(kod_fixed) { na.locf(kod_fixed, na.rm=F, fromLast = T) })] # we carry the OGs backward, so that everybody is always unmarried until the most recent time they were counted as unmarried (implicitly assuming that recent data is more reliable)
swed_civil[is.na(kod_fixed), kod_fixed := kod] # for all years that weren't followed by a "never married status", we take the old codes
swed_civil[, status_end := NA] # find the points in the time series where the civil status changes
swed_civil[, status_end := lead(kod_fixed) != kod_fixed] # whenever the code changes, obviously (divorces, spousal death, marriages)
swed_civil[, status_end := ifelse(c(!is.na(diff(CIVILANTAR)) & diff(CIVILANTAR) < diff(year), F), TRUE, status_end)] # whenever the number of years with the same status changes (could be divorced and remarried within less than the year difference). Should theoretically encompass all the above cases, but we don't trust this.
swed_civil[, status_end := ifelse(lead(LOPNR) == LOPNR, status_end, T)] # whenever the person changes, that is a status end too, it's basically their "last seen as"
swed_civil[nrow(swed_civil), status_end := TRUE] # last row
# now we try to count marriages with all available information
swed_civil[, marriage_count := ifelse(kod_fixed == "G", 1, 0)] # if you marry, you gain one, OG = constant (for the sake of comparability and because legislation related to this is rather recent, we also treat registered partnerships as 0s)
#####################
### special cases ###
#####################
# first row for each person: if someone starts out as widow(er)/divorcee, then count as 1
swed_civil[!duplicated(LOPNR) & kod_fixed %in% c("S","Ä"), marriage_count := 1]
# if someone goes from being OG straight to being divorced or widowed (esp in the early data, when assessments is five-yearly)
swed_civil[, marriage_count := ifelse(
ave(kod_fixed, LOPNR, FUN = function(kod_fixed) { !is.na(lag(kod_fixed)) & lag(kod_fixed) =="OG" & kod_fixed %in% c("S","Ä") }),
1, # is S/Ä right after OG? count as 1, not -1
marriage_count)] # other wise keep value
# now we count "cumulative" spouses (number of spouses had in whole life in given year)
swed_civil[, cumul_spouses := NA_real_]
swed_civil[status_end == TRUE, cumul_spouses := ave(marriage_count,LOPNR, FUN = cumsum)] ## we only use the times when the status ends
crosstabs(swed_civil[lead(LOPNR) != LOPNR, cumul_spouses]) # personal maxima
## swed_civil[lead(LOPNR) != LOPNR, cumul_spouses]
## 0 1 2 3 4 5 6 7 8
## 4462254 6944642 1107224 108534 7255 486 58 6 4
## 9
## 1
swed_civil_s = swed_civil %>%
rename(idIndividu = LOPNR) %>% # now make one row per person
group_by(idIndividu) %>%
summarise(
marriage_codes = paste(sort(unique(kod_fixed)), collapse=","),
ever_married_narrow = ifelse(any(kod_fixed == 'G'), 1, 0), # this is the simplest (fewest errors?) way to determine if someone was ever married
ever_married = ifelse(any(kod_fixed %in% c('G','S','Ä')), 1, 0), # this is the inclusive way to determine if someone was ever married, includes people for whom we only know they were divorced. only slight discrepancy with the above after error correction
years_married = sum(CIVILANTAR[ kod_fixed == "G" & status_end == T]), # for years married we take a more conservative approach and don't put so much effort into reconstructing unobserved marriages
spouses = max(cumul_spouses,na.rm=T), # for each person we take lifetime number of spouses
times_divorced = sum(kod_fixed == "S" & status_end == T), # count S
year_of_first_marriage = ifelse(any(marriage_count == 1), min(year[marriage_count == 1], na.rm=T), NA_real_), # won't work for everyone, we're not using years_married to count back
times_widowed = sum(kod_fixed == "Ä" & status_end == T), # count Ä
years_unmarried = sum(CIVILANTAR[ kod_fixed == "OG" & status_end == T])
) %>% data.table()
swed_civil_s$ever_divorced = ifelse(swed_civil_s$times_divorced > 1, 1,0)
# qplot(years_married, data=swed_civil_s)
# qplot(years_married, data=swed_civil_s) + xlim(1,NA)
# qplot(ever_married, data=swed_civil_s)
# qplot(years_married,years_unmarried, geom = 'jitter', data=swed_civil_s)
rm(swed_civil) # free memory
crosstabs(~ years_married + ever_married, data=swed_civil_s)
## ever_married
## years_married 0 1
## 0 4044284 931033
## 1 0 124615
## 2 0 126998
## 3 0 124607
## 4 0 119152
## 5 0 113206
## 6 0 103153
## 7 0 98574
## 8 0 92754
## 9 0 95272
## 10 0 85548
## 11 0 78329
## 12 0 76730
## 13 0 76345
## 14 0 74645
## 15 0 74236
## 16 0 73153
## 17 0 75402
## 18 0 72494
## 19 0 75613
## 20 0 146811
## 21 0 71191
## 22 0 66565
## 23 0 62447
## 24 0 62338
## 25 0 60659
## 26 0 58970
## 27 0 58562
## 28 0 58223
## 29 0 59036
## 30 0 58535
## 31 0 57761
## 32 0 59905
## 33 0 64158
## 34 0 66150
## 35 0 67932
## 36 0 62780
## 37 0 63975
## 38 0 66159
## 39 0 71019
## 40 0 76885
## 41 0 81136
## 42 0 86306
## 43 0 89134
## 44 0 89034
## 45 0 87608
## 46 0 83593
## 47 0 84392
## 48 0 83845
## 49 0 82298
## 50 0 80882
## 51 0 80379
## 52 0 78960
## 53 0 75088
## 54 0 73158
## 55 0 68787
## 56 0 63761
## 57 0 57827
## 58 0 52369
## 59 0 45741
## 60 0 39196
## 61 0 33060
## 62 0 27362
## 63 0 21284
## 64 0 16323
## 65 0 11383
## 66 0 7973
## 67 0 5362
## 68 0 3403
## 69 0 2165
## 70 0 1262
## 71 0 772
## 72 0 425
## 73 0 242
## 74 0 156
## 75 0 76
## 76 0 72
## 77 0 51
## 78 0 25
## 79 0 32
## 80 0 15
## 81 0 8
## 82 0 18
## 83 0 7
## 84 0 11
## 85 0 11
## 86 0 8
## 87 0 7
## 88 0 5
## 89 0 2
## 90 0 6
## 91 0 2
## 92 0 7
## 93 0 3
## 94 0 3
## 95 0 5
## 96 0 4
## 97 0 6
## 98 0 3
## 99 0 4
## 100 0 3
## 101 0 1
## 102 0 1
## 103 0 2
## 104 0 4
## 105 0 4
## 106 0 3
## 107 0 8
## 108 0 1
## 109 0 2
## 112 0 3
## 113 0 1
## 114 0 2
## 115 0 1
## 116 0 2
## 117 0 1
## 119 0 3
## 120 0 2
## 121 0 2
## 122 0 2
## 123 0 2
## 128 0 1
## <NA> 0 2859153
crosstabs(~ years_married + spouses, data=swed_civil_s)
## spouses
## years_married 0 1 2 3 4 5 6
## 0 4462254 511697 1328 37 1 0 0
## 1 0 121773 2737 99 6 0 0
## 2 0 122576 4238 169 14 1 0
## 3 0 118532 5775 279 18 2 0
## 4 0 111640 7059 406 44 1 2
## 5 0 104733 7934 493 42 4 0
## 6 0 94095 8400 604 50 2 2
## 7 0 89120 8698 674 71 10 1
## 8 0 83234 8734 702 73 9 2
## 9 0 85814 8618 760 71 8 0
## 10 0 75543 9146 774 75 8 2
## 11 0 69042 8466 731 79 10 1
## 12 0 67362 8523 775 60 9 1
## 13 0 67446 8163 669 56 8 2
## 14 0 65780 8162 641 56 4 2
## 15 0 65717 7860 604 47 8 0
## 16 0 64911 7597 584 51 6 4
## 17 0 67471 7367 530 31 2 1
## 18 0 64749 7193 514 36 2 0
## 19 0 67908 7243 422 38 1 1
## 20 0 138316 8044 416 34 0 1
## 21 0 63546 7230 395 17 3 0
## 22 0 58721 7450 371 22 0 1
## 23 0 54541 7483 406 16 1 0
## 24 0 53448 8462 402 24 2 0
## 25 0 51262 8936 435 22 3 0
## 26 0 49134 9369 450 17 0 0
## 27 0 48666 9417 451 26 2 0
## 28 0 47830 9837 530 26 0 0
## 29 0 47737 10717 559 22 0 1
## 30 0 46902 11047 565 21 0 0
## 31 0 46771 10352 612 26 0 0
## 32 0 49413 9811 654 27 0 0
## 33 0 54023 9413 696 25 1 0
## 34 0 55026 10375 722 26 1 0
## 35 0 56866 10291 749 23 3 0
## 36 0 52225 9658 877 20 0 0
## 37 0 53569 9527 860 19 0 0
## 38 0 56069 9205 860 23 2 0
## 39 0 28793 41485 726 15 0 0
## 40 0 26328 49888 661 8 0 0
## 41 0 28443 52180 505 8 0 0
## 42 0 31047 54821 427 11 0 0
## 43 0 33427 55365 339 3 0 0
## 44 0 36188 52603 243 0 0 0
## 45 0 38940 48504 163 1 0 0
## 46 0 40889 42613 88 2 1 0
## 47 0 44364 39980 48 0 0 0
## 48 0 47262 36576 7 0 0 0
## 49 0 76196 6096 6 0 0 0
## 50 0 80656 221 5 0 0 0
## 51 0 80207 168 4 0 0 0
## 52 0 78799 160 0 1 0 0
## 53 0 74956 128 3 1 0 0
## 54 0 73038 118 2 0 0 0
## 55 0 68679 105 3 0 0 0
## 56 0 63694 63 4 0 0 0
## 57 0 57757 68 2 0 0 0
## 58 0 52296 70 3 0 0 0
## 59 0 45689 50 2 0 0 0
## 60 0 39150 44 2 0 0 0
## 61 0 33022 38 0 0 0 0
## 62 0 27336 26 0 0 0 0
## 63 0 21250 31 3 0 0 0
## 64 0 16292 29 2 0 0 0
## 65 0 11368 14 1 0 0 0
## 66 0 7958 10 5 0 0 0
## 67 0 5346 15 1 0 0 0
## 68 0 3389 12 2 0 0 0
## 69 0 2149 15 1 0 0 0
## 70 0 1249 10 3 0 0 0
## 71 0 762 9 1 0 0 0
## 72 0 415 10 0 0 0 0
## 73 0 236 3 3 0 0 0
## 74 0 144 10 2 0 0 0
## 75 0 68 8 0 0 0 0
## 76 0 65 7 0 0 0 0
## 77 0 42 7 2 0 0 0
## 78 0 20 4 1 0 0 0
## 79 0 26 4 2 0 0 0
## 80 0 12 2 1 0 0 0
## 81 0 5 2 1 0 0 0
## 82 0 6 11 1 0 0 0
## 83 0 6 1 0 0 0 0
## 84 0 3 8 0 0 0 0
## 85 0 2 7 2 0 0 0
## 86 0 4 4 0 0 0 0
## 87 0 1 6 0 0 0 0
## 88 0 0 2 3 0 0 0
## 89 0 2 0 0 0 0 0
## 90 0 3 3 0 0 0 0
## 91 0 0 2 0 0 0 0
## 92 0 2 5 0 0 0 0
## 93 0 0 3 0 0 0 0
## 94 0 0 3 0 0 0 0
## 95 0 0 5 0 0 0 0
## 96 0 0 4 0 0 0 0
## 97 0 0 6 0 0 0 0
## 98 0 0 3 0 0 0 0
## 99 0 0 4 0 0 0 0
## 100 0 0 3 0 0 0 0
## 101 0 0 1 0 0 0 0
## 102 0 0 1 0 0 0 0
## 103 0 0 2 0 0 0 0
## 104 0 0 4 0 0 0 0
## 105 0 0 4 0 0 0 0
## 106 0 0 3 0 0 0 0
## 107 0 0 8 0 0 0 0
## 108 0 0 1 0 0 0 0
## 109 0 0 2 0 0 0 0
## 112 0 0 3 0 0 0 0
## 113 0 0 1 0 0 0 0
## 114 0 0 2 0 0 0 0
## 115 0 0 1 0 0 0 0
## 116 0 0 2 0 0 0 0
## 117 0 0 1 0 0 0 0
## 119 0 0 2 0 0 1 0
## 120 0 0 2 0 0 0 0
## 121 0 0 2 0 0 0 0
## 122 0 0 2 0 0 0 0
## 123 0 0 2 0 0 0 0
## 128 0 0 1 0 0 0 0
## <NA> 0 2463454 305660 83777 5850 371 34
## spouses
## years_married 7 8 9
## 0 0 0 0
## 1 0 0 0
## 2 0 0 0
## 3 1 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## 7 0 0 0
## 8 0 0 0
## 9 0 1 0
## 10 0 0 0
## 11 0 0 0
## 12 0 0 0
## 13 0 1 0
## 14 0 0 0
## 15 0 0 0
## 16 0 0 0
## 17 0 0 0
## 18 0 0 0
## 19 0 0 0
## 20 0 0 0
## 21 0 0 0
## 22 0 0 0
## 23 0 0 0
## 24 0 0 0
## 25 1 0 0
## 26 0 0 0
## 27 0 0 0
## 28 0 0 0
## 29 0 0 0
## 30 0 0 0
## 31 0 0 0
## 32 0 0 0
## 33 0 0 0
## 34 0 0 0
## 35 0 0 0
## 36 0 0 0
## 37 0 0 0
## 38 0 0 0
## 39 0 0 0
## 40 0 0 0
## 41 0 0 0
## 42 0 0 0
## 43 0 0 0
## 44 0 0 0
## 45 0 0 0
## 46 0 0 0
## 47 0 0 0
## 48 0 0 0
## 49 0 0 0
## 50 0 0 0
## 51 0 0 0
## 52 0 0 0
## 53 0 0 0
## 54 0 0 0
## 55 0 0 0
## 56 0 0 0
## 57 0 0 0
## 58 0 0 0
## 59 0 0 0
## 60 0 0 0
## 61 0 0 0
## 62 0 0 0
## 63 0 0 0
## 64 0 0 0
## 65 0 0 0
## 66 0 0 0
## 67 0 0 0
## 68 0 0 0
## 69 0 0 0
## 70 0 0 0
## 71 0 0 0
## 72 0 0 0
## 73 0 0 0
## 74 0 0 0
## 75 0 0 0
## 76 0 0 0
## 77 0 0 0
## 78 0 0 0
## 79 0 0 0
## 80 0 0 0
## 81 0 0 0
## 82 0 0 0
## 83 0 0 0
## 84 0 0 0
## 85 0 0 0
## 86 0 0 0
## 87 0 0 0
## 88 0 0 0
## 89 0 0 0
## 90 0 0 0
## 91 0 0 0
## 92 0 0 0
## 93 0 0 0
## 94 0 0 0
## 95 0 0 0
## 96 0 0 0
## 97 0 0 0
## 98 0 0 0
## 99 0 0 0
## 100 0 0 0
## 101 0 0 0
## 102 0 0 0
## 103 0 0 0
## 104 0 0 0
## 105 0 0 0
## 106 0 0 0
## 107 0 0 0
## 108 0 0 0
## 109 0 0 0
## 112 0 0 0
## 113 0 0 0
## 114 0 0 0
## 115 0 0 0
## 116 0 0 0
## 117 0 0 0
## 119 0 0 0
## 120 0 0 0
## 121 0 0 0
## 122 0 0 0
## 123 0 0 0
## 128 0 0 0
## <NA> 4 2 1
crosstabs(~ spouses + ever_married, data=swed_civil_s)
## ever_married
## spouses 0 1
## 0 4044284 417970
## 1 0 6944643
## 2 0 1107224
## 3 0 108534
## 4 0 7255
## 5 0 486
## 6 0 58
## 7 0 6
## 8 0 4
## 9 0 1
props(~ ever_married, swed_civil_s)
## ever_married
## 0 1
## 0.3202 0.6798
# swed = read.csv("/Volumes/Elements/swed.csv")
# save(swed, file = "data/swed2.rdata")
load(file="data/swed2.rdata")
# load(file="data/swed.rdata")
# char_vars = sapply(swed,is.character) | sapply(swed, is.factor)
# swed[,char_vars] = plyr::colwise(function(x) {
# type.convert(as.character(x),as.is=TRUE)
# })(swed[,char_vars])
names(swed) = c("idIndividu", "idPere", "idMere", "idPaternalGrandfather", "idPaternalGrandmother", "idMaternalGrandfather", "idMaternalGrandmother", "sex", "education", "education_Father", "education_Mother", "education.MaternalGrandmother", "education.MaternalGrandfather", "education.PaternalGrandfather", "education.PaternalGrandmother", "byear", "byear.Father", "byear.Mother", "byear.PaternalGrandfather", "byear.MaternalGrandfather", "byear.PaternalGrandmother", "byear.MaternalGrandmother", "dyear", "dyear.Father", "dyear.Mother", "dyear.PaternalGrandfather", "dyear.MaternalGrandfather", "dyear.PaternalGrandmother", "dyear.MaternalGrandmother")
swed = as.data.table(swed)
# swed = head(swed, 1e6)
dupes = swed$idIndividu[duplicated(swed$idIndividu)]
# swed[LOPNR %in% dupes, ]
swed = swed[!duplicated(swed$idIndividu),]
swed = merge(swed, swed_civil_s, by = "idIndividu", all.x = T)
rm(swed_civil_s)
swed[, age_at_first_marriage := year_of_first_marriage - byear]
# qplot( year_of_first_marriage-byear, years_unmarried,data=swed, geom = 'jitter', alpha = I(0.3)) + facet_wrap(~ byear)
# qplot(byear, times_divorced, data = swed, stat = 'summary', fun.data = 'mean_sdl')
# qplot(byear, times_widowed, data = swed, stat = 'summary', fun.data = 'mean_sdl')
# qplot(byear, times_divorced, data = swed, stat = 'summary', fun.data = 'mean_sdl')
sort(props(~ swed$marriage_codes))
## swed$marriage_codes
## Ä,EP,G,OG,RP,S Ä,EP,G,RP,S Ä,EP,OG,S Ä,G,RP,S Ä,OG,RP
## 1.219e-07 1.219e-07 1.219e-07 1.219e-07 1.219e-07
## EP,G,OG EP,S OG,S,SP Ä,G,OG,RP,SP EP,G,OG,RP,S
## 1.219e-07 1.219e-07 1.219e-07 2.438e-07 2.438e-07
## EP,G,OG,RP,SP EP,G,OG,S EP,OG,RP,SP RP Ä,EP,OG,RP
## 2.438e-07 2.438e-07 2.438e-07 2.438e-07 3.658e-07
## Ä,G,OG,RP Ä,EP,G,OG,RP EP,G,RP,S RP,S Ä,EP,OG
## 4.877e-07 7.315e-07 8.535e-07 8.535e-07 9.754e-07
## OG,RP,S,SP G,OG,S,SP G,OG,SP EP,OG G,RP,S,SP
## 1.097e-06 1.219e-06 1.707e-06 1.829e-06 2.317e-06
## OG,SP OG,RP,S EP,G,OG,RP EP,OG,RP G,RP,S
## 2.560e-06 2.682e-06 5.121e-06 5.243e-06 9.998e-06
## Ä,OG,S G,OG,RP,S,SP Ä G,OG,RP,SP G,OG,RP,S
## 1.146e-05 1.402e-05 2.499e-05 3.572e-05 4.218e-05
## OG,RP,SP G,OG,RP Ä,S OG,RP Ä,OG
## 7.425e-05 1.196e-04 1.264e-04 3.138e-04 4.022e-04
## Ä,G,OG,S S Ä,G,OG OG,S Ä,G
## 5.116e-04 4.196e-03 4.804e-03 5.114e-03 5.699e-03
## Ä,G,S G G,OG,S G,S G,OG
## 8.315e-03 1.695e-02 7.141e-02 8.610e-02 2.035e-01
## <NA> OG
## 2.176e-01 3.746e-01
sort(props(~ swed$ever_married))
## swed$ever_married
## <NA> 0 1
## 0.2176 0.3750 0.4074
sort(props(~ swed[swed$byear < 1960,]$ever_married))
## swed[swed$byear < 1960, ]$ever_married
## <NA> 0 1
## 0.009808 0.180113 0.810079
swed[is.na(ever_married), ever_married := 0] # have verified that those missing from the civil file did not get far
miss_frac(swed)
## [1] 0
# make male var.
crosstabs(swed$sex)
## swed$sex
## 1 2
## 4206572 3995396
swed[, male := Recode(sex,"'1'=1;'2'=0;else=NA")]
swed$sex = NULL
## Warning in alloc.col(x): Attempt to reduce allocation from 105 to 104
## ignored. Can only increase allocation via shallow copy.
# crosstabs(swed$dyear)
swed[, age := dyear-byear ]
# qplot(swed$age)
# library(mgcv)
# qplot(data=swed,dyear,ifelse(age < 1, 0, 1)) + geom_smooth()
# qplot(data=swed,dyear,age) + geom_smooth()
# we believe that those who don't have a death date by 2009 and who were born after 1962, were alive by 2009. this probably incorrectly treats some out-migrants.
swed[byear>=1962 & is.na(dyear) & byear < 2009, age_at_least := as.integer(2009)-byear]
swed[!is.na(age), age_at_least := age]
swed[age < 0, age := NA]
swed[, survive1y := ifelse(age_at_least <= 1, 0, 1) ] # we are conservative in our survival estimation and would estimate 1;11 years as dead in the first year in the worst case (because we only have birth years, not months)
swed[, survive5y := ifelse(age_at_least <= 5, 0, 1) ]
swed[,surviveR := ifelse(age_at_least <= 15, 0, 1)]
crosstabs(~ survive1y + byear, data = swed)
## byear
## survive1y 1932 1933 1934 1935 1936 1937 1938 1939 1940
## 0 0 0 0 0 0 0 0 0 0
## 1 7741 13121 14622 14447 14020 13313 12895 12809 11585
## <NA> 19311 36165 44551 49456 54761 57923 62556 66790 68187
## byear
## survive1y 1941 1942 1943 1944 1945 1946 1947 1948 1949
## 0 0 0 0 0 0 1 4 4 3
## 1 11739 12449 13055 12974 11955 11066 10013 9076 8061
## <NA> 73866 87560 97828 106179 108817 110600 109889 109842 106781
## byear
## survive1y 1950 1951 1952 1953 1954 1955 1956 1957 1958
## 0 1 2 1 6 3 2 3 5 2
## 1 7394 6539 6050 5656 5185 5067 4755 4360 4011
## <NA> 102753 99036 100723 101876 98548 101314 102850 102763 101746
## byear
## survive1y 1959 1960 1961 1962 1963 1964 1965 1966 1967
## 0 4 158 1399 1372 1385 1431 1324 1263 1268
## 1 3662 3546 3515 110197 116267 126024 126374 126463 125090
## <NA> 102396 100194 103991 0 0 0 0 0 0
## byear
## survive1y 1968 1969 1970 1971 1972 1973 1974 1975 1976
## 0 1219 980 984 1004 929 826 856 700 666
## 1 117406 111587 113380 117419 116096 114157 114944 108898 103430
## <NA> 0 0 0 0 0 0 0 0 0
## byear
## survive1y 1977 1978 1979 1980 1981 1982 1983 1984 1985
## 0 610 578 594 569 538 523 575 568 577
## 1 101411 98826 102194 103602 100940 100500 99680 102345 106941
## <NA> 0 0 0 0 0 0 0 0 0
## byear
## survive1y 1986 1987 1988 1989 1990 1991 1992 1993 1994
## 0 525 572 565 585 649 627 560 483 434
## 1 110470 113027 120897 124457 132234 131360 129139 123121 118274
## <NA> 0 0 0 0 0 0 0 0 0
## byear
## survive1y 1995 1996 1997 1998 1999 2000 2001 2002 2003
## 0 377 337 296 271 265 283 306 275 296
## 1 109307 101234 96507 95279 94383 96533 96688 100907 103633
## <NA> 0 0 0 0 0 0 0 0 0
## byear
## survive1y 2004 2005 2006 2007 2008 2009
## 0 284 253 272 243 109526 256
## 1 105393 105367 109071 109358 23 0
## <NA> 0 0 0 0 0 105725
# bugs probably due to re-assigned personnummer
nrow(swed[!is.na(education) & survive1y == 0,])
## [1] 37
swed[!is.na(education) & survive1y == 0, survive1y := NA]
swed[!is.na(education) & survive1y == 0, age_at_least := NA]
swed[!is.na(education) & survive5y == 0, survive5y := NA]
swed[!is.na(education) & survive5y == 0, age_at_least := NA]
nrow(swed[education > 2 & surviveR == 0,])
## [1] 35
swed[education > 2 & surviveR == 0, surviveR := NA]
swed[education > 2 & surviveR == 0, age_at_least := NA]
swed[, paternalage := byear - byear.Father]
swed[, idParents := str_c(idMere,"_",idPere)]
swed[, maternalage := byear - byear.Mother]
# qplot(maternalage,paternalage,data=swed,geom="jitter",alpha=I(0.1))
cor.test(swed$maternalage,swed$paternalage)
##
## Pearson's product-moment correlation
##
## data: swed$maternalage and swed$paternalage
## t = 3100, df = 8100000, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7365 0.7372
## sample estimates:
## cor
## 0.7368
swed$born = 1
swed = swed[order(idParents,byear), ]
swed[ , first_child_with_this_partner := ifelse( !duplicated(idParents), 1, 0) ]
swed = count_and_merge(swed, 'fertile_unions', wt_var = "first_child_with_this_partner")
crosstabs(~ fertile_unions + spouses, data = swed)
## spouses
## fertile_unions 0 1 2 3 4 5 6
## 0 2209839 273143 51614 5576 355 20 3
## 1 805108 1921584 654070 53067 2582 108 11
## 2 102010 169292 94192 25748 2136 161 8
## 3 10435 15535 10292 4039 583 62 12
## 4 1141 1643 1305 606 125 18 4
## 5 151 241 177 104 28 4 2
## 6 26 38 54 20 4 1 1
## 7 10 4 13 5 2 1 1
## 8 1 2 3 1 0 0 0
## 9 0 1 0 0 0 0 0
## 10 0 0 1 0 0 0 0
## 11 0 1 0 0 0 0 0
## spouses
## fertile_unions 7 8 9 <NA>
## 0 0 0 0 1784037
## 1 2 1 0 584
## 2 1 0 0 12
## 3 1 2 0 0
## 4 1 0 0 1
## 5 1 0 1 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## 10 0 0 0 0
## 11 0 0 0 0
swed = count_and_merge(swed, 'children', wt_var = "born")
swed$children.per.spouse = swed$children/swed$spouses
swed$children.per.spouse[which(swed$spouses==0)] = NA
# qplot(swed$children.per.spouse)
# qplot(swed$children)
# qplot(swed$spouses)
changeNAto1 = function(x) { colwise(function(x) { ifelse(is.na(x), 1, x)})(x) }
# bugs probably owing to re-assigment of personnummer
nrow(swed[children>0 & surviveR == 0,])
## [1] 25
swed[children>0 & survive1y == 0, survive1y := NA]
swed[children>0 & surviveR == 0, surviveR := NA]
swed[children>0 & surviveR == 0, age_at_least := NA]
nrow(swed[spouses > 0 & surviveR == 0,])
## [1] 3690
swed[spouses > 0 & survive1y == 0, survive1y := NA]
swed[spouses > 0 & surviveR == 0, survive5y := NA]
swed[spouses > 0 & surviveR == 0, surviveR := NA]
swed[spouses > 0 & surviveR == 0, age_at_least := NA]
swed = count_and_merge(swed, 'children.surviving1y', wt_var = 'survive1y')
swed = count_and_merge(swed, 'children.surviving5y', wt_var = 'survive5y')
swed = count_and_merge(swed, 'children.survivingR', wt_var = 'surviveR')
# swed = count_and_merge(swed, 'children.spouses', wt_var = 'spouses')
# swed = count_and_merge(swed, 'grandchildren.per.spouse', wt_var = 'children.per.spouse')
swed = count_and_merge(swed, 'grandchildren',wt_var='children')
swed = count_and_merge(swed, 'grandchildren.surviving1y', wt_var = 'children.surviving1y')
swed = count_and_merge(swed, 'grandchildren.survivingR', wt_var = 'children.survivingR')
swed[, dead1y := ifelse(age_at_least >= 1, 0, 1)]
swed = count_and_merge(swed, 'children.dead1y', wt_var = 'dead1y')
swed[, children.wddate := children.dead1y + children.surviving1y]
xtabs(~ (grandchildren>0) +(children>0) + surviveR,data=swed,exclude=NULL, na.action= na.pass)
## , , surviveR = 0
##
## children > 0
## grandchildren > 0 FALSE TRUE
## FALSE 1594483 0
## TRUE 0 0
##
## , , surviveR = 1
##
## children > 0
## grandchildren > 0 FALSE TRUE
## FALSE 2160586 1563637
## TRUE 0 184535
##
## , , surviveR = NA
##
## children > 0
## grandchildren > 0 FALSE TRUE
## FALSE 568287 828315
## TRUE 0 1302125
# qplot(swed$children)
# qplot(swed[which(swed$age > 15),]$children )
# qplot(swed$grandchildren)
# qplot(swed[which(swed$age > 15),]$grandchildren )
# qplot(swed$spouses)
pre-calculate some predictors
swed = swed[order(idParents,byear), ]
swed <- transform(swed, birthorder = ave(rep(NA, nrow(swed)), swed$idParents, FUN = seq_along)) # old trick to get birth order, don't know what this does to those with missings for father though
# qplot(swed$birthorder,binwidth=1)
swed <- transform(swed, min15.birthorder = ave(surviveR, idPere, FUN =function(x) { x[is.na(x)] = 0
cumsum(x)
} ))
xtabs(data=swed, ~is.na(birthorder) + is.na(min15.birthorder))
## is.na(min15.birthorder)
## is.na(birthorder) FALSE
## FALSE 8141033
## TRUE 60935
table(swed$min15.birthorder,exclude=NULL)
##
## 0 1 2 3 4 5 6 7 8
## 3709277 2463803 1370459 474328 128488 36357 11935 4291 1748
## 9 10 11 12 13 14 15 16 17
## 692 319 145 59 33 18 10 4 1
## 18 <NA>
## 1 0
swed[, nr.siblings := ave(born,idParents,FUN= function(x) { sum(x,na.rm=T) } ) - 1]
# qplot(swed$nr.siblings,binwidth=1)
count dependent sibs
swed = swed %>%
group_by(idParents) %>%
mutate(
dependent_sibs_f5y = dependent_sibs_f5y(survive1y=survive1y, byear=byear, dyear=dyear)
) %>% data.table()
swed[, birth.cohort := year_bins(byear)]
crosstabs(swed$birth.cohort)
## swed$birth.cohort
## 1930-1935 1935-1940 1940-1945 1945-1950 1950-1955 1955-1960 1960-1965
## 135511 358970 495422 596112 533773 532940 569479
## 1965-1970 1970-1975 1975-1980 1980-1985 1985-1990 1990-1995 1995-2000
## 612974 580595 517907 509840 578616 636881 498256
## 2000-2005 2005-2010
## 504598 540094
swed %>%
tbl_df() %>%
mutate(
maternal_loss_age = dyear.Mother - byear
,maternal_loss_age = as.numeric(ifelse(maternal_loss_age >= -1 & maternal_loss_age < 0, 0, maternal_loss_age))
,maternal_loss = as.character(cut(maternal_loss_age, breaks = c(0,1,5,10,15,20,25,30,35,40,45), include.lowest = T ))
,maternal_loss = ifelse( maternal_loss_age >= 45, "later", maternal_loss)
,maternal_loss = ifelse(is.na(maternal_loss_age) | is.na(maternal_loss), "unclear", maternal_loss)
,maternal_loss = factor(maternal_loss, levels = c("later","[0,1]", "(1,5]", "(5,10]", "(10,15]", "(15,20]", "(20,25]", "(25,30]", "(30,35]", "(35,40]", "(40,45]", "unclear"))
,paternal_loss_age = dyear.Father - byear
,paternal_loss_age = as.numeric(ifelse(paternal_loss_age >= -1 & paternal_loss_age < 0, 0, paternal_loss_age))
,paternal_loss = as.character(cut(paternal_loss_age, breaks = c(0,1,5,10,15,20,25,30,35,40,45), include.lowest = T ))
,paternal_loss = ifelse( paternal_loss_age >= 45, "later", paternal_loss)
,paternal_loss = ifelse(is.na(paternal_loss_age) | is.na(paternal_loss), "unclear", paternal_loss)
,paternal_loss = factor(paternal_loss, levels = c("later","[0,1]", "(1,5]", "(5,10]", "(10,15]", "(15,20]", "(20,25]", "(25,30]", "(30,35]", "(35,40]", "(40,45]", "unclear"))
) %>%
data.table() ->
swed
crosstabs(swed$maternal_loss)
## swed$maternal_loss
## later [0,1] (1,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35]
## 1274984 2151 7869 16189 27191 45547 71480 106952 149433
## (35,40] (40,45] unclear
## 204279 208744 6087149
crosstabs(swed$paternal_loss)
## swed$paternal_loss
## later [0,1] (1,5] (5,10] (10,15] (15,20] (20,25] (25,30] (30,35]
## 1096763 7248 19961 38395 63816 108260 170261 251155 328328
## (35,40] (40,45] unclear
## 394178 339366 5384237
swed$older_siblings = factor(ifelse((swed$birthorder - 1) > 4,"5+", swed$birthorder - 1))
swed$last_born = ifelse(swed$birthorder == swed$nr.siblings, 1, 0)
swed = swed[order(swed$idParents,swed$byear), ]
swed <- transform(swed, siblings = ave(rep(NA, nrow(swed)), swed$idParents, FUN = length)-1) # sibling count
swed <- transform(swed, birthorder = ave(rep(NA, nrow(swed)), swed$idParents, FUN = seq_along)) # old trick to get birth order, don't know what this does to those with missings for father though
qplot(swed$birthorder)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 60935 rows containing non-finite values (stat_bin).
swed$younger_siblings = swed$siblings + 1 - swed$birthorder
swed = swed %>%
group_by(idParents) %>%
mutate(
younger_sibs_ad_5y = younger_sibs_alive_and_dependent(survive5y=survive5y, byear=byear, dyear=dyear) ,
older_sibs_ad_5y = older_sibs_alive_and_dependent(survive5y=survive5y, byear=byear, dyear=dyear),
dependent_sibs_f5y = dependent_sibs_f5y(survive1y=survive1y, byear=byear, dyear=dyear)
) %>% data.table()
swed[, paternalage := paternalage/10]
swed[, maternalage := maternalage/10]
swed = recenter.pat(recenter.pat(swed), what = "maternalage")
min_na = function(x) { ifelse(all(is.na(x)), NA, min(x,na.rm=T) ) }
max_na = function(x) { ifelse(all(is.na(x)), NA, max(x,na.rm=T) ) }
swed = swed[order(idPere),]
swed[, paternalage_at_1st_sib := ave(paternalage, idPere, FUN = min_na)]
swed[, paternalage_at_last_sib := ave(paternalage, idPere, FUN = max_na)]
swed = swed[order(idMere),]
swed[, maternalage_at_1st_sib := ave(maternalage, idMere, FUN = min_na)]
swed[, maternalage_at_last_sib := ave(maternalage, idMere, FUN = max_na)]
fathers = swed[!duplicated(idPere), list(idPere, paternalage_at_1st_sib, paternalage_at_last_sib)]
names(fathers) = c("idIndividu","age_at_1st_child", "age_at_last_child")
mothers = swed[!duplicated(idMere), list(idMere, maternalage_at_1st_sib, maternalage_at_last_sib)]
names(mothers) = c("idIndividu","age_at_1st_child", "age_at_last_child")
parents = rbind(fathers, mothers)
swed = merge(swed, parents, by = "idIndividu", all.x = T)
rm(parents,fathers,mothers)
swed[, nr.siblings := ave(born,idParents,FUN= function(x) { sum(x,na.rm=T) } ) -1] # dont count self
swed[, maternalage.factor := cut((maternalage*10), breaks = c(14, 20, 35, 61))]
swed[, maternalage.factor := relevel(maternalage.factor, ref = "(20,35]")]
swed[, ever_married := ifelse(spouses>0, 1,0)]
swed[, any_surviving_children := ifelse(children.survivingR > 0, 1, 0)]
swed$birth_cohort = factor(swed$birth.cohort)
swed$male = factor(swed$male)
swed$last_born = factor(swed$last_born)
swed[ , first_sib_byear := ave( byear, idParents, FUN = function(x)min(x,na.rm=T)) ]
swed[ , last_sib_byear := ave( byear, idParents, FUN = function(x)max(x,na.rm=T)) ]
swed[, birth_span := last_sib_byear - first_sib_byear]
hist(swed$birth_span,breaks = 0:35)
swed.mp = swed[!is.na(paternalage), ]
swed.1 = swed.mp[byear >= 1962-15 & byear <= 2009-50, ]
swed.2 = swed.mp[byear >= 1969 & byear < 2000, ]
set.seed(seed = 346834964)
swed.2 %>% filter(complete.cases(survive1y, birth_cohort, male, maternalage.factor, paternalage.mean, paternalage, paternal_loss, maternal_loss, older_siblings, nr.siblings, last_born, idParents)) %>% select(idParents) %>% distinct() -> complete_fams
selected_fams = sample(complete_fams$idParents, size = 200000)
swed_subset_survival.1 = swed.2 %>% filter(idParents %in% selected_fams)
swed.1 %>% filter(complete.cases(children, birth_cohort, male, maternalage.factor, paternalage.mean, paternalage, paternal_loss, maternal_loss, older_siblings, nr.siblings, last_born, idParents)) %>% select(idParents) %>% distinct() -> complete_fams
selected_fams = sample(complete_fams$idParents, size = 80000)
swed_subset_children.1 = swed.1 %>% filter(idParents %in% selected_fams)
save(swed,file="swed.rdata")
save(swed.1, file="swed1.rdata")
save(swed.2, file="swed2.rdata")
save(swed_subset_children.1, file="swed_subset_children.rdata")
save(swed_subset_survival.1, file="swed_subset_survival.rdata")