Multivariate bounded data • mglm4twin

Abstract

The mglm4twin package implements multivariate generalized linear models for twin and family data as proposed by Bonat and Hjelmborg (2020). The core fit function mglm4twin is employed for fitting a set of models. In this introductory vignette we restrict ourselves to model multivariate bounded outcomes. We present a set of special specifications of the standard ACE for multivariate bounded traits in the context of twin data. This vignette reproduces the data analysis presented in the Section 5.6 Dataset 6: Multivariate bounded data of Bonat and Hjelmborg (2020).

Install and loading the mglm4twin package

To install the stable version of mglm4twin, use devtools::install_git(). For more information, visit mglm4twin/README.

library(devtools)
install_git("bonatwagner/mglm4twin")

library(mglm4twin)
packageVersion("mglm4twin")
#> [1] '0.5.0'

Data set

In this example, we simulate a dataset as similar as possible to a real dataset concernings three bounded outcomes reflecting mental conditions. The data set consists of 828 (524 DZ and 274 MZ) twin pairs. The outcomes are obtained based on questionaries and includes the Mini Mental State Examination (MMSE) taking values on the interval $[0\ldots, 30]$ . The second outcome corresponds to depression symptomatology which was evaluated based on an adaptation of the depression section of the Cambridge Mental Disorders of the Elderly Examination (CAMDEX). The depression scale used here is a composite of responses to 17 depression items and taking integer values in the interval $[0, \ldots, 30]$ and is highly left-skewed in distribution in any population-based sample. The third outcome is a Social Activity scale based on six items that assess the frequency with which the individual is engaged with others and mental pursuits. We opted to simulate such a data set because we can show how to simulate the data and make it available for the users. For simulating the data set, we are going to use the SimCorMultRes package.

## Loading extra packages
require(Matrix)
#> Loading required package: Matrix
require(SimCorMultRes)
#> Loading required package: SimCorMultRes
require(mglm4twin)

## Setting model parameters
E <- c(0.75, 0.7, 0.65, -0.3, 0.25, -0.4)
A <- c(0.25,0.3, 0.35, -0.15, 0.20, -0.2)
C <- rep(0, 6)
tau = c(E, A, C)

## Twin structure
DZ = mt_twin(N_DZ = 1, N_MZ = 0, n_resp = 3, model = "ACE")
MZ = mt_twin(N_DZ = 0, N_MZ = 1, n_resp = 3, model = "ACE")
Omega_DZ <- as.matrix(mt_matrix_linear_predictor(tau = tau, Z = DZ))
Omega_MZ <- as.matrix(mt_matrix_linear_predictor(tau = tau, Z = MZ))

## Regression structures
sex <- c(rep("Male", 220/2), rep("Female", 334/2), 
         rep("Male", 116/2), rep("Female", 158/2))
zyg <- c(rep("DZ", 554/2), rep("MZ", 274/2))
set.seed(123)
Age <- rbeta(414, shape1 = 0.3*2, shape2 = 0.7*2)*20 + 70
age_std <- (Age - mean(Age))/var(Age)
X <- model.matrix(~ sex + age_std)
beta1 <- c(1.6956, 0.0584, -0.2576)
mu1 <- exp(X%*%beta1)/(1+exp(X%*%beta1))
beta2 <- c(-1.7930, 0.0875, 0.2382)
mu2 <- exp(X%*%beta2)/(1+exp(X%*%beta2))
beta3 <- c(-0.5363, -0.05138, -0.1528)
mu3 <- exp(X%*%beta3)/(1+exp(X%*%beta3))

## Marginal distributions
dist <- c("qbeta","qbeta","qbeta")
invcdfnames <- rep(dist, each = 2)

## Simulating data set
Y1_MZ <- list()
Y2_MZ <- list()
Y3_MZ <- list()

Y1_DZ <- list()
Y2_DZ <- list()
Y3_DZ <- list()
set.seed(181185)
phi = 5
for(i in 1:137) {
  paramslists <- list(
    m1 = list(shape1 = mu1[i]*phi, shape2 = (1-mu1[i])*phi),
    m1 = list(shape1 = mu1[i]*phi, shape2 = (1-mu1[i])*phi),
    m2 = list(shape1 = mu2[i]*phi, shape2 = (1-mu2[i])*phi),
    m2 = list(shape1 = mu2[i]*phi, shape2 = (1-mu2[i])*phi),
    m3 = list(shape1 = mu3[i]*phi, shape2 = (1-mu3[i])*phi),
    m3 = list(shape1 = mu3[i]*phi, shape2 = (1-mu3[i])*phi))
  Y_MZ <- rnorta(R = 1, cor.matrix = Omega_MZ,
                 distr = invcdfnames, qparameters = paramslists)
  Y1_MZ[[i]] <- Y_MZ[1:2]
  Y2_MZ[[i]] <- Y_MZ[3:4]
  Y3_MZ[[i]] <- Y_MZ[5:6]
}
set.seed(181185)
for(i in 138:414) {
  paramslists <- list(
    m1 = list(shape1 = mu1[i]*phi, shape2 = (1-mu1[i])*phi),
    m1 = list(shape1 = mu1[i]*phi, shape2 = (1-mu1[i])*phi),
    m2 = list(shape1 = mu2[i]*phi, shape2 = (1-mu2[i])*phi),
    m2 = list(shape1 = mu2[i]*phi, shape2 = (1-mu2[i])*phi),
    m3 = list(shape1 = mu3[i]*phi, shape2 = (1-mu3[i])*phi),
    m3 = list(shape1 = mu3[i]*phi, shape2 = (1-mu3[i])*phi))
  Y_DZ <- rnorta(R = 1, cor.matrix = Omega_DZ,
                 distr = invcdfnames, qparameters = paramslists)
  Y1_DZ[[i]] <- Y_DZ[1:2]
  Y2_DZ[[i]] <- Y_DZ[3:4]
  Y3_DZ[[i]] <- Y_DZ[5:6]
}

Y1 <- c(do.call(c, Y1_DZ), do.call(c, Y1_MZ))
Y2 <- c(do.call(c, Y2_DZ), do.call(c, Y2_MZ))
Y3 <- c(do.call(c, Y3_DZ), do.call(c, Y3_MZ))

data <- data.frame("Y1" = Y1, "Y2" = Y2, "Y3" = Y3, "twin_id" = rep(1:2, 414),
                   "zyg" = rep(zyg, each = 2), "sex" = rep(sex, each = 2), 
                   "age_std" = rep(age_std, each = 2))

Model specification

The model specification is done in three steps:

Specifying the linear predictor (mean model).
Specifying the matrix linear predictor (twin dependence struture).
Specifying link and variance function suitable to be response variable type.

Specyfing the linear predictor

In the example we have the covariates sex and age for composing the linear predictor. Thus, the linear predictor is given by

form_Y1 <- c(Y1 ~ sex + age_std)
form_Y2 <- c(Y2 ~ sex + age_std)
form_Y3 <- c(Y3 ~ sex + age_std)

Specyfing the matrix linear predictor

The matrix linear predictor describes the twin dependence structure. The mglm4twin package provides a set of pre-specified covariance structures. However, the users can easily specify thein own covariance structure through a linear combination of known matrices. In this example, we are going to specify a set of special cases of the standard ACE models for twin data.

ACE = mt_twin(N_DZ = 554/2, N_MZ = 274/2, n_resp = 3, model = "ACE")
AE = mt_twin(N_DZ = 554/2, N_MZ = 274/2, n_resp = 3, model = "AE")
CE = mt_twin(N_DZ = 554/2, N_MZ = 274/2, n_resp = 3, model = "CE")
E = mt_twin(N_DZ = 554/2, N_MZ = 274/2, n_resp = 3, model = "E")

The function mt_twin() has four main arguments: N_DZ and N_MZ the number of dizygotic and monozygotic twin, respectively. n_resp number of response variables, in this case just $1$ and the model a string given the model’s name. It is extreme important to note that the data set should be ordered in a way that all DZ twin appear first and then all MZ twin and the twin pair should appear together, i.e. twin 2 after twin 1.

Specyfing link and variance functions

The specificantion of the link and variance functions depends on the response variable type. In the case of the bounded data the mglm4twinpackage offers the logit, probit, cauchit, cloglog and loglog link functions. For the variance function the binomialP represents the standard binomial variance function, i.e. $\mu(1-\mu)$ which is suitable for bounded outcomes. In this example, we follow Bonat and Hjelmborg (2020) and use the logit link function and the binomialP variance function.

link = rep("logit", 3)
variance = rep("binomialP", 3)

Model fitting

The main fitting function in the mglm4twin package is the mglm4twin() function. it was designed to be similar to standard model fitting functions in R such as lm() and glm().

fit_ACE <- mglm4twin(linear_pred = c(form_Y1, form_Y2, form_Y3), 
                     matrix_pred = ACE, link = link,
                     variance = variance, 
                     data = data)
#> Automatic initial values selected.
fit_AE <- mglm4twin(linear_pred = c(form_Y1, form_Y2, form_Y3), 
                    matrix_pred = AE, link = link,
                    variance = variance, 
                    data = data)
#> Automatic initial values selected.
fit_CE <- mglm4twin(linear_pred = c(form_Y1, form_Y2, form_Y3), 
                    matrix_pred = CE, link = link,
                    variance = variance, 
                    data = data)
#> Automatic initial values selected.
fit_E <- mglm4twin(linear_pred = c(form_Y1, form_Y2, form_Y3), 
                    matrix_pred = E, link = link,
                    variance = variance, 
                    data = data)
#> Automatic initial values selected.

Reproducing Table 8 of Bonat and Hjelmborg (2020)

rbind(gof(fit_E), gof(fit_AE), gof(fit_CE), gof(fit_ACE))[,c(1,2,3,5)]
#>   plogLik Df     pAIC      pBIC
#> 1 1294.32 15 -2558.64 -2471.376
#> 2 1311.04 21 -2580.08 -2457.910
#> 3 1307.94 21 -2573.88 -2451.710
#> 4 1311.69 27 -2569.38 -2412.304

Reproducing Table 9 of Bonat and Hjelmborg (2020)

Table 9 is done by collecting the results from the model’s summary.

summary(fit_AE, model = "AE", biometric = TRUE)
#> Call: Y1 ~ sex + age_std
#> 
#> Link function: logit
#> Variance function: binomialP
#> Regression:
#>               Estimates  std.error    z value      Pr(>|z|)
#> (Intercept)  1.74420183 0.05291741 32.9608287 2.959761e-238
#> sexMale     -0.06859125 0.08198530 -0.8366286  4.028013e-01
#> age_std     -0.24957571 0.21050178 -1.1856228  2.357713e-01
#> 
#> Call: Y2 ~ sex + age_std
#> 
#> Link function: logit
#> Variance function: binomialP
#> Regression:
#>               Estimates  std.error     z value      Pr(>|z|)
#> (Intercept) -1.81642618 0.05590683 -32.4902391 1.464717e-231
#> sexMale      0.06108591 0.08667232   0.7047914  4.809401e-01
#> age_std      0.45342969 0.22054316   2.0559681  3.978559e-02
#> 
#> Call: Y3 ~ sex + age_std
#> 
#> Link function: logit
#> Variance function: binomialP
#> Regression:
#>               Estimates  std.error    z value     Pr(>|z|)
#> (Intercept) -0.54371923 0.03970218 -13.694945 1.088421e-42
#> sexMale     -0.06694236 0.06272402  -1.067252 2.858579e-01
#> age_std     -0.24058640 0.16243509  -1.481123 1.385737e-01
#> 
#> Dispersion:
#> Genetic component:
#>        Estimates   std.error    z value    Pr(>|z|)
#> A1   0.027929924 0.011012562  2.5361876 0.011206667
#> A2   0.041948185 0.011088036  3.7831934 0.000154829
#> A3   0.029001292 0.011134358  2.6046668 0.009196365
#> A12 -0.015019172 0.008190355 -1.8337632 0.066689151
#> A13  0.007991468 0.008219853  0.9722154 0.330943424
#> A23 -0.013718879 0.008734612 -1.5706340 0.116267687
#> 
#> Environment component:
#>       Estimates   std.error   z value     Pr(>|z|)
#> E1   0.12812751 0.014343170  8.932998 4.146175e-19
#> E2   0.11544285 0.012164565  9.490093 2.308273e-21
#> E3   0.13215173 0.011462188 11.529363 9.383491e-31
#> E12 -0.04475663 0.008457948 -5.291666 1.212074e-07
#> E13  0.04644685 0.008850462  5.247958 1.537946e-07
#> E23 -0.06735482 0.009130621 -7.376806 1.621321e-13
#> 
#> Measures associated with genetic structure:
#> Genetic correlation:
#>        Estimates std.error   z value   Pr(>|z|)
#> rho12 -0.4387878 0.1877430 -2.337172 0.01943023
#> rho13  0.2807909 0.2475721  1.134178 0.25671972
#> rho23 -0.3933266 0.1795943 -2.190084 0.02851811
#> 
#> Heritability:
#>    Estimates  std.error  z value     Pr(>|z|)
#> h1 0.1789721 0.06973874 2.566322 1.027833e-02
#> h2 0.2665221 0.06729049 3.960769 7.470886e-05
#> h3 0.1799612 0.06729608 2.674171 7.491434e-03
#> 
#> Bivariate heritability:
#>     Estimates std.error   z value   Pr(>|z|)
#> h12 0.2512584 0.1313564 1.9127994 0.05577373
#> h13 0.1467986 0.1479034 0.9925303 0.32093892
#> h23 0.1692149 0.1046257 1.6173365 0.10580566
#> 
#> Measures associated with environment structure:
#> Environment correlation:
#>        Estimates  std.error    z value     Pr(>|z|)
#> rho12 -0.3680041 0.05815352  -6.328149 2.481196e-10
#> rho13  0.3569428 0.05709776   6.251433 4.067047e-10
#> rho23 -0.5453166 0.04741133 -11.501822 1.291602e-30
#> 
#> Environmentality:
#>    Estimates  std.error  z value     Pr(>|z|)
#> e1 0.8210279 0.06973874 11.77291 5.383495e-32
#> e2 0.7334779 0.06729049 10.90017 1.150392e-27
#> e3 0.8200388 0.06729608 12.18553 3.712093e-34
#> 
#> Bivariate environmentality:
#>     Estimates std.error  z value     Pr(>|z|)
#> e12 0.7487416 0.1313564 5.700079 1.197522e-08
#> e13 0.8532014 0.1479034 5.768640 7.991358e-09
#> e23 0.8307851 0.1046257 7.940548 2.012905e-15
#> 
#> Algorithm: chaser
#> Correction: FALSE
#> Number iterations: 6