Figure 1 shows a path diagram of the DoC model, including all possible paths between two traits that can be estimated using cross-sectional twin data. In a bivariate ACE model, each trait is influenced by latent additive genetic (A), shared environmental (C), and unique environmental (E) variance components which may covary between the traits (parameters rA, rC, and rE). Causal effects between X and Y are represented by the direct regression paths between the phenotypes (parameters b1 and b2). Due to identification constraints, only three of the five potential sources of covariance (b12, b21, rA, rC, and rE) can be estimated in any model (Heath et al., 1993; Neale & Cardon, 1992). Further, the model makes the standard assumptions of non-random mating, the absence of non-additive genetic variation, and gene-environment independence (Neale & Cardon, 1992). Differences in the patterns of the cross-twin, cross-trait covariances enable testing causal hypotheses.
The likelihood ratio test is typically used to compare the model fit of the DoC and other variance component models. Unidirectional DoC models are nested within the Cholesky ACE model, so likelihood ratio testing is appropriate when certain regularity conditions are met (Steiger et al., 1985; Verhulst et al., 2019). When models are not nested, such as the bidirectional DoC model with additive genetic correlation (rA estimated, rC and rE fixed) versus one with unshared environment correlation (rE estimated and rA and rC fixed), a parsimony-based fit index such as Akaike Information Criterion (AIC) can be used. The AIC, calculated as twice the negative log-likelihood minus twice the degrees of freedom, balances model complexity and model fit to achieve a parsimony-related statistic. The model with the lowest AIC is the model of choice, in principle. Although the difference in log-likelihood between models with different numbers of mixture components may not meet the Steiger et al. (1985) regularity conditions, and hence may not be distributed as chi-squared, it is still useful as a guide to relative model fit.
Integrating Mixture Modeling with the Direction of Causation Model
Given that the CTD offers the means to address the direction of causation, it seems plausible that a mixture distribution model may be able to detect population heterogeneity in this respect. Finite mixture modeling of twin data has been considered before. However, this was to conduct genetic covariance structure analysis in the absence of zygosity information (Benyamin et al., 2006; Neale, 2003), or to investigate (latent) group differences in genetic and environmental variance components (Gillespie & Neale, 2006). The present aim is to investigate the feasibility of mixture modeling to detect causal heterogeneity. Accordingly, we conducted a simulation study to: i) ensure that the mixture model accurately recovers parameter values used to simulate the data; ii) compare the model fit statistics between mixture and non-mixture DoC models; and iii) evaluate the posterior probabilities that each individual belongs to each class (i.e., the X-causes-Y class or the Y-causes-X class). Thus, we aim to determine the feasibility of finite mixture modeling of heterogeneity of causal relations using twin data, estimate the required sample sizes, and identify patterns of parameters that aid in model fitting.
Structural Equation Mixture Modeling (SEMM)
Here we formally define the finite mixture distribution model. Let y denote a column vector of continuous observations, and C denote the number of mixture classes. The prior probability of belonging to latent class i, ωi, is equal to the proportion of the population belonging to that latent class. In a mixture model, the probability density function with C latent classes can be expressed as:
$$f\left(\varvec{y}\right)=\sum _{i=1}^{C}{\omega }_{i}{\varPhi }_{i}(y;{\theta }_{i})$$ (1)
where \(\varPhi\)i is the i-th density, \({\theta }_{i}\) is the vector of the parameters of the i-th density, and, as, above, ωi is the corresponding mixture proportions, where \(\sum {\omega }_{i}=1\). In the DoC mixture model, the class-specific densities \(\varPhi\)i \((y;{\theta }_{i})\) come from the same parametric family (multivariate normal) with class-specific mean and covariance matrices, which are a function of the \({\theta }_{i}\) parameter estimates. Combining structural equation modeling (SEM) with finite mixture modeling, known as Structural Equation Mixture Modeling (SEMM), involves specifying an SEM structure within each class (Dolan & van der Maas, 1998; Lubke & Muthén, 2005; Vermunt & Magidson, 2014; Yung, 1997). The SEM structures must differ in some way to ensure identification of the mixture proportions. This approach allows for the estimation of: i) the number of classes, ii) the parameters of the densities and the mixture proportions, and iii) each individual’s conditional class membership (posterior) probabilities.
Direction of Causation Twin Mixture Model (mixDoC)
The model developed here is a mixture distribution DoC twin model (mixDoC), which includes two classes that differ in the direction of the causal relationship between the two traits, X and Y. The two classes correspond to subpopulations characterized by opposite causality directions. In one subpopulation, X causes Y, while in the other, Y causes X. Since the DoC model uses pairs of twins as the sampling unit and each twin could be drawn from either causal distribution, there are four possible combinations for each family: two combinations where the twins are concordant for causal direction and two combinations where they are discordant. Specifically: 1) X causes Y in both twins; 2) Y causes X in both twins; 3) X causes Y in twin 1 and Y causes X in twin 2; or 4) Y causes X in twin 1 and X causes Y in twin 2. Since the ordering of twins within a twin pair is usually random or unsystematic, the expected parameter estimates and mixing proportions of the discordant classes are expected to be equal. Figure 2 shows a schematic of the model. The model consists of two groups (MZ and DZ twin pairs), with four mixture component classes within each group.
The ability to disaggregate mixtures of opposing causal processes stems, in large part, from differences in the observed means that arise from each causal process. To identify the DoC model within each concordant class, means for each trait (X and Y) are equated across twin 1 and twin 2 and across zygosities. However, the causal process may differ between the classes, thus altering the expected mean for each variable. The means in the class where X causes Y may differ from the means in the class where Y causes X. This distinction results in two means for each trait, where the phenotypic mean differences across classes are associated with the direction of causation. For model identification, at least two of the parameters, bx−>y, by−>x, rA, rC, or rE, must be fixed to a specific value, which is zero in this case. Constraining all three parameters to equal zero specifies that there is no latent confounding due to genetic, shared environmental, or unique environmental sources. In that case, the only source of phenotypic covariance is the causal relationship. To limit the set of data-generating models, we considered two models: one in which all background confounding was set to zero (i.e., rA=rC=rE=0) for all classes and unidirectional causation, and an extended model where the mixture still focused on unidirectional causation but also allowed for background genetic confounding (rA was freely estimated but equated across class).
Simulation Design Parameters
Table 1
Simulation Designs: Table outlines the various parameters used in the simulation and their corresponding explanations.
|
Scenario
|
Parameter
|
Explanation
|
|
Degree of Heterogeneity
|
Proportion of twins within each class
|
To determine if the level of heterogeneity affects the classification accuracy.
|
|
Presence of Twin Pairs Exhibiting Bidirectional Causation
|
Proportion of twins exhibiting Bidirectional Causation
|
To assess the impact of bidirectional causation on model performance.
|
|
Phenotypic Mean Difference
|
Difference in trait means across groups
|
To evaluate how differences in phenotypic means between groups influence classification accuracy.
|
|
Causal Effect Size
|
Magnitude of causal relationship
|
To examine the effect of varying causal effect sizes on model accuracy.
|
|
Trait Heritability
|
Proportion of variance due to genetics (A)
|
To understand how heritability affects the model's ability to classify individuals.
|
|
Genetic Confounding
|
Presence of genetic confounding (ra)
|
To investigate the impact of genetic confounding on model classification accuracy.
|
All analyses were performed using R version 4.2.1 (R Core Team, 2021) with models fitted using OpenMx v2.20.6 (Neale et al., 2016). Code is available in a Github repository (https://github.com/Pvinh147/mixDoC). All data were simulated under a bivariate Gaussian mixture model, in which the data are generated from a finite mixture of bivariate Gaussian distributions.
For each simulation, eight separate datasets (four classes for each zygosity group) were generated and merged into MZ and DZ twin datasets. An overview of the simulation designs are provided in Table 1. To model different degrees of heterogeneity, simulations were completed with varying proportions of concordant and discordant twin pairs at fixed parameter values. These values were chosen to explore the feasibility of this mixture model rather than to exhaustively explore the multidimensional space.
A measure of entropy was used to evaluate the accuracy with which we can probabilistically assign individuals to classes based on the results of fitting the mixDoC model. The relative entropy index (Ramaswamy et al., 1993) is expressed as:
$$Ent=1-\frac{\sum _{i=1}^{N}\sum _{i=C}^{C}-{p}_{iC}ln{p}_{iC}}{NlnC}$$
2
where C is the number of classes, pic is the estimated posterior probability for individual i in class C, N is the number of observations. Entropy values range from zero to one, where values closer to one indicate better classification, meaning that the posterior probabilities approach zero or one. To evaluate the entropy of the mixture model, we used parameter estimates averaged from 1,000 replications. Parameters that we varied include: i) strength of the causal effects; ii) class means; iii) modes of inheritance; and iv) presence of genetic confounding.
We fitted the following models to each dataset: i) the novel 4-class-per-zygosity mixture distribution (4-class); ii) the concordant-pairs-only reduced mixture model (2-class); iii) non-mixture X causes Y DoC; iv) non-mixture Y causes X DoC; v) non-mixture bidirectional DoC; and vi) non-mixture bivariate ACE model with rA, rC, and rE estimated, i.e., trait covariance is entirely due to the sharing of A, C, and E factors.