Sample
Data were obtained from sibling pairs (including twins) aged 9–11 years participating in the
Adolescent Brain Cognitive Development (ABCD) study conducted by the National Institute on Drug Abuse. Spanning 22 sites across the United States, the ABCD study is an ongoing longitudinal investigation of child health and development utilizing a nationally representative cohort. A comprehensive description of the range of measures administered as part of the study can be found elsewhere (Volkow et al., 2018). The ABCD cohort comprises N = 11,666 individuals. This analysis focused on a subset of European twin and sibling pairs due to the lack of GWAS on brain morphometry in other ancestries. This subset amounted to N = 1,874 individuals.
Genotyping
Subjects provided saliva and/or blood samples for DNA extraction during their study visit, either by spitting into small tubes or through a blood draw. DNA extraction and genotyping were conducted by the former Rutgers University Cell and DNA Repository (now known as SAMPLED). Genotyping was performed using the Affymetrix Axiom Smokescreen Array, comprising 646,247 markers, including genome-wide association markers, tag SNPs, exonic markers in addiction-related gene regions, fine-mapping markers in loci related to nicotine metabolism, smoking behavior, and comorbidity markers (Baurley et al., 2016). The array design encompassed nearly all variants in the 1000 Genomes Project Phase 1, NHLBI GO Exome Sequencing Project, and HapMap databases pertinent to smoking behavior and nicotine metabolism. Additionally, the array encompassed genome-wide common variation (65.67%, 82.37%, and 90.72% in African (YRI), East Asian (ASN), and European (EUR) populations respectively) (Baurley et al., 2016).
The Affymetrix Power Tools and the Affymetrix Best Practice Workflow were employed for calling and aligning genotypic data using human genome build hg19 (Baurley et al., 2016; Fan et al., 2023). Batch-level quality control of blood and saliva samples, also performed using these tools, resulted in approximately 590K variants per batch. Samples with more than 20% missing rates on genotype calls or variants with more than 10% missingness were excluded from downstream analyses (Anderson et al., 2010; Fan et al., 2023). Additionally, samples with excessive relatedness or implausible numbers of third-degree relatives were excluded. SNPs were phased and imputed using SHAPEIT2 and IMPUTE2. Following the ABCD study's quality control pipeline (Anderson et al., 2010; Fan et al., 2023), the standard step of sex-genotype concordance checking was intentionally not performed, as the study investigators determined that comparing genotyped sex with reported sex/gender information could mischaracterize participants' identities and raise complex interpretational issues regarding biological sex versus gender identity. After quality control procedures, N = 11,389 unique participants and 516,598 genetic variants remained for analysis (Fan et al., 2023). Genetic principal components were calculated using the R-based software PC-AiR, resulting in 158,103 SNPs after pruning. PC-AiR was then run on this pruned set of SNPs, projecting N = 8,005 unrelated individuals and N = 3,384 related individuals. Additionally, N = 2,504 individuals from the 1000 Genomes Project were projected onto the same PC space as the ABCD sample to provide reference points for different ancestral distributions (AFR, African; AMR, Native American; EAS, East Asian; EUR, European; SAS, South Asian). Genetic relatedness independent of any ancestral relatedness was assessed using PC-Relate, employing the same set of pruned SNPs and the first two genetic principal components.
Genetic Ancestry and Principal Component Analysis
Ancestry assignment was performed using the method described by Peterson et al., (2017) on N = 11,666 individuals. Ancestry outliers, defined as individuals greater than three standard deviations from from the center of their assigned superpopulation in PCA space, were excluded. The numbers assigned to each 1000 Genomes super population was as follows: N = 2,170 to AFR, 2,909 to AMR, 188 to EAS, 5,815 to EUR, and 372 to SAS. Within-ancestry principal component analysis was conducted for EUR individuals using PC-AiR, following the same method used for trans-ancestry PCA in ABCD Release 5.0. Only individuals with
Europeanderived ancestry were selected for the current analyses due to the absence of brain structure GWAS summary statistics from other ancestral groups.
Post-Imputation QC and SNP Processing
Imputation was performed for individuals with European ancestry using the 1000 Genomes reference panel. Standard GWAS QC was performed with variants with minor allele frequency
(MAF) < 0.001 and variants failing the Hardy-Weinberg test at a significance threshold of 1 x 1010 being excluded. After QC, a total of N = 13,549,177 SNPs remained for analysis.
Before using PRS-CS, we screened the input GWAS summary statistics files for duplicate SNPs. Duplicates were only found in the summary statistics from the thalamus volume GWAS (N = 17 duplicate SNPs), which were removed to ensure each SNP was represented only once in the PRS calculation.
Genome-Wide Association Study (GWAS) Summary Statistics
Polygenic risk scores (PRS) were estimated for each subject, including a CUD PRS and a SU/SUD PRS. For the SU/SUD PRS, summary statistics from Hatoum et al.’s (2023) multivariate meta-analytic GWAS of CUD, problematic alcohol use, problematic tobacco use, and opioid use disorder were utilized. The CUD GWAS summary statistics used in Johnson et al.'s (2020) multivariate meta-analytic GWAS were also employed to generate a separate CUD PRS (Johnson et al., 2020). Full descriptions of the SNP quality control, alignment, ancestry assignment, and imputation procedures for each discovery GWAS are detailed elsewhere (Fan et al., 2023). Cortical and subcortical GWAS summary statistics were utilized to estimate cortical and subcortical PRSs. For the hippocampus, GWAS summary statistics were derived from a
meta-analysis of mean bilateral hippocampal volume from the ENIGMA and CHARGE
consortium. Subcortical GWAS summary statistics were obtained from various studies within the CHARGE consortium, ENIGMA consortium, and UK Biobank (Bycroft et al., 2018; Psaty et al., 2009; Thompson et al., 2014).
Polygenic Risk Score (PRS) Calculation
PRSs were estimated using Polygenic Risk Score-Continuous Shrinkage (PRS-CS) [https://github.com/getian107/PRScs] method. This method utilizes a Bayesian framework to infer posterior effect sizes of SNPs using genome-wide association summary statistics and an external linkage disequilibrium (LD) panel (Ge et al., 2019). The method then applies a continuous shrinkage (CS) prior to SNP effect sizes in order to shrink many small effect sizes toward zero and allow a small subset of variants with stronger signals to retain larger effect sizes, thereby reducing noise and avoiding overfitting. We wanted to model all common variation of general addiction liability, cannabis use disorder, and brain structures, so utilizing PRS-CS was the most appropriate. The 1000 Genomes European (EUR) Phase 3 reference panel, which was provided with the PRS-CS software, was used as the reference panel. Standard GWAS QC was previously performed on the PRS-CS reference panel which overlapped with 1,062,342 SNPs for CUD, 1,060,634 SNPs for cortical surface area, 1,061,323 SNPs for cortical thickness, 1,042,513 SNPs for brainstem, 1,056,811 for hippocampus, 1,042,259 for nucleus accumbens, 1,039,048 for putamen, 1,038,846 for caudate, 1,041,942 for thalamus. PRSs were estimated for: a general addiction factor (Hatoum et al., 2023); cannabis use disorder (Johnson et al., 2020); cortical surface area and cortical thickness (Grasby et al., 2020); hippocampus, nucleus accumbens, caudate nucleus, putamen and thalamus (Satizabal et al., 2019). Each set of summary statistics was checked for duplicates, and 17 duplicate SNPs were found in the thalamus summary statistics. Uncorrected associations between the eight brain PRSs and their corresponding brain regions are shown in Table 1. Prior to model fitting, all PRSs were standardized to a mean of zero and a standard deviation of one.
Exclusion Criteria
Parents reported their child’s DSM-V-based symptoms of alcohol and drug use disorders using a self-administered computerized version of the Kiddie Schedule of Affective Disorders and Schizophrenia Scale for School-Aged Children (KSADS-COMP), which is used to assess psychiatric disorders in children and adolescents, including drug use (Kaufman et al., 2021). Subjects with any reported substance use (N = 63) based on the KSADS-COMP parental report were excluded. This report did not address individuals' use of caffeine.
Imaging
All structural MRI images were acquired on 3T (Siemens, Phillips, and GE) scanners with 1 mm isotropic T1-weighted structural images, using either a 32-channel head or 64-channel headandneck coil. Total intracranial volume and the volumes of six subcortical structures were extracted: thalamus, caudate, putamen, hippocampus, amygdala, and nucleus accumbens. As this was a pediatric population prone to excessive head motion in the scanner, prospective motion correction on the GE and Volumetric Navigators on the Siemens platforms were done to mitigate this concern by applying real-time motion detection and correction. The Multi-Modal Processing Stream software package, which includes FreeSurfer 5.3 (Fischl et al., 2004; Ségonne et al., 2004, 2007), was used to process MRI data. After the MRI images are corrected for distortions and head motion and cross-modality associations are performed, the cortical surface is reconstructed, and subcortical and white matter regions of the brain are segmented. Following MRI quality control procedures (Hagler et al., 2019), subcortical volumes (hippocampus, nucleus accumbens, putamen, thalamus, caudate, amygdala) were averaged across the left and right hemispheres. Total intracranial volume and the volumes of six subcortical structures were extracted: thalamus, caudate, putamen, hippocampus, amygdala, and nucleus accumbens. For each ROI, the effects of age, sex, total intracranial volume, and the first ten genetic principal components were removed using the modelr package in R (R version 4.3.2, http://www.rproject.org). Across ROI’s, variation in the principal components collectively explained up to 0.6% of the variance in each ROI (cortical surface area: 0.002, cortical thickness: 0.006, hippocampus: 0.003, thalamus: 0.006, caudate: 0.004, nucleus accumbens: 0.004, amygdala:
0.004, putamen: 0.004).
Volumes for subcortical structures, including the hippocampus, nucleus accumbens, putamen, thalamus, caudate, amygdala, were extracted from FreeSurfer's automatic segmentation process. The FreeSurfer pipeline automatically segments subcortical structures using a probabilistic atlas and applies deformable templates to delineate structures. The volume of each structure is calculated as the sum of the voxel-wise segmentations multiplied by voxel volume. FreeSurfer outputs were reviewed for errors, and manual corrections were applied as necessary. An overview of the quality control procedures for subcortical volumes is provided in the ABCD 5.0 Release Notes (https://wiki.abcdstudy.org/release-notes/).
Cortical surface area measures were derived from FreeSurfer's automated reconstruction pipeline. This pipeline constructs a surface mesh for each hemisphere of the brain, with the boundary between gray matter and white matter defined as the pial surface and the boundary between gray matter and cerebrospinal fluid defined as the gray-white matter boundary. Surface area was calculated as the sum of the areas of all triangles in the surface mesh. Quality control procedures for cortical surface area are outlined in the ABCD 5.0 Release Notes.
Statistical analyses
The OpenMx 2.21.11 software package (M. C. Neale et al., 2016) in R 4.3.2 (R Development
Core Team, 2018) with the Constrained Sequential Optimal Linear and Nonlinear Programming (CSOLNP) optimizer was used to test basic assumptions of mean and variance homogeneity, calculate phenotypic and twin pair correlations and their 95% confidence intervals, and to fit univariate and multivariate genetically informative twin models(M. Neale & Cardon, 1992).
Tests of mean and variance homogeneity
Mean and variance homogeneity within twin and sibling pairs, as well as across zygosity and sex, was tested (M. Neale & Cardon, 1992). Under the null hypothesis, when randomly sampling twins from the population, one can expect equal means and variances across and within MZ and DZ twins, assuming there are no sex differences in the variable of interest (Jinks & Fulker, 1970). Prior to estimating phenotypic and twin pair correlations and model fitting, we tested basic assumptions concerning mean and variance homogeneity by constraining mean and variance estimates to be equal within twin pairs, across zygosity and regular sib-pair groups, and across sex. Accepting the null hypotheses of neither mean nor variance differences across the MZ, DZ, and regular sib-pair groups indicates that the distributions of the data are consistent with several assumptions of the classical twin study. In particular, there is no evidence of sibling interaction, which would generate different total variances of MZ and DZ twins when some genetic variation is present. Departures from multivariate normality would also increase the likelihood that these equality tests would fail.
Phenotypic twin pair correlations
Phenotypic and twin pair correlations were computed using full maximum likelihood in OpenMx (version 2.21.1)(Neale et al., 2016). Regarding twin pair correlations, if familial aggregation is entirely attributable to shared family environments, then monozygotic (MZ) and dizygotic (DZ) twin pair correlations will be statistically equal. In contrast, if familial aggregation is entirely attributable to shared additive (or non-additive) genetic factors, then DZ twin pair correlations will be ½ (or less) the size of their MZ twin pair counterparts.
Univariate Analyses
In univariate analyses, the total variation in each ROI was estimated as the sum of additive genetic (A), shared or common environmental (C), and non-shared or unique (E) environmental variance components using an ‘ACE’ variance component model (Verhulst & Neale, see Fig. 1). Note that the non-shared environmental effects (E) are, by definition, uncorrelated and include measurement error. The association of variance with these sources exploits the expected genetic and environmental correlations between MZ and DZ twin pairs. MZ twin pairs are genetically almost identical, whereas DZ twin pairs share, on average, half of their genes. Under the equal environments assumption, the model assumes that shared environmental effects (C) are equal in MZ and DZ twin pairs, and that the magnitude of genetic and environmental effects is the same for both twins in a pair. Therefore, the MZ and DZ twin pair correlations (rA) for additive genetic effects are fixed to 1.0 and 0.5, respectively, as opposed to the shared environmental twin pair correlation (rc), which is fixed to 1.0 regardless of zygosity. Given that non-twin sibling pairs also share, on average, half of their genes, we included non-twin sibling pairs as a separate group in addition to the DZ group after establishing mean and variance homogeneity. Note that since our analyses relied on a non-clinically ascertained community dwelling sample, we herein refer to all A, C, and E variance components as genetic and environmental ‘influences.’These variance components assume any observed variation in normal development comprises both risk and protective factors.
This figure illustrates a classical twin design univariate analysis that decomposes variation in a brain region of interest (ROI) into additive genetic (A1), shared environmental (C1), and unique environmental (E1) components. Single-headed arrows represent regressions, while double headed arrows represent correlations. The shared environmental correlation (rC) is fixed to 1.0 for both MZ and DZ twins, reflecting the equal environments assumption. Cortical SA = Total cortical surface area (example ROI), rMZ = Expected correlation for monozygotic twins (fixed to 1.0). rDZ = Expected correlation for dizygotic twins (fixed to 0.5). A1 = Additive genetic factors for Twin 1 and Twin 2, C1 = Shared environmental factors for Twin 1 and Twin 2. E1 = unique environmental factors for Twin 1 and Twin 2. a11, c11, e11 represent the variance components for the latent factors representing the influence of A1, C1, and E1 on the ROI.
Multivariate analyses
The univariate model was extended to the multivariate case to estimate the size and significance of the impact of ROI-specific and drug-related PRSs to variance in each of the 8 ROIs. As illustrated in Fig. 2, the contributions of the SU/SUD and ROI-specific PRSs to the latent variance at each ROI are denoted by β31 and β21 pathway coefficients, respectively. To allow for background confounding, PRSs were also allowed to correlate via the a32 double-headed arrow. Variance in the latent ROI factor not captured by the two PRSs is explained by the residual additive genetic (A), common (C), and non-shared (E) environmental influences denoted by the a11, c11, and e11 variance components respectively. Finally, to scale the relative contribution of the PRSs, we estimated their standard deviations (δ11 & δ22) while constraining the latent PRS variances to one (Dolan et al., 2021). An equivalent approach would be to standardize the PRS data prior to analysis.
Multivariate twin model incorporating polygenic risk scores (PRSs) for brain structure and substance use. This figure illustrates a multivariate twin model that incorporates PRSs to examine the genetic and environmental influences on brain structure, using cortical surface area as an example. The model includes additive genetic (A), shared environmental (C), and unique environmental (E) influences on cortical surface area, as well as the effects of PRSs for cortical surface area (β21) and general liability to SU and SUD (SU/SUD; β31). The model allows for a correlation between the two PRSs (double-headed arrow between PRS variables). Each multivariate analysis followed a two-step approach: First, the best-fitting univariate model (ACE, AE, CE, or E) for each region of interest was determined; next, the significance of both the ROIspecific and the CUD/SUD PRSs was tested. This approach enabled us to test competing nested models representing alternative explanations for the sources of variance in brain structure, including models excluding either the SU/SUD PRS or the CUD PRS.
Model Fit
In the univariate case, we determined the most likely sources of variance by fitting three additional sub-models in which the (i) C, (ii) A, and (iii) C and A influences were fixed to zero, i.e., we compared the likelihoods of the AE, CE, and E models, respectively. In the multivariate case, once the best fitting model was chosen, i.e., i) ACE, ii) AE, iii) CE, or iv) E, we then estimated the association of each ROI with the ROI PRS and the SU/SUD PRS, as illustrated by the β21 and β31 pathways coefficients, respectively (See Fig. 2). Models were compared using the change in the minus two Log-Likelihood (Δ-2LL) and Akaike’s Information Criterion (AIC) which balances model fit against model complexity (Akaike, 1987).
The multivariate analysis was then re-run, substituting the CUD PRS for the SU/SUD PRS. The significances of the A, C, and E parameters, in the univariate and multivariate analyses and the β31 and β21 in the multivariate case were determined using the change in the minus two LogLikelihood (Δ-2LL). Under certain regularity conditions, this change is asymptotically distributed as a chi-square with the degrees of freedom equal to the difference in the number of free parameters in the two models (Steiger et al., 1985). The selection of the most suitable model used AIC (Akaike, 1987). In addition to examining the AIC, we looked at the difference in -2 loglikelihood (-2LL) between models. Model identification was confirmed for each model with the OpenMx utility mxCheckIdentification (Hunter et al., 2021). The significance of the β21 and β31 parameters in Fig. 2 was determined by examining the change in -2 loglikelihood (-2LL) by dropping β21 and β31.
For each ROI, we began by determining the most likely sources of variance by modeling all five sources of variation (A + C + E + 2 PRSs), followed by three nested sub-models in which the i) C, ii) A, and iii) C and A influences were successively fixed to zero. Based on the best-fitting ACE, AE, or CE model, we then dropped the β31 and β21 pathway coefficients to determine the significance of the impact of the PRSs on each ROI. This model-fitting strategy was applied to all eight ROIs separately, beginning with the CUD PRS. We repeated this model fitting strategy after substituting the CUD PRS for the SU/SUD PRS.