Introduction

Canonical correlation is a method to assess correlations between two sets of variables. Canonical correlation is appropriate in the same situations where multiple regression would be, but where are there are many intercorrelated outcome variables.

The following example is an replication of the paper “Examination of the relationships between environmental exposures to volatile organic compounds(VOCs) and biochemical liver tests: Application of canonical correlation analysis”(Liu 2009).

Instead of repeating the examination, we further investigate the correlation in the subgroup controlling on people who drink more than 12 times per year.

The typical purposes of CCA are: 1. Data reduction: explain covariation between two sets of variables using a small number of linear combinations. 2. Data interpretation: find features (canonical variates) that are important for explaining covariation between sets of variables.

Note Canonical correlation terminology makes an important distinction between the words variables and variates. The term variables is reserved for referring to the original variables being analyzed. The term variates is used to refer to variables that are constructed as weighted averages of the original variables. Thus, a set of Y variates is constructed from the original Y variables.

Derivation

If we have two vectors \(X = (X_1 ,X_2,.X_n )^T\) and \(Y = (Y_1,Y_2,.Y_m )^T\) of random variables, and there are correlations among the variables, then canonical-correlation analysis seeks the vectors \(a(a\in R_n)\) and \(b(b\in R_m)\) such that the linear combinations \(a^TX\) and \(b^TY\) maximize the correlation \(\rho= corr(a^TX,b^TY)\). In short, it can be expressed as:

\[(a,b)=arg\max_{a,b}corr(a^TX,b^TY)\]

Let \(\Sigma_{XX}=cov(X,X)\) and \(\Sigma_{YY}=cov(Y,Y)\). The parameter to maximize is\(\rho={{a^T\Sigma_{XY}b}\over{\sqrt{a^T \Sigma_{XX}a}\sqrt{b^T \Sigma_{YY}b}}}\).

To define a change of basis, we set:

\[c=\Sigma_{XX}^{-\frac{1}{2}}a\]

\[d=\Sigma_{YY}^{-\frac{1}{2}}b\]

thus, we have:

\[\rho={{c^T \Sigma_{XX}^{-\frac{1}{2}}\Sigma_{XY}\Sigma_{YY}^{-\frac{1}{2}}d}\over {\sqrt{c^Tc}\sqrt{d^Td}}}\]

By the Cauchy-Schwarz inequality, we have:

\[(c^T \Sigma_{XX} ^{-\frac{1}{2}} \Sigma_{XY} \Sigma_{YY} ^{-\frac{1}{2}}) d \leq (c^T \Sigma_{XX} ^{-\frac{1}{2}} \Sigma_{XY} \Sigma_{YY} ^{-1} \Sigma_{YX} \Sigma_{XX} ^{-\frac{1}{2}} c)^{-\frac{1}{2}} (d^T d)^{-\frac{1}{2}}\]

After canceling the term \((d^T d)^{-\frac{1}{2}}\), we have:

\[\rho \leq {{(c^T\Sigma_{XX}^{-\frac{1}{2}}\Sigma_{XY}\Sigma_{YY}^{-1}\Sigma_{XX}^{-\frac{1}{2}}c)^{-\frac{1}{2}}}\over{\sqrt{c^Tc}}}\]

There is equality if the vectors d and \((c^T \Sigma_{XX} ^{-\frac{1}{2}} \Sigma_{XY} \Sigma_{YY} ^{-\frac{1}{2}})^T\) are colinear. In addition, the maximum of correlation is attained if c is the eigenvector with the maximum eigenvalue for the matrix \(\Sigma_{XX} ^{-\frac{1}{2}} \Sigma_{XY} \Sigma_{YY} ^{-1} \Sigma_{YX} \Sigma_{XX} ^{-\frac{1}{2}}\).

The subsequent pairs are found by using eigenvalues of decreasing magnitudes. Orthogonality is guaranteed by the symmetry of the correlation matrices.

Data description

The VOC Project of personal exposures to air toxics was conducted among a subsample of NHANES 1999-2000 participants between the ages of 20 and 59 years. This tutorial include 565 observations with 10VOCs: benzene, chloroform, ethylbenzene, tetrachloroethene, toluene, trichloroethene, o-xylene, m-,p-xylene, 1,4-dichloro- benzene, andmethyl tert-butyl ether(MTBE). Besides, liver condition serves as the outcome variables, which was measured by albumin, totalbilirubin(TB), alanineaminotransfer- ase (ALT), aspartateaminotransferase(AST), lactatedehydrogenase(LDH), alkaline phosphatase (ALP) and g-glutamyltransferase(GGT). Controlling on people who drink more than 12 times per year, the subgroup descriptive statistics are also listed below.

Descriptive statistics of personal exposure to 10 VOCs and biochemical liver tests

Figure 1. Personal exposures to 10 VOCs.
n min mean median 90th percentile 95th percentile
1,4-Dichlorobenzene 556 0.62 46.96 2.08 104.93 257.94
Benzene 557 1.25 5.36 2.88 11.16 17.22
o-Xylene 556 0.31 6.17 2.11 12.13 21.59
m,p-Xylene 556 0.34 17.76 5.97 33.99 61.17
Ethylbenzene 552 0.20 7.69 2.33 11.53 21.94
MTBE 555 0.60 6.13 0.60 13.69 23.47
Toluene 548 2.69 34.54 16.02 54.64 92.24
Chloroform 561 0.32 2.69 1.14 6.16 10.36
Tetrachloroethene 553 0.18 4.59 0.78 6.24 16.15
Trichloroethene 556 0.27 4.09 0.27 1.19 6.75
Figure 2. Biochemical liver tests.
n min mean median 90th percentile 95th percentile
Albumin 555 33.0 44.37 45.0 49.0 50.0
ALT 555 7.0 26.46 20.0 45.6 57.0
ALP 555 28.0 81.47 78.0 112.6 129.0
AST 555 9.0 23.99 21.0 34.0 40.3
GGT 555 5.0 30.13 20.0 51.0 74.0
LDH 555 63.0 147.38 144.0 182.0 195.3
TB 555 1.7 9.21 8.6 13.7 17.1

Descriptive statistics of personal exposure to 10 VOCs and biochemical liver tests(subgroup)

Figure 3. Personal exposures to 10 VOCs.(subgroup)
n min mean median 90th percentile 95th percentile
1,4-Dichlorobenzene 356 0.62 45.29 1.83 62.80 223.88
Benzene 359 1.25 5.48 2.88 10.95 17.66
o-Xylene 358 0.31 6.91 2.23 12.48 24.55
m,p-Xylene 358 0.34 20.78 6.34 37.13 80.74
Ethylbenzene 356 0.20 9.57 2.46 12.96 27.76
MTBE 356 0.60 5.96 0.60 13.23 22.58
Toluene 351 2.69 38.15 16.13 54.61 92.12
Chloroform 361 0.32 2.44 1.11 5.10 8.77
Tetrachloroethene 356 0.18 4.98 0.82 5.38 14.76
Trichloroethene 360 0.27 2.93 0.27 1.17 7.53
Figure 4. Biochemical liver tests.(subgroup)
n min mean median 90th percentile 95th percentile
Albumin 357 33.0 44.79 45.0 49.0 50.00
ALT 357 8.0 28.67 22.0 51.4 65.00
ALP 357 28.0 79.68 76.0 109.0 122.40
AST 357 9.0 25.21 22.0 35.0 45.20
GGT 357 5.0 34.41 21.0 62.4 93.00
LDH 357 63.0 147.42 143.0 187.0 199.00
TB 357 3.4 9.55 8.6 15.4 17.44

Note: To avoid potential confounding, we excluded the patients who had liver conditions, heart disease, stroke, cancer or diabetes. Those tested serum positive to hepatitis C virus (HCV) were also excluded. One observation with extreme values in ALT(1163U/L) and AST(827U/L) is also excluded.

Comparison of Bolm normal transformation

part <- c('o-Xylene','m,p-Xylene')
scatterplotMatrix(combine1[,..part])

scatterplotMatrix(combine2[,..part])

To satisfy the assumption of CCA, we also transformed the VOC and liver function test variables to Blom normal scores (\(s = \Phi(\frac{r-3/8} {n +1/4})\)) from their ranks to assure the multivariate normality is not violated. Here we only use o-Xylene and m,p-Xylene as an example, and it shows that after transformation, their distributions become normal.

Software

CCA output can be fairly complex. Quantities of interest include raw coefficients, structural correlations (or loadings) and standardized coefficients. The first two are of primary focuses in this project. If you want to know more, please read the discussion session and related references below.

R

This tutorial uses the following packages.

library(yacca)
library(data.table)
library(dplyr)

Canonical structures of the first pair of canonical variate and F-test (full group)

#Display raw canonical coefficients
cca.fit$xcoef
cca.fit$ycoef
Figure 5. Canonical coefficients of personal exposures to VOCs.
CV 1 CV 2 CV 3 CV 4 CV 5 CV 6 CV 7
1,4-Dichlorobenzene 0.08 0.83 0.08 0.37 0.11 0.09 -0.25
Benzene 0.81 0.08 -0.59 -0.30 0.45 0.55 0.18
o-Xylene 0.62 0.69 0.95 0.34 -0.64 -0.88 -1.04
m,p-Xylene -0.24 -0.63 -0.56 -0.20 -0.13 1.46 0.30
Ethylbenzene -0.03 -0.29 0.08 0.82 0.60 -0.39 0.42
MTBE 0.32 -0.02 0.05 -0.66 -0.11 -0.54 -0.64
Toluene -0.18 -0.22 0.16 -0.21 0.01 -0.98 0.55
Chloroform -0.35 -0.29 -0.04 0.07 0.81 -0.07 -0.46
Tetrachloroethene -0.07 0.21 0.82 -0.39 0.21 0.49 0.47
Trichloroethene -0.03 0.51 -0.74 -0.06 0.08 -0.49 0.39
Figure 6. Canonical coefficients of biochemical liver tests.
CV 1 CV 2 CV 3 CV 4 CV 5 CV 6 CV 7
Albumin 0.97 -0.24 0.64 0.20 -0.09 0.06 -0.07
ALT 0.09 -0.63 -0.69 0.27 0.17 -0.90 1.16
ALP 0.52 0.61 -0.51 0.15 -0.39 -0.22 -0.28
AST -0.14 0.02 0.44 -1.19 -0.71 -0.03 -0.88
GGT 0.04 0.11 -0.22 -0.32 0.60 1.18 0.05
LDH -0.09 0.65 0.56 0.13 0.65 -0.27 0.37
TB -0.63 0.40 -0.01 0.34 -0.66 0.37 0.51

The raw canonical coefficients are interpreted in a same manner analogous to interpreting regression coefficients. For example, a one unit increase in the concentration of Benzene is associated to a 0.81 increase in the first canonical variate in the set liver tests when all the other variables are held constant.

Next, we are going to compute the loadings of variables on the canonical dimensions(variates). Canonical loadings are known as the correlations between observed variables and canonical variables. These canonical variates are actually a type of latent variable.

cca.fit$corr
cca.fit$xstructcorr #Loadings
cca.fit$ystructcorr
F.test.cca(cca.fit)
##       CV 1       CV 2       CV 3       CV 4       CV 5       CV 6 
## 0.31838580 0.23136020 0.19211314 0.11436996 0.10021086 0.06559313 
##       CV 7 
## 0.02455458
Figure 7. Loadings of variables (VOCs).
CV 1 CV 2 CV 3 CV 4 CV 5 CV 6 CV 7
1,4-Dichlorobenzene 0.09 0.73 0.11 0.35 0.26 -0.03 -0.19
Benzene 0.87 -0.10 -0.12 -0.01 0.39 0.08 0.13
o-Xylene 0.70 -0.12 0.38 0.38 0.10 -0.19 0.01
m,p-Xylene 0.67 -0.21 0.29 0.37 0.16 -0.08 0.12
Ethylbenzene 0.58 -0.22 0.26 0.49 0.26 -0.21 0.24
MTBE 0.43 0.01 0.14 -0.55 0.11 -0.35 -0.32
Toluene 0.45 -0.18 0.29 0.16 0.25 -0.51 0.33
Chloroform -0.21 -0.11 0.08 0.00 0.84 -0.09 -0.38
Tetrachloroethene 0.07 0.27 0.63 -0.39 0.37 0.13 0.39
Trichloroethene 0.02 0.48 -0.29 -0.13 0.22 -0.31 0.38
Figure 8. Loadings of variables (biochemical liver tests).
CV 1 CV 2 CV 3 CV 4 CV 5 CV 6 CV 7
Albumin 0.68 -0.13 0.48 0.00 -0.32 0.22 0.38
ALT 0.28 -0.06 -0.19 -0.56 -0.11 -0.16 0.73
ALP 0.51 0.66 -0.49 -0.11 -0.19 -0.10 0.00
AST 0.13 0.14 0.14 -0.87 -0.29 -0.09 0.32
GGT 0.33 0.16 -0.23 -0.55 0.16 0.54 0.45
LDH 0.06 0.64 0.38 -0.33 0.32 -0.26 0.40
TB -0.11 0.24 0.20 0.03 -0.65 0.33 0.60
## 
##  F Test for Canonical Correlations (Rao's F Approximation)
## 
##           Corr         F    Num df Den df    Pr(>F)    
## CV 1  0.318386  1.821520 70.000000 3202.2 4.284e-05 ***
## CV 2  0.231360  1.244772 54.000000 2804.0    0.1096    
## CV 3  0.192113  0.912402 40.000000 2400.2    0.6283    
## CV 4  0.114370  0.557007 28.000000 1988.1    0.9709    
## CV 5  0.100211  0.461607 18.000000 1561.8    0.9733    
## CV 6  0.065593  0.272125 10.000000 1106.0    0.9871    
## CV 7  0.024555  0.083556  4.000000  554.0    0.9875    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A cutoff value of 0.35 are chosen to select important loadings.

The first canonical correlation coefficient was 0.318. The first canonical correlation was statistically significant (F = 1.82, p < 0.05) indicating that two sets of variables were correlated. From the result, we found that personal exposure to Benzene, o-Xylene, m,p-Xylene, Ethylbenzene, MTBE and Toluene as a group might affect the serum levels of Albumin and ALP.

Canonical structures of the first pair of canonical variate and F-test (subgroup)

#Display raw canonical coefficients (subgroup)
cca.fit1$xcoef
cca.fit1$ycoef
Figure 9. Canonical coefficients of personal exposures to VOCs.
CV 1 CV 2 CV 3 CV 4 CV 5 CV 6 CV 7
1,4-Dichlorobenzene -0.09 0.79 -0.35 0.44 0.19 0.05 0.30
Benzene 0.95 -0.13 -0.51 -0.20 0.31 -0.16 0.28
o-Xylene 0.39 1.14 0.80 -0.70 0.03 -0.53 0.17
m,p-Xylene -0.08 -0.62 -0.23 1.10 -0.26 1.21 -1.89
Ethylbenzene -0.20 -0.19 0.13 0.24 0.77 -0.79 1.00
MTBE 0.39 0.00 -0.24 -0.32 -0.19 -0.13 -0.02
Toluene -0.28 -0.13 0.47 -0.26 -0.97 -0.32 0.60
Chloroform -0.38 -0.22 0.12 -0.26 0.51 -0.63 -0.29
Tetrachloroethene -0.16 0.47 0.22 -0.69 0.28 0.48 0.06
Trichloroethene -0.12 0.18 -0.61 0.16 -0.71 -0.61 -0.55
Figure 10. Canonical coefficients of biochemical liver tests.
CV 1 CV 2 CV 3 CV 4 CV 5 CV 6 CV 7
Albumin 0.85 0.23 0.81 0.05 -0.01 0.13 -0.05
ALT 0.00 -0.71 -0.06 0.39 -1.51 -0.43 0.71
ALP 0.30 0.32 -0.39 0.67 -0.19 -0.10 -0.63
AST -0.01 -0.35 0.12 -0.84 0.60 0.27 -1.37
GGT 0.42 0.01 -0.68 -0.28 1.02 0.23 0.62
LDH -0.07 0.90 -0.09 -0.47 -0.41 0.06 0.43
TB -0.69 -0.01 -0.17 0.41 -0.11 0.91 0.11

Compared to the full group, a one unit increase in the concentration of Benzene has a larger effect, and is associated to a 0.95 increase in the first canonical variate in the set liver tests when all the other variables are held constant.

cca.fit1$corr
cca.fit1$xstructcorr
cca.fit$xstructcorr
cca.fit1$ystructcorr
cca.fit$ystructcorr
F.test.cca(cca.fit1)
##       CV 1       CV 2       CV 3       CV 4       CV 5       CV 6 
## 0.34584676 0.27173286 0.20181075 0.14155267 0.08342351 0.07952745 
##       CV 7 
## 0.03187424
Figure 11. Loadings of variables (VOCs).
CV 1 CV 2 CV 3 CV 4 CV 5 CV 6 CV 7
1,4-Dichlorobenzene -0.08 0.74 -0.20 0.33 0.14 -0.15 0.27
Benzene 0.82 0.11 0.09 -0.06 0.09 -0.32 0.05
o-Xylene 0.47 0.38 0.65 0.19 -0.01 -0.29 -0.27
m,p-Xylene 0.46 0.28 0.60 0.27 -0.01 -0.25 -0.33
Ethylbenzene 0.32 0.21 0.51 0.30 0.12 -0.44 -0.02
MTBE 0.41 0.10 -0.16 -0.42 -0.10 -0.21 -0.10
Toluene 0.23 0.21 0.58 -0.01 -0.42 -0.39 0.22
Chloroform -0.27 -0.02 0.05 -0.32 0.42 -0.59 -0.26
Tetrachloroethene -0.08 0.51 0.08 -0.65 0.07 0.11 -0.13
Trichloroethene -0.15 0.31 -0.43 -0.10 -0.38 -0.44 -0.39
Figure 12. Loadings of variables (biochemical liver tests).
CV 1 CV 2 CV 3 CV 4 CV 5 CV 6 CV 7
Albumin 0.69 0.00 0.44 0.05 -0.14 0.56 0.06
ALT 0.43 -0.44 -0.36 -0.27 -0.58 0.30 0.05
ALP 0.43 0.26 -0.60 0.36 -0.24 0.08 -0.43
AST 0.29 -0.28 -0.28 -0.57 -0.31 0.43 -0.40
GGT 0.58 -0.22 -0.59 -0.20 0.07 0.40 0.24
LDH 0.11 0.57 -0.28 -0.57 -0.46 0.23 0.02
TB -0.09 -0.08 -0.05 0.17 -0.18 0.96 0.03
## 
##  F Test for Canonical Correlations (Rao's F Approximation)
## 
##           Corr         F    Num df  Den df Pr(>F)  
## CV 1  0.345847  1.428028 70.000000 2030.16 0.0124 *
## CV 2  0.271733  1.002686 54.000000 1779.05 0.4701  
## CV 3  0.201811  0.671618 40.000000 1524.05 0.9428  
## CV 4  0.141553  0.434890 28.000000 1263.37 0.9957  
## CV 5  0.083424  0.280575 18.000000  993.26 0.9987  
## CV 6  0.079527  0.259591 10.000000  704.00 0.9892  
## CV 7  0.031874  0.089750  4.000000  353.00 0.9856  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The first canonical correlation coefficient was 0.34. The pooled sum of squares of all canonical correlation coefficients was 0.269, which was contributed by 44.5% by the first canonical correlation. Compared to the full group, the subgroup analysis narrowed down the relationship between the VOC exposure and liver function to fewer numbers of VOCs but more liver function tests. The first canonical correlation indicated that Benzene, o-Xylene, m,p-Xylene and MTBE as a group might affect the serum levels of albumin, ALP, ALT and GGT.

Stata

We need to load the data into STATA using import delimited.

. import delimited data.csv
(18 vars, 565 obs)

To view the first three rows of data, we use list command.

. list in 1/3

     +-----------------------------------------------------------------------+
  1. | v1  |        v2  |   benzene  |   oxylene  |  mpxylene  |  ethylbe~e  |
     |  1  |  1.037667  | -1.060735  |  2.163308  |  2.163308  |   2.007744  |
     |-----------------------------------------------------------------------|
     |      mtbe |   toluene | chlorof~m | tetrac~e  | trichl~e  |  albumin  |
     |    1.0924 |   1.88951 |  .8161429 | 1.250906  | .5039024  | .0732355  |
     |-----------+-----------+-----------+-----------+-----------+-----------|
     |       alt |       alp |       ast |       ggt |       ldh |        tb |
     | -.6783221 | -.1066309 | -.4419047 | -.6262044 |  1.088382 | -.4615485 |
     +-----------------------------------------------------------------------+

     +-----------------------------------------------------------------------+
  2. | v1  |        v2  |   benzene  |   oxylene  |  mpxylene  |  ethylbe~e  |
     |  2  | -.0643453  |  .2031609  | -.5292404  | -.3295934  |   -.493858  |
     |-----------------------------------------------------------------------|
     |      mtbe |   toluene | chlorof~m | tetrac~e  | trichl~e  |  albumin  |
     | -.5626988 | -.5652983 |     .7528 | -.160328  | -.542037  | 1.007743  |
     |-----------+-----------+-----------+-----------+-----------+-----------|
     |       alt |       alp |       ast |       ggt |       ldh |        tb |
     | -1.033876 |  -.599451 | -.6154503 | -.4838632 | -.0754589 |  .4615485 |
     +-----------------------------------------------------------------------+

     +-----------------------------------------------------------------------+
  3. | v1  |        v2  |   benzene  |   oxylene  |  mpxylene  |  ethylbe~e  |
     |  3  | -.2326791  | -.3672921  | -2.211818  | -1.199255  |  -.4370206  |
     |-----------------------------------------------------------------------|
     |      mtbe |   toluene | chlorof~m | tetrac~e  | trichl~e  |  albumin  |
     |  .5241461 | -1.475034 | -.2993069 | .6506712  | 1.605612  | .3696651  |
     |-----------+-----------+-----------+-----------+-----------+-----------|
     |       alt |       alp |       ast |       ggt |       ldh |        tb |
     |  2.042195 |  .6506712 |  1.876456 |  1.792977 |  -.013304 |  1.100489 |
     +-----------------------------------------------------------------------+

We then run the canonical correlation analysis using canon command, specifying the exposure variables (volatile organic compounds) as the first set of variables and the outcome variables (biochemical liver tests) as the second set. From the output, we can see the coefficients, also called canonical weights for the two variable sets and the canonical correlations.

The canonical weights can be used to generate canonical variates. The number of possible canonical variate pairs is equal to the number of variables in the smaller set. This leads to seven possible canonical variate pairs and seven canonical correlations in the output.

. canon (v2 benzene oxylene mpxylene ethylbenzene mtbe toluene chloroform tetra
> chloroethene trichloroethene) (albumin alt alp ast ggt ldh tb)

Canonical correlation analysis                      Number of obs =        565

Raw coefficients for the first variable set

                 |        1         2         3         4         5         6         7 
    -------------+----------------------------------------------------------------------
              v2 |   0.0756    0.8255    0.0848    0.3657    0.1051   -0.0897    0.2513 
         benzene |   0.8105    0.0800   -0.5946   -0.2967    0.4546   -0.5500   -0.1799 
         oxylene |   0.6197    0.6911    0.9462    0.3385   -0.6435    0.8811    1.0375    
        mpxylene |  -0.2399   -0.6321   -0.5617   -0.1971   -0.1289   -1.4618   -0.3028
    ethylbenzene |  -0.0325   -0.2859    0.0801    0.8217    0.6042    0.3922   -0.4197 
            mtbe |   0.3226   -0.0154    0.0538   -0.6598   -0.1084    0.5373    0.6359
         toluene |  -0.1840   -0.2222    0.1589   -0.2089    0.0055    0.9820   -0.5544 
      chloroform |  -0.3459   -0.2872   -0.0361    0.0722    0.8131    0.0680    0.4626
    tetrachlor~e |  -0.0666    0.2111    0.8209   -0.3917    0.2137   -0.4856   -0.4713 
    trichloroe~e |  -0.0323    0.5114   -0.7368   -0.0611    0.0812    0.4931   -0.3929 
    ------------------------------------------------------------------------------------

Raw coefficients for the second variable set

                 |        1         2         3         4         5         6         7 
    -------------+----------------------------------------------------------------------
         albumin |   0.9655   -0.2429    0.6400    0.2047   -0.0927   -0.0558    0.0704 
             alt |   0.0933   -0.6251   -0.6853    0.2720    0.1690    0.8961   -1.1633 
             alp |   0.5209    0.6121   -0.5120    0.1461   -0.3885    0.2201    0.2831 
             ast |  -0.1377    0.0208    0.4377   -1.1944   -0.7119    0.0288    0.8795 
             ggt |   0.0386    0.1082   -0.2176   -0.3201    0.5975   -1.1838   -0.0459  
             ldh |  -0.0874    0.6455    0.5559    0.1323    0.6481    0.2671   -0.3746  
              tb |  -0.6301    0.3990   -0.0146    0.3359   -0.6610   -0.3703   -0.5104  
    ------------------------------------------------------------------------------------
    
----------------------------------------------------------------------------
Canonical correlations:
  0.3184  0.2314  0.1921  0.1144  0.1002  0.0656  0.0246

----------------------------------------------------------------------------
Tests of significance of all canonical correlations

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .796381       70  3202.18       1.8215     0.0000 a
        Pillai's trace     .219833       70     3878       1.7962     0.0001 a
Lawley-Hotelling trace     .236003       70     3824       1.8418     0.0000 a
    Roy's largest root     .112804       10      554       6.2494     0.0000 u
----------------------------------------------------------------------------
                            e = exact, a = approximate, u = upper bound on F

In order to find out how many possible canonical correlations would be statistically significant, we can use the test option in canon command as shown below. From the output, we discover that the first two canonical correlations are statistically significant (F = 1.82, P < 0.0000 and F = 1.24, P = 0.1096), indicating that the two sets of variables are correlated. The first canonical correlation is 0.3184 and the second was 0.2214.

. canon, test(1 2 3 4 5 6 7)


Canonical correlation analysis                      Number of obs =        565

Raw coefficients for the first variable set

                 |        1         2         3         4         5         6         7 
    -------------+----------------------------------------------------------------------
              v2 |   0.0756    0.8255    0.0848    0.3657    0.1051   -0.0897    0.2513 
         benzene |   0.8105    0.0800   -0.5946   -0.2967    0.4546   -0.5500   -0.1799 
         oxylene |   0.6197    0.6911    0.9462    0.3385   -0.6435    0.8811    1.0375 
        mpxylene |  -0.2399   -0.6321   -0.5617   -0.1971   -0.1289   -1.4618   -0.3028  
    ethylbenzene |  -0.0325   -0.2859    0.0801    0.8217    0.6042    0.3922   -0.4197  
            mtbe |   0.3226   -0.0154    0.0538   -0.6598   -0.1084    0.5373    0.6359 
         toluene |  -0.1840   -0.2222    0.1589   -0.2089    0.0055    0.9820   -0.5544  
      chloroform |  -0.3459   -0.2872   -0.0361    0.0722    0.8131    0.0680    0.4626 
    tetrachlor~e |  -0.0666    0.2111    0.8209   -0.3917    0.2137   -0.4856   -0.4713 
    trichloroe~e |  -0.0323    0.5114   -0.7368   -0.0611    0.0812    0.4931   -0.3929  
    ------------------------------------------------------------------------------------

Raw coefficients for the second variable set

                 |        1         2         3         4         5         6         7 
    -------------+----------------------------------------------------------------------
         albumin |   0.9655   -0.2429    0.6400    0.2047   -0.0927   -0.0558    0.0704 
             alt |   0.0933   -0.6251   -0.6853    0.2720    0.1690    0.8961   -1.1633  
             alp |   0.5209    0.6121   -0.5120    0.1461   -0.3885    0.2201    0.2831 
             ast |  -0.1377    0.0208    0.4377   -1.1944   -0.7119    0.0288    0.8795 
             ggt |   0.0386    0.1082   -0.2176   -0.3201    0.5975   -1.1838   -0.0459  
             ldh |  -0.0874    0.6455    0.5559    0.1323    0.6481    0.2671   -0.3746  
              tb |  -0.6301    0.3990   -0.0146    0.3359   -0.6610   -0.3703   -0.5104  
    ------------------------------------------------------------------------------------

----------------------------------------------------------------------------
Canonical correlations:
  0.3184  0.2314  0.1921  0.1144  0.1002  0.0656  0.0246

----------------------------------------------------------------------------
Tests of significance of all canonical correlations

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .796381       70  3202.18       1.8215     0.0000 a
        Pillai's trace     .219833       70     3878       1.7962     0.0001 a
Lawley-Hotelling trace     .236003       70     3824       1.8418     0.0000 a
    Roy's largest root     .112804       10      554       6.2494     0.0000 u
----------------------------------------------------------------------------
Test of significance of canonical correlations 1-7

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .796381       70  3202.18       1.8215     0.0000 a
----------------------------------------------------------------------------
Test of significance of canonical correlations 2-7

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .886217       54  2803.96       1.2448     0.1096 a
----------------------------------------------------------------------------
Test of significance of canonical correlations 3-7

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .936336       40  2400.19       0.9124     0.6283 a
----------------------------------------------------------------------------
Test of significance of canonical correlations 4-7

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .972219       28  1988.08       0.5570     0.9709 a
----------------------------------------------------------------------------
Test of significance of canonical correlations 5-7

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .985104       18  1561.78       0.4616     0.9733 a
----------------------------------------------------------------------------
Test of significance of canonical correlations 6-7

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .995097       10     1106       0.2721     0.9871 e
----------------------------------------------------------------------------
Test of significance of canonical correlation 7

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .999397        4      554       0.0836     0.9875 e
----------------------------------------------------------------------------
                            e = exact, a = approximate, u = upper bound on F

Now we focus on the first two sets of canonical weights and we might be interested in which coefficients in each set are significant (P < 0.1). By stderr option, we can call out the standard errors and significance test.

The first set of canonical weights of exposure variables mainly represent Benzene, Chloroform, and MTBE and the first set of canonical weights of outcome variables mainly represent Albumin, ALP, and TB.

The second set of canonical weights of exposure variables mainly represent 1,4-Dichlorobenzene, and Trichloroethene and the second set of canonical weights of outcome variables mainly represent ALT, ALP, LDH and TB.

These results help narrow down the relationship VOCs exposure and liver function tests outcome to fewer numbers of VOCs and liver function tests. This implies that exposure to a cluster of certain VOCs might be associated with certain biochemical liver tests as a group.

. canon (v2 benzene oxylene mpxylene ethylbenzene mtbe toluene chloroform tetra
> chloroethene trichloroethene) (albumin alt alp ast ggt ldh tb), first(2) stde
> rr

Linear combinations for canonical correlations      Number of obs =        565
------------------------------------------------------------------------------
             |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
u1           |
          v2 |   .0756267   .1390955     0.54   0.587    -.1975818    .3488351
     benzene |   .8105217   .1831084     4.43   0.000     .4508641    1.170179
     oxylene |   .6196655   .3964418     1.56   0.119    -.1590173    1.398348
    mpxylene |  -.2399039   .4662774    -0.51   0.607    -1.155756    .6759485
ethylbenzene |  -.0325145   .2653383    -0.12   0.903    -.5536864    .4886574
        mtbe |   .3225807   .1650388     1.95   0.051     -.001585    .6467463
     toluene |  -.1839661   .1851906    -0.99   0.321    -.5477136    .1797815
  chloroform |  -.3458607   .1372016    -2.52   0.012    -.6153493   -.0763722
tetrachlor~e |  -.0666163   .1499327    -0.44   0.657     -.361111    .2278783
trichloroe~e |  -.0323482   .1683552    -0.19   0.848    -.3630279    .2983315
-------------+----------------------------------------------------------------
v1           |
     albumin |   .9655298   .1524076     6.34   0.000      .666174    1.264886
         alt |   .0932795   .2231235     0.42   0.676     -.344975    .5315341
         alp |   .5209347   .1387179     3.76   0.000     .2484679    .7934014
         ast |  -.1377092   .2155219    -0.64   0.523    -.5610329    .2856144
         ggt |   .0385658   .1749642     0.22   0.826    -.3050952    .3822267
         ldh |  -.0874316   .1483353    -0.59   0.556    -.3787886    .2039255
          tb |  -.6300689   .1546978    -4.07   0.000     -.933923   -.3262148
-------------+----------------------------------------------------------------
u2           |
          v2 |   .8255232   .1964452     4.20   0.000     .4396696    1.211377
     benzene |   .0800258   .2586049     0.31   0.757    -.4279204    .5879721
     oxylene |   .6910712   .5598968     1.23   0.218    -.4086663    1.790809
    mpxylene |  -.6320947   .6585259    -0.96   0.338    -1.925557    .6613681
ethylbenzene |  -.2859168   .3747385    -0.76   0.446     -1.02197    .4501368
        mtbe |  -.0153545   .2330851    -0.07   0.948    -.4731753    .4424663
     toluene |  -.2222011   .2615456    -0.85   0.396    -.7359235    .2915213
  chloroform |  -.2872412   .1937705    -1.48   0.139    -.6678412    .0933588
tetrachlor~e |   .2110777   .2117507     1.00   0.319    -.2048386    .6269939
trichloroe~e |   .5113532   .2377688     2.15   0.032     .0443327    .9783738
-------------+----------------------------------------------------------------
v2           |
     albumin |  -.2429123    .215246    -1.13   0.260     -.665694    .1798693
         alt |  -.6251191   .3151185    -1.98   0.048    -1.244068     -.00617
         alp |   .6120708    .195912     3.12   0.002     .2272646     .996877
         ast |   .0208182   .3043827     0.07   0.945    -.5770439    .6186803
         ggt |   .1082178   .2471027     0.44   0.662    -.3771362    .5935719
         ldh |   .6454798   .2094947     3.08   0.002     .2339948    1.056965
          tb |   .3989909   .2184804     1.83   0.068    -.0301437    .8281256
------------------------------------------------------------------------------
                                     (Standard errors estimated conditionally)
Canonical correlations:
  0.3184  0.2314  0.1921  0.1144  0.1002  0.0656  0.0246

----------------------------------------------------------------------------
Tests of significance of all canonical correlations

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .796381       70  3202.18       1.8215     0.0000 a
        Pillai's trace     .219833       70     3878       1.7962     0.0001 a
Lawley-Hotelling trace     .236003       70     3824       1.8418     0.0000 a
    Roy's largest root     .112804       10      554       6.2494     0.0000 u
----------------------------------------------------------------------------
                            e = exact, a = approximate, u = upper bound on F

Finally, we use the estat loadings command to display the structure correlation coefficients, also called canonical loadings. These loadings are correlations between variables and the canonical variates, used to interpret the importance of each original variable in the canonical variates.

. estat loadings

Canonical loadings for variable list 1

                 |        1         2 
    -------------+--------------------
              v2 |   0.0873    0.7347 
         benzene |   0.8667   -0.1024 
         oxylene |   0.6973   -0.1190 
        mpxylene |   0.6694   -0.2100 
    ethylbenzene |   0.5779   -0.2223 
            mtbe |   0.4336    0.0067 
         toluene |   0.4493   -0.1767 
      chloroform |  -0.2142   -0.1082 
    tetrachlor~e |   0.0706    0.2696 
    trichloroe~e |   0.0185    0.4834 
    ----------------------------------

Canonical loadings for variable list 2

                 |        1         2 
    -------------+--------------------
         albumin |   0.6800   -0.1265 
             alt |   0.2808   -0.0569 
             alp |   0.5064    0.6637 
             ast |   0.1298    0.1422 
             ggt |   0.3257    0.1612 
             ldh |   0.0558    0.6438 
              tb |  -0.1148    0.2439 
    ----------------------------------

Correlation between variable list 1 and canonical variates from list 2

                 |        1         2 
    -------------+--------------------
              v2 |   0.0278    0.1700 
         benzene |   0.2759   -0.0237 
         oxylene |   0.2220   -0.0275 
        mpxylene |   0.2131   -0.0486 
    ethylbenzene |   0.1840   -0.0514 
            mtbe |   0.1381    0.0015 
         toluene |   0.1430   -0.0409 
      chloroform |  -0.0682   -0.0250 
    tetrachlor~e |   0.0225    0.0624 
    trichloroe~e |   0.0059    0.1118 
    ----------------------------------

Correlation between variable list 2 and canonical variates from list 1

                 |        1         2 
    -------------+--------------------
         albumin |   0.2165   -0.0293 
             alt |   0.0894   -0.0132 
             alp |   0.1612    0.1536 
             ast |   0.0413    0.0329 
             ggt |   0.1037    0.0373 
             ldh |   0.0178    0.1489 
              tb |  -0.0366    0.0564 
    ----------------------------------

Repeat the above steps using subgroup data. This time we only report the output of related to CCA.

. canon (v2 benzene oxylene mpxylene ethylbenzene mtbe toluene chloroform tetra
> chloroethene trichloroethene) (albumin alt alp ast ggt ldh tb)

Canonical correlation analysis                      Number of obs =        364

Raw coefficients for the first variable set

                 |        1         2         3         4         5         6         7 
    -------------+----------------------------------------------------------------------
              v2 |  -0.0864    0.7927   -0.3509   -0.4380    0.1915   -0.0461   -0.2965  
         benzene |   0.9536   -0.1311   -0.5125    0.2035    0.3055    0.1649   -0.2804  
         oxylene |   0.3931    1.1373    0.7959    0.6987    0.0267    0.5340   -0.1730  
        mpxylene |  -0.0786   -0.6241   -0.2293   -1.0983   -0.2600   -1.2125    1.8878 
    ethylbenzene |  -0.1973   -0.1926    0.1325   -0.2430    0.7675    0.7911   -1.0040  
            mtbe |   0.3857    0.0035   -0.2383    0.3222   -0.1886    0.1266    0.0232 
         toluene |  -0.2762   -0.1281    0.4657    0.2602   -0.9704    0.3166   -0.6035  
      chloroform |  -0.3778   -0.2199    0.1192    0.2600    0.5099    0.6297    0.2853 
    tetrachlor~e |  -0.1556    0.4707    0.2210    0.6919    0.2814   -0.4807   -0.0608  
    trichloroe~e |  -0.1163    0.1813   -0.6071   -0.1560   -0.7052    0.6053    0.5468 
    ------------------------------------------------------------------------------------

Raw coefficients for the second variable set

                 |        1         2         3         4         5         6         7 
    -------------+----------------------------------------------------------------------
         albumin |   0.8539    0.2318    0.8054   -0.0546   -0.0056   -0.1301    0.0476 
             alt |  -0.0039   -0.7102   -0.0580   -0.3882   -1.5081    0.4344   -0.7116  
             alp |   0.2985    0.3195   -0.3945   -0.6661   -0.1914    0.0953    0.6314 
             ast |  -0.0061   -0.3476    0.1249    0.8405    0.5994   -0.2740    1.3703 
             ggt |   0.4186    0.0072   -0.6775    0.2759    1.0163   -0.2260   -0.6180  
             ldh |  -0.0719    0.8987   -0.0876    0.4705   -0.4063   -0.0612   -0.4309  
              tb |  -0.6927   -0.0054   -0.1744   -0.4061   -0.1110   -0.9081   -0.1127  
    ------------------------------------------------------------------------------------

----------------------------------------------------------------------------
Canonical correlations:
  0.3458  0.2717  0.2018  0.1416  0.0834  0.0795  0.0319

----------------------------------------------------------------------------
Tests of significance of all canonical correlations

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .755585       70  2030.16       1.4280     0.0124 a
        Pillai's trace     .268514       70     2471       1.4081     0.0153 a
Lawley-Hotelling trace      .29288       70     2417       1.4447     0.0099 a
    Roy's largest root      .13586       10      353       4.7959     0.0000 u
----------------------------------------------------------------------------
                            e = exact, a = approximate, u = upper bound on F

In order to find out how many possible canonical correlations would be statistically significant, we can use the test option in canon command as shown below. From the output, we discover that the first canonical correlations are statistically significant (F = 1.43, P < 0.0124), indicating that this set of variables is correlated. The first canonical correlation is 0.3458.

. canon, test(1 2 3 4 5 6 7)


Canonical correlation analysis                      Number of obs =        364

Raw coefficients for the first variable set

                 |        1         2         3         4         5         6         7 
    -------------+----------------------------------------------------------------------
              v2 |  -0.0864    0.7927   -0.3509   -0.4380    0.1915   -0.0461   -0.2965  
         benzene |   0.9536   -0.1311   -0.5125    0.2035    0.3055    0.1649   -0.2804  
         oxylene |   0.3931    1.1373    0.7959    0.6987    0.0267    0.5340   -0.1730  
        mpxylene |  -0.0786   -0.6241   -0.2293   -1.0983   -0.2600   -1.2125    1.8878 
    ethylbenzene |  -0.1973   -0.1926    0.1325   -0.2430    0.7675    0.7911   -1.0040  
            mtbe |   0.3857    0.0035   -0.2383    0.3222   -0.1886    0.1266    0.0232 
         toluene |  -0.2762   -0.1281    0.4657    0.2602   -0.9704    0.3166   -0.6035  
      chloroform |  -0.3778   -0.2199    0.1192    0.2600    0.5099    0.6297    0.2853 
    tetrachlor~e |  -0.1556    0.4707    0.2210    0.6919    0.2814   -0.4807   -0.0608  
    trichloroe~e |  -0.1163    0.1813   -0.6071   -0.1560   -0.7052    0.6053    0.5468 
    ------------------------------------------------------------------------------------

Raw coefficients for the second variable set

                 |        1         2         3         4         5         6         7 
    -------------+----------------------------------------------------------------------
         albumin |   0.8539    0.2318    0.8054   -0.0546   -0.0056   -0.1301    0.0476 
             alt |  -0.0039   -0.7102   -0.0580   -0.3882   -1.5081    0.4344   -0.7116  
             alp |   0.2985    0.3195   -0.3945   -0.6661   -0.1914    0.0953    0.6314 
             ast |  -0.0061   -0.3476    0.1249    0.8405    0.5994   -0.2740    1.3703 
             ggt |   0.4186    0.0072   -0.6775    0.2759    1.0163   -0.2260   -0.6180  
             ldh |  -0.0719    0.8987   -0.0876    0.4705   -0.4063   -0.0612   -0.4309  
              tb |  -0.6927   -0.0054   -0.1744   -0.4061   -0.1110   -0.9081   -0.1127  
    ------------------------------------------------------------------------------------

----------------------------------------------------------------------------
Canonical correlations:
  0.3458  0.2717  0.2018  0.1416  0.0834  0.0795  0.0319

----------------------------------------------------------------------------
Tests of significance of all canonical correlations

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .755585       70  2030.16       1.4280     0.0124 a
        Pillai's trace     .268514       70     2471       1.4081     0.0153 a
Lawley-Hotelling trace      .29288       70     2417       1.4447     0.0099 a
    Roy's largest root      .13586       10      353       4.7959     0.0000 u
----------------------------------------------------------------------------
Test of significance of canonical correlations 1-7

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .755585       70  2030.16       1.4280     0.0124 a
----------------------------------------------------------------------------
Test of significance of canonical correlations 2-7

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .858239       54  1779.05       1.0027     0.4701 a
----------------------------------------------------------------------------
Test of significance of canonical correlations 3-7

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .926663       40  1524.05       0.6716     0.9428 a
----------------------------------------------------------------------------
Test of significance of canonical correlations 4-7

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .966006       28  1263.37       0.4349     0.9957 a
----------------------------------------------------------------------------
Test of significance of canonical correlations 5-7

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .985757       18  993.263       0.2806     0.9987 a
----------------------------------------------------------------------------
Test of significance of canonical correlations 6-7

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .992666       10      704       0.2596     0.9892 e
----------------------------------------------------------------------------
Test of significance of canonical correlation 7

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .998984        4      353       0.0898     0.9856 e
----------------------------------------------------------------------------
                            e = exact, a = approximate, u = upper bound on F

Now we focus on the first set of canonical weights and we might be interested in which coefficients in each set are significant (P < 0.1). By stderr option, we can call out the standard errors and significance test.

This set of canonical weights of exposure variables mainly represent Benzene, MTBE, and Chloroform and the first set of canonical weights of outcome variables mainly represent Albumin, ALP, GGT, and TB.

. canon (v2 benzene oxylene mpxylene ethylbenzene mtbe toluene chloroform tetra
> chloroethene trichloroethene) (albumin alt alp ast ggt ldh tb), first(1) stde
> rr

Linear combinations for canonical correlations      Number of obs =        364
------------------------------------------------------------------------------
             |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
u1           |
          v2 |  -.0863798    .161699    -0.53   0.594    -.4043642    .2316047
     benzene |   .9535735   .2029701     4.70   0.000     .5544285    1.352718
     oxylene |    .393107   .4568681     0.86   0.390    -.5053334    1.291547
    mpxylene |  -.0786449   .5300306    -0.15   0.882    -1.120961    .9636712
ethylbenzene |  -.1973275   .2819226    -0.70   0.484    -.7517342    .3570792
        mtbe |   .3857021   .1862609     2.07   0.039     .0194162    .7519879
     toluene |  -.2761614   .2033818    -1.36   0.175    -.6761159    .1237931
  chloroform |  -.3778441   .1561604    -2.42   0.016    -.6849368   -.0707515
tetrachlor~e |  -.1556496   .1689163    -0.92   0.357    -.4878271    .1765279
trichloroe~e |   -.116344   .1957082    -0.59   0.553    -.5012083    .2685202
-------------+----------------------------------------------------------------
v1           |
     albumin |   .8538873   .1733685     4.93   0.000     .5129546     1.19482
         alt |  -.0038843   .2738774    -0.01   0.989    -.5424699    .5347012
         alp |   .2984847   .1597807     1.87   0.063    -.0157273    .6126968
         ast |  -.0060705   .2554019    -0.02   0.981    -.5083237    .4961826
         ggt |   .4185558   .2121205     1.97   0.049     .0014165    .8356951
         ldh |  -.0719023   .1699071    -0.42   0.672    -.4060281    .2622235
          tb |  -.6927423   .1775667    -3.90   0.000    -1.041931   -.3435538
------------------------------------------------------------------------------
                                     (Standard errors estimated conditionally)
Canonical correlations:
  0.3458  0.2717  0.2018  0.1416  0.0834  0.0795  0.0319

----------------------------------------------------------------------------
Tests of significance of all canonical correlations

                         Statistic      df1      df2            F     Prob>F
         Wilks' lambda     .755585       70  2030.16       1.4280     0.0124 a
        Pillai's trace     .268514       70     2471       1.4081     0.0153 a
Lawley-Hotelling trace      .29288       70     2417       1.4447     0.0099 a
    Roy's largest root      .13586       10      353       4.7959     0.0000 u
----------------------------------------------------------------------------
                            e = exact, a = approximate, u = upper bound on F

Finally, we use the estat loadings command to display the structure correlation coefficients, also called canonical loadings.

. estat loadings

Canonical loadings for variable list 1

                 |        1 
    -------------+----------
              v2 |  -0.0787 
         benzene |   0.8187 
         oxylene |   0.4725 
        mpxylene |   0.4565 
    ethylbenzene |   0.3241 
            mtbe |   0.4090 
         toluene |   0.2334 
      chloroform |  -0.2706 
    tetrachlor~e |  -0.0831 
    trichloroe~e |  -0.1468 
    ------------------------

Canonical loadings for variable list 2

                 |        1 
    -------------+----------
         albumin |   0.6854 
             alt |   0.4265 
             alp |   0.4337 
             ast |   0.2890 
             ggt |   0.5831 
             ldh |   0.1124 
              tb |  -0.0906 
    ------------------------

Correlation between variable list 1 and canonical variates from list 2

                 |        1 
    -------------+----------
              v2 |  -0.0272 
         benzene |   0.2831 
         oxylene |   0.1634 
        mpxylene |   0.1579 
    ethylbenzene |   0.1121 
            mtbe |   0.1415 
         toluene |   0.0807 
      chloroform |  -0.0936 
    tetrachlor~e |  -0.0287 
    trichloroe~e |  -0.0508 
    ------------------------

Correlation between variable list 2 and canonical variates from list 1

                 |        1 
    -------------+----------
         albumin |   0.2371 
             alt |   0.1475 
             alp |   0.1500 
             ast |   0.1000 
             ggt |   0.2017 
             ldh |   0.0389 
              tb |  -0.0313 
    ------------------------

SAS

Canonical structures of the first pair of canonical variate and F-test (full group)

proc cancorr data=project.Combine2;
var Benzene o_Xylene m_p_Xylene Ethylbenzene MTBE _1_4_Dichlorobenzene Toluene Chloroform Tetrachloroethene Trichloroethene;
with Albumin ALT ALP AST GGT LDH TB; 
run;

The output below gives the canonical correlations and the multivariate tests of the dimensions, and also includes the multivariate criteria and the F approximations.

Note The F statistics vary depending on the criteria.

Next, the raw canonical coefficients are shown below.

The raw canonical coefficients are interpreted in a manner analogous to interpreting regression coefficients. For the variable o-Xylene, a one unit increase in o-Xylene leads to a 0.6197 increase in the first canonical variate of set 1 when all of the other variables are held constant.

The raw coefficients are followed by the standardized canonical coefficients shown below. After standardizing, the coefficients are easier to compare because their values don’t depend on the their units. However, the raw coefficients are more interpretable.

Below are correlations between observed variables and canonical variables which are known as the canonical loadings, which SAS labels as the canonical structure.

Through the graphes above, we can narrow down the relationship between the VOC exposure and liver function to fewer numbers of VOCs and liver function tests in the full group. in addition, it can imply that exposure to a cluster of certain VOCs mignt be associated with certain biochemical liver tests as a group.

Canonical structures of the first pair of canonical variate and F-test (subgroup)

proc cancorr data=project.moredrink;
var Benzene o_Xylene m_p_Xylene Ethylbenzene MTBE _1_4_Dichlorobenzene Toluene Chloroform Tetrachloroethene Trichloroethene;
with Albumin ALT ALP AST GGT LDH TB; 
run;

The output below gives the canonical correlations and the multivariate tests of the dimensions, and also includes the multivariate criteria and the F approximations.

Next, the raw canonical coefficients are shown below.

The raw canonical coefficients are interpreted in a manner analogous to interpreting regression coefficients. For the variable o-Xylene, a one unit increase in o-Xylene leads to a 0.3931 increase in the first canonical variate of set 1 when all of the other variables are held constant.

The raw coefficients are followed by the standardized canonical coefficients shown below.

Below are correlations between observed variables and canonical variables which are known as the canonical loadings, which SAS labels as the canonical structure.

We can compare these graphes with the graphes of full group and see whether the change of group has influence on the correlation between the VOC exposure and liver function tests.

Discussion

The advantages of CCA in this case:

  • The liver damage caused by isolated VOCs maybe be even worse when facing a cluster of VOCs.

  • The liver injuries would be better captured by the combination of liver function tests.

  • Compared to the full model, subgroup implies that liver injuries may be caused by a narrower cluster of VOCs.

Take-home notes

  • Canonical correlation terminology makes an important distinction between the words variables and variates. The term variables is reserved for referring to the original variables being analyzed. The term variates is used to refer to variables that are constructed as weighted averages of the original variables.

  • CCA output can be fairly complex. Quantities of particular interest include the correlations between the original variables in each set and their respective canonical variates (structural correlations or loadings). The canonical correlations provide the concordance between the transformed variables, while the loadings reveal the extent to which each canonical variate is associated with particular variables in each set.

  • In general, the number of canonical dimensions is equal to the number of variables in the smaller set; however, the number of significant dimensions may be even smaller. Canonical dimensions, also known as canonical variates, are latent variables that are analogous to factors in factor analysis.

Alternative Methods

  • Multivariate multiple regression
  • Separate OLS Regressions

Citation

*Burch, J. B., Everson, T. M., Seth, R. K., Wirth, M. D., & Chatterjee, S. (2015). Trihalomethane exposure and biomonitoring for the liver injury indicator, alanine aminotransferase, in the United States population (NHANES 1999-2006). Science of The Total Environment, 521-522, 226-234. doi:10.1016/j.scitotenv.2015.03.050*

*Liu, J., Drane, W., Liu, X., & Wu, T. (2009). Examination of the relationships between environmental exposures to volatile organic compounds and biochemical liver tests: Application of canonical correlation analysis. Environmental Research, 109(2), 193-199. doi:10.1016/j.envres.2008.11.002*

*Jang, E. S., Jeong, S., Hwang, S. H., Kim, H. Y., Ahn, S. Y., Lee, J.,… Lee, D. H. (2012). Effects of coffee, smoking, and alcohol on liver function tests: A comprehensive cross-sectional study. BMC Gastroenterology, 12(1). doi:10.1186/1471-230x-12-145*

Caconical Correlation by Wikipedia

The Algorithm of CCA(Chinese version)

NCSS Statistical Software Chapter 400 Canonical Correlation

Canonical Correlation Analysis by University of Minnesota

CCA Package