Introduction

Factor Analysis is a method for modeling observed variables. It is commonly used for data reduction. The key concept of Factor Analysis is that some of variables have similar patterns because they are all related to a factors (latent variables) that cannot be directly measured. The factors typically are viewed as broad concepts or ideas that may describe an observed phenomenon.

Mathematical Models

Suppose we have a set of p observable random variables, \(X_1, X_2, ..., X_p\), and they are linearly related to m latent variables (factors), \(F_1, F_2, ..., F_m\), where \(m<p\). They are expressed below \[X_1 = l_{11}F_1+l_{12}F_2+ \cdots +l_{1m}F_m + \epsilon_1\] \[X_2 = l_{21}F_1+l_{22}F_2+ \cdots +l_{2m}F_m + \epsilon_2\] \[\vdots\]

\[X_p = l_{p1}F_1+l_{p2}F_2+ \cdots +l_{pm}F_m + \epsilon_p\]

Here \(l_{ij}\) denotes the factor loading of \(i^{th}\) variable on the \(j^{th}\) factor. The \(\epsilon_i\) means unique error variance of \(i^{th}\) variable, which cannot be explained by extracted factors.

Other assumptions are

  • \(Cov(F, \epsilon) = \mathbf{0}\).

  • \(Cov(\epsilon_i, \epsilon_j) = 0\), where \(i,j = 1,2, ..., p\).

  • \(\mathbf{E}(F) = \mathbf{0}\).

  • \(Cov(F) = I\).

Work on factor analysis

  • Find the hidden factors behind observed variables: The hidden factors cannot measured directly, but can be interpretable.
  • Estimate the factor loadings.
  • Evaluate the factors we extracted.

Factor Analysis

Data

The dataset we work on is called bfi from psych r-package. It consists of 2800 observations on the following 28 variables including 25 self-report personality items and three additional demographic variables (sex, education, and age). In this project, we plan to figure out the putative factors of the first 25 items in dataset.

library(psych)
library(tidyverse)
library(corrplot)
mydata = bfi[1:25]
desp = describe(mydata ) %>%
  select(vars, n, mean, sd, median, min, max, range)
desp
##    vars    n mean   sd median min max range
## A1    1 2784 2.41 1.41      2   1   6     5
## A2    2 2773 4.80 1.17      5   1   6     5
## A3    3 2774 4.60 1.30      5   1   6     5
## A4    4 2781 4.70 1.48      5   1   6     5
## A5    5 2784 4.56 1.26      5   1   6     5
## C1    6 2779 4.50 1.24      5   1   6     5
## C2    7 2776 4.37 1.32      5   1   6     5
## C3    8 2780 4.30 1.29      5   1   6     5
## C4    9 2774 2.55 1.38      2   1   6     5
## C5   10 2784 3.30 1.63      3   1   6     5
## E1   11 2777 2.97 1.63      3   1   6     5
## E2   12 2784 3.14 1.61      3   1   6     5
## E3   13 2775 4.00 1.35      4   1   6     5
## E4   14 2791 4.42 1.46      5   1   6     5
## E5   15 2779 4.42 1.33      5   1   6     5
## N1   16 2778 2.93 1.57      3   1   6     5
## N2   17 2779 3.51 1.53      4   1   6     5
## N3   18 2789 3.22 1.60      3   1   6     5
## N4   19 2764 3.19 1.57      3   1   6     5
## N5   20 2771 2.97 1.62      3   1   6     5
## O1   21 2778 4.82 1.13      5   1   6     5
## O2   22 2800 2.71 1.57      2   1   6     5
## O3   23 2772 4.44 1.22      5   1   6     5
## O4   24 2786 4.89 1.22      5   1   6     5
## O5   25 2780 2.49 1.33      2   1   6     5
sum(complete.cases(bfi[1:25]))
## [1] 2436

Each item corresponds to one variable whose values are integers from 1 to 6. If we remove the cases with no missing values, there are 2436 cases, about 13% of the original ones. According to the criteria of sample size adequacy: sample size 50 is very poor, 100 poor, 200 fair, 300 good, 500 very good, and more than 1,000 excellent. Therefore, our sample size after removal is adequate for our purposes.

mydata = bfi[1:25][complete.cases(bfi[1:25]),]
corrplot(cor(mydata), type = "upper", tl.col = "darkblue", tl.srt = 45)

Based on the correlation matrix figure above, we observe several correlations are closed to 1 or -1 which means some of items are somewhat or even highly correlated. In this case, we used factor analysis to explore the structure of the relationships between variables and to determine whether these relationships can be explained by a smaller number of latent dimensions.

R

Description of R-packages

  • psych : A package for personality, psychometric, and psychological research including fa, fa.diagram, fa.parallel functions for factor analysis.
  • tidyverse: A set of packages for data processing.
library(psych)
library(tidyverse)

Adequacy of Factor Analysis

There are some methods to check the adequacy of factor analysis. In this project, we adopt two of them as follow.

1. Kaiser-Meyer-Olkin (KMO)

Kaiser-Meyer-Olkin (KMO) is the measure of sampling adequacy, which varies between 0 and 1. The values closed to 1.0 usually means that it is useful to process the data used by factor analysis. If the value is less than 0.50, the results of the factor analysis probably won’t be very useful.

mydata = bfi[1:25][complete.cases(bfi[1:25]),]
KMO(mydata)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = mydata)
## Overall MSA =  0.85
## MSA for each item = 
##   A1   A2   A3   A4   A5   C1   C2   C3   C4   C5   E1   E2   E3   E4   E5 
## 0.75 0.84 0.87 0.88 0.90 0.84 0.80 0.85 0.83 0.86 0.84 0.88 0.90 0.88 0.89 
##   N1   N2   N3   N4   N5   O1   O2   O3   O4   O5 
## 0.78 0.78 0.86 0.89 0.86 0.86 0.78 0.84 0.77 0.76

Based on the results of KMO above, we observe the MSA for each item is greater than 0.74 and the overall MSA is 0.85, which illustrates it is useful if we use factor analysis on the data.

2. Bartlett’s Test of Sphericity

Bartlett’s Test of Sphericity is the test for null hypothesis that the correlation matrix has an identity matrix. The significant p-value indicates null hypothesis that all off-diagonal correlations are zero is falsified.

cortest.bartlett(mydata)
## $chisq
## [1] 18146.07
## 
## $p.value
## [1] 0
## 
## $df
## [1] 300

As we can see, the p-value of this test is significant, which means some of off-diagonal correlations are nonzero. It further suggests it is necessary to go ahead with factor analysis.

Number of Factors

There are several ways to determine the number of factors. Now, we use two ways below. ####1. Screeplot In screeplot, eigenvalues of correlation matrix represent variances explained by each factor (sometimes sums of squared factor loadings are used instead). The adequate number of factors is before the sudden downward inflextion of the plot. Eigenvalues sum to the number of items, so an eigenvalue more than 1 is more informative than a single average item.

scree(mydata, fa = TRUE, pc = FALSE, main = "scree plot")

This scree plot shows that the eigenvalues for the first three factors are all strictly greater than 1, while the fourth eigenvalue is 0.97 which is almost equal to 1. So the first four factors account for most of the total variability in data (given by the eigenvalues). The remaining factors account for a very small proportion of the variability and are likely unimportant

2. Parallel analysis

Parallel analysis is based on the comparison of eigenvalues of the actual data to those of the simulative data. The focal point is how many of the factors obtained from the actual data have an eigenvalue greater than that of the simulative data and accordingly the number of factors is decided. The number of factors at the point where the eigenvalue in the simulative data is greater than that of the actual data is considered significant.

fa.parallel(mydata, fa = 'fa', error.bars = TRUE)

## Parallel analysis suggests that the number of factors =  6  and the number of components =  NA

From Parallel Analysis scree plot, it is obvious that the six factors construct decided as a result of the examination of the eigenvalues is supported. Because the first six factors of the actual data have higher eigenvalues than the first six factors of the simulative data.

Combined results with two methods mentioned above, we should compare 4, 5 and 6 factors cases and find out which case is the most intepretable.

Conducting Factor Analysis

The function fa.diagram only shows the factor loadings greater than 0.3. It is easy to determine which one is the most interpretable. Let’s get started with 4 factors by varimax rotation method.

fa4 = fa(mydata, nfactors = 4, fm="pa", max.iter = 100, rotate = "varimax")
fa.diagram(fa4, main = "# factors = 4", cut = .3)

As we can see, A1 and O4 do not have more dependence on each of these four factors than other items. It motivated us to try more factors.

fa6 = fa(mydata, nfactors = 6, fm="pa", max.iter = 100, rotate = "varimax")
fa.diagram(fa6, main = "# factors = 6", cut = .3)

Here is the factor analysis with 6 factors. No items have strong relationship with the sixth factor. Therefore, we finally try the anaylsis with 5 factors.

fa5 = fa(mydata, nfactors = 5, fm="pa", max.iter = 100, rotate = "varimax")
fa.diagram(fa5, main = "# factors = 5", cut = .3)

With 5 factors, it is easy to see each factors corresponse to five different items, which indicates it is more interpretable.

Conclusions

Factor Loadings

Below are the factor loadings with 5 factors cases. In this output, we can also see the factor loadings less than 0.3.

print(fa5$loadings, cutoff=0, digits=3)
## 
## Loadings:
##    PA2    PA1    PA3    PA5    PA4   
## A1  0.111  0.040  0.023 -0.428 -0.078
## A2  0.030  0.214  0.139  0.627  0.062
## A3  0.009  0.318  0.109  0.651  0.056
## A4 -0.066  0.205  0.231  0.436 -0.113
## A5 -0.122  0.393  0.088  0.537  0.067
## C1  0.010  0.070  0.546  0.039  0.210
## C2  0.090  0.033  0.649  0.103  0.115
## C3 -0.031  0.024  0.557  0.112 -0.005
## C4  0.240 -0.065 -0.634 -0.038 -0.108
## C5  0.290 -0.176 -0.562 -0.048  0.037
## E1  0.043 -0.575  0.033 -0.105 -0.059
## E2  0.245 -0.679 -0.102 -0.112 -0.042
## E3  0.024  0.537  0.083  0.258  0.281
## E4 -0.116  0.647  0.102  0.306 -0.074
## E5  0.036  0.504  0.313  0.090  0.214
## N1  0.786  0.079 -0.046 -0.216 -0.085
## N2  0.754  0.027 -0.031 -0.194 -0.010
## N3  0.732 -0.061 -0.067 -0.028 -0.004
## N4  0.591 -0.345 -0.179  0.006  0.075
## N5  0.538 -0.161 -0.037  0.101 -0.150
## O1 -0.002  0.213  0.115  0.062  0.505
## O2  0.176  0.005 -0.100  0.082 -0.469
## O3  0.027  0.311  0.077  0.127  0.596
## O4  0.221 -0.191 -0.022  0.156  0.369
## O5  0.085 -0.005 -0.063 -0.010 -0.534
## 
##                  PA2   PA1   PA3   PA5   PA4
## SS loadings    2.710 2.473 2.041 1.844 1.522
## Proportion Var 0.108 0.099 0.082 0.074 0.061
## Cumulative Var 0.108 0.207 0.289 0.363 0.424

Communalities

The communalities for the ith variable are computed by taking the sum of the squared loadings for that variable. \[h_i^2 = \sum \limits_{j = 1}^m \hat{l}_{ij}^2\] Communatities indicate the amount of variance in each variable that is accounted for.

print(fa5$communality, cutoff=0, digits=3)
##    A1    A2    A3    A4    A5    C1    C2    C3    C4    C5    E1    E2 
## 0.204 0.463 0.539 0.302 0.470 0.348 0.454 0.324 0.477 0.435 0.348 0.545 
##    E3    E4    E5    N1    N2    N3    N4    N5    O1    O2    O3    O4 
## 0.441 0.541 0.407 0.681 0.608 0.545 0.506 0.349 0.317 0.268 0.474 0.246 
##    O5 
## 0.296

Generally, communatities between 0.0-0.4 should be considered as low ones. However, almost half of the communatities below are less than 0.4. So, we should check whether the communatities go much larger as the number of factors increasing.

com_matrix = c()
for(i in 1:11){
  result <- fa(mydata, nfactors = i, fm="pa", max.iter = 100, rotate = "varimax")
  com_matrix = rbind(com_matrix, result$communality)
}

as.tibble(com_matrix) %>%  
  gather(items, loadings) %>%
  group_by(items) %>%
  mutate(n = 1:11) %>%
  ggplot(aes(x = n ,y = loadings)) +
  geom_line(aes(colour = factor(items)), size = 0.5) +
  geom_vline(xintercept = 5, linetype="dotted", color = "blue", size=1) +
  geom_vline(xintercept = 6, linetype="dotted", color = "red", size=1) +
  xlab("Number of Factors") + ylab("Communalities")

As we can see, when the number of factors is larger than 5 or 6, there are no obvious increases in communatities. It further illustrates 5 factors are necessary for this analysis.

Percentage of Variance Accounted For

We can use the eigenvalues to calculate the percentage of variance accounted for by each of the factors.

nfactors = 5
fa5 = fa(mydata, nfactors = nfactors, fm="pa", max.iter = 100, rotate = "varimax")
var_per = 100*fa5$values[1:nfactors]/sum(fa5$values[1:nfactors])
cum_var = cumsum(var_per)
tb = as.data.frame(cbind(Var = var_per, Cum.var = cum_var))
tb %>%
  transmute(Factor = paste0('factor',1:nfactors), 
            Var = sprintf('%4.2f%%', Var)) 

The Factor Analysis has thus identified 5 putative factors that affect 25 self-report personality itemss. They can be categorized as follow:

No. Factors Main Items Proportion of explained variability
1 Agreeableness A1, A2, A3, A4, A5 43.43%
2 Conscientiousness C1, C2, C3, C4, C5 21.42%
3 Extraversion E1, E2, E3, E4, E5 14.62%
4 Neuroticism N1, N2, N3, N4, N5 11.50%
5 Openness O1, O2, O3, O4, O5 9.02%

Python

Description of Packages

In this python example, we’ll use pandas and numpy packages to do matrix manipulation, use matplotlib to make plots and use factor_analyzer to do factor analysis.

Data Cleansing

We firstly load the dataset. Firstly we select out the first 25 columns that we plan to implement factor analysis on. Then, the rows containing missing values are deleted. I print the first 15 rows out.

import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.read_csv('bfi.csv')
del df['age']
del df['gender']
del df['education']
del df['Unnamed: 0']
df=df.dropna()
print(df.head(15))
##     Unnamed: 0  A1  A2  A3  A4  A5  C1  C2 ...  N3  N4  N5  O1  O2  O3  O4  O5
## 0        61617   2   4   3   4   4   2   3 ...   2   2   3   3   6   3   4   3
## 1        61618   2   4   5   2   5   5   4 ...   3   5   5   4   2   4   3   3
## 2        61620   5   4   5   4   4   4   5 ...   4   2   3   4   2   5   5   2
## 3        61621   4   4   6   5   5   4   4 ...   2   4   1   3   3   4   3   5
## 4        61622   2   3   3   4   5   4   4 ...   4   4   3   3   3   4   3   3
## 5        61623   6   6   5   6   5   6   6 ...   2   2   3   4   3   5   6   1
## 6        61624   2   5   5   3   5   5   4 ...   2   1   1   5   2   5   6   1
## 7        61629   4   3   1   5   1   3   2 ...   2   6   4   3   2   4   5   3
## 8        61633   2   5   6   6   5   6   5 ...   5   2   4   5   1   5   5   2
## 9        61634   4   4   5   6   5   4   3 ...   4   2   3   5   3   5   6   3
## 10       61637   5   5   5   6   4   5   4 ...   2   2   2   4   2   4   5   2
## 11       61639   5   5   5   6   6   4   4 ...   1   2   1   5   3   4   4   4
## 12       61640   4   5   2   2   1   5   5 ...   2   2   3   5   2   5   5   5
## 13       61643   4   3   6   6   3   5   5 ...   4   5   5   6   6   6   3   2
## 14       61650   4   6   6   2   5   4   4 ...   4   4   5   5   1   5   6   3
## 
## [15 rows x 26 columns]

Number of Factors

Before inplementing factor analysis, the number of factors should be determined. Here we make a Scree plot and choose the number of factors to be the largest number that corresponds to a Eigenvalue larger than 1, that’s a rule of thumb. There is a direct function called scree() in R to directly make this plot and we We take the source code of that function and translate it into python codes to calculate the eigenvalues corresponding to factors.

cor = mat(df.corr())
w,v = np.linalg.eig(cor)
nvar=25
tot=np.sum(w)
w=(nvar/tot)*w
cor=np.dot(np.dot(v,np.diag(w)),v.transpose())
fill_diagonal(cor, 1-1/cor.I.diagonal()) 
w,_ = np.linalg.eig(cor)

Then we make the plot

fig = plt.figure(figsize=(8,5))
plt.plot(range(1,len(w)+1), w, 'ro-', linewidth=2)
plt.axhline(y=1, color='b', linestyle='-')
plt.title('Scree Plot')
plt.xlabel('Number of factors')
plt.ylabel('Eigenvalues of factors')
leg = plt.legend(['Eigenvalues'], loc='best', borderpad=0.3, 
                 shadow=False, prop=matplotlib.font_manager.FontProperties(size='small'),
                 markerscale=0.4)
leg.get_frame().set_alpha(0.4)
leg.draggable(state=True)
plt.show()

There are some difference between the results in R and python, the eigenvalues we get in python are a little larger.That’s because the R function scree() does some adjustion to the correlation matrix before calculating eigen value. we can’t easily figure out exactly what it does from source codes because scree() calls another function which again calls other functions. Also there is difference in precision between two programming language, which is a cause of the difference. Also, there is no function in python that can directly do the parallel test in Python and it would take really much time to implement it in Python, so we just don’t do this. From this plot, we should choose the number of factors to be 4, but the eigenvlues corresponding to factor 5 is really close to 1, so we perform factor analysis for both cases, that is with 4 factors and with 5 factors, and compare the results. Note that since eigenvalues we get in R are a little smaller, we choose the number of factors to be 4,5 and 6 for R part.

Conducting Factor Analysis

from factor_analyzer import FactorAnalyzer
fa = FactorAnalyzer()
fa.analyze(df, 4, rotation="varimax")
print(fa.loadings)
##      Factor1   Factor2   Factor3   Factor4
## A1 -0.226687  0.093998 -0.007377 -0.013420
## A2  0.542422  0.030603  0.154972  0.011014
## A3  0.640846  0.009598  0.123128  0.013398
## A4  0.438320 -0.064614  0.235476 -0.138937
## A5  0.645121 -0.121608  0.090791  0.039497
## C1  0.085806  0.007177  0.543701  0.215154
## C2  0.101918  0.089465  0.651983  0.111184
## C3  0.097541 -0.032617  0.560232 -0.011784
## C4 -0.076720  0.248365 -0.632005 -0.116322
## C5 -0.172626  0.301729 -0.546415  0.013301
## E1 -0.499413  0.062931  0.053258 -0.111204
## E2 -0.576909  0.263473 -0.073725 -0.108642
## E3  0.584843  0.012900  0.062361  0.306781
## E4  0.695974 -0.133051  0.078326 -0.033375
## E5  0.447418  0.013521  0.280720  0.258260
## N1 -0.070888  0.748992 -0.062360 -0.036899
## N2 -0.095046  0.734729 -0.040713  0.023141
## N3 -0.055609  0.744888 -0.059397 -0.005511
## N4 -0.258909  0.601013 -0.152009  0.027601
## N5 -0.052317  0.549468 -0.019533 -0.179215
## O1  0.202969 -0.006323  0.104372  0.517115
## O2  0.064475  0.178084 -0.096007 -0.477556
## O3  0.320167  0.022416  0.064080  0.611863
## O4 -0.046972  0.229054  0.001080  0.295590
## O5 -0.003589  0.083179 -0.064602 -0.522385
from factor_analyzer import FactorAnalyzer
fa = FactorAnalyzer()
fa.analyze(df, 5, rotation="varimax")
print(fa.loadings)
##      Factor1   Factor2   Factor3   Factor4   Factor5
## A1  0.040465  0.111126  0.022798 -0.077931 -0.428166
## A2  0.213716  0.029588  0.139037  0.062139  0.626946
## A3  0.317848  0.009357  0.109331  0.056196  0.650743
## A4  0.204566 -0.066476  0.230584 -0.112700  0.435624
## A5  0.393034 -0.122113  0.087869  0.066708  0.537087
## C1  0.070184  0.010416  0.545824  0.209584  0.038878
## C2  0.033270  0.089574  0.648731  0.115434  0.102782
## C3  0.023907 -0.030855  0.557036 -0.005183  0.111578
## C4 -0.064984  0.240410 -0.633806 -0.107535 -0.037498
## C5 -0.176395  0.290318 -0.562467  0.036822 -0.047525
## E1 -0.574835  0.042819  0.033144 -0.058795 -0.104813
## E2 -0.678731  0.244743 -0.102483 -0.042010 -0.112517
## E3  0.536816  0.024180  0.083010  0.280877  0.257906
## E4  0.646833 -0.115614  0.102023 -0.073422  0.306101
## E5  0.504069  0.036145  0.312899  0.213739  0.090354
## N1  0.078923  0.786807 -0.045997 -0.084704 -0.216363
## N2  0.027301  0.754109 -0.030568 -0.010304 -0.193744
## N3 -0.061430  0.731721 -0.067084 -0.004217 -0.027712
## N4 -0.345388  0.590602 -0.178902  0.075225  0.005886
## N5 -0.161291  0.537858 -0.037309 -0.149769  0.100931
## O1  0.213005 -0.002224  0.115080  0.504907  0.061550
## O2  0.004560  0.175788 -0.099729 -0.468925  0.081809
## O3  0.310956  0.026736  0.076873  0.596007  0.126889
## O4 -0.191196  0.220582 -0.021906  0.369012  0.155475
## O5 -0.005347  0.085401 -0.062730 -0.533778 -0.010384

Discussion

From the factor loadings, we can see that the result of factor analysis with 5 factors is good. When the number of factor is chosen to be 5, Factor 1 is mainly composed of E1,E2,E3,E4,E5. Factor 2 is mainly composed of N1,N2,N3,N4,N5. Factor 3 is mainly composed of C1,C2,C3,C4,C5. Factor 4 is mainly composed of O1,O2,O3,O4,O5. Factor 5 is mainly composed of A1,A2,A3,A4,A5. This fits well with the natural of raw column, because the columns with the same initial letter are the answers to questions from the same category. If we choose the number of factor to be 4, factor 1 is composed of both E1,E2,E3,E4,E5 and A1,A2,A3,A4,A5, meaning that it categorize those two groups of columns together and account for them with a single factor. It shows that columns with initial E and columns with initial A are related in some aspects. However, we notice that the absolute value of factor loadings corresponding to A1 are all smaller than 0.25, meaning that column A1 is not well explained by any one of the factors. But if we take the number of factor to be 5, the obsolute value of the largest factor loading corresponding to each original column is larger than 0.4. That means a large amount of each column is explained by a factor, the columns are well categoried. Therefore, choosing the number of factors to be 5 is more reasonable.

Conclusion

In conclusion, five is the most reasonable number of factors. With this number of factors, the variations of columns that are originally correlated (with same initial letter) would be mostly accounted by the same factor, meaning that the structure of the original data is successfully figured out by factor analysis. We successfully find out the five independent latent variables. Maybe later we can look deeper into the source code of scree() function in R to see what exactly changes it does to the correlation matrix so that we can get the same result from python and R.

Stata

Load and summarize the data

import delimited bfi.csv,clear

Then we summarize the data to see the brief description. We are only interested in the 25 personality self report items, so although the three additional demographic variables (sex, education and age) exist, we do not include them for our analysis. Total observations should be 2800, but some of them are missing as we can see from the table below.

summarize A1-A5 C1-C5 E1-E5 N1-N5 O1-O5 gender education age
                     Table1: The summarization table of all 28 variables

Principle components extraction

We are using principle components extraction method (PCA) to do factor analysis for bfi data. PCA uses an orthogonal transformation to convert observations of possibly correlated variables into a bunch of values of linearly uncorrelated variables, and those linearly uncorrelated variables are called the principle components. After this transformation, the first principle component has the largest possible variance, and each succeeding variable in turn has the largest variance, and is orthogonal to the preceding component.These principle components are the directions where the data is most spread out.

factor A1-A5 C1-C5 E1-E5 N1-N5 O1-O5,pcf
        Table2: Using Principle-component factors.

This tables shows the eigenvalues, difference, proportion, and cumulative variance.

  • An eigenvalue is the variance of the factor, the first factor will account for the most variance, the second will account for the next highest variance.

  • Difference gives the differences between the current and following eigenvalues.

  • Proportion of variance accounted for by the factor

  • Cumulative proportion of variance accounted for by this factor

Checking Adequacy

KMO

The Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy is a better measure of factorability. The KMO tests to see if the partial correlations within your data are close enough to zero to suggest that there is at least one latent factor underlying your variables.

estat kmo
               Table 3: The KMO value of the 25 variables 

The KMO value= 0.8486 is greater than 0.5, the minimum acceptable value.

Number of factors

Scree Plot

A Scree Plot shows the fraction of total variance in the data represented by each PC. We have 6 eigenvalues greater than 1, which indicates 6 factors. Another method to evaluate the scree plot is within a parallel analysis.

scree
                  Figure1:Scree plot of eigenvalues after PCA

Parallel Analysis

search fapara
fapara,pca reps(100)
                   Picture2: Parallel Analysis. 

This suggests that the number of factors= 5 , as 5 factors lie above the simulated (dashed) data line. Thus, I decide to use 5 factors instead.

Conducting and Conclusions

To interpret the extracted factors, we use varimax rotation which attempts to maximize the squared loadings of the columns, and the “blanks” option only displays factor loading greater than a specific value (say 0.45).

factor A1-A5 C1-C5 E1-E5 N1-N5 O1-O5,pcf factor(5)
rotate, orthogonal varimax blanks(0.45)

After limiting the facotrs to 5 and rotating factors to faciliate interpretation, we got the final results: Factor 1 has N1-N5, Factor 2 has E1-E5, Factor 3 has C1-C5, Factor 4 has A1-A5, Factor 5 has O1-O5.

      Table4: The final rotated factor loadings and uniqueness

Things to consider

  • Based on the final factor loadings obtained by three approaches, they are nearly the same but it seems that the variables have much dependence on factors using Stata than R and Python. Later we may figure out the reasons why it happens.
  • We use almost the same methods to determine the number of factors in three ways, but there are different results coming out. Fortunately, our final best number of factors is included in all three sets of results. We hope to consider further why the difference comes out, especially the difference in scree plot between R and Python.

  • We all conduct factor analysis by “varimax” rotation method. What if we utilize other kinds of rotation, such as “quartimax” , “bentlerT”, “equamax” and etc. Do they change the results dramatically or not?
  • In this case, the type of factor analysis we use is exploratory factor anaylsis (EFA). There is another type of factor analysis , Confirmatory factor analysis (CFA) which is a more complex approach that tests the hypothesis that the items are associated with specific factors. So what if we do the CFA on the example dataset?

Reference