Foreword
We want to represent the distances among the elements in a parsimonious way. In plain English, we want to reduce several independent variables to a 2D chart.
Classical MDS uses the fact that the coordinate matrix x and y can be derived by eigenvalue decomposition (or singular value decomposition).
We have a dataset, run_record, with Olympic records by country in 9 track & field events (100m, 200m, 400m, 800m, 1500m, 5000m, 10000m, Marathon). These Olympic standards are the country’s best times. The data is standardized.
## X100m X200m X400m X800m X1500m
## Argentina 0.0594137 -0.3126106 0.24391299 0.03530558 0.1757051
## Australia -1.2962228 -0.8777403 -1.00718688 -0.53664481 -0.8126362
## Austria -0.3020894 -0.1667707 -0.02020809 0.03530558 -0.4831891
## Belgium -0.3472772 -0.6407504 -0.56235137 -0.72729495 -0.5490785
## Bermuda 0.2401652 -0.4402205 -0.39553805 0.41660584 0.3074840
## Brazil -0.9799076 -1.1876501 -1.06974187 -1.29924534 -0.5490785
## X5000m X10000m marathon
## Argentina -0.3779943 -0.5271604 -0.4366164
## Australia -0.9037300 -0.5986248 -0.6667369
## Austria -0.4699980 -0.4854728 -0.1405878
## Belgium -1.0351640 -0.9916792 -0.7013666
## Bermuda 1.3437903 1.1641643 1.4400937
## Brazil -0.1808434 -0.2413027 -0.8298319
## [1] 54 8
We have 54 rows (observations) by 8 columns (variables): the input matrix. Each row is identified by a unique row name. This is a small matrix. Nonetheless, the matrix holds a lot of information.
We want to reduce the multidimensional matrix to two dimensions and be able to represent the results on a 2D plot.
For that, we use an optimization procedure that relies on a loss function called ‘Stress’. Similar to a regression, classical MDS minimizes the residual sum of squares. Classical MDS is also known as metric MDS.
Implementing a classical MDS
First, we compute the eigenvalues of the matrix based on Euclidian distances (the procedure is similar to the one we use in hierarchical clustering). Then, we reduce the number of dimensions. We can choose any number less than the original number of dimensions (p=8). We choose two dimensions (k=2).
# Euclidean distances
run_record_d <- dist(run_record_sc)
# k = 2 dimensions
run_record_fit <- cmdscale(run_record_d, eig = TRUE, k = 2)
# coordinates
head(run_record_fit$points)## [,1] [,2]
## Argentina -0.4163326 -0.3945394
## Australia -2.3525022 0.5502192
## Austria -0.7306318 -0.1805723
## Belgium -1.9797765 -0.3770560
## Bermuda 1.4861338 1.6421881
## Brazil -2.2082526 0.7916572
# eigenvalues
head(run_record_fit$eig)## [1] 355.274367 33.835736 12.058798 10.910007 5.171604 3.746459
The eigenvalues or eigenvectors show the strength of each component. The first two components are strong; beyond 2, the values plummet. This indicates that 2 dimensions are sufficient: these are the principal components.
Now, we can plot the two dimensions.
From 9 variables, we boiled down the observations to 2 ‘latents variables’.
What kind of readings can we make?
These are Olympic records in Athletism. We must read the axes as race times (below the average, average or above the average). The center (centrality) is the average, as always.
On the left corner, we can find highly competitive countries (times are lower or better than the average). These countries have the best standards (times): we could call this reference group the ‘elite group’ (proximity).
The US are set apart in the upper-left corner. It is a track and field powerhouse in speed events.
Thailand is on the right of the elite group. Thailand’s times are worst than the elite group.
Component 1 (x-axis): overall standards (times). Left: bellow the average. Right: over the average.
On the other hand, Kenya is on the left side, but further down. Kenya is not known for winning podiums at speed events, but manage to place runners in final waves in all events. For that reason, Kenya sits on the negative side of the x-axis with the elite group. However, we know Kenya mainly dominates long distances. That sets it apart from the elite group on the y-axis.
It is exactly the opposite (polarity) for the US. We know they dominate speed races, but they seem to underperform in endurance events.
Component 2 (y-axis): endurance vs. speed events. Top: speed. Bottom: endurance.
Can we draw other conclusions?
About Samoa and the Cook’s Islands. Turkey or North Korea.
We have to interpret the results with the concepts of centrality, proximity, and polarity in mind.
Improve the visualization – Circles
Using the plotrix package, let us add circles.
Now we have an overall view in relation to the origin; centrality.
We zoom in on the central observations.
Improve the visualization – Colours
We could discriminate the observation with factors:
We choose to differentiate the observations in geographical zones.
We enrich the dataset with geographical zones (continent).
## X100m X200m X400m X800m X1500m X5000m X10000m marathon continent
## Argentina 10.23 20.37 46.18 106.2 220.8 799.8 1659.0 7774.2 AM
## Australia 9.93 20.06 44.38 104.4 211.8 775.8 1651.8 7650.6 PA
## Austria 10.15 20.45 45.80 106.2 214.8 795.6 1663.2 7933.2 EU
## Belgium 10.14 20.19 45.02 103.8 214.2 769.8 1612.2 7632.0 EU
## Bermuda 10.27 20.30 45.26 107.4 222.0 878.4 1829.4 8782.2 AM
## Brazil 10.00 19.89 44.29 102.0 214.2 808.8 1687.8 7563.0 AM
AF-ME: Africa and Middle East,AM: Americas,AS: Asia,EU: Europe,PA: Pacific.We plot the colored chart.
3D plotting
Can 3D help?
First, we rerun the analysis to generate 3 vectors (k=3). Then we plot the new results.
# Euclidean distances
run_record_3d <- dist(run_record_sc)
# k = 3 dimensions
run_record_3dfit <- cmdscale(run_record_3d, eig = TRUE, k = 3)
head(run_record_3dfit$points)## [,1] [,2] [,3]
## Argentina -0.4163326 -0.3945394 0.40914141
## Australia -2.3525022 0.5502192 0.29073599
## Austria -0.7306318 -0.1805723 0.26293174
## Belgium -1.9797765 -0.3770560 0.04183601
## Bermuda 1.4861338 1.6421881 -0.72124975
## Brazil -2.2082526 0.7916572 -0.52070899
# Plot
x <- run_record_3dfit$points[,1]
y <- run_record_3dfit$points[,2]
z <- run_record_3dfit$points[,3]
# Convert the continents
run_record_geo$colour <- NA
run_record_geo$colour <- ifelse(run_record_geo$continent == 'AF-ME', 1,
ifelse(run_record_geo$continent == 'AM', 2,
ifelse(run_record_geo$continent == 'AS', 3,
ifelse(run_record_geo$continent == 'EU', 4,
ifelse(run_record_geo$continent == 'PA', 5, NA)))))We can change the parameters and reprint the plot. However, it would be more effective if we could work with an interactive 3D plot. The rgl package offers this possibility.
Normally, we turn and flip with the 3D plot to examine the results. This is a .gif file:
The third dimension does not bring any value as we wanted to reduce the number of dimensions to make the data more interpretable. However, there are situations where going from 4+ dimensions to 3 dimensions rather than 2 might be helpful.
When the original dimension is clearly defined, we can better interpret the reduction. If the 8 dimension were economic indicators, a 3D plot would reveal latent variables, clusters, and affinities among countries: prosperity, policies, integration, etc.
For example, the reduction would certainly reveal a cluster showing an economic block or a free-trade zone because of metrics such as GDP, inflation, trade, central bank prime rate all point in the same direction. It is easy to follow these metrics when we compare two countries, but not when we want to focus on several countries. This is where dimension reduction comes to help. We can automatically find trends, patterns, and cluster in a list of 100 countries.
With the classical (metric) MDS, the algorithm minimizes the ‘Stress’ function (a residual sum of squares) using the Euclidean distances between observations.
Non-metric MDS (from the MASS package) can find:
It can sometimes improve the results. Let us find out. We plot the metric and nonmetric MDS side-by-side.
Sometimes, metric MDS can yield significantly different results than nonmetric MDS.
A crucial decision is to find out how many dimensions would be optimal. In our worked example above, we choose k=2.
The nFactors package offers a suite of functions that give us more information.
The scree plots confirm that 2 dimensions are sufficient. Beyond 2 components, the scree plot does not show much improvement. In fact, the ‘acceleration factor’ dot demonstrates how much of the variability is picked up by 2 components.
Component analysis applies a mathematical procedure for transforming a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components.
The first principal component accounts for as much of the variability in the data as possible. Each succeeding component accounts for as much of the remaining variability as possible. It is possible to view all the components. However, we can only plot 2 or 3 components.
Component analysis is useful when there is data on a large number of variables, and (possibly) there is some redundancy in those variables. In this case, redundancy means that some of the variables are correlated with one another. And because of this redundancy, component analysis can be used to reduce the observed variables into a smaller number of components that will account for most of the variance in the observed variables.
Component analysis is recommended as an exploratory tool to uncover unknown trends in the data. The technique has found application in fields such as face recognition, image compression, and is a common technique for finding patterns in data of high dimension.
This technique is abundantly used in social sciences and marketing (as illustrated with the schema about the Beer Market).
Component analysis has three declinations:
The three approaches are similar. The main difference comes from the dataset they are employed with. In the last section (Wrap-up) we will examine the three approaches with three distinct examples.
The dataset
For our introduction to component analysis, we use a dataset about the 50 US states. crime_data list crime information and the percentage of urban population. Source: Edureka.
The data is standardized.
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
Entering raw data and extracting the components
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938
## Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
## Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000
The first and second rows are the most important: Standard Deviation and Proportion of Variance. The first component is the most volatile and accounts for 2/3 of the variance.
There is a close kinship with regressions. Only with regressions, we choose a number of the explanatory variables to maximize \(R^2\) while in Component Analysis, we end up with latent variables and the principal variables account for much of the variability.
We can see with the following plot how each component contributes to the overall variability.
We could also base the components on the covariance matrix (cor = FALSE).
We plot the principal components (1 and 2) on a 2D chart (there is a function for it).
‘Groups’ and ‘clusters’ we can perceive are illustrative of affinities. For example, ‘Mississippi-North & South Carolina have ’something’ in common. It is not simply geographical.
‘Assaut’ and ‘Murder’ both point in the same direction because they are positively correlated. When two arrows point in opposite directions, they are negatively correlated. When they are perpendicular (‘Murder’ and ‘UrbanPop’), there is no correlation.
Moreover, the arrow length indicates the strength of the relation.
Arrows also ‘lean’ in the direction of the states they are the most correlated with. For example, ‘Murder’ leans on the side of states with high murder rates. We can compare the biplot with the choropleth maps below.
Both visualizations are coherent with the component analysis.
Determining the number of factors to extract
Contrary to the cmdscale from MDS, we do not choose the number of dimension or component with the princomp function. It is done automatically.
Again, we want to challenge our findings with the nFactors package.
The scree plots confirm that 2 eigenvalues is sufficient. Beyond 2 components, the scree plot does not show much improvement.
Factor (or Factorial) Analysis is the task where we intend to reduce the dataset dimension by analyzing and understanding the impact of its features on a model.
Such analysis allows us to select a subset of the original features, reducing the dimensions. During a subset selection, we try to identify and remove as much of the irrelevant and redundant information as possible.
Let us apply the technique on the standardized Olympic dataset.
run_record_fit3 <- factanal(run_record_sc, 3, rotation = "varimax")
print(run_record_fit3, digits = 2, cutoff = 0.3, sort = TRUE)##
## Call:
## factanal(x = run_record_sc, factors = 3, rotation = "varimax")
##
## Uniquenesses:
## X100m X200m X400m X800m X1500m X5000m X10000m marathon
## 0.08 0.07 0.23 0.00 0.11 0.02 0.01 0.09
##
## Loadings:
## Factor1 Factor2 Factor3
## X1500m 0.67 0.49 0.44
## X5000m 0.84 0.43 0.31
## X10000m 0.87 0.40
## marathon 0.84 0.38
## X100m 0.37 0.87
## X200m 0.37 0.83 0.32
## X400m 0.47 0.68 0.30
## X800m 0.54 0.44 0.71
##
## Factor1 Factor2 Factor3
## SS loadings 3.40 2.82 1.18
## Proportion Var 0.43 0.35 0.15
## Cumulative Var 0.43 0.78 0.92
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 9.44 on 7 degrees of freedom.
## The p-value is 0.223
Factor1 and Factor2 are two latent variables that account for most of the variability (78 %).
Let us transfer these results on a 2D chart.
Factor1 distributes events by distance (left to right: speed events, endurance events).
Factor2 could be anything: how standards by country are close from one another, how some events are highly competitive vs. other events which are dominated by some countries, etc. We would need to dig in.
From entropy, we emerge with two latent variables. One is explainable (distance), the other remains to understand. Nonetheless, factor analysis fulfilled its role to distil raw data into something finer.
Determining the number of factors to extract
Again, we challenge our findings with the nFactors package.
The scree plots confirm that 2 factors are optimal.
Dimension reduction methods can address issues in diverse fields: social sciences, business, natural sciences, health sciences, etc.
The FactoMineR package
So far, we have used basic functions for MDS (dist, cmdscale for metric MDS, and dist, isoMDS for nonmetric MDS), princomp for CA, and factanal for FA. We also touch base with the FactoMineR package by drawing scree plots.
The FactoMineR package can perform more:
The package offers a large number of additional functions for exploratory analysis. The website mention additional packages:
missMDA package for handling missing values in exploratory multivariate analysis such as principal component analysis (PCA), multiple correspondence analysis (MCA) and multiple factor analysis (MFA).FAMT package performs simultaneous tests under dependence in high-dimensional data.SensoMineR provides numerous graphical outputs easy to interpret, as well as syntheses of results issuing from various analysis of variance models or from various factor analysis methods accompanied with confidence ellipses.Not to forget the ade4 package; designed for analyzing ecological data and applying Euclidean methods in environmental sciences.
The dataset
We use a dataset from the Heritage Foundation about Economic Freedom. Each year, the Foundation along with the Wall Street Journal releases an Index of Economic Freedom where countries are measured with 12 metrics (12 ‘freedoms’). Consult the website for definitions, the scope of the survey, assumptions, rankings, graphs, interactive maps, and more.
The dataset comes with microeconomic (tax rates for example) and macroeconomic indicators (GDP for example).
In addition, we enrich the dataset with the 2014 Human Development Index.
We remove some countries (those with NA metrics and those with less than 1M population).
Here is the subset we are going to work with.
## 'data.frame': 146 obs. of 18 variables:
## $ Region : chr "Asia-Pacific" "Europe" "Middle East / North Africa" "Sub-Saharan Africa" ...
## $ World.Rank : num 163 65 172 165 156 33 5 30 68 44 ...
## $ Region.Rank : num 40 30 14 41 26 19 4 17 15 4 ...
## $ X2017.Score : num 49.1 64.4 46.5 48.5 50.4 ...
## $ Property.Rights : num 12.6 54 38.2 36.4 32.4 ...
## $ Judical.Effectiveness: num 28.4 28.5 29.6 19.8 39.6 ...
## $ Government.Integrity : num 27.5 39.7 31.7 12.8 38.2 ...
## $ Tax.Burden : num 91.6 86.9 81.1 87.7 62.6 ...
## $ Gov.t.Spending : num 79.9 72.5 51 58.6 54.6 ...
## $ Fiscal.Health : num 97.3 51.5 19.8 70.7 56.4 ...
## $ Business.Freedom : num 54.2 79.3 62.1 58.5 57.3 78.5 89.3 76.9 71.5 69.4 ...
## $ Labor.Freedom : num 59.9 50.7 49.5 40.4 46.1 ...
## $ Monetary.Freedom : num 69.3 81.4 67 70.6 50.9 ...
## $ Trade.Freedom : num 66 87.7 63.3 56.7 66.7 ...
## $ Investment.Freedom. : num 1 70 35 30 50 80 80 90 55 75 ...
## $ Financial.Freedom : num 1 70 30 40 50 70 90 70 50 80 ...
## $ HDI_2014 : num 46.5 73.3 73.6 53.2 83.6 73.3 93.5 88.5 75.1 82.4 ...
## $ Country_HDI : chr " Afghanistan" " Albania" " Algeria" " Angola" ...
The Freedom Index (X2017.Score) and each Freedom metrics are percentages; 100 being the highest value. The same goes for the HDI_2014.
The 12 Freedom metrics are:
Country’s names may differ (row names vs. Country_HDI). HDI = Human Development Index.
We also have extra subsets: a microeconomic, a macroeconomic and the HDI subsets.
## 'data.frame': 146 obs. of 4 variables:
## $ Tariff.Rate.... : num NA 1.1 8.4 11.7 6.6 2.4 1.9 1.5 5.3 3.6 ...
## $ Income.Tax.Rate.... : num 20 23 35 17 35 26 45 50 25 0 ...
## $ Corporate.Tax.Rate....: num 20 15 23 30 35 20 30 25 20 0 ...
## $ Tax.Burden...of.GDP : num 6.5 23.6 11.7 6.5 35.9 23.5 27.5 43 14.2 3.8 ...
## 'data.frame': 146 obs. of 11 variables:
## $ Gov.t.Expenditure...of.GDP.: num 27.1 30 44.4 28.9 43.9 ...
## $ Country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
## $ Population..Millions. : num 32 2.8 39.5 25.1 42.4 3.3 23.9 8.6 9.5 1.2 ...
## $ GDP..Billions..PPP. : num 62.3 32.7 578.7 184.4 972 ...
## $ GDP.Growth.Rate.... : num 1.5 2.6 3.7 3 1.2 3 2.5 0.9 1.1 3.2 ...
## $ X5.Year.GDP.Growth.Rate....: num 5.4 1.9 3.3 4.7 2.7 4.3 2.7 1 2.4 3.7 ...
## $ GDP.per.Capita..PPP. : num 1947 11301 14504 7344 22554 ...
## $ Unemployment.... : num 9.6 17.3 10.5 7.6 6.7 16.3 6.3 5.7 4.7 1.2 ...
## $ Inflation.... : num -1.5 1.9 4.8 10.3 26.5 3.7 1.5 0.8 4 1.8 ...
## $ FDI.Inflow..Millions. : num 58 1003 -587 8681 11655 ...
## $ Public.Debt....of.GDP. : num 6.8 71.9 8.7 62.3 56.5 46.6 36.8 86.2 36.1 63.3 ...
## 'data.frame': 146 obs. of 2 variables:
## $ Country_HDI: chr " Afghanistan" " Albania" " Algeria" " Angola" ...
## $ HDI_2014 : num 46.5 73.3 73.6 53.2 83.6 73.3 93.5 88.5 75.1 82.4 ...
HDI = Human Development Index.
The analysis
Compute the components.
library(FactoMineR)
liberty_pca <- PCA(liberty_base[ ,1:17], quanti.sup = c(2,3,4,17), quali.sup = c(1), graph = FALSE)We can print the Principal Component Analysis (PCA) results (the liberty_pca object). It is a lot of results we can leave aside.
We can print a lot more results with summary(liberty_pca)… Again, we get a lot of results. Let us focus on the essential.
We can also estimate the best number of dimensions (estim_npc) to use (find the minimum or the first local minimum) and the mean error for each dimension tested.
Originally, we used:
liberty_pca <- PCA(liberty_base[ ,1:17], quanti.sup = c(2,3,4,17), quali.sup = c(1), graph = FALSE)
We have to remove the quanti.sup and quali.sup parameters.
# from
# str(liberty_base[ ,1:17])
# to
liberty_ncp <- estim_ncp(liberty_base[ ,5:16], scale = TRUE)
liberty_ncp$ncp## [1] 5
It suggests 5 components as a optimum/maximum.
We print the mean error and show the first local minimum.
We plot the eigenvectors (the value). We saw the use of eigenvalues in MDS above. We can rank the components with the eigenvalues.
Each component is what we call a ‘latent variable’. Component 1, Comp 1, is ALWAYS the most important ‘latent variable’ representing over 50 % of the total variability. The first 5 components total 85 % of the variability.
The correspondence map
We plot the 2D chart (the first two components). There are lots of parameters. Here are three example before we jump to the analysis.
Countries with labels.
Countries without labels.
Regions and labels; countries in the background.
These regions are the qualitative supplementary variables quali.sup in:
PCA(liberty_base[ ,1:17], quanti.sup = c(2,3,4,17), quali.sup = c(1), graph = FALSE)
Dim 1 and Dim 2 stand for comp 1 and comp 2 and account for 2/3 of the variability.
We read the results with three aspects in mind:
Digging in the results
We print the ‘Variables-Dimensions’ matrix.
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Property.Rights 14.4 0.0 0.7 0.6 0.3
## Judical.Effectiveness 12.2 0.0 3.8 2.0 3.2
## Government.Integrity 13.4 0.5 1.1 2.7 0.7
## Tax.Burden 1.0 41.6 3.4 2.9 24.9
## Gov.t.Spending 3.3 30.3 4.3 4.1 11.5
## Fiscal.Health 0.5 12.3 16.1 64.9 0.0
## Business.Freedom 11.0 1.4 10.3 0.5 0.8
## Labor.Freedom 4.7 12.6 15.2 1.0 31.5
## Monetary.Freedom 7.4 0.3 27.3 4.1 0.6
## Trade.Freedom 10.4 0.8 0.0 0.1 23.2
## Investment.Freedom. 9.8 0.1 14.7 10.8 1.1
## Financial.Freedom 11.9 0.3 3.1 6.2 2.2
We can see how each variable contributes to each component.
We filter out the principal variables in the first component.
We could call this first ‘latent variable’: ‘Free Market’. On the x-axis of the map (above), we can split the observation between:
We filter out the principal variables in the second component.
We could call this second ‘latent variable’: ‘State Size’. Higher taxes and larger the government expenditures mean larger States. On the y-axis of the map (above), we can split the observation between:
We print the head of the ‘Countries-Dimensions’ matrix.
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Afghanistan 1.7 0.7 0.5 1.9 0.1
## Albania 0.1 0.2 0.1 1.2 2.2
## Algeria 0.6 0.5 1.2 0.4 0.0
## Angola 1.1 0.0 0.0 0.1 0.4
## Argentina 0.4 1.3 0.5 0.2 0.1
## Armenia 0.1 0.9 0.0 0.1 0.0
We print the contribution of Sweden (Sweden) and Poland (Poland) for the first two components:
liberty_pca$ind$contrib['Sweden', c(1,2)]
liberty_pca$ind$contrib['Poland', c(1,2)]## Dim.1 Dim.2
## 2.556548 2.439401
## Dim.1 Dim.2
## 0.46178004 0.01302201
Sweden has much more impact on the first two components.
We print Sweden and Poland’s (x,y) coordinates.
liberty_pca$ind$coord['Sweden', c(1,2)]
liberty_pca$ind$coord['Poland', c(1,2)]## Dim.1 Dim.2
## 4.781568 -2.359848
## Dim.1 Dim.2
## 2.0321745 -0.1724176
We can display the coordinates on the plot.
Indeed, on the correspondence map, Sweden is further away from the origin than Poland.
Above all, we read the graphic with these ‘labels’ in mind:
Sweden is more interventionist and open than Poland (polarity), but both states stand in the same quadrant (proximity). Sweden stands apart from the averages (centrality).
The correlation circle
We plot the circle.
On the radar, the 12 Freedom metrics appear in red and a supplementary variables appears in blue. The formerare the quantitative supplementary variables quant.sup in:
PCA(liberty_base[ ,1:17], quanti.sup = c(2,3,4,17), quali.sup = c(1), graph = FALSE)
Supplementary variables, qualitative or quantitative, do not influence the correspondence analysis. They simply enrich the maps and circles.
Let us interpret the results with these findings in mind:
Arrows point or lean in the direction of the dimension they are positively correlated with.
When two arrows point in the same direction, they are positively correlated. When two arrows point in the opposite direction, they are negatively correlated. When the arrows are perpendicular, they are uncorrelated.
The longer the arrow, the stronger the correlation with the component.
We print the results related to the above plot; for the first component (x-axis).
The high correlation coefficients are consistent with the long arrows.
Print the results related to the above plot; for the second component (y-axis).
## correlation p.value
## Tax.Burden 0.81 0
## Gov.t.Spending 0.69 0
## Labor.Freedom 0.44 0
## Fiscal.Health 0.44 0
## X2017.Score 0.37 0
## World.Rank -0.32 0
## Region.Rank -0.43 0
Squared cosines are the squared correlations: close to the coefficient of determination (\(R^2\) in a regression). We can read the variable-component correlations.
We print a table of squared cosines. Add a column total and a row average.
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 TOTAL
## Property.Rights 0.9 0.0 0.0 0.0 0.0 0.9
## Judical.Effectiveness 0.7 0.0 0.0 0.0 0.0 0.8
## Government.Integrity 0.8 0.0 0.0 0.0 0.0 0.9
## Tax.Burden 0.1 0.6 0.0 0.0 0.2 0.9
## Gov.t.Spending 0.2 0.5 0.0 0.0 0.1 0.8
## Fiscal.Health 0.0 0.2 0.2 0.6 0.0 1.0
## Business.Freedom 0.7 0.0 0.1 0.0 0.0 0.8
## Labor.Freedom 0.3 0.2 0.1 0.0 0.2 0.8
## Monetary.Freedom 0.5 0.0 0.3 0.0 0.0 0.8
## Trade.Freedom 0.6 0.0 0.0 0.0 0.1 0.8
## Investment.Freedom. 0.6 0.0 0.1 0.1 0.0 0.8
## Financial.Freedom 0.7 0.0 0.0 0.1 0.0 0.8
## AVERAGE 0.5 0.1 0.1 0.1 0.1 0.8
We can plot partial results.
We read the results with these ‘labels’ in mind:
We run a regression between the first two component and the supplementary variable: 2014_DHI. HDI = Human Development Index.
lm_pca <- lm(liberty_base$HDI_2014 ~ liberty_pca$ind$coord[, 1] + liberty_pca$ind$coord[, 2])
summary(lm_pca)##
## Call:
## lm(formula = liberty_base$HDI_2014 ~ liberty_pca$ind$coord[,
## 1] + liberty_pca$ind$coord[, 2])
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.272 -5.623 1.044 5.477 38.827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.1274 0.8817 78.403 <2e-16 ***
## liberty_pca$ind$coord[, 1] 5.0033 0.3562 14.045 <2e-16 ***
## liberty_pca$ind$coord[, 2] -0.8711 0.7051 -1.235 0.219
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.65 on 143 degrees of freedom
## Multiple R-squared: 0.5816, Adjusted R-squared: 0.5757
## F-statistic: 99.39 on 2 and 143 DF, p-value: < 2.2e-16
The first coefficient is positive: there is a positive relationship between ‘Free Market’ and the Human Development Index (HDI).
The second coefficient shows the negative relationship between ‘State Size’ and HDI. In other words, ‘Interventionist States’ are related to high HDI.
The \(R^2\) are close to 60 %; 60 % of the HDI variability is explainable with the two latent variables.
lm_predict <- predict(lm_pca, data.frame(liberty_pca$ind$coord[, 1] + liberty_pca$ind$coord[, 2]))Mmmmh… What about the top ranks?
Data points under the line are optimistic predictions vis-Ã -vis the actuals. Sweden, for example, comes out better than the actual HDI. Nonetheless, since the data points are compact around the line because 60 % of the HDI variability is explained by the principal components.
And the bottom ranks?
This time, we are going to work with contingency tables. They are also named cross tables. Along 2 dimensions, we tally unique observations. Tables can be symmetrical (n x n) or rectangular (n x p where we either have more n or more p).
Simple correspondence analysis is useful for political surveys, consumer studies or population census (on humans or animals). However, we are going to work with a simple example.
Hair color and eye color
The data is a classic 5x5 contingency table that was analyzed by Ronald Aylmer Fisher, an English statistician and biologist (the F-test and more). It is related to data on hair color and eye color of a sample of 5387 schoolchildren from Scotland.
## fair.hair red.hair medium.hair dark.hair black.hair
## blue eyes 326 38 241 110 3
## light eyes 688 116 584 188 4
## medium eyes 343 84 909 412 26
## dark eyes 98 48 403 681 85
These are categories including:
We compute the components with all the variables.
colors_ca <- CA(colors, graph = FALSE)We print the Component Analysis (CA) results (the colors_ca object).
## **Results of the Correspondence Analysis (CA)**
## The row variable has 4 categories; the column variable has 5 categories
## The chi square of independence between the two variables is equal to 1240.039 (p-value = 4.123993e-258 ).
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$col" "results for the columns"
## 3 "$col$coord" "coord. for the columns"
## 4 "$col$cos2" "cos2 for the columns"
## 5 "$col$contrib" "contributions of the columns"
## 6 "$row" "results for the rows"
## 7 "$row$coord" "coord. for the rows"
## 8 "$row$cos2" "cos2 for the rows"
## 9 "$row$contrib" "contributions of the rows"
## 10 "$call" "summary called parameters"
## 11 "$call$marge.col" "weights of the columns"
## 12 "$call$marge.row" "weights of the rows"
In correspondence analysis, it is important to be able to visualize distances between different row profiles or between different column profiles.
We use the \(\chi^2\) metric as a measure of distance. The \(\chi^2\) is 1240.039 (p-value = 4.123993e-258); we can reject the hypothesis of independence of row and column.
In other words, the rows and columns are linked together; there are relationships (clusters, patterns, trends, etc.).
We can code the \(\chi^2\) independence test.
We transform the data frame into a matrix.
## fair.hair red.hair medium.hair dark.hair black.hair
## blue eyes 326 38 241 110 3
## light eyes 688 116 584 188 4
## medium eyes 343 84 909 412 26
## dark eyes 98 48 403 681 85
We perform the \(\chi^2\) test between the two variables. In a contingency table, between the row and column.
chisq.test(colors2)##
## Pearson's Chi-squared test
##
## data: colors2
## X-squared = 1240, df = 12, p-value < 2.2e-16
The result is almost the same as above where the \(\chi^2\) is 1240.039 (p-value = 4.123993e-258); we can reject the hypothesis of independence of row and column.
We print and plot the eigenvectors.
## eigenvalue percentage of variance cumulative percentage of variance
## dim 1 0.2 86.6 86.6
## dim 2 0.0 13.1 99.6
## dim 3 0.0 0.4 100.0
Almost 100% of the variability comes from the first two dimension. Component 1 is ALWAYS the most important ‘latent variable’ representing over 86 % of the total variability.
The correspondence map
We plot the 2D chart.
We read the results with three aspects in mind:
In addition:
Conclusion
At first view, the map exhibit a U-shaped hair-color/eye-color pattern. There are no clear average nor extremes.
The first dimension, horizontal, displays a gradation along a fair-red-medium-dark hair scale and the light-blue-medium-dark eyes scale.
The second dimension, vertical, displays the difference between medium-color hair and eyes and the other hair and eyes colors.
There are three clusters:
Additional results
We can extract more statistics.
## Dim 1 Dim 2 Dim 3
## fair.hair -0.54399533 0.17384449 -0.012522082
## red.hair -0.23326097 0.04827895 0.118054940
## medium.hair -0.04202412 -0.20830421 -0.003236468
## dark.hair 0.58870853 0.10395044 -0.010116315
## black.hair 1.09438828 0.28643670 0.046135954
## Dim 1 Dim 2 Dim 3
## blue eyes -0.40029985 0.16541100 -0.064157519
## light eyes -0.44070764 0.08846303 0.031773257
## medium eyes 0.03361434 -0.24500190 -0.005552885
## dark eyes 0.70273880 0.13391383 0.004345377
Visually.
## $`Dim 1`
## $`Dim 1`$row
## coord
## light eyes -0.44070764
## blue eyes -0.40029985
## medium eyes 0.03361434
## dark eyes 0.70273880
##
## $`Dim 1`$col
## coord
## fair.hair -0.54399533
## red.hair -0.23326097
## medium.hair -0.04202412
## dark.hair 0.58870853
## black.hair 1.09438828
##
##
## $`Dim 2`
## $`Dim 2`$row
## coord
## medium eyes -0.24500190
## light eyes 0.08846303
## dark eyes 0.13391383
## blue eyes 0.16541100
##
## $`Dim 2`$col
## coord
## medium.hair -0.20830421
## red.hair 0.04827895
## dark.hair 0.10395044
## fair.hair 0.17384449
## black.hair 0.28643670
More statistics
We plot the hair colors and hair colors by eye colors.
We can compute the overall proportion of each entry in the matrix.
We want to run an exact binomial test. The test measures if one sample proportion is representative of the population.
Beforehand, we need to reshape the dataset from ‘wide’ to ‘long’ to compute some of the necessary statistics.
library(reshape2)
colors3 <- cbind(eyes = row.names(colors), colors)
# number of medium eyes-medium hair
colors3 <- melt(colors3)
colors3## eyes variable value
## 1 blue eyes fair.hair 326
## 2 light eyes fair.hair 688
## 3 medium eyes fair.hair 343
## 4 dark eyes fair.hair 98
## 5 blue eyes red.hair 38
## 6 light eyes red.hair 116
## 7 medium eyes red.hair 84
## 8 dark eyes red.hair 48
## 9 blue eyes medium.hair 241
## 10 light eyes medium.hair 584
## 11 medium eyes medium.hair 909
## 12 dark eyes medium.hair 403
## 13 blue eyes dark.hair 110
## 14 light eyes dark.hair 188
## 15 medium eyes dark.hair 412
## 16 dark eyes dark.hair 681
## 17 blue eyes black.hair 3
## 18 light eyes black.hair 4
## 19 medium eyes black.hair 26
## 20 dark eyes black.hair 85
# count of medium eyes-medium hair
colors3[11,3]## [1] 909
# total
sum(colors3[,3])## [1] 5387
# proportion of medium eyes-medium hair
pr <- colors3[11,3] / sum(colors3[,3])
round(pr * 100 , 2)## [1] 16.87
909 individuals are medium eyes-medium hair in the sample (n = 5387).
The dataset is a sample of the population where we have 16.87 % of medium eyes-medium hair
We want to test if medium eyes-medium hair accounts for more than 16 % of the population.
nber <- round(pr * sum(colors3[,3]), 0)
# One-sided test (greater or less)
binom.test(nber, n = 5387, p = 0.16, alternative = 'greater', conf.level = 0.95)##
## Exact binomial test
##
## data: nber and 5387
## number of successes = 909, number of trials = 5387, p-value =
## 0.04245
## alternative hypothesis: true probability of success is greater than 0.16
## 95 percent confidence interval:
## 0.1603875 1.0000000
## sample estimates:
## probability of success
## 0.1687396
# Two-sided test
binom.test(nber, n = 5387, p = 0.16, alternative = 'two.sided', conf.level = 0.95)##
## Exact binomial test
##
## data: nber and 5387
## number of successes = 909, number of trials = 5387, p-value =
## 0.08069
## alternative hypothesis: true probability of success is not equal to 0.16
## 95 percent confidence interval:
## 0.1588270 0.1790123
## sample estimates:
## probability of success
## 0.1687396
The one-sided test rejects of null hypothesis at 5 % that medium eyes-medium hair is greater than 16 % of the population (they represent less than 16 %). The two-sided test is not as significant (10 % significant), but ends up with the same conclusion. However, this test is purely illustrative. The question is whether medium eyes-medium hair constitutes more than 16 % of the population and not if medium eyes-medium hair exactly makes up 16 % of the population. We could have tested lefties or other determinisms to see if the sample if faithful to the population.
Compute the proportion (by column) for each row.
## fair.hair red.hair medium.hair dark.hair black.hair Total
## blue eyes 45.4 5.3 33.6 15.3 0.4 100.0
## light eyes 43.5 7.3 37.0 11.9 0.3 100.0
## medium eyes 19.3 4.7 51.2 23.2 1.5 99.9
## dark eyes 7.5 3.7 30.6 51.8 6.5 100.1
We compute the proportion (by row) for each column.
## fair.hair red.hair medium.hair dark.hair black.hair
## blue eyes 22.4 13.3 11.3 7.9 2.5
## light eyes 47.3 40.6 27.3 13.5 3.4
## medium eyes 23.6 29.4 42.5 29.6 22.0
## dark eyes 6.7 16.8 18.9 49.0 72.0
## Total 100.0 100.1 100.0 100.0 99.9
We want to run an exact proportion test. This time, the test measures if the proportions in several groups are similar. In other words, we want to test if the proportion of each hair color is the same by eye color or not. We test each row (each eye color, from 1 to 4).
prop.test(colors2[1,], n = colors_tc, p = NULL, alternative = 'two.sided', conf.level = 0.95, correct = TRUE)##
## 5-sample test for equality of proportions without continuity
## correction
##
## data: colors2[1, ] out of colors_tc
## X-squared = 158.82, df = 4, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4 prop 5
## 0.22405498 0.13286713 0.11277492 0.07907980 0.02542373
prop.test(colors2[2,], n = colors_tc, p = NULL, alternative = 'two.sided', conf.level = 0.95, correct = TRUE)##
## 5-sample test for equality of proportions without continuity
## correction
##
## data: colors2[2, ] out of colors_tc
## X-squared = 453.99, df = 4, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4 prop 5
## 0.47285223 0.40559441 0.27328030 0.13515457 0.03389831
prop.test(colors2[3,], n = colors_tc, p = NULL, alternative = 'two.sided', conf.level = 0.95, correct = TRUE)##
## 5-sample test for equality of proportions without continuity
## correction
##
## data: colors2[3, ] out of colors_tc
## X-squared = 161.84, df = 4, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4 prop 5
## 0.2357388 0.2937063 0.4253627 0.2961898 0.2203390
prop.test(colors2[4,], n = colors_tc, p = NULL, alternative = 'two.sided', conf.level = 0.95, correct = TRUE)##
## 5-sample test for equality of proportions without continuity
## correction
##
## data: colors2[4, ] out of colors_tc
## X-squared = 890.35, df = 4, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4 prop 5
## 0.06735395 0.16783217 0.18858212 0.48957584 0.72033898
Each row is significantly different from the total row or the proportions of each hair color on each row over the total row is significantly different than the overall proportions of each hair color.
Not only are comparison tests used in research, but also in business, marketing, and web analytics. They are known as ‘A/B tests’.
Depending on the situation, A/B tests can be:
A and B can be two web pages or B can be an alternative version of web page A. Other situations:
Two-way contingency tables consist of square tables where individuals, or other kinds of observations, are naturally paired:
We can measure evolution, mobility, progression, etc.
The data is a classic 14x14 contingency table that was analyzed by Karl Pearson, an English mathematician and biostatistician (the Pearsons correlation and more). It is a sample of 775 males in 1904 England. We want to study mobility by occupational categories from one generation to another (father to son):
## a b c d e f g h i j k l m n
## A 28 0 4 0 0 0 1 3 3 0 3 1 5 2
## B 2 51 1 1 2 0 0 1 2 0 0 0 1 1
## C 6 5 7 0 9 1 3 6 4 2 1 1 2 7
## D 0 12 0 6 5 0 0 1 7 1 2 0 0 10
## E 5 5 2 1 54 0 0 6 9 4 12 3 1 13
## F 0 2 3 0 3 0 0 1 4 1 4 2 1 5
## G 17 1 4 0 14 0 6 11 4 1 3 3 17 7
## H 3 5 6 0 6 0 2 18 13 1 1 1 8 5
## I 0 1 1 0 4 0 0 1 4 0 2 1 1 4
## J 12 16 4 1 15 0 0 5 13 11 6 1 7 15
## K 0 4 2 0 1 0 0 0 3 0 20 0 5 6
## L 1 3 1 0 0 0 1 0 1 1 1 6 2 1
## M 5 0 2 0 3 0 1 8 1 2 2 3 23 1
## N 5 3 0 2 6 0 1 3 1 0 0 1 1 9
We ad totals.
## a b c d e f g h i j k l m n Total
## A 28 0 4 0 0 0 1 3 3 0 3 1 5 2 50
## B 2 51 1 1 2 0 0 1 2 0 0 0 1 1 62
## C 6 5 7 0 9 1 3 6 4 2 1 1 2 7 54
## D 0 12 0 6 5 0 0 1 7 1 2 0 0 10 44
## E 5 5 2 1 54 0 0 6 9 4 12 3 1 13 115
## F 0 2 3 0 3 0 0 1 4 1 4 2 1 5 26
## G 17 1 4 0 14 0 6 11 4 1 3 3 17 7 88
## H 3 5 6 0 6 0 2 18 13 1 1 1 8 5 69
## I 0 1 1 0 4 0 0 1 4 0 2 1 1 4 19
## J 12 16 4 1 15 0 0 5 13 11 6 1 7 15 106
## K 0 4 2 0 1 0 0 0 3 0 20 0 5 6 41
## L 1 3 1 0 0 0 1 0 1 1 1 6 2 1 18
## M 5 0 2 0 3 0 1 8 1 2 2 3 23 1 51
## N 5 3 0 2 6 0 1 3 1 0 0 1 1 9 32
## Total 84 108 37 11 122 1 15 64 69 24 57 23 74 86 775
Occupations:
We compute the components with all the variables and print the results.
fathers_ca <- CA(fathers, graph = FALSE)## **Results of the Correspondence Analysis (CA)**
## The row variable has 14 categories; the column variable has 14 categories
## The chi square of independence between the two variables is equal to 1005.454 (p-value = 5.633858e-119 ).
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$col" "results for the columns"
## 3 "$col$coord" "coord. for the columns"
## 4 "$col$cos2" "cos2 for the columns"
## 5 "$col$contrib" "contributions of the columns"
## 6 "$row" "results for the rows"
## 7 "$row$coord" "coord. for the rows"
## 8 "$row$cos2" "cos2 for the rows"
## 9 "$row$contrib" "contributions of the rows"
## 10 "$call" "summary called parameters"
## 11 "$call$marge.col" "weights of the columns"
## 12 "$call$marge.row" "weights of the rows"
The \(\chi^2\) is 1005.454 (p-value = 5.633858e-119); we can reject the hypothesis of independence of row and column. In other words, the rows and columns are linked together; there are relationships (clusters, patterns, trends, etc.).
We print and plot the eigenvectors.
## eigenvalue percentage of variance cumulative percentage of variance
## dim 1 0.4 31.6 31.6
## dim 2 0.3 19.3 51.0
## dim 3 0.2 12.6 63.6
## dim 4 0.1 10.4 74.0
## dim 5 0.1 7.0 81.0
## dim 6 0.1 6.5 87.5
## dim 7 0.1 5.3 92.8
## dim 8 0.0 3.3 96.1
## dim 9 0.0 2.1 98.3
## dim 10 0.0 0.8 99.1
## dim 11 0.0 0.6 99.6
## dim 12 0.0 0.3 99.9
## dim 13 0.0 0.1 100.0
Component 1 is ALWAYS the most important and represents 32 % total variability. The cumulative variance of the first two component is 51 %. In this case, lower components contain significant information, but we will not here.
The correspondence map
Plot the 2D chart.
We read the results with three aspects in mind:
Conclusion
On the right of the map, there is a steady progression in occupation from A to E (sweeping the upper-right and lower-right quadrants).
Two occupations, B and D, representing arts and crafts, stand out from the rest.
Points that are close together on the map reflect the fact that there is a lot of movement from father to son (sons opt for different occupations).
High mobility: sons opt for different occupations than their father’s (C).
On the other hand, points that are far apart from each other indicate relatively little movement.
Low mobility: sons opt for their father’s occupation (B).
B-b really stands apart in term of conservatism. 82 % of sons chose their father’s occupation.
Overall.
Percentage of sons who pursue their father’s occupation by occupation.
Clusters suggest that occupational mobility from father to son is confined to movements within the various clusters only (and not between clusters):
This time, the contingency table comprise qualitative variables: categorical data, nominal or ordinal factors.
MCA is useful to map out the results of interviews or Likert-scale results.
Consumption credit
The data is about 66 bank clients who contracted consumption credit. There are 11 explanatory variables.
## 'data.frame': 66 obs. of 11 variables:
## $ Market : Factor w/ 5 levels "Car","Furniture",..: 4 4 1 4 5 4 4 4 2 4 ...
## $ Cash : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 1 2 2 2 2 ...
## $ Unpaid : Factor w/ 3 levels "Unp_0","Unp_1_or_2",..: 1 1 1 1 3 1 1 1 1 1 ...
## $ Insurance : Factor w/ 4 levels "A","A_Benefits",..: 1 3 1 4 1 4 4 3 3 4 ...
## $ Debt : Factor w/ 4 levels "Debt1","Debt2",..: 1 2 3 2 4 3 2 2 1 2 ...
## $ Status : Factor w/ 5 levels "Bachelor","Cohab",..: 2 4 4 5 2 4 5 5 2 4 ...
## $ Dependents: Factor w/ 5 levels "Child_0","Child_1",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Housing : Factor w/ 5 levels "Employer_housed",..: 2 4 3 4 1 4 4 4 1 4 ...
## $ Occupation: Factor w/ 6 levels "Executive","Mid_management",..: 6 5 5 4 2 4 4 4 6 4 ...
## $ Title : Factor w/ 3 levels "Miss","Mr","Ms": 2 2 2 2 2 2 3 3 2 2 ...
## $ Age : Factor w/ 5 levels "20","30","40",..: 4 3 2 5 1 5 5 5 3 5 ...
We display the statistics.
For labels too small to read.
levels(credit$Market)## [1] "Car" "Furniture" "Moto" "Renovation" "Scooter"
levels(credit$Status)## [1] "Bachelor" "Cohab" "Divorced" "Married" "Widowed"
levels(credit$Housing)## [1] "Employer_housed" "Family_housed" "New_owner" "Owner"
## [5] "Tenant"
levels(credit$Occupation)## [1] "Executive" "Mid_management" "Mid_mangement"
## [4] "Retired" "Skilled_worker" "Unskilled_worker"
We compute the components with the first 5 variables.
If level.ventil=0.05 for example, levels lower or equal to 5 % of the total would automatically be attributed to the other levels (ventilated according to the other level weights).
credit_mca <- MCA(credit, quali.sup = 6:11, level.ventil = 0, graph = FALSE)We plot the results.
barplot(credit_mca$eig[1:5,2], names.arg = rownames(credit_mca$eig[1:5,]))The first two components represent only 30 % of total variability – or total inertia as it is often called.
In the case of a MCA, this low value is robust, though.
The correspondence maps
A map of all individuals.
A map without numbers, with variables.
Adding more colors.
By Age (11th).
The complete map.
On the horizontal axis, we have younger profiles on the right (keywords: scooter, new owner, tenant, housed in family, high debt) and older profiles on the left (keywords: senior, retired, widowed, owner, low debt).
On the vertical axis, we have people with financial difficulties at the bottom. These people are heavy in debt and are unemployed.
We can isolate some cluster:
Let us have a look at the contribution, credit_mca$contrib, of each variable to the first two components. The individual results can be extracted with credit_mca$ind.
## Dim 1 Dim 2
## Car 6.2978074 2.300173e+00
## Furniture 1.7343275 9.518153e+00
## Moto 0.7442569 8.329618e-01
## Renovation 12.0394243 8.135005e+00
## Scooter 10.4133134 3.419319e+00
## No 6.4918889 3.404219e+00
## Yes 5.7499587 3.015165e+00
## Unp_0 1.1943032 1.522427e+00
## Unp_1_or_2 2.3318481 9.803501e+00
## Unp_3+ 0.4772412 1.775006e+01
## A 7.3305180 1.877022e+00
## A_Benefits 0.7203020 1.970557e+01
## None 3.4769365 3.451857e-04
## Senior 13.6314575 7.126168e+00
## Debt1 0.4896683 4.747586e+00
## Debt2 14.1983793 1.029576e-02
## Debt3 1.2789896 6.648669e+00
## Debt4 11.3993792 1.833538e-01
Let us plot the variable-component correlations.
plot(credit_mca, choix = 'var')Market has the most influence on the first component; Insurance and Unpaid debts, on the second.
We can build other maps with components 3 and 4.
And so on…
Here, we come back to Factor Analysis. We touch base with our Olympic dataset and the factanal function.
## 'data.frame': 54 obs. of 8 variables:
## $ X100m : num 0.0594 -1.2962 -0.3021 -0.3473 0.2402 ...
## $ X200m : num -0.313 -0.878 -0.167 -0.641 -0.44 ...
## $ X400m : num 0.2439 -1.0072 -0.0202 -0.5624 -0.3955 ...
## $ X800m : num 0.0353 -0.5366 0.0353 -0.7273 0.4166 ...
## $ X1500m : num 0.176 -0.813 -0.483 -0.549 0.307 ...
## $ X5000m : num -0.378 -0.904 -0.47 -1.035 1.344 ...
## $ X10000m : num -0.527 -0.599 -0.485 -0.992 1.164 ...
## $ marathon: num -0.437 -0.667 -0.141 -0.701 1.44 ...
The data are numerical and with numerical and categorical variables, we can perform a Multiple Factorial Analysis (MFA) with the MineFactoR package. There are other possibilities we do not address:
In addition, we can perform Hierarchical Clustering on Principle Components (HCPC).
Multiple Factor Analysis
This is a first example included in the package. From a dataset, we build an object as we did with PCA and CA.
data(wine)
wine_mfa <- MFA(wine, group = c(2,5,3,10,9,2), type = c('n', rep('s', 5)), ncp = 5, name.group = c('orig', 'olf', 'vis', 'olfag', 'gust', 'ens'), num.group.sup = c(1, 6))We plot the eigenvalues…
… and confidence ellipses around the categories; they are available in PCA, CA, and MFA.
We can also plot interactive graphs…
LDA is closely related to ANOVA and regression analysis with a qualitative dependent variable. We want to explain (and predict) a qualitative dependent variable, Y, with quantitative and qualitative variables, X, or explanatory variables.
The first use of LDA is classification, pattern recognition, and machine learning. Y may have two or more levels. We want to compute the probability for each level to belong to one group. Behind the scenes, the cost function is based on the Bayes’ theorem.
The other use of LDA is dimension reduction. As with PCA, we create latent variables called ‘discriminant canonical variables’ that are linear combinations of features that characterize or separates two or more classes of objects or events.
A first example
We illustrate the method with a dataset about snoring (yes or no the individual snores). The explanatory variables are age, weight is kg, height in cm, alcohol consumption (portion per day), sex and smoke if the individual is a smoker.
## 'data.frame': 100 obs. of 7 variables:
## $ age : num 47 56 46 70 51 46 40 46 49 39 ...
## $ weight : num 71 58 116 96 91 98 112 77 76 119 ...
## $ height : num 158 164 208 186 195 188 193 165 164 196 ...
## $ alcohol: num 0 7 3 3 2 0 5 0 0 3 ...
## $ sex : Factor w/ 2 levels "F","M": 2 2 2 2 2 1 2 1 2 2 ...
## $ snore : Factor w/ 2 levels "N","Y": 1 2 1 1 2 1 1 1 1 1 ...
## $ smoke : Factor w/ 2 levels "N","Y": 2 1 2 2 2 1 2 1 1 1 ...
## age weight height alcohol sex
## Min. :23.00 Min. : 42.00 Min. :158.0 Min. : 0.00 F:25
## 1st Qu.:43.00 1st Qu.: 77.00 1st Qu.:166.0 1st Qu.: 0.00 M:75
## Median :52.00 Median : 95.00 Median :186.0 Median : 2.00
## Mean :52.27 Mean : 90.41 Mean :181.1 Mean : 2.95
## 3rd Qu.:62.25 3rd Qu.:107.00 3rd Qu.:194.0 3rd Qu.: 4.25
## Max. :74.00 Max. :120.00 Max. :208.0 Max. :15.00
## snore smoke
## N:65 N:36
## Y:35 Y:64
##
##
##
##
We build a first model.
library(MASS)
# prior
# age, weight, height, alcohol, sex, smoke
snore_lda_model <- lda(snore ~ ., data = snore)
snore_lda_model## Call:
## lda(snore ~ ., data = snore)
##
## Prior probabilities of groups:
## N Y
## 0.65 0.35
##
## Group means:
## age weight height alcohol sexM smokeY
## N 50.26154 90.47692 180.9538 2.369231 0.6923077 0.6769231
## Y 56.00000 90.28571 181.3714 4.028571 0.8571429 0.5714286
##
## Coefficients of linear discriminants:
## LD1
## age 0.05973655
## weight -0.01620579
## height 0.01590170
## alcohol 0.24058822
## sexM 0.55413371
## smokeY -1.14621434
We print the discriminant functions based on centered (not standardized) variables.
No significance tests are produced.
In the results (Prior probabilities of groups), 35 % of the sample snores (Y for ‘yes’).
Next, we read the mean of each subgroup. For example, the average age of snorers is 56, their weight is over 90.2kg, etc.
Finally, we have the discriminant canonical variables LD1 and its coefficients. These are similar to a regression’s coefficient. For example, weight is negatively correlated with snoring and smoke=Y has the most impact. We read that sex=M increases the proportion of snoring by 0.5541; each additional year increase the proportion of snoring by 0.05974.
Since the weight and height coefficients are nearly zero, we remove these variables and run another model.
snore_lda_model1 <- lda(snore ~ age + alcohol + sex + smoke, data = snore)
snore_lda_model1## Call:
## lda(snore ~ age + alcohol + sex + smoke, data = snore)
##
## Prior probabilities of groups:
## N Y
## 0.65 0.35
##
## Group means:
## age alcohol sexM smokeY
## N 50.26154 2.369231 0.6923077 0.6769231
## Y 56.00000 4.028571 0.8571429 0.5714286
##
## Coefficients of linear discriminants:
## LD1
## age 0.06062024
## alcohol 0.24060629
## sexM 0.54047702
## smokeY -1.13039555
It barely affects the coefficients.
We can set the prior probabilities with prior. Instead of ‘35 % of the sample snores’, we set equal proportions (50/50).
snore_lda_model2 <- lda(snore ~ ., data = snore, prior = c(0.50, 0.50))
snore_lda_model3 <- lda(snore ~ age + alcohol + sex + smoke, data = snore, prior = c(0.50, 0.50))
snore_lda_model2## Call:
## lda(snore ~ ., data = snore, prior = c(0.5, 0.5))
##
## Prior probabilities of groups:
## N Y
## 0.5 0.5
##
## Group means:
## age weight height alcohol sexM smokeY
## N 50.26154 90.47692 180.9538 2.369231 0.6923077 0.6769231
## Y 56.00000 90.28571 181.3714 4.028571 0.8571429 0.5714286
##
## Coefficients of linear discriminants:
## LD1
## age 0.05973655
## weight -0.01620579
## height 0.01590170
## alcohol 0.24058822
## sexM 0.55413371
## smokeY -1.14621434
The coefficients did not change. However, the choice impacts how we assign new individuals to each group (from 65/35 to 50/50).
As any classification model, we can compute the confusion matrix and error.
Re-substitution (using the same data to derive the functions and evaluate their prediction accuracy) is the default method unless CV=TRUE is specified. Re-substitution will be overly optimistic.
We turn the CV parameter on (cross validation). CV=TRUE generates jackknifed (i.e., leave one out) predictions.
# snore_lda_model
snore_lda_pred <- lda(snore ~ ., data = snore, CV = TRUE)$class
# Confusion matrix
table(snore_lda_pred, snore$snore)##
## snore_lda_pred N Y
## N 53 22
## Y 12 13
We compute the error rate by comparing the predictions (snore_lda_pred) with the actuals (snore$snore).
## Prediction Actual Check
## 1 N N
## 2 Y Y
## 3 N N
## 4 Y N error
## 5 N Y error
## 6 N N
## 7 N N
## 8 N N
## 9 N N
## 10 N N
## 11 N N
## 12 N N
## 13 N N
## 14 N N
## 15 Y N error
## 16 N N
## 17 Y N error
## 18 Y N error
## 19 N N
## 20 N N
## 21 N N
## 22 Y N error
## 23 N N
## 24 N N
## 25 N N
## 26 N N
## 27 N N
## 28 N Y error
## 29 N N
## 30 N N
## 31 N N
## 32 N N
## 33 Y N error
## 34 N N
## 35 N N
## 36 N N
## 37 N N
## 38 Y N error
## 39 Y N error
## 40 N N
## 41 Y N error
## 42 N N
## 43 N N
## 44 Y N error
## 45 N N
## 46 N N
## 47 N N
## 48 N N
## 49 N N
## 50 N N
## 51 N N
## 52 N N
## 53 Y N error
## 54 N N
## 55 N N
## 56 N N
## 57 N N
## 58 N N
## 59 N N
## 60 N N
## 61 N N
## 62 N N
## 63 N Y error
## 64 Y Y
## 65 N Y error
## 66 N Y error
## 67 Y Y
## 68 N Y error
## 69 N Y error
## 70 N Y error
## 71 N Y error
## 72 Y Y
## 73 Y Y
## 74 N Y error
## 75 N Y error
## 76 N Y error
## 77 Y Y
## 78 N Y error
## 79 Y Y
## 80 Y Y
## 81 Y Y
## 82 Y Y
## 83 N Y error
## 84 Y Y
## 85 N Y error
## 86 Y Y
## 87 N Y error
## 88 N N
## 89 N Y error
## 90 N N
## 91 N Y error
## 92 Y N error
## 93 N N
## 94 N N
## 95 Y Y
## 96 N Y error
## 97 N Y error
## 98 N N
## 99 N Y error
## 100 N Y error
## [1] 0.34
A 34% error rate.
And for the other models…
snore_lda_model1
##
## snore_lda_pred1 N Y
## N 52 23
## Y 13 12
## [1] 0.36
snore_lda_model2
##
## snore_lda_pred2 N Y
## N 41 10
## Y 24 25
## [1] 0.34
snore_lda_model3
## [1] 0.32
The last model is the best (lowest error rate).
Now, we can use the classifying model to make predictions by supplying new data.
## age weight height alcohol sex smoke
## 1 42 55 169 0 F N
## 2 58 94 185 4 M Y
## 3 35 70 180 6 M Y
## 4 67 63 166 3 F N
Let us compute the results with the first model of all.
snore_lda_predict <- predict(snore_lda_model, newdata = n_snore)First, we get the predictions.
## [1] N N N Y
## Levels: N Y
A comparative: Actuals vs Predictions.
## age weight height alcohol sex smoke pred_smoke
## 1 42 55 169 0 F N N
## 2 58 94 185 4 M Y N
## 3 35 70 180 6 M Y N
## 4 67 63 166 3 F N Y
Second, we get the classification probabilities.
## N Y
## 1 0.7957279 0.2042721
## 2 0.6095636 0.3904364
## 3 0.7325932 0.2674068
## 4 0.3532721 0.6467279
Third, we get the score by individual.
## LD1
## 1 -0.6238163
## 2 0.3246422
## 3 -0.2586916
## 4 1.4140108
And we know from above that change the prior probabilities will not affect the coefficients nor the predictions.
We can also remove missing values with na.action.
library(MASS)
snore_lda_model4 <- lda(snore ~ ., data = snore, na.action = 'na.omit', CV = TRUE)We then perform a LDA using listwise deletion of missing data.
The quadratic discriminant function does not assume homogeneity of variance-covariance matrices.
library(MASS)
snore_qda_model <- qda(snore ~ ., data = snore, na.action = 'na.omit', CV = TRUE)Let us compare the the LDA and QDA results.
## Actual LDA Check Actual.1 QDA Check.1
## 1 N N N N
## 2 Y Y Y N error
## 3 N N N N
## 4 N Y error N Y error
## 5 Y N error Y N error
## 6 N N N N
## 7 N N N N
## 8 N N N N
## 9 N N N N
## 10 N N N N
## 11 N N N N
## 12 N N N N
## 13 N N N N
## 14 N N N N
## 15 N Y error N Y error
## 16 N N N N
## 17 N Y error N Y error
## 18 N Y error N Y error
## 19 N N N N
## 20 N N N N
## 21 N N N N
## 22 N Y error N Y error
## 23 N N N N
## 24 N N N N
## 25 N N N N
## 26 N N N N
## 27 N N N N
## 28 Y N error Y Y
## 29 N N N N
## 30 N N N N
## 31 N N N N
## 32 N N N N
## 33 N Y error N Y error
## 34 N N N N
## 35 N N N N
## 36 N N N N
## 37 N N N N
## 38 N Y error N Y error
## 39 N Y error N Y error
## 40 N N N Y error
## 41 N Y error N Y error
## 42 N N N N
## 43 N N N N
## 44 N Y error N Y error
## 45 N N N N
## 46 N N N Y error
## 47 N N N N
## 48 N N N N
## 49 N N N N
## 50 N N N N
## 51 N N N N
## 52 N N N N
## 53 N Y error N Y error
## 54 N N N N
## 55 N N N N
## 56 N N N N
## 57 N N N N
## 58 N N N N
## 59 N N N N
## 60 N N N Y error
## 61 N N N N
## 62 N N N N
## 63 Y N error Y Y
## 64 Y Y Y N error
## 65 Y N error Y N error
## 66 Y N error Y Y
## 67 Y Y Y N error
## 68 Y N error Y N error
## 69 Y N error Y Y
## 70 Y N error Y N error
## 71 Y N error Y N error
## 72 Y Y Y N error
## 73 Y Y Y Y
## 74 Y N error Y N error
## 75 Y N error Y N error
## 76 Y N error Y N error
## 77 Y Y Y Y
## 78 Y N error Y Y
## 79 Y Y Y Y
## 80 Y Y Y N error
## 81 Y Y Y N error
## 82 Y Y Y N error
## 83 Y N error Y N error
## 84 Y Y Y N error
## 85 Y N error Y N error
## 86 Y Y Y N error
## 87 Y N error Y N error
## 88 N N N N
## 89 Y N error Y N error
## 90 N N N N
## 91 Y N error Y Y
## 92 N Y error N Y error
## 93 N N N N
## 94 N N N N
## 95 Y Y Y Y
## 96 Y N error Y N error
## 97 Y N error Y N error
## 98 N N N N
## 99 Y N error Y N error
## 100 Y N error Y N error
Let us validate with the confusion matrix.
# Alternative code
conf <- table(snore$snore, snore_qda_model$class)
diag(prop.table(conf, 1))## N Y
## 0.7692308 0.2857143
# Error rate
1 - sum(diag(prop.table(conf)))## [1] 0.4
The QDA is not an improvement.
The klaR package can display the results of a LDA or QDA, two variables at a time.
LDA
QDA
When we abide by Occam’s razor, the law of parsimony, we want the simplest, but the best model. Selecting the right features in the data, selecting the right variables, has a huge impact on training times and prediction accuracy.
Data can contain attributes that are highly correlated with each other. Many methods perform better if highly correlated attributes are removed.
The caret package provides tools to automatically report on the relevance and importance of attributes in the data and select the most important features.
We use the Pima Indians Diabetes dataset.
## 'data.frame': 768 obs. of 9 variables:
## $ pregnant: num 6 1 8 1 0 5 3 10 2 8 ...
## $ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
## $ pressure: num 72 66 64 66 40 74 50 0 70 96 ...
## $ triceps : num 35 29 0 23 35 0 32 0 45 0 ...
## $ insulin : num 0 0 0 94 168 0 88 0 543 0 ...
## $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
## $ age : num 50 31 32 21 33 30 26 29 53 54 ...
## $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
Removing redundant features
We build the correlation matrix.
correlationMatrix <- cor(PimaIndiansDiabetes[,1:8])
correlationMatrix## pregnant glucose pressure triceps insulin
## pregnant 1.00000000 0.12945867 0.14128198 -0.08167177 -0.07353461
## glucose 0.12945867 1.00000000 0.15258959 0.05732789 0.33135711
## pressure 0.14128198 0.15258959 1.00000000 0.20737054 0.08893338
## triceps -0.08167177 0.05732789 0.20737054 1.00000000 0.43678257
## insulin -0.07353461 0.33135711 0.08893338 0.43678257 1.00000000
## mass 0.01768309 0.22107107 0.28180529 0.39257320 0.19785906
## pedigree -0.03352267 0.13733730 0.04126495 0.18392757 0.18507093
## age 0.54434123 0.26351432 0.23952795 -0.11397026 -0.04216295
## mass pedigree age
## pregnant 0.01768309 -0.03352267 0.54434123
## glucose 0.22107107 0.13733730 0.26351432
## pressure 0.28180529 0.04126495 0.23952795
## triceps 0.39257320 0.18392757 -0.11397026
## insulin 0.19785906 0.18507093 -0.04216295
## mass 1.00000000 0.14064695 0.03624187
## pedigree 0.14064695 1.00000000 0.03356131
## age 0.03624187 0.03356131 1.00000000
We find attributes that are highly corrected (ideally > 0.75; average > 0.50). Keep in mind this rule of thumb: 0.75.
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff = 0.5)
highlyCorrelated## [1] 8
row.names(correlationMatrix)[highlyCorrelated]## [1] "age"
The 8th attribute is age.
Rank features by importance
Some methods, such as decision trees, have a built-in mechanism to report on variable importance. For other algorithms, the importance can be estimated using a confusion matrix and a ROC curve analysis conducted on each attribute.
The example below constructs a Learning Vector Quantization (LVQ) model.
set.seed(7)
# Prepare the training
control <- trainControl(method = 'repeatedcv', number = 10, repeats = 3)
# Train the model
model <- train(diabetes ~ ., data = PimaIndiansDiabetes, method = 'lvq', preProcess = 'scale', trControl = control)We then estimate the variable importance.
importance <- varImp(model, scale = FALSE)
importance## ROC curve variable importance
##
## Importance
## glucose 0.7881
## mass 0.6876
## age 0.6869
## pregnant 0.6195
## pedigree 0.6062
## pressure 0.5865
## triceps 0.5536
## insulin 0.5379
And plot it.
The results show that the glucose, mass and age attributes are the top 3 most important attributes in the dataset and the insulin attribute is the least important.
Feature selection
Automatic feature selection methods can be used to build many models with different subsets of a dataset. We then identify attributes that are required to build an accurate model.
A popular automatic method is the Recursive Feature Elimination (RFE). A Random Forest algorithm is used on each iteration to evaluate the model. The algorithm is configured to explore all possible subsets of the attributes.
library(randomForest)
set.seed(7)
# Define the control using a random forest selection function
control <- rfeControl(functions = rfFuncs, method = 'cv', number = 10)
# Run the RFE algorithm (it takes some time)
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes = c(1:8), rfeControl = control)
results##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 1 0.6873 0.2583 0.05704 0.11679
## 2 0.7265 0.3748 0.04812 0.09870
## 3 0.7356 0.4086 0.04534 0.10020
## 4 0.7565 0.4547 0.04107 0.09508
## 5 0.7539 0.4429 0.05277 0.12191
## 6 0.7473 0.4324 0.03894 0.08530
## 7 0.7499 0.4329 0.04013 0.09785
## 8 0.7616 0.4584 0.04917 0.11000 *
##
## The top 5 variables (out of 8):
## glucose, mass, age, pregnant, pedigree
We notice the 8th variable with an asterisk. The accuracy is over 0.75; a reminder of the rule of thumb we underlined above. The results also list the top five variables.
What is this 8th variable?
## VarName Variables Accuracy Kappa AccuracySD KappaSD
## 1 pregnant 1 0.6873377 0.2582596 0.05704175 0.11678728
## 2 glucose 2 0.7264696 0.3748388 0.04812330 0.09869825
## 3 pressure 3 0.7356288 0.4086387 0.04534209 0.10020218
## 4 triceps 4 0.7565277 0.4547458 0.04106932 0.09507553
## 5 insulin 5 0.7539303 0.4429472 0.05277373 0.12190532
## 6 mass 6 0.7473172 0.4324254 0.03894122 0.08530466
## 7 pedigree 7 0.7499316 0.4329196 0.04012769 0.09784581
## 8 age 8 0.7615858 0.4583660 0.04916710 0.11000058
age, again. Let us plot the results.
All 8 attributes are selected in this example. glucose, mass, age, pregnant, pedigree correspond to 1, 2, 6, 7 and 8. The choice has to do with accuracy and other factors. There is no need for variables that will introduce multicollinearity in the model.
Conclusion
Variable importance, from the caret package, and the results, from the randomForest package, are contradictory.
There is likely no best set of features just like there is no best model.
We must make choices given these results. The results will not tell us which variable to pick.