Foreword

Output options: the ‘tango’ syntax and the ‘readable’ theme.
Snippets and results.
Source: ‘Analyse de données avec R, Presses universitaires de Rennes, 2016’, bonified with online documentation and public sources.

Multidimensional Scaling (MDS)

We want to represent the distances among the elements in a parsimonious way. In plain English, we want to reduce several independent variables to a 2D chart.

Classical MDS

Classical MDS uses the fact that the coordinate matrix x and y can be derived by eigenvalue decomposition (or singular value decomposition).

We have a dataset, run_record, with Olympic records by country in 9 track & field events (100m, 200m, 400m, 800m, 1500m, 5000m, 10000m, Marathon). These Olympic standards are the country’s best times. The data is standardized.

##                X100m      X200m       X400m       X800m     X1500m
## Argentina  0.0594137 -0.3126106  0.24391299  0.03530558  0.1757051
## Australia -1.2962228 -0.8777403 -1.00718688 -0.53664481 -0.8126362
## Austria   -0.3020894 -0.1667707 -0.02020809  0.03530558 -0.4831891
## Belgium   -0.3472772 -0.6407504 -0.56235137 -0.72729495 -0.5490785
## Bermuda    0.2401652 -0.4402205 -0.39553805  0.41660584  0.3074840
## Brazil    -0.9799076 -1.1876501 -1.06974187 -1.29924534 -0.5490785
##               X5000m    X10000m   marathon
## Argentina -0.3779943 -0.5271604 -0.4366164
## Australia -0.9037300 -0.5986248 -0.6667369
## Austria   -0.4699980 -0.4854728 -0.1405878
## Belgium   -1.0351640 -0.9916792 -0.7013666
## Bermuda    1.3437903  1.1641643  1.4400937
## Brazil    -0.1808434 -0.2413027 -0.8298319

## [1] 54  8

We have 54 rows (observations) by 8 columns (variables): the input matrix. Each row is identified by a unique row name. This is a small matrix. Nonetheless, the matrix holds a lot of information.

We want to reduce the multidimensional matrix to two dimensions and be able to represent the results on a 2D plot.

For that, we use an optimization procedure that relies on a loss function called ‘Stress’. Similar to a regression, classical MDS minimizes the residual sum of squares. Classical MDS is also known as metric MDS.

Implementing a classical MDS

First, we compute the eigenvalues of the matrix based on Euclidian distances (the procedure is similar to the one we use in hierarchical clustering). Then, we reduce the number of dimensions. We can choose any number less than the original number of dimensions (p=8). We choose two dimensions (k=2).

# Euclidean distances
run_record_d <- dist(run_record_sc)
# k = 2 dimensions
run_record_fit <- cmdscale(run_record_d, eig = TRUE, k = 2)

# coordinates
head(run_record_fit$points)

##                 [,1]       [,2]
## Argentina -0.4163326 -0.3945394
## Australia -2.3525022  0.5502192
## Austria   -0.7306318 -0.1805723
## Belgium   -1.9797765 -0.3770560
## Bermuda    1.4861338  1.6421881
## Brazil    -2.2082526  0.7916572

# eigenvalues
head(run_record_fit$eig)

## [1] 355.274367  33.835736  12.058798  10.910007   5.171604   3.746459

The eigenvalues or eigenvectors show the strength of each component. The first two components are strong; beyond 2, the values plummet. This indicates that 2 dimensions are sufficient: these are the principal components.

Now, we can plot the two dimensions.

From 9 variables, we boiled down the observations to 2 ‘latents variables’.

Dimension reduction concepts

Centrality: the origin represents the overall average.
- The closer the observation to the origin, the more the observation is average.
- No matter the scale, good or bad, fast or slow, large or small, 0 means average.
- The further away the observation is from the origin, the more a country is distinct from the pack (extreme cases are outliers).
Proximity: countries with affinities are clustered together.
Polarity: each country is always compared to the others.
- The longer the distance between two countries, the more dissimilar the countries are.
- The longest distances are always diagonal when countries are opposed in both x and y.

What kind of readings can we make?

These are Olympic records in Athletism. We must read the axes as race times (below the average, average or above the average). The center (centrality) is the average, as always.

On the left corner, we can find highly competitive countries (times are lower or better than the average). These countries have the best standards (times): we could call this reference group the ‘elite group’ (proximity).

The US are set apart in the upper-left corner. It is a track and field powerhouse in speed events.

Thailand is on the right of the elite group. Thailand’s times are worst than the elite group.

Component 1 (x-axis): overall standards (times). Left: bellow the average. Right: over the average.

On the other hand, Kenya is on the left side, but further down. Kenya is not known for winning podiums at speed events, but manage to place runners in final waves in all events. For that reason, Kenya sits on the negative side of the x-axis with the elite group. However, we know Kenya mainly dominates long distances. That sets it apart from the elite group on the y-axis.

It is exactly the opposite (polarity) for the US. We know they dominate speed races, but they seem to underperform in endurance events.

Component 2 (y-axis): endurance vs. speed events. Top: speed. Bottom: endurance.

Can we draw other conclusions?

About Samoa and the Cook’s Islands. Turkey or North Korea.

We have to interpret the results with the concepts of centrality, proximity, and polarity in mind.

Improve the visualization – Circles

Using the plotrix package, let us add circles.

Now we have an overall view in relation to the origin; centrality.

We zoom in on the central observations.

Improve the visualization – Colours

We could discriminate the observation with factors:

GDP or % of GDP invested in Amateur Sports.
Population or Number of Amateur Athletes.
Any additional metric that can improve the interpretation.

We choose to differentiate the observations in geographical zones.

We enrich the dataset with geographical zones (continent).

##           X100m X200m X400m X800m X1500m X5000m X10000m marathon continent
## Argentina 10.23 20.37 46.18 106.2  220.8  799.8  1659.0   7774.2        AM
## Australia  9.93 20.06 44.38 104.4  211.8  775.8  1651.8   7650.6        PA
## Austria   10.15 20.45 45.80 106.2  214.8  795.6  1663.2   7933.2        EU
## Belgium   10.14 20.19 45.02 103.8  214.2  769.8  1612.2   7632.0        EU
## Bermuda   10.27 20.30 45.26 107.4  222.0  878.4  1829.4   8782.2        AM
## Brazil    10.00 19.89 44.29 102.0  214.2  808.8  1687.8   7563.0        AM

AF-ME: Africa and Middle East,
AM: Americas,
AS: Asia,
EU: Europe,
PA: Pacific.

We plot the colored chart.

3D plotting

Can 3D help?

First, we rerun the analysis to generate 3 vectors (k=3). Then we plot the new results.

# Euclidean distances
run_record_3d <- dist(run_record_sc)
# k = 3 dimensions
run_record_3dfit <- cmdscale(run_record_3d, eig = TRUE, k = 3)
head(run_record_3dfit$points)

##                 [,1]       [,2]        [,3]
## Argentina -0.4163326 -0.3945394  0.40914141
## Australia -2.3525022  0.5502192  0.29073599
## Austria   -0.7306318 -0.1805723  0.26293174
## Belgium   -1.9797765 -0.3770560  0.04183601
## Bermuda    1.4861338  1.6421881 -0.72124975
## Brazil    -2.2082526  0.7916572 -0.52070899

# Plot
x <- run_record_3dfit$points[,1]
y <- run_record_3dfit$points[,2]
z <- run_record_3dfit$points[,3]

# Convert the continents
run_record_geo$colour <-  NA
run_record_geo$colour <- ifelse(run_record_geo$continent == 'AF-ME', 1,
                                ifelse(run_record_geo$continent == 'AM', 2,
                                       ifelse(run_record_geo$continent == 'AS', 3,
                                              ifelse(run_record_geo$continent == 'EU', 4,
                                                     ifelse(run_record_geo$continent == 'PA', 5, NA)))))

We can change the parameters and reprint the plot. However, it would be more effective if we could work with an interactive 3D plot. The rgl package offers this possibility.

Normally, we turn and flip with the 3D plot to examine the results. This is a .gif file:

The third dimension does not bring any value as we wanted to reduce the number of dimensions to make the data more interpretable. However, there are situations where going from 4+ dimensions to 3 dimensions rather than 2 might be helpful.

When the original dimension is clearly defined, we can better interpret the reduction. If the 8 dimension were economic indicators, a 3D plot would reveal latent variables, clusters, and affinities among countries: prosperity, policies, integration, etc.

For example, the reduction would certainly reveal a cluster showing an economic block or a free-trade zone because of metrics such as GDP, inflation, trade, central bank prime rate all point in the same direction. It is easy to follow these metrics when we compare two countries, but not when we want to focus on several countries. This is where dimension reduction comes to help. We can automatically find trends, patterns, and cluster in a list of 100 countries.

Nonmetric MDS

With the classical (metric) MDS, the algorithm minimizes the ‘Stress’ function (a residual sum of squares) using the Euclidean distances between observations.

Non-metric MDS (from the MASS package) can find:

The non-parametric monotonic relationship between dissimilarities in the item-item matrix.
The Euclidean distances between items.

It can sometimes improve the results. Let us find out. We plot the metric and nonmetric MDS side-by-side.

Sometimes, metric MDS can yield significantly different results than nonmetric MDS.

Determining the number of factors to extract

A crucial decision is to find out how many dimensions would be optimal. In our worked example above, we choose k=2.

The nFactors package offers a suite of functions that give us more information.

The scree plots confirm that 2 dimensions are sufficient. Beyond 2 components, the scree plot does not show much improvement. In fact, the ‘acceleration factor’ dot demonstrates how much of the variability is picked up by 2 components.

Component Analysis (CA) – An Introduction

Component analysis applies a mathematical procedure for transforming a number of (possibly) correlated variables into a (smaller) number of uncorrelated variables called principal components.

The first principal component accounts for as much of the variability in the data as possible. Each succeeding component accounts for as much of the remaining variability as possible. It is possible to view all the components. However, we can only plot 2 or 3 components.

Component analysis is useful when there is data on a large number of variables, and (possibly) there is some redundancy in those variables. In this case, redundancy means that some of the variables are correlated with one another. And because of this redundancy, component analysis can be used to reduce the observed variables into a smaller number of components that will account for most of the variance in the observed variables.

Component analysis is recommended as an exploratory tool to uncover unknown trends in the data. The technique has found application in fields such as face recognition, image compression, and is a common technique for finding patterns in data of high dimension.

This technique is abundantly used in social sciences and marketing (as illustrated with the schema about the Beer Market).

Component analysis has three declinations:

Principal component analysis (what we will demonstrate hereby).
Single component analysis.
Multiple component analysis.

The three approaches are similar. The main difference comes from the dataset they are employed with. In the last section (Wrap-up) we will examine the three approaches with three distinct examples.

The dataset

For our introduction to component analysis, we use a dataset about the 50 US states. crime_data list crime information and the percentage of urban population. Source: Edureka.

The data is standardized.

##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

Entering raw data and extracting the components

## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4
## Standard deviation     1.5748783 0.9948694 0.5971291 0.41644938
## Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
## Cumulative Proportion  0.6200604 0.8675017 0.9566425 1.00000000

The first and second rows are the most important: Standard Deviation and Proportion of Variance. The first component is the most volatile and accounts for 2/3 of the variance.

There is a close kinship with regressions. Only with regressions, we choose a number of the explanatory variables to maximize $R^2$ while in Component Analysis, we end up with latent variables and the principal variables account for much of the variability.

We can see with the following plot how each component contributes to the overall variability.

We could also base the components on the covariance matrix (cor = FALSE).

We plot the principal components (1 and 2) on a 2D chart (there is a function for it).

‘Groups’ and ‘clusters’ we can perceive are illustrative of affinities. For example, ‘Mississippi-North & South Carolina have ’something’ in common. It is not simply geographical.

‘Assaut’ and ‘Murder’ both point in the same direction because they are positively correlated. When two arrows point in opposite directions, they are negatively correlated. When they are perpendicular (‘Murder’ and ‘UrbanPop’), there is no correlation.

Moreover, the arrow length indicates the strength of the relation.

Arrows also ‘lean’ in the direction of the states they are the most correlated with. For example, ‘Murder’ leans on the side of states with high murder rates. We can compare the biplot with the choropleth maps below.

Both visualizations are coherent with the component analysis.

Determining the number of factors to extract

Contrary to the cmdscale from MDS, we do not choose the number of dimension or component with the princomp function. It is done automatically.

Again, we want to challenge our findings with the nFactors package.

The scree plots confirm that 2 eigenvalues is sufficient. Beyond 2 components, the scree plot does not show much improvement.

Factor (Factorial) Analysis (FA)

Factor (or Factorial) Analysis is the task where we intend to reduce the dataset dimension by analyzing and understanding the impact of its features on a model.

Such analysis allows us to select a subset of the original features, reducing the dimensions. During a subset selection, we try to identify and remove as much of the irrelevant and redundant information as possible.

Let us apply the technique on the standardized Olympic dataset.

run_record_fit3 <- factanal(run_record_sc, 3, rotation = "varimax")

print(run_record_fit3, digits = 2, cutoff = 0.3, sort = TRUE)

## 
## Call:
## factanal(x = run_record_sc, factors = 3, rotation = "varimax")
## 
## Uniquenesses:
##    X100m    X200m    X400m    X800m   X1500m   X5000m  X10000m marathon 
##     0.08     0.07     0.23     0.00     0.11     0.02     0.01     0.09 
## 
## Loadings:
##          Factor1 Factor2 Factor3
## X1500m   0.67    0.49    0.44   
## X5000m   0.84    0.43    0.31   
## X10000m  0.87    0.40           
## marathon 0.84    0.38           
## X100m    0.37    0.87           
## X200m    0.37    0.83    0.32   
## X400m    0.47    0.68    0.30   
## X800m    0.54    0.44    0.71   
## 
##                Factor1 Factor2 Factor3
## SS loadings       3.40    2.82    1.18
## Proportion Var    0.43    0.35    0.15
## Cumulative Var    0.43    0.78    0.92
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 9.44 on 7 degrees of freedom.
## The p-value is 0.223

Factor1 and Factor2 are two latent variables that account for most of the variability (78 %).

Let us transfer these results on a 2D chart.

Factor1 distributes events by distance (left to right: speed events, endurance events).

Factor2 could be anything: how standards by country are close from one another, how some events are highly competitive vs. other events which are dominated by some countries, etc. We would need to dig in.

From entropy, we emerge with two latent variables. One is explainable (distance), the other remains to understand. Nonetheless, factor analysis fulfilled its role to distil raw data into something finer.

Determining the number of factors to extract

Again, we challenge our findings with the nFactors package.

The scree plots confirm that 2 factors are optimal.

Component Analysis (CA) – Adding Computing Power

Dimension reduction methods can address issues in diverse fields: social sciences, business, natural sciences, health sciences, etc.

The FactoMineR package

So far, we have used basic functions for MDS (dist, cmdscale for metric MDS, and dist, isoMDS for nonmetric MDS), princomp for CA, and factanal for FA. We also touch base with the FactoMineR package by drawing scree plots.

The FactoMineR package can perform more:

PCA when variables are quantitative.
Single CA with contingency or cross tables.
Multiple CA when variables are categorical, qualitative or ordinal.
Multiple MFA when variables are structured in groups.

The package offers a large number of additional functions for exploratory analysis. The website mention additional packages:

There is also the missMDA package for handling missing values in exploratory multivariate analysis such as principal component analysis (PCA), multiple correspondence analysis (MCA) and multiple factor analysis (MFA).
The FAMT package performs simultaneous tests under dependence in high-dimensional data.
SensoMineR provides numerous graphical outputs easy to interpret, as well as syntheses of results issuing from various analysis of variance models or from various factor analysis methods accompanied with confidence ellipses.

Not to forget the ade4 package; designed for analyzing ecological data and applying Euclidean methods in environmental sciences.

Principal Component Analysis (PCA)

The dataset

We use a dataset from the Heritage Foundation about Economic Freedom. Each year, the Foundation along with the Wall Street Journal releases an Index of Economic Freedom where countries are measured with 12 metrics (12 ‘freedoms’). Consult the website for definitions, the scope of the survey, assumptions, rankings, graphs, interactive maps, and more.

The dataset comes with microeconomic (tax rates for example) and macroeconomic indicators (GDP for example).

In addition, we enrich the dataset with the 2014 Human Development Index.

We remove some countries (those with NA metrics and those with less than 1M population).

Here is the subset we are going to work with.

## 'data.frame':    146 obs. of  18 variables:
##  $ Region               : chr  "Asia-Pacific" "Europe" "Middle East / North Africa" "Sub-Saharan Africa" ...
##  $ World.Rank           : num  163 65 172 165 156 33 5 30 68 44 ...
##  $ Region.Rank          : num  40 30 14 41 26 19 4 17 15 4 ...
##  $ X2017.Score          : num  49.1 64.4 46.5 48.5 50.4 ...
##  $ Property.Rights      : num  12.6 54 38.2 36.4 32.4 ...
##  $ Judical.Effectiveness: num  28.4 28.5 29.6 19.8 39.6 ...
##  $ Government.Integrity : num  27.5 39.7 31.7 12.8 38.2 ...
##  $ Tax.Burden           : num  91.6 86.9 81.1 87.7 62.6 ...
##  $ Gov.t.Spending       : num  79.9 72.5 51 58.6 54.6 ...
##  $ Fiscal.Health        : num  97.3 51.5 19.8 70.7 56.4 ...
##  $ Business.Freedom     : num  54.2 79.3 62.1 58.5 57.3 78.5 89.3 76.9 71.5 69.4 ...
##  $ Labor.Freedom        : num  59.9 50.7 49.5 40.4 46.1 ...
##  $ Monetary.Freedom     : num  69.3 81.4 67 70.6 50.9 ...
##  $ Trade.Freedom        : num  66 87.7 63.3 56.7 66.7 ...
##  $ Investment.Freedom.  : num  1 70 35 30 50 80 80 90 55 75 ...
##  $ Financial.Freedom    : num  1 70 30 40 50 70 90 70 50 80 ...
##  $ HDI_2014             : num  46.5 73.3 73.6 53.2 83.6 73.3 93.5 88.5 75.1 82.4 ...
##  $ Country_HDI          : chr  " Afghanistan" " Albania" " Algeria" " Angola" ...

The Freedom Index (X2017.Score) and each Freedom metrics are percentages; 100 being the highest value. The same goes for the HDI_2014.

The 12 Freedom metrics are:

Property Rights
Judicial Effectiveness
Government Integrity
Tax Burden
Government Spending
Fiscal Health
Business Freedom
Labor Freedom
Monetary Freedom
Trade Freedom
Investment Freedom
Financial Freedom

Country’s names may differ (row names vs. Country_HDI). HDI = Human Development Index.

We also have extra subsets: a microeconomic, a macroeconomic and the HDI subsets.

## 'data.frame':    146 obs. of  4 variables:
##  $ Tariff.Rate....       : num  NA 1.1 8.4 11.7 6.6 2.4 1.9 1.5 5.3 3.6 ...
##  $ Income.Tax.Rate....   : num  20 23 35 17 35 26 45 50 25 0 ...
##  $ Corporate.Tax.Rate....: num  20 15 23 30 35 20 30 25 20 0 ...
##  $ Tax.Burden...of.GDP   : num  6.5 23.6 11.7 6.5 35.9 23.5 27.5 43 14.2 3.8 ...

## 'data.frame':    146 obs. of  11 variables:
##  $ Gov.t.Expenditure...of.GDP.: num  27.1 30 44.4 28.9 43.9 ...
##  $ Country                    : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ Population..Millions.      : num  32 2.8 39.5 25.1 42.4 3.3 23.9 8.6 9.5 1.2 ...
##  $ GDP..Billions..PPP.        : num  62.3 32.7 578.7 184.4 972 ...
##  $ GDP.Growth.Rate....        : num  1.5 2.6 3.7 3 1.2 3 2.5 0.9 1.1 3.2 ...
##  $ X5.Year.GDP.Growth.Rate....: num  5.4 1.9 3.3 4.7 2.7 4.3 2.7 1 2.4 3.7 ...
##  $ GDP.per.Capita..PPP.       : num  1947 11301 14504 7344 22554 ...
##  $ Unemployment....           : num  9.6 17.3 10.5 7.6 6.7 16.3 6.3 5.7 4.7 1.2 ...
##  $ Inflation....              : num  -1.5 1.9 4.8 10.3 26.5 3.7 1.5 0.8 4 1.8 ...
##  $ FDI.Inflow..Millions.      : num  58 1003 -587 8681 11655 ...
##  $ Public.Debt....of.GDP.     : num  6.8 71.9 8.7 62.3 56.5 46.6 36.8 86.2 36.1 63.3 ...

## 'data.frame':    146 obs. of  2 variables:
##  $ Country_HDI: chr  " Afghanistan" " Albania" " Algeria" " Angola" ...
##  $ HDI_2014   : num  46.5 73.3 73.6 53.2 83.6 73.3 93.5 88.5 75.1 82.4 ...

HDI = Human Development Index.

The analysis

Compute the components.

library(FactoMineR)

liberty_pca <- PCA(liberty_base[ ,1:17], quanti.sup = c(2,3,4,17), quali.sup = c(1), graph = FALSE)

We can print the Principal Component Analysis (PCA) results (the liberty_pca object). It is a lot of results we can leave aside.

We can print a lot more results with summary(liberty_pca)… Again, we get a lot of results. Let us focus on the essential.

We can also estimate the best number of dimensions (estim_npc) to use (find the minimum or the first local minimum) and the mean error for each dimension tested.

Originally, we used:

liberty_pca <- PCA(liberty_base[ ,1:17], quanti.sup = c(2,3,4,17), quali.sup = c(1), graph = FALSE)

We have to remove the quanti.sup and quali.sup parameters.

# from 
# str(liberty_base[ ,1:17])
# to

liberty_ncp <- estim_ncp(liberty_base[ ,5:16], scale = TRUE)

liberty_ncp$ncp

## [1] 5

It suggests 5 components as a optimum/maximum.

We print the mean error and show the first local minimum.

We plot the eigenvectors (the value). We saw the use of eigenvalues in MDS above. We can rank the components with the eigenvalues.

Each component is what we call a ‘latent variable’. Component 1, Comp 1, is ALWAYS the most important ‘latent variable’ representing over 50 % of the total variability. The first 5 components total 85 % of the variability.

The correspondence map

We plot the 2D chart (the first two components). There are lots of parameters. Here are three example before we jump to the analysis.

Countries with labels.

Countries without labels.

Regions and labels; countries in the background.

These regions are the qualitative supplementary variables quali.sup in:

PCA(liberty_base[ ,1:17], quanti.sup = c(2,3,4,17), quali.sup = c(1), graph = FALSE)

Dim 1 and Dim 2 stand for comp 1 and comp 2 and account for 2/3 of the variability.

We read the results with three aspects in mind:

Proximity (clusters, affinities).
Polarity (dissimilarities, longest distance apart).
Centrality (the average on both axes).

Digging in the results

We print the ‘Variables-Dimensions’ matrix.

##                       Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Property.Rights        14.4   0.0   0.7   0.6   0.3
## Judical.Effectiveness  12.2   0.0   3.8   2.0   3.2
## Government.Integrity   13.4   0.5   1.1   2.7   0.7
## Tax.Burden              1.0  41.6   3.4   2.9  24.9
## Gov.t.Spending          3.3  30.3   4.3   4.1  11.5
## Fiscal.Health           0.5  12.3  16.1  64.9   0.0
## Business.Freedom       11.0   1.4  10.3   0.5   0.8
## Labor.Freedom           4.7  12.6  15.2   1.0  31.5
## Monetary.Freedom        7.4   0.3  27.3   4.1   0.6
## Trade.Freedom          10.4   0.8   0.0   0.1  23.2
## Investment.Freedom.     9.8   0.1  14.7  10.8   1.1
## Financial.Freedom      11.9   0.3   3.1   6.2   2.2

We can see how each variable contributes to each component.

We filter out the principal variables in the first component.

We could call this first ‘latent variable’: ‘Free Market’. On the x-axis of the map (above), we can split the observation between:

‘Open Market’ (right).
‘Closed Market’ (left).

We filter out the principal variables in the second component.

We could call this second ‘latent variable’: ‘State Size’. Higher taxes and larger the government expenditures mean larger States. On the y-axis of the map (above), we can split the observation between:

‘Minimalist State’ (up).
‘Interventionist State’ (down).

We print the head of the ‘Countries-Dimensions’ matrix.

##             Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Afghanistan   1.7   0.7   0.5   1.9   0.1
## Albania       0.1   0.2   0.1   1.2   2.2
## Algeria       0.6   0.5   1.2   0.4   0.0
## Angola        1.1   0.0   0.0   0.1   0.4
## Argentina     0.4   1.3   0.5   0.2   0.1
## Armenia       0.1   0.9   0.0   0.1   0.0

We print the contribution of Sweden (Sweden) and Poland (Poland) for the first two components:

liberty_pca$ind$contrib['Sweden', c(1,2)]
liberty_pca$ind$contrib['Poland', c(1,2)]

##    Dim.1    Dim.2 
## 2.556548 2.439401 
##      Dim.1      Dim.2 
## 0.46178004 0.01302201

Sweden has much more impact on the first two components.

We print Sweden and Poland’s (x,y) coordinates.

liberty_pca$ind$coord['Sweden', c(1,2)]
liberty_pca$ind$coord['Poland', c(1,2)]

##     Dim.1     Dim.2 
##  4.781568 -2.359848 
##      Dim.1      Dim.2 
##  2.0321745 -0.1724176

We can display the coordinates on the plot.

Indeed, on the correspondence map, Sweden is further away from the origin than Poland.

Above all, we read the graphic with these ‘labels’ in mind:

‘Open Market’ (right).
‘Closed Market’ (left).
‘Minimalist State’ (up).
‘Interventionist State’ (down).

Sweden is more interventionist and open than Poland (polarity), but both states stand in the same quadrant (proximity). Sweden stands apart from the averages (centrality).

The correlation circle

We plot the circle.

On the radar, the 12 Freedom metrics appear in red and a supplementary variables appears in blue. The formerare the quantitative supplementary variables quant.sup in:

PCA(liberty_base[ ,1:17], quanti.sup = c(2,3,4,17), quali.sup = c(1), graph = FALSE)

Supplementary variables, qualitative or quantitative, do not influence the correspondence analysis. They simply enrich the maps and circles.

Let us interpret the results with these findings in mind:

‘Open Market’ (right).
‘Closed Market’ (left).
‘Minimalist State’ (up).
‘Interventionist State’ (down).

Arrows point or lean in the direction of the dimension they are positively correlated with.

When two arrows point in the same direction, they are positively correlated. When two arrows point in the opposite direction, they are negatively correlated. When the arrows are perpendicular, they are uncorrelated.

The longer the arrow, the stronger the correlation with the component.

We print the results related to the above plot; for the first component (x-axis).

The high correlation coefficients are consistent with the long arrows.

Print the results related to the above plot; for the second component (y-axis).

##                correlation p.value
## Tax.Burden            0.81       0
## Gov.t.Spending        0.69       0
## Labor.Freedom         0.44       0
## Fiscal.Health         0.44       0
## X2017.Score           0.37       0
## World.Rank           -0.32       0
## Region.Rank          -0.43       0

Squared cosines are the squared correlations: close to the coefficient of determination ($R^2$ in a regression). We can read the variable-component correlations.

We print a table of squared cosines. Add a column total and a row average.

##                       Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 TOTAL
## Property.Rights         0.9   0.0   0.0   0.0   0.0   0.9
## Judical.Effectiveness   0.7   0.0   0.0   0.0   0.0   0.8
## Government.Integrity    0.8   0.0   0.0   0.0   0.0   0.9
## Tax.Burden              0.1   0.6   0.0   0.0   0.2   0.9
## Gov.t.Spending          0.2   0.5   0.0   0.0   0.1   0.8
## Fiscal.Health           0.0   0.2   0.2   0.6   0.0   1.0
## Business.Freedom        0.7   0.0   0.1   0.0   0.0   0.8
## Labor.Freedom           0.3   0.2   0.1   0.0   0.2   0.8
## Monetary.Freedom        0.5   0.0   0.3   0.0   0.0   0.8
## Trade.Freedom           0.6   0.0   0.0   0.0   0.1   0.8
## Investment.Freedom.     0.6   0.0   0.1   0.1   0.0   0.8
## Financial.Freedom       0.7   0.0   0.0   0.1   0.0   0.8
## AVERAGE                 0.5   0.1   0.1   0.1   0.1   0.8

We can plot partial results.

We read the results with these ‘labels’ in mind:

‘Open Market’ (right).
‘Closed Market’ (left).
‘Minimalist State’ (up).
‘Interventionist State’ (down).

We run a regression between the first two component and the supplementary variable: 2014_DHI. HDI = Human Development Index.

lm_pca <- lm(liberty_base$HDI_2014 ~ liberty_pca$ind$coord[, 1] + liberty_pca$ind$coord[, 2])

summary(lm_pca)

## 
## Call:
## lm(formula = liberty_base$HDI_2014 ~ liberty_pca$ind$coord[, 
##     1] + liberty_pca$ind$coord[, 2])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.272  -5.623   1.044   5.477  38.827 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 69.1274     0.8817  78.403   <2e-16 ***
## liberty_pca$ind$coord[, 1]   5.0033     0.3562  14.045   <2e-16 ***
## liberty_pca$ind$coord[, 2]  -0.8711     0.7051  -1.235    0.219    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.65 on 143 degrees of freedom
## Multiple R-squared:  0.5816, Adjusted R-squared:  0.5757 
## F-statistic: 99.39 on 2 and 143 DF,  p-value: < 2.2e-16

The first coefficient is positive: there is a positive relationship between ‘Free Market’ and the Human Development Index (HDI).

The second coefficient shows the negative relationship between ‘State Size’ and HDI. In other words, ‘Interventionist States’ are related to high HDI.

The $R^2$ are close to 60 %; 60 % of the HDI variability is explainable with the two latent variables.

lm_predict <- predict(lm_pca, data.frame(liberty_pca$ind$coord[, 1] + liberty_pca$ind$coord[, 2]))

Mmmmh… What about the top ranks?

Data points under the line are optimistic predictions vis-à-vis the actuals. Sweden, for example, comes out better than the actual HDI. Nonetheless, since the data points are compact around the line because 60 % of the HDI variability is explained by the principal components.

And the bottom ranks?

Simple Component Analysis (CA)

This time, we are going to work with contingency tables. They are also named cross tables. Along 2 dimensions, we tally unique observations. Tables can be symmetrical (n x n) or rectangular (n x p where we either have more n or more p).

Simple correspondence analysis is useful for political surveys, consumer studies or population census (on humans or animals). However, we are going to work with a simple example.

Hair color and eye color

The data is a classic 5x5 contingency table that was analyzed by Ronald Aylmer Fisher, an English statistician and biologist (the F-test and more). It is related to data on hair color and eye color of a sample of 5387 schoolchildren from Scotland.

##             fair.hair red.hair medium.hair dark.hair black.hair
## blue eyes         326       38         241       110          3
## light eyes        688      116         584       188          4
## medium eyes       343       84         909       412         26
## dark eyes          98       48         403       681         85

These are categories including:

Hair
- Fair: all tones of blond.
- Red: copper, auburn, russet, ginger.
- Medium: light brown, hazel.
- Dark: dark brown to jet-black.
Eyes
- Blue: cyan, aqua, paler blues.
- Light: blue-gray, gray, darker blues, dark gray, honey.
- Medium: hazel, dark honey.
- Dark: dark brown to jet-black.

We compute the components with all the variables.

colors_ca <- CA(colors, graph = FALSE)

We print the Component Analysis (CA) results (the colors_ca object).

## **Results of the Correspondence Analysis (CA)**
## The row variable has  4  categories; the column variable has 5 categories
## The chi square of independence between the two variables is equal to 1240.039 (p-value =  4.123993e-258 ).
## *The results are available in the following objects:
## 
##    name              description                   
## 1  "$eig"            "eigenvalues"                 
## 2  "$col"            "results for the columns"     
## 3  "$col$coord"      "coord. for the columns"      
## 4  "$col$cos2"       "cos2 for the columns"        
## 5  "$col$contrib"    "contributions of the columns"
## 6  "$row"            "results for the rows"        
## 7  "$row$coord"      "coord. for the rows"         
## 8  "$row$cos2"       "cos2 for the rows"           
## 9  "$row$contrib"    "contributions of the rows"   
## 10 "$call"           "summary called parameters"   
## 11 "$call$marge.col" "weights of the columns"      
## 12 "$call$marge.row" "weights of the rows"

In correspondence analysis, it is important to be able to visualize distances between different row profiles or between different column profiles.

We use the $\chi^2$ metric as a measure of distance. The $\chi^2$ is 1240.039 (p-value = 4.123993e-258); we can reject the hypothesis of independence of row and column.

In other words, the rows and columns are linked together; there are relationships (clusters, patterns, trends, etc.).

We can code the $\chi^2$ independence test.

We transform the data frame into a matrix.

##             fair.hair red.hair medium.hair dark.hair black.hair
## blue eyes         326       38         241       110          3
## light eyes        688      116         584       188          4
## medium eyes       343       84         909       412         26
## dark eyes          98       48         403       681         85

We perform the $\chi^2$ test between the two variables. In a contingency table, between the row and column.

chisq.test(colors2)

## 
##  Pearson's Chi-squared test
## 
## data:  colors2
## X-squared = 1240, df = 12, p-value < 2.2e-16

The result is almost the same as above where the $\chi^2$ is 1240.039 (p-value = 4.123993e-258); we can reject the hypothesis of independence of row and column.

We print and plot the eigenvectors.

##       eigenvalue percentage of variance cumulative percentage of variance
## dim 1        0.2                   86.6                              86.6
## dim 2        0.0                   13.1                              99.6
## dim 3        0.0                    0.4                             100.0

Almost 100% of the variability comes from the first two dimension. Component 1 is ALWAYS the most important ‘latent variable’ representing over 86 % of the total variability.

The correspondence map

We plot the 2D chart.

We read the results with three aspects in mind:

Proximity (clusters, affinities).
Polarity (dissimilarities, longest distance apart).
Centrality (the average on both axes).

In addition:

If row points are close, then those rows have similar conditional distributions across columns.
If column points are close, then those columns have similar conditional distributions across rows.
If a row points in close to a column point, then that configuration suggests a particular deviation from independence.

Conclusion

At first view, the map exhibit a U-shaped hair-color/eye-color pattern. There are no clear average nor extremes.

The first dimension, horizontal, displays a gradation along a fair-red-medium-dark hair scale and the light-blue-medium-dark eyes scale.

The second dimension, vertical, displays the difference between medium-color hair and eyes and the other hair and eyes colors.

There are three clusters:

Darker features (upper-left).
Medium features (bottom-center).
Lighter features (upper-right).

Additional results

We can extract more statistics.

##                   Dim 1       Dim 2        Dim 3
## fair.hair   -0.54399533  0.17384449 -0.012522082
## red.hair    -0.23326097  0.04827895  0.118054940
## medium.hair -0.04202412 -0.20830421 -0.003236468
## dark.hair    0.58870853  0.10395044 -0.010116315
## black.hair   1.09438828  0.28643670  0.046135954
##                   Dim 1       Dim 2        Dim 3
## blue eyes   -0.40029985  0.16541100 -0.064157519
## light eyes  -0.44070764  0.08846303  0.031773257
## medium eyes  0.03361434 -0.24500190 -0.005552885
## dark eyes    0.70273880  0.13391383  0.004345377

Visually.

## $`Dim 1`
## $`Dim 1`$row
##                   coord
## light eyes  -0.44070764
## blue eyes   -0.40029985
## medium eyes  0.03361434
## dark eyes    0.70273880
## 
## $`Dim 1`$col
##                   coord
## fair.hair   -0.54399533
## red.hair    -0.23326097
## medium.hair -0.04202412
## dark.hair    0.58870853
## black.hair   1.09438828
## 
## 
## $`Dim 2`
## $`Dim 2`$row
##                   coord
## medium eyes -0.24500190
## light eyes   0.08846303
## dark eyes    0.13391383
## blue eyes    0.16541100
## 
## $`Dim 2`$col
##                   coord
## medium.hair -0.20830421
## red.hair     0.04827895
## dark.hair    0.10395044
## fair.hair    0.17384449
## black.hair   0.28643670

More statistics

We plot the hair colors and hair colors by eye colors.

We can compute the overall proportion of each entry in the matrix.

We want to run an exact binomial test. The test measures if one sample proportion is representative of the population.

Beforehand, we need to reshape the dataset from ‘wide’ to ‘long’ to compute some of the necessary statistics.

library(reshape2)

colors3 <- cbind(eyes = row.names(colors), colors)

# number of medium eyes-medium hair
colors3 <- melt(colors3)
colors3

##           eyes    variable value
## 1    blue eyes   fair.hair   326
## 2   light eyes   fair.hair   688
## 3  medium eyes   fair.hair   343
## 4    dark eyes   fair.hair    98
## 5    blue eyes    red.hair    38
## 6   light eyes    red.hair   116
## 7  medium eyes    red.hair    84
## 8    dark eyes    red.hair    48
## 9    blue eyes medium.hair   241
## 10  light eyes medium.hair   584
## 11 medium eyes medium.hair   909
## 12   dark eyes medium.hair   403
## 13   blue eyes   dark.hair   110
## 14  light eyes   dark.hair   188
## 15 medium eyes   dark.hair   412
## 16   dark eyes   dark.hair   681
## 17   blue eyes  black.hair     3
## 18  light eyes  black.hair     4
## 19 medium eyes  black.hair    26
## 20   dark eyes  black.hair    85

# count of medium eyes-medium hair
colors3[11,3]

## [1] 909

# total
sum(colors3[,3])

## [1] 5387

# proportion of medium eyes-medium hair
pr <- colors3[11,3] / sum(colors3[,3])
round(pr * 100 , 2)

## [1] 16.87

909 individuals are medium eyes-medium hair in the sample (n = 5387).

The dataset is a sample of the population where we have 16.87 % of medium eyes-medium hair

We want to test if medium eyes-medium hair accounts for more than 16 % of the population.

nber <- round(pr * sum(colors3[,3]), 0)

# One-sided test (greater or less)
binom.test(nber, n = 5387, p = 0.16, alternative = 'greater', conf.level = 0.95)

## 
##  Exact binomial test
## 
## data:  nber and 5387
## number of successes = 909, number of trials = 5387, p-value =
## 0.04245
## alternative hypothesis: true probability of success is greater than 0.16
## 95 percent confidence interval:
##  0.1603875 1.0000000
## sample estimates:
## probability of success 
##              0.1687396

# Two-sided test
binom.test(nber, n = 5387, p = 0.16, alternative = 'two.sided', conf.level = 0.95)

## 
##  Exact binomial test
## 
## data:  nber and 5387
## number of successes = 909, number of trials = 5387, p-value =
## 0.08069
## alternative hypothesis: true probability of success is not equal to 0.16
## 95 percent confidence interval:
##  0.1588270 0.1790123
## sample estimates:
## probability of success 
##              0.1687396

The one-sided test rejects of null hypothesis at 5 % that medium eyes-medium hair is greater than 16 % of the population (they represent less than 16 %). The two-sided test is not as significant (10 % significant), but ends up with the same conclusion. However, this test is purely illustrative. The question is whether medium eyes-medium hair constitutes more than 16 % of the population and not if medium eyes-medium hair exactly makes up 16 % of the population. We could have tested lefties or other determinisms to see if the sample if faithful to the population.

Compute the proportion (by column) for each row.

##             fair.hair red.hair medium.hair dark.hair black.hair Total
## blue eyes        45.4      5.3        33.6      15.3        0.4 100.0
## light eyes       43.5      7.3        37.0      11.9        0.3 100.0
## medium eyes      19.3      4.7        51.2      23.2        1.5  99.9
## dark eyes         7.5      3.7        30.6      51.8        6.5 100.1

We compute the proportion (by row) for each column.

##             fair.hair red.hair medium.hair dark.hair black.hair
## blue eyes        22.4     13.3        11.3       7.9        2.5
## light eyes       47.3     40.6        27.3      13.5        3.4
## medium eyes      23.6     29.4        42.5      29.6       22.0
## dark eyes         6.7     16.8        18.9      49.0       72.0
## Total           100.0    100.1       100.0     100.0       99.9

We want to run an exact proportion test. This time, the test measures if the proportions in several groups are similar. In other words, we want to test if the proportion of each hair color is the same by eye color or not. We test each row (each eye color, from 1 to 4).

prop.test(colors2[1,], n = colors_tc, p = NULL, alternative = 'two.sided', conf.level = 0.95, correct = TRUE)

## 
##  5-sample test for equality of proportions without continuity
##  correction
## 
## data:  colors2[1, ] out of colors_tc
## X-squared = 158.82, df = 4, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
##     prop 1     prop 2     prop 3     prop 4     prop 5 
## 0.22405498 0.13286713 0.11277492 0.07907980 0.02542373

prop.test(colors2[2,], n = colors_tc, p = NULL, alternative = 'two.sided', conf.level = 0.95, correct = TRUE)

## 
##  5-sample test for equality of proportions without continuity
##  correction
## 
## data:  colors2[2, ] out of colors_tc
## X-squared = 453.99, df = 4, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
##     prop 1     prop 2     prop 3     prop 4     prop 5 
## 0.47285223 0.40559441 0.27328030 0.13515457 0.03389831

prop.test(colors2[3,], n = colors_tc, p = NULL, alternative = 'two.sided', conf.level = 0.95, correct = TRUE)

## 
##  5-sample test for equality of proportions without continuity
##  correction
## 
## data:  colors2[3, ] out of colors_tc
## X-squared = 161.84, df = 4, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
##    prop 1    prop 2    prop 3    prop 4    prop 5 
## 0.2357388 0.2937063 0.4253627 0.2961898 0.2203390

prop.test(colors2[4,], n = colors_tc, p = NULL, alternative = 'two.sided', conf.level = 0.95, correct = TRUE)

## 
##  5-sample test for equality of proportions without continuity
##  correction
## 
## data:  colors2[4, ] out of colors_tc
## X-squared = 890.35, df = 4, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
##     prop 1     prop 2     prop 3     prop 4     prop 5 
## 0.06735395 0.16783217 0.18858212 0.48957584 0.72033898

Each row is significantly different from the total row or the proportions of each hair color on each row over the total row is significantly different than the overall proportions of each hair color.

Pointer on testing (A/B tests)

Not only are comparison tests used in research, but also in business, marketing, and web analytics. They are known as ‘A/B tests’.

Depending on the situation, A/B tests can be:

A Welch’s t-test for comparing average revenues (mean) per paying user between two geographical groups.
A Fisher’s test for comparing click-through rates between two webpages.
A $\chi^2$ test for comparing a number of units purchased of two products on sale.
And much more.

A and B can be two web pages or B can be an alternative version of web page A. Other situations:

Ads click-throughs.
Landing pages.
Campaign sign-ups.
Social sharing button use.
Promotions lift.
Navigation element additions.
Funnel component changes.
Call-to-action.
Engagement.
Purchases.
Leads.
Conversion.
Etc.

Pointer on two-way tables

Two-way contingency tables consist of square tables where individuals, or other kinds of observations, are naturally paired:

Husband and wives, by occupational or social status.
Fathers and sons, by occupational or social status.
Vision grades, left and right eyes.
Individual or events judged independently at two distinct points in time.
Customer or student surveys, pre and post a service, training or another event.
Etc.

We can measure evolution, mobility, progression, etc.

Occupational mobility

The data is a classic 14x14 contingency table that was analyzed by Karl Pearson, an English mathematician and biostatistician (the Pearsons correlation and more). It is a sample of 775 males in 1904 England. We want to study mobility by occupational categories from one generation to another (father to son):

##    a  b c d  e f g  h  i  j  k l  m  n
## A 28  0 4 0  0 0 1  3  3  0  3 1  5  2
## B  2 51 1 1  2 0 0  1  2  0  0 0  1  1
## C  6  5 7 0  9 1 3  6  4  2  1 1  2  7
## D  0 12 0 6  5 0 0  1  7  1  2 0  0 10
## E  5  5 2 1 54 0 0  6  9  4 12 3  1 13
## F  0  2 3 0  3 0 0  1  4  1  4 2  1  5
## G 17  1 4 0 14 0 6 11  4  1  3 3 17  7
## H  3  5 6 0  6 0 2 18 13  1  1 1  8  5
## I  0  1 1 0  4 0 0  1  4  0  2 1  1  4
## J 12 16 4 1 15 0 0  5 13 11  6 1  7 15
## K  0  4 2 0  1 0 0  0  3  0 20 0  5  6
## L  1  3 1 0  0 0 1  0  1  1  1 6  2  1
## M  5  0 2 0  3 0 1  8  1  2  2 3 23  1
## N  5  3 0 2  6 0 1  3  1  0  0 1  1  9

Upper case: fathers.
Lower case: sons.

We ad totals.

##        a   b  c  d   e f  g  h  i  j  k  l  m  n Total
## A     28   0  4  0   0 0  1  3  3  0  3  1  5  2    50
## B      2  51  1  1   2 0  0  1  2  0  0  0  1  1    62
## C      6   5  7  0   9 1  3  6  4  2  1  1  2  7    54
## D      0  12  0  6   5 0  0  1  7  1  2  0  0 10    44
## E      5   5  2  1  54 0  0  6  9  4 12  3  1 13   115
## F      0   2  3  0   3 0  0  1  4  1  4  2  1  5    26
## G     17   1  4  0  14 0  6 11  4  1  3  3 17  7    88
## H      3   5  6  0   6 0  2 18 13  1  1  1  8  5    69
## I      0   1  1  0   4 0  0  1  4  0  2  1  1  4    19
## J     12  16  4  1  15 0  0  5 13 11  6  1  7 15   106
## K      0   4  2  0   1 0  0  0  3  0 20  0  5  6    41
## L      1   3  1  0   0 0  1  0  1  1  1  6  2  1    18
## M      5   0  2  0   3 0  1  8  1  2  2  3 23  1    51
## N      5   3  0  2   6 0  1  3  1  0  0  1  1  9    32
## Total 84 108 37 11 122 1 15 64 69 24 57 23 74 86   775

Occupations:

A, a, army.
B, b, art.
C, c, teaching, clerical work.
D, d, crafts.
E, e, divinity.
F, f, agriculture.
G, g, landownership.
H, h, law.
I, i, literature.
J, j, commerce.
K, k, medicine.
L, l, navy.
M, m, politics and court.
N, n, scholarship and science.

We compute the components with all the variables and print the results.

fathers_ca <- CA(fathers, graph = FALSE)

## **Results of the Correspondence Analysis (CA)**
## The row variable has  14  categories; the column variable has 14 categories
## The chi square of independence between the two variables is equal to 1005.454 (p-value =  5.633858e-119 ).
## *The results are available in the following objects:
## 
##    name              description                   
## 1  "$eig"            "eigenvalues"                 
## 2  "$col"            "results for the columns"     
## 3  "$col$coord"      "coord. for the columns"      
## 4  "$col$cos2"       "cos2 for the columns"        
## 5  "$col$contrib"    "contributions of the columns"
## 6  "$row"            "results for the rows"        
## 7  "$row$coord"      "coord. for the rows"         
## 8  "$row$cos2"       "cos2 for the rows"           
## 9  "$row$contrib"    "contributions of the rows"   
## 10 "$call"           "summary called parameters"   
## 11 "$call$marge.col" "weights of the columns"      
## 12 "$call$marge.row" "weights of the rows"

The $\chi^2$ is 1005.454 (p-value = 5.633858e-119); we can reject the hypothesis of independence of row and column. In other words, the rows and columns are linked together; there are relationships (clusters, patterns, trends, etc.).

We print and plot the eigenvectors.

##        eigenvalue percentage of variance cumulative percentage of variance
## dim 1         0.4                   31.6                              31.6
## dim 2         0.3                   19.3                              51.0
## dim 3         0.2                   12.6                              63.6
## dim 4         0.1                   10.4                              74.0
## dim 5         0.1                    7.0                              81.0
## dim 6         0.1                    6.5                              87.5
## dim 7         0.1                    5.3                              92.8
## dim 8         0.0                    3.3                              96.1
## dim 9         0.0                    2.1                              98.3
## dim 10        0.0                    0.8                              99.1
## dim 11        0.0                    0.6                              99.6
## dim 12        0.0                    0.3                              99.9
## dim 13        0.0                    0.1                             100.0

Component 1 is ALWAYS the most important and represents 32 % total variability. The cumulative variance of the first two component is 51 %. In this case, lower components contain significant information, but we will not here.

The correspondence map

Plot the 2D chart.

We read the results with three aspects in mind:

Proximity (clusters, affinities).
Polarity (dissimilarities, longest distance apart).
Centrality (the average on both axes).

Conclusion

On the right of the map, there is a steady progression in occupation from A to E (sweeping the upper-right and lower-right quadrants).

Two occupations, B and D, representing arts and crafts, stand out from the rest.

Points that are close together on the map reflect the fact that there is a lot of movement from father to son (sons opt for different occupations).

High mobility: sons opt for different occupations than their father’s (C).

On the other hand, points that are far apart from each other indicate relatively little movement.

Low mobility: sons opt for their father’s occupation (B).

B-b really stands apart in term of conservatism. 82 % of sons chose their father’s occupation.

Overall.

Percentage of sons who pursue their father’s occupation by occupation.

A, a, army.
B, b, art.
C, c, teaching, clerical work.
D, d, crafts.
E, e, divinity.
F, f, agriculture.
G, g, landownership.
H, h, law.
I, i, literature.
J, j, commerce.
K, k, medicine.
L, l, navy.
M, m, politics and court.
N, n, scholarship and science.

Clusters suggest that occupational mobility from father to son is confined to movements within the various clusters only (and not between clusters):

A (army), M (politics and court).
C (teaching, clerical work, civil service), G (landownership), H (law), L (navy).
F (agriculture), I (literature), J (commerce), N (scholarship and science).
E (divinity), K (medicine).

Multiple Component Analysis (MCA)

This time, the contingency table comprise qualitative variables: categorical data, nominal or ordinal factors.

MCA is useful to map out the results of interviews or Likert-scale results.

Consumption credit

The data is about 66 bank clients who contracted consumption credit. There are 11 explanatory variables.

## 'data.frame':    66 obs. of  11 variables:
##  $ Market    : Factor w/ 5 levels "Car","Furniture",..: 4 4 1 4 5 4 4 4 2 4 ...
##  $ Cash      : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 1 2 2 2 2 ...
##  $ Unpaid    : Factor w/ 3 levels "Unp_0","Unp_1_or_2",..: 1 1 1 1 3 1 1 1 1 1 ...
##  $ Insurance : Factor w/ 4 levels "A","A_Benefits",..: 1 3 1 4 1 4 4 3 3 4 ...
##  $ Debt      : Factor w/ 4 levels "Debt1","Debt2",..: 1 2 3 2 4 3 2 2 1 2 ...
##  $ Status    : Factor w/ 5 levels "Bachelor","Cohab",..: 2 4 4 5 2 4 5 5 2 4 ...
##  $ Dependents: Factor w/ 5 levels "Child_0","Child_1",..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Housing   : Factor w/ 5 levels "Employer_housed",..: 2 4 3 4 1 4 4 4 1 4 ...
##  $ Occupation: Factor w/ 6 levels "Executive","Mid_management",..: 6 5 5 4 2 4 4 4 6 4 ...
##  $ Title     : Factor w/ 3 levels "Miss","Mr","Ms": 2 2 2 2 2 2 3 3 2 2 ...
##  $ Age       : Factor w/ 5 levels "20","30","40",..: 4 3 2 5 1 5 5 5 3 5 ...

We display the statistics.

For labels too small to read.

levels(credit$Market)

## [1] "Car"        "Furniture"  "Moto"       "Renovation" "Scooter"

levels(credit$Status)

## [1] "Bachelor" "Cohab"    "Divorced" "Married"  "Widowed"

levels(credit$Housing)

## [1] "Employer_housed" "Family_housed"   "New_owner"       "Owner"          
## [5] "Tenant"

levels(credit$Occupation)

## [1] "Executive"        "Mid_management"   "Mid_mangement"   
## [4] "Retired"          "Skilled_worker"   "Unskilled_worker"

We compute the components with the first 5 variables.

If level.ventil=0.05 for example, levels lower or equal to 5 % of the total would automatically be attributed to the other levels (ventilated according to the other level weights).

credit_mca <- MCA(credit, quali.sup = 6:11, level.ventil = 0, graph = FALSE)

We plot the results.

barplot(credit_mca$eig[1:5,2], names.arg = rownames(credit_mca$eig[1:5,]))

The first two components represent only 30 % of total variability – or total inertia as it is often called.

In the case of a MCA, this low value is robust, though.

The correspondence maps

A map of all individuals.

A map without numbers, with variables.

Adding more colors.

By Age (11th).

The complete map.

On the horizontal axis, we have younger profiles on the right (keywords: scooter, new owner, tenant, housed in family, high debt) and older profiles on the left (keywords: senior, retired, widowed, owner, low debt).

On the vertical axis, we have people with financial difficulties at the bottom. These people are heavy in debt and are unemployed.

We can isolate some cluster:

Senior-Retired-Widowed-60 (-years old).
Skilled worker-No child-Bachelor-Mid-management.
Heavy debt (Debt4)-20 (-years old)-Tenant.
Etc.

Let us have a look at the contribution, credit_mca$contrib, of each variable to the first two components. The individual results can be extracted with credit_mca$ind.

##                 Dim 1        Dim 2
## Car         6.2978074 2.300173e+00
## Furniture   1.7343275 9.518153e+00
## Moto        0.7442569 8.329618e-01
## Renovation 12.0394243 8.135005e+00
## Scooter    10.4133134 3.419319e+00
## No          6.4918889 3.404219e+00
## Yes         5.7499587 3.015165e+00
## Unp_0       1.1943032 1.522427e+00
## Unp_1_or_2  2.3318481 9.803501e+00
## Unp_3+      0.4772412 1.775006e+01
## A           7.3305180 1.877022e+00
## A_Benefits  0.7203020 1.970557e+01
## None        3.4769365 3.451857e-04
## Senior     13.6314575 7.126168e+00
## Debt1       0.4896683 4.747586e+00
## Debt2      14.1983793 1.029576e-02
## Debt3       1.2789896 6.648669e+00
## Debt4      11.3993792 1.833538e-01

Let us plot the variable-component correlations.

plot(credit_mca, choix = 'var')

Market has the most influence on the first component; Insurance and Unpaid debts, on the second.

We can build other maps with components 3 and 4.

And so on…

Factorial Analysis (FA) – Other facets

Here, we come back to Factor Analysis. We touch base with our Olympic dataset and the factanal function.

## 'data.frame':    54 obs. of  8 variables:
##  $ X100m   : num  0.0594 -1.2962 -0.3021 -0.3473 0.2402 ...
##  $ X200m   : num  -0.313 -0.878 -0.167 -0.641 -0.44 ...
##  $ X400m   : num  0.2439 -1.0072 -0.0202 -0.5624 -0.3955 ...
##  $ X800m   : num  0.0353 -0.5366 0.0353 -0.7273 0.4166 ...
##  $ X1500m  : num  0.176 -0.813 -0.483 -0.549 0.307 ...
##  $ X5000m  : num  -0.378 -0.904 -0.47 -1.035 1.344 ...
##  $ X10000m : num  -0.527 -0.599 -0.485 -0.992 1.164 ...
##  $ marathon: num  -0.437 -0.667 -0.141 -0.701 1.44 ...

The data are numerical and with numerical and categorical variables, we can perform a Multiple Factorial Analysis (MFA) with the MineFactoR package. There are other possibilities we do not address:

Hierarchical Multiple Factorial Analysis (HMFA), an extension of MFA for which variables are structured according to a hierarchy.
Generalized Procrustean Analysis (GPA), for which variables must be continuous.
Dual MFA, when several sets of individuals are described by one set of continuous variables.
Factor Analysis of Mixed Data (FAMD), when one set of individuals is described by one set of variables that may be continuous and/or categorical.

In addition, we can perform Hierarchical Clustering on Principle Components (HCPC).

Multiple Factor Analysis

This is a first example included in the package. From a dataset, we build an object as we did with PCA and CA.

data(wine)

wine_mfa <- MFA(wine, group = c(2,5,3,10,9,2), type = c('n', rep('s', 5)), ncp = 5, name.group = c('orig', 'olf', 'vis', 'olfag', 'gust', 'ens'), num.group.sup = c(1, 6))

We plot the eigenvalues…

… and confidence ellipses around the categories; they are available in PCA, CA, and MFA.

We can also plot interactive graphs…

Linear Discriminant Analysis (LDA)

LDA is closely related to ANOVA and regression analysis with a qualitative dependent variable. We want to explain (and predict) a qualitative dependent variable, Y, with quantitative and qualitative variables, X, or explanatory variables.

The first use of LDA is classification, pattern recognition, and machine learning. Y may have two or more levels. We want to compute the probability for each level to belong to one group. Behind the scenes, the cost function is based on the Bayes’ theorem.

The other use of LDA is dimension reduction. As with PCA, we create latent variables called ‘discriminant canonical variables’ that are linear combinations of features that characterize or separates two or more classes of objects or events.

A first example

We illustrate the method with a dataset about snoring (yes or no the individual snores). The explanatory variables are age, weight is kg, height in cm, alcohol consumption (portion per day), sex and smoke if the individual is a smoker.

## 'data.frame':    100 obs. of  7 variables:
##  $ age    : num  47 56 46 70 51 46 40 46 49 39 ...
##  $ weight : num  71 58 116 96 91 98 112 77 76 119 ...
##  $ height : num  158 164 208 186 195 188 193 165 164 196 ...
##  $ alcohol: num  0 7 3 3 2 0 5 0 0 3 ...
##  $ sex    : Factor w/ 2 levels "F","M": 2 2 2 2 2 1 2 1 2 2 ...
##  $ snore  : Factor w/ 2 levels "N","Y": 1 2 1 1 2 1 1 1 1 1 ...
##  $ smoke  : Factor w/ 2 levels "N","Y": 2 1 2 2 2 1 2 1 1 1 ...

##       age            weight           height         alcohol      sex   
##  Min.   :23.00   Min.   : 42.00   Min.   :158.0   Min.   : 0.00   F:25  
##  1st Qu.:43.00   1st Qu.: 77.00   1st Qu.:166.0   1st Qu.: 0.00   M:75  
##  Median :52.00   Median : 95.00   Median :186.0   Median : 2.00         
##  Mean   :52.27   Mean   : 90.41   Mean   :181.1   Mean   : 2.95         
##  3rd Qu.:62.25   3rd Qu.:107.00   3rd Qu.:194.0   3rd Qu.: 4.25         
##  Max.   :74.00   Max.   :120.00   Max.   :208.0   Max.   :15.00         
##  snore  smoke 
##  N:65   N:36  
##  Y:35   Y:64  
##               
##               
##               
##

We build a first model.

library(MASS)

# prior
# age, weight, height, alcohol, sex, smoke
snore_lda_model <- lda(snore ~ ., data = snore)

snore_lda_model

## Call:
## lda(snore ~ ., data = snore)
## 
## Prior probabilities of groups:
##    N    Y 
## 0.65 0.35 
## 
## Group means:
##        age   weight   height  alcohol      sexM    smokeY
## N 50.26154 90.47692 180.9538 2.369231 0.6923077 0.6769231
## Y 56.00000 90.28571 181.3714 4.028571 0.8571429 0.5714286
## 
## Coefficients of linear discriminants:
##                 LD1
## age      0.05973655
## weight  -0.01620579
## height   0.01590170
## alcohol  0.24058822
## sexM     0.55413371
## smokeY  -1.14621434

We print the discriminant functions based on centered (not standardized) variables.

No significance tests are produced.

In the results (Prior probabilities of groups), 35 % of the sample snores (Y for ‘yes’).

Next, we read the mean of each subgroup. For example, the average age of snorers is 56, their weight is over 90.2kg, etc.

Finally, we have the discriminant canonical variables LD1 and its coefficients. These are similar to a regression’s coefficient. For example, weight is negatively correlated with snoring and smoke=Y has the most impact. We read that sex=M increases the proportion of snoring by 0.5541; each additional year increase the proportion of snoring by 0.05974.

Since the weight and height coefficients are nearly zero, we remove these variables and run another model.

snore_lda_model1 <- lda(snore ~ age + alcohol + sex + smoke, data = snore)

snore_lda_model1

## Call:
## lda(snore ~ age + alcohol + sex + smoke, data = snore)
## 
## Prior probabilities of groups:
##    N    Y 
## 0.65 0.35 
## 
## Group means:
##        age  alcohol      sexM    smokeY
## N 50.26154 2.369231 0.6923077 0.6769231
## Y 56.00000 4.028571 0.8571429 0.5714286
## 
## Coefficients of linear discriminants:
##                 LD1
## age      0.06062024
## alcohol  0.24060629
## sexM     0.54047702
## smokeY  -1.13039555

It barely affects the coefficients.

We can set the prior probabilities with prior. Instead of ‘35 % of the sample snores’, we set equal proportions (50/50).

snore_lda_model2 <- lda(snore ~ ., data = snore, prior = c(0.50, 0.50))

snore_lda_model3 <- lda(snore ~ age + alcohol + sex + smoke, data = snore, prior = c(0.50, 0.50))

snore_lda_model2

## Call:
## lda(snore ~ ., data = snore, prior = c(0.5, 0.5))
## 
## Prior probabilities of groups:
##   N   Y 
## 0.5 0.5 
## 
## Group means:
##        age   weight   height  alcohol      sexM    smokeY
## N 50.26154 90.47692 180.9538 2.369231 0.6923077 0.6769231
## Y 56.00000 90.28571 181.3714 4.028571 0.8571429 0.5714286
## 
## Coefficients of linear discriminants:
##                 LD1
## age      0.05973655
## weight  -0.01620579
## height   0.01590170
## alcohol  0.24058822
## sexM     0.55413371
## smokeY  -1.14621434

The coefficients did not change. However, the choice impacts how we assign new individuals to each group (from 65/35 to 50/50).

As any classification model, we can compute the confusion matrix and error.

Re-substitution (using the same data to derive the functions and evaluate their prediction accuracy) is the default method unless CV=TRUE is specified. Re-substitution will be overly optimistic.

We turn the CV parameter on (cross validation). CV=TRUE generates jackknifed (i.e., leave one out) predictions.

# snore_lda_model
snore_lda_pred <- lda(snore ~ ., data = snore, CV = TRUE)$class

# Confusion matrix
table(snore_lda_pred, snore$snore)

##               
## snore_lda_pred  N  Y
##              N 53 22
##              Y 12 13

We compute the error rate by comparing the predictions (snore_lda_pred) with the actuals (snore$snore).

##     Prediction Actual Check
## 1            N      N      
## 2            Y      Y      
## 3            N      N      
## 4            Y      N error
## 5            N      Y error
## 6            N      N      
## 7            N      N      
## 8            N      N      
## 9            N      N      
## 10           N      N      
## 11           N      N      
## 12           N      N      
## 13           N      N      
## 14           N      N      
## 15           Y      N error
## 16           N      N      
## 17           Y      N error
## 18           Y      N error
## 19           N      N      
## 20           N      N      
## 21           N      N      
## 22           Y      N error
## 23           N      N      
## 24           N      N      
## 25           N      N      
## 26           N      N      
## 27           N      N      
## 28           N      Y error
## 29           N      N      
## 30           N      N      
## 31           N      N      
## 32           N      N      
## 33           Y      N error
## 34           N      N      
## 35           N      N      
## 36           N      N      
## 37           N      N      
## 38           Y      N error
## 39           Y      N error
## 40           N      N      
## 41           Y      N error
## 42           N      N      
## 43           N      N      
## 44           Y      N error
## 45           N      N      
## 46           N      N      
## 47           N      N      
## 48           N      N      
## 49           N      N      
## 50           N      N      
## 51           N      N      
## 52           N      N      
## 53           Y      N error
## 54           N      N      
## 55           N      N      
## 56           N      N      
## 57           N      N      
## 58           N      N      
## 59           N      N      
## 60           N      N      
## 61           N      N      
## 62           N      N      
## 63           N      Y error
## 64           Y      Y      
## 65           N      Y error
## 66           N      Y error
## 67           Y      Y      
## 68           N      Y error
## 69           N      Y error
## 70           N      Y error
## 71           N      Y error
## 72           Y      Y      
## 73           Y      Y      
## 74           N      Y error
## 75           N      Y error
## 76           N      Y error
## 77           Y      Y      
## 78           N      Y error
## 79           Y      Y      
## 80           Y      Y      
## 81           Y      Y      
## 82           Y      Y      
## 83           N      Y error
## 84           Y      Y      
## 85           N      Y error
## 86           Y      Y      
## 87           N      Y error
## 88           N      N      
## 89           N      Y error
## 90           N      N      
## 91           N      Y error
## 92           Y      N error
## 93           N      N      
## 94           N      N      
## 95           Y      Y      
## 96           N      Y error
## 97           N      Y error
## 98           N      N      
## 99           N      Y error
## 100          N      Y error

## [1] 0.34

A 34% error rate.

And for the other models…

snore_lda_model1

##                
## snore_lda_pred1  N  Y
##               N 52 23
##               Y 13 12

## [1] 0.36

snore_lda_model2

##                
## snore_lda_pred2  N  Y
##               N 41 10
##               Y 24 25

## [1] 0.34

snore_lda_model3

## [1] 0.32

The last model is the best (lowest error rate).

Now, we can use the classifying model to make predictions by supplying new data.

##   age weight height alcohol sex smoke
## 1  42     55    169       0   F     N
## 2  58     94    185       4   M     Y
## 3  35     70    180       6   M     Y
## 4  67     63    166       3   F     N

Let us compute the results with the first model of all.

snore_lda_predict <- predict(snore_lda_model, newdata = n_snore)

First, we get the predictions.

## [1] N N N Y
## Levels: N Y

A comparative: Actuals vs Predictions.

##   age weight height alcohol sex smoke pred_smoke
## 1  42     55    169       0   F     N          N
## 2  58     94    185       4   M     Y          N
## 3  35     70    180       6   M     Y          N
## 4  67     63    166       3   F     N          Y

Second, we get the classification probabilities.

##           N         Y
## 1 0.7957279 0.2042721
## 2 0.6095636 0.3904364
## 3 0.7325932 0.2674068
## 4 0.3532721 0.6467279

Third, we get the score by individual.

##          LD1
## 1 -0.6238163
## 2  0.3246422
## 3 -0.2586916
## 4  1.4140108

And we know from above that change the prior probabilities will not affect the coefficients nor the predictions.

We can also remove missing values with na.action.

library(MASS)

snore_lda_model4 <- lda(snore ~ ., data = snore, na.action = 'na.omit', CV = TRUE)

We then perform a LDA using listwise deletion of missing data.

Quadratic Discriminant Analysis (QDA)

The quadratic discriminant function does not assume homogeneity of variance-covariance matrices.

library(MASS)

snore_qda_model <- qda(snore ~ ., data = snore, na.action = 'na.omit', CV = TRUE)

Let us compare the the LDA and QDA results.

##     Actual LDA Check Actual.1 QDA Check.1
## 1        N   N              N   N        
## 2        Y   Y              Y   N   error
## 3        N   N              N   N        
## 4        N   Y error        N   Y   error
## 5        Y   N error        Y   N   error
## 6        N   N              N   N        
## 7        N   N              N   N        
## 8        N   N              N   N        
## 9        N   N              N   N        
## 10       N   N              N   N        
## 11       N   N              N   N        
## 12       N   N              N   N        
## 13       N   N              N   N        
## 14       N   N              N   N        
## 15       N   Y error        N   Y   error
## 16       N   N              N   N        
## 17       N   Y error        N   Y   error
## 18       N   Y error        N   Y   error
## 19       N   N              N   N        
## 20       N   N              N   N        
## 21       N   N              N   N        
## 22       N   Y error        N   Y   error
## 23       N   N              N   N        
## 24       N   N              N   N        
## 25       N   N              N   N        
## 26       N   N              N   N        
## 27       N   N              N   N        
## 28       Y   N error        Y   Y        
## 29       N   N              N   N        
## 30       N   N              N   N        
## 31       N   N              N   N        
## 32       N   N              N   N        
## 33       N   Y error        N   Y   error
## 34       N   N              N   N        
## 35       N   N              N   N        
## 36       N   N              N   N        
## 37       N   N              N   N        
## 38       N   Y error        N   Y   error
## 39       N   Y error        N   Y   error
## 40       N   N              N   Y   error
## 41       N   Y error        N   Y   error
## 42       N   N              N   N        
## 43       N   N              N   N        
## 44       N   Y error        N   Y   error
## 45       N   N              N   N        
## 46       N   N              N   Y   error
## 47       N   N              N   N        
## 48       N   N              N   N        
## 49       N   N              N   N        
## 50       N   N              N   N        
## 51       N   N              N   N        
## 52       N   N              N   N        
## 53       N   Y error        N   Y   error
## 54       N   N              N   N        
## 55       N   N              N   N        
## 56       N   N              N   N        
## 57       N   N              N   N        
## 58       N   N              N   N        
## 59       N   N              N   N        
## 60       N   N              N   Y   error
## 61       N   N              N   N        
## 62       N   N              N   N        
## 63       Y   N error        Y   Y        
## 64       Y   Y              Y   N   error
## 65       Y   N error        Y   N   error
## 66       Y   N error        Y   Y        
## 67       Y   Y              Y   N   error
## 68       Y   N error        Y   N   error
## 69       Y   N error        Y   Y        
## 70       Y   N error        Y   N   error
## 71       Y   N error        Y   N   error
## 72       Y   Y              Y   N   error
## 73       Y   Y              Y   Y        
## 74       Y   N error        Y   N   error
## 75       Y   N error        Y   N   error
## 76       Y   N error        Y   N   error
## 77       Y   Y              Y   Y        
## 78       Y   N error        Y   Y        
## 79       Y   Y              Y   Y        
## 80       Y   Y              Y   N   error
## 81       Y   Y              Y   N   error
## 82       Y   Y              Y   N   error
## 83       Y   N error        Y   N   error
## 84       Y   Y              Y   N   error
## 85       Y   N error        Y   N   error
## 86       Y   Y              Y   N   error
## 87       Y   N error        Y   N   error
## 88       N   N              N   N        
## 89       Y   N error        Y   N   error
## 90       N   N              N   N        
## 91       Y   N error        Y   Y        
## 92       N   Y error        N   Y   error
## 93       N   N              N   N        
## 94       N   N              N   N        
## 95       Y   Y              Y   Y        
## 96       Y   N error        Y   N   error
## 97       Y   N error        Y   N   error
## 98       N   N              N   N        
## 99       Y   N error        Y   N   error
## 100      Y   N error        Y   N   error

Let us validate with the confusion matrix.

# Alternative code
conf <- table(snore$snore, snore_qda_model$class)

diag(prop.table(conf, 1))

##         N         Y 
## 0.7692308 0.2857143

# Error rate
1 - sum(diag(prop.table(conf)))

## [1] 0.4

The QDA is not an improvement.

The klaR package can display the results of a LDA or QDA, two variables at a time.

LDA

QDA

Feature Selection

When we abide by Occam’s razor, the law of parsimony, we want the simplest, but the best model. Selecting the right features in the data, selecting the right variables, has a huge impact on training times and prediction accuracy.

Data can contain attributes that are highly correlated with each other. Many methods perform better if highly correlated attributes are removed.

The caret package provides tools to automatically report on the relevance and importance of attributes in the data and select the most important features.

We use the Pima Indians Diabetes dataset.

## 'data.frame':    768 obs. of  9 variables:
##  $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
##  $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
##  $ pressure: num  72 66 64 66 40 74 50 0 70 96 ...
##  $ triceps : num  35 29 0 23 35 0 32 0 45 0 ...
##  $ insulin : num  0 0 0 94 168 0 88 0 543 0 ...
##  $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
##  $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

Removing redundant features

We build the correlation matrix.

correlationMatrix <- cor(PimaIndiansDiabetes[,1:8])

correlationMatrix

##             pregnant    glucose   pressure     triceps     insulin
## pregnant  1.00000000 0.12945867 0.14128198 -0.08167177 -0.07353461
## glucose   0.12945867 1.00000000 0.15258959  0.05732789  0.33135711
## pressure  0.14128198 0.15258959 1.00000000  0.20737054  0.08893338
## triceps  -0.08167177 0.05732789 0.20737054  1.00000000  0.43678257
## insulin  -0.07353461 0.33135711 0.08893338  0.43678257  1.00000000
## mass      0.01768309 0.22107107 0.28180529  0.39257320  0.19785906
## pedigree -0.03352267 0.13733730 0.04126495  0.18392757  0.18507093
## age       0.54434123 0.26351432 0.23952795 -0.11397026 -0.04216295
##                mass    pedigree         age
## pregnant 0.01768309 -0.03352267  0.54434123
## glucose  0.22107107  0.13733730  0.26351432
## pressure 0.28180529  0.04126495  0.23952795
## triceps  0.39257320  0.18392757 -0.11397026
## insulin  0.19785906  0.18507093 -0.04216295
## mass     1.00000000  0.14064695  0.03624187
## pedigree 0.14064695  1.00000000  0.03356131
## age      0.03624187  0.03356131  1.00000000

We find attributes that are highly corrected (ideally > 0.75; average > 0.50). Keep in mind this rule of thumb: 0.75.

highlyCorrelated <- findCorrelation(correlationMatrix, cutoff = 0.5)

highlyCorrelated

## [1] 8

row.names(correlationMatrix)[highlyCorrelated]

## [1] "age"

The 8th attribute is age.

Rank features by importance

Some methods, such as decision trees, have a built-in mechanism to report on variable importance. For other algorithms, the importance can be estimated using a confusion matrix and a ROC curve analysis conducted on each attribute.

The example below constructs a Learning Vector Quantization (LVQ) model.

set.seed(7)

# Prepare the training
control <- trainControl(method = 'repeatedcv', number = 10, repeats = 3)

# Train the model
model <- train(diabetes ~ ., data = PimaIndiansDiabetes, method = 'lvq', preProcess = 'scale', trControl = control)

We then estimate the variable importance.

importance <- varImp(model, scale = FALSE)

importance

## ROC curve variable importance
## 
##          Importance
## glucose      0.7881
## mass         0.6876
## age          0.6869
## pregnant     0.6195
## pedigree     0.6062
## pressure     0.5865
## triceps      0.5536
## insulin      0.5379

And plot it.

The results show that the glucose, mass and age attributes are the top 3 most important attributes in the dataset and the insulin attribute is the least important.

Feature selection

Automatic feature selection methods can be used to build many models with different subsets of a dataset. We then identify attributes that are required to build an accurate model.

A popular automatic method is the Recursive Feature Elimination (RFE). A Random Forest algorithm is used on each iteration to evaluate the model. The algorithm is configured to explore all possible subsets of the attributes.

library(randomForest)

set.seed(7)

# Define the control using a random forest selection function
control <- rfeControl(functions = rfFuncs, method = 'cv', number = 10)

# Run the RFE algorithm (it takes some time)
results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes = c(1:8), rfeControl = control)

results

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          1   0.6873 0.2583    0.05704 0.11679         
##          2   0.7265 0.3748    0.04812 0.09870         
##          3   0.7356 0.4086    0.04534 0.10020         
##          4   0.7565 0.4547    0.04107 0.09508         
##          5   0.7539 0.4429    0.05277 0.12191         
##          6   0.7473 0.4324    0.03894 0.08530         
##          7   0.7499 0.4329    0.04013 0.09785         
##          8   0.7616 0.4584    0.04917 0.11000        *
## 
## The top 5 variables (out of 8):
##    glucose, mass, age, pregnant, pedigree

We notice the 8th variable with an asterisk. The accuracy is over 0.75; a reminder of the rule of thumb we underlined above. The results also list the top five variables.

What is this 8th variable?

##    VarName Variables  Accuracy     Kappa AccuracySD    KappaSD
## 1 pregnant         1 0.6873377 0.2582596 0.05704175 0.11678728
## 2  glucose         2 0.7264696 0.3748388 0.04812330 0.09869825
## 3 pressure         3 0.7356288 0.4086387 0.04534209 0.10020218
## 4  triceps         4 0.7565277 0.4547458 0.04106932 0.09507553
## 5  insulin         5 0.7539303 0.4429472 0.05277373 0.12190532
## 6     mass         6 0.7473172 0.4324254 0.03894122 0.08530466
## 7 pedigree         7 0.7499316 0.4329196 0.04012769 0.09784581
## 8      age         8 0.7615858 0.4583660 0.04916710 0.11000058

age, again. Let us plot the results.

All 8 attributes are selected in this example. glucose, mass, age, pregnant, pedigree correspond to 1, 2, 6, 7 and 8. The choice has to do with accuracy and other factors. There is no need for variables that will introduce multicollinearity in the model.

Conclusion

Variable importance, from the caret package, and the results, from the randomForest package, are contradictory.

There is likely no best set of features just like there is no best model.

We must make choices given these results. The results will not tell us which variable to pick.

Introduction to Machine Learning - Dimension Reduction

Multidimensional Scaling (MDS)

Classical MDS

Dimension reduction concepts

Nonmetric MDS

Determining the number of factors to extract

Component Analysis (CA) – An Introduction

Factor (Factorial) Analysis (FA)

Component Analysis (CA) – Adding Computing Power

Principal Component Analysis (PCA)

Simple Component Analysis (CA)

Pointer on testing (A/B tests)

Pointer on two-way tables

Occupational mobility

Multiple Component Analysis (MCA)

Factorial Analysis (FA) – Other facets

Linear Discriminant Analysis (LDA)

Quadratic Discriminant Analysis (QDA)

Feature Selection