Foreword

  • Output options: ‘tango’ syntax, the ‘readable’ theme.
  • Snippets and results.
  • Source: US Census Bureau.


Descriptive Statistics

Inputting

Dataset

The dataset comes from the US Census Bureau.



Loading the data

On their website, open the Excel file ‘NST-EST2011-02’ about the annual estimates of the resident population. In the spreadsheet, starting from cell B11, copy and paste the data to the clipboard (from B11 to B61).

There are several ways to input data into R. With the clipboard, the safest way is to create a new worksheet where you paste raw data; without any formatting or formulas (value-only). Make sure the data is clean. Copy the raw data and run:

USstatePops <- read.table(file = 'clipboard', sep = '\t', header = FALSE)

The data becomes a R object: a data frame. Print it out. Add column names.

head(USstatePops,5)
##           V1       V2
## 1    Alabama  4779735
## 2     Alaska   710231
## 3    Arizona  6392013
## 4   Arkansas  2915921
## 5 California 37253956
colnames(USstatePops) <- c('State', 'Pop')

head(USstatePops, 5)
##        State      Pop
## 1    Alabama  4779735
## 2     Alaska   710231
## 3    Arizona  6392013
## 4   Arkansas  2915921
## 5 California 37253956

Check out the data frame.

str(USstatePops)
## 'data.frame':    51 obs. of  2 variables:
##  $ State: Factor w/ 51 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Pop  : int  4779735 710231 6392013 2915921 37253956 5029196 3574097 897934 601723 18801311 ...

There are 51 states (50 states and the District of Columbia). The states should be strings, not factors. The trick to extract numbers or strings without any loss from a factor structure is the following code:

# make a copy for safety
USstatePops2 <- USstatePops

# convert
USstatePops2$State <- as.character(levels(USstatePops2$State))

Check out the new data frame.

str(USstatePops2)
## 'data.frame':    51 obs. of  2 variables:
##  $ State: chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ Pop  : int  4779735 710231 6392013 2915921 37253956 5029196 3574097 897934 601723 18801311 ...
head(USstatePops2, 5)
##        State      Pop
## 1    Alabama  4779735
## 2     Alaska   710231
## 3    Arizona  6392013
## 4   Arkansas  2915921
## 5 California 37253956

Better; the states are strings (characters).

We now have a data frame to work with.

Exploration and visual inspection

Compute the mode, mean, and median.

mode(USstatePops2$Pop)
## [1] "numeric"
mean(USstatePops2$Pop)
## [1] 6053834
median(USstatePops2$Pop)
## [1] 4339362

Draw a basic histogram.

hist(USstatePops2$Pop/1000000, xlab = 'Pop (\'000 000)', main = '', xlim = c(0, 50))

Control the number of bins.

hist(USstatePops2$Pop/1000000, breaks = 20, xlab = 'Pop (\'000 000)', main = '', xlim = c(0, 50))

Notice that 15 states have a population under 2M. Subset the dataset by removing these states.

USstatePops3 <- USstatePops2$Pop[USstatePops2$Pop >= 2000000]

How does it look now?

length(USstatePops3)
## [1] 36
str(USstatePops3)
##  int [1:36] 4779735 6392013 2915921 37253956 5029196 3574097 18801311 9687660 12830632 6483800 ...

We have 36 states left.

Print summaries: ‘Before’ vs. ‘After’ removing (in millions).

# before
summary(USstatePops2$Pop/1000000)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5636  1.6970  4.3394  6.0538  6.6361 37.2540
# after
summary(USstatePops3/1000000)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.059   3.811   5.881   8.127   9.574  37.254

Mean confidence interval

First, compute the mean and standard deviation.

mean(USstatePops2$Pop)
## [1] 6053834
sd(USstatePops2$Pop)
## [1] 6823984

Second, display the distribution (or density).

hist(USstatePops2$Pop/1000000, breaks = 20, xlab = 'Pop (\'000 000)', main = '', xlim = c(0, 50))

Third, build the confidence interval; the easy way.

t.test(USstatePops2$Pop, conf.level = 0.95)$conf.int
## [1] 4134558 7973111
## attr(,"conf.level")
## [1] 0.95

Building the confidence interval; the hard way.

Assess the sample size.

dim(USstatePops2)[1]
## [1] 51

This is our sample size n or degree of freedom df.

Compute upper and lower limits of the 95 % confidence interval (with 2.5 % at each tail; the 97.5 % quantile).

mean(USstatePops2$Pop) - qt(0.975, df = 51) * sd(USstatePops2$Pop) / sqrt(51)
## [1] 4135490
mean(USstatePops2$Pop) + qt(0.975, df = 51) * sd(USstatePops2$Pop) / sqrt(51)
## [1] 7972179

Dipping into inferential statistics

Simulate the state population with a normal distribution with the rnorm(size, mean, sd) function:

hist(rnorm(51, mean(USstatePops2$Pop/1000000), sd(USstatePops2$Pop/1000000)), main = '', xlab = 'Pop (\'000 000)')

Conclusive remark: if we rule out the negative numbers, we have a distribution that resembles the actual data. According to the Pareto principle, we could say that the simulation ‘infer’ 80% of the actuals.



Inferential Statistics

Sampling basics

The average US state population is:

## [1] 6053834
Notes

- Sample: a small part of the entire dataset.
- Population: the entire dataset.
- Observation: one item in the population sample.

It can be confusing since the 'population' is also the US state population. 

In statistics, the 'population' could also be the daily output of unit A, all predators living on a given territory or the total inventory of books.

Sample the US state population (the ‘population’). Draw 16 states (or ‘observations’).

samp1 <- sample(USstatePops2$Pop, size = 16, replace = TRUE)
# in millions
samp1/1000000
##  [1]  8.001030 12.830632  9.883635  0.710231 25.145561 11.536502  4.533372
##  [8]  0.563626 11.536502  8.791894  2.915921  4.339362  0.897934 25.145561
## [15]  1.567582  6.483800
Note

Replace, replacement: a state can be drawn more than once during the sampling.

Compute statistics.

mean(samp1/1000000)
## [1] 8.430197
summary(samp1/1000000)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5636  2.5788  7.2424  8.4302 11.5365 25.1456

Increasing the replications

Draw 4 samples of 16 observations, with replacement, and compute the ‘mean of means’.

samp2 <- replicate(4, mean(sample(USstatePops2$Pop, size = 16, replace = TRUE)), simplify = TRUE)

summary(samp2/1000000)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.765   5.005   6.351   6.392   7.739   8.102
mean(samp2/1000000)
## [1] 6.392239

Replacement is also necessary since \(4~samples~\times~16~obs = 64\) and \(64 > 50~states\). Some states must be drawn more than once.

Compute 400 samples of 16 observations, with replacement.

samp3 <- replicate(100, mean(sample(USstatePops2$Pop, size = 16, replace = TRUE)), simplify = TRUE)

summary(samp3/1000000)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.667   4.576   5.501   5.856   6.728  11.136
mean(samp3/1000000)
## [1] 5.856276

Compute 4000 samples of 16 observations, with replacement.

samp4 <- replicate(4000, mean(sample(USstatePops2$Pop, size = 16, replace = TRUE)), simplify = TRUE)

summary(samp4/1000000)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.843   4.799   5.884   6.067   7.133  13.549
mean(samp4/1000000)
## [1] 6.067052

Compare the samples with the population.

c(mean(samp1), mean(samp2), mean(samp3), mean(samp4))/1000000
## [1] 8.430197 6.392239 5.856276 6.067052
# vs
mean(USstatePops2$Pop/1000000)
## [1] 6.053834

As the number of samples increases, the ‘means of means’ are getting closer to the US state population mean.

Draw the histograms.

par(mfrow = c(2,2))

hist(samp1, main = '1 sample')
hist(samp2, main = '4 samples')
hist(samp3, main = '400 samples')
hist(samp4, main = '4000 samples')

par(mfrow = c(1,1))

Conclusive remark: we are getting closer to a normal distribution as we increase the number of samples or replications.

Increasing the sample size

Draw among the 51 states, with replacement. Start with 100 samples or replications and compute the ‘mean of means’.

samp5 <- replicate(100, mean(sample(USstatePops2$Pop, size = 51, replace = TRUE)), simplify = TRUE)

mean(samp5/1000000)
## [1] 5.960696

Draw among 120 states. It is beyond the number of states! We absolutely need ‘replacement’. Start with 100 samples and compute the ‘mean of means’.

samp6 <- replicate(100, mean(sample(USstatePops2$Pop, size = 120, replace = TRUE)), simplify = TRUE)

mean(samp6/1000000)
## [1] 5.988411

Compare the samples with the population.

c(mean(samp5), mean(samp6))/1000000
## [1] 5.960696 5.988411
# vs
mean(USstatePops2$Pop/1000000)
## [1] 6.053834

Conclusive remark: we are getting closer to a normal distribution as the number of observations per sample increases.

Exploring a robust sample (or not?)

Run a good sampling and explore the results (size, average, summary, and quantiles).

sampleMeans <- replicate(10000, mean(sample(USstatePops2$Pop, size = 5, replace = TRUE)), simplify = TRUE)

length(sampleMeans)
## [1] 10000
mean(sampleMeans)/1000000
summary(sampleMeans)/1000000
quantile(sampleMeans/1000000, probs = c(0.25, 0.50, 0.75))
## [1] 6.040959
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7226  3.8545  5.3393  6.0410  7.5887 22.9986 
##      25%      50%      75% 
## 3.854458 5.339335 7.588745
mean(USstatePops2$Pop)/1000000
summary(USstatePops2$Pop)/1000000
quantile(USstatePops2$Pop/1000000, probs = c(0.25, 0.50, 0.75))
## [1] 6.053834
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5636  1.6970  4.3394  6.0538  6.6361 37.2540 
##      25%      50%      75% 
## 1.696962 4.339362 6.636084

The mean of the sample (6.047144) is almost equal to the mean of the population (6.053834).

Conclusive remark: this is a demonstration of the Law of Large Numbers.

The rest of the statistics (min, max, quarters) diverges from the population.

What about ‘rare events’ or ‘extreme values’?

quantile(sampleMeans/1000000, probs = c(0.025, 0.975))
sd(sampleMeans)/1000000
##      2.5%     97.5% 
##  2.015484 13.576499 
## [1] 3.019792
quantile(USstatePops2$Pop/1000000, probs = c(0.025, 0.975))
sd(USstatePops2$Pop)/1000000
##       2.5%      97.5% 
##  0.6077275 23.7036967 
## [1] 6.823984

Conclusive remark: sampleMeans is not perfect, but still good sample.

Keep in mind the differences between sampleMeans and the true values. We will compare true values with this new sample.

MysterySample <- c(3706690, 159358, 106405, 55519, 53883)
mean(MysterySample)/1000000
summary(MysterySample)/1000000
quantile(MysterySample/1000000, probs = c(0.025, 0.975))
sd(MysterySample)/1000000
## [1] 0.816371
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05388 0.05552 0.10640 0.81637 0.15936 3.70669 
##      2.5%     97.5% 
## 0.0540466 3.3519568 
## [1] 1.616319
mean(USstatePops2$Pop)/1000000
summary(USstatePops2$Pop)/1000000
quantile(USstatePops2$Pop/1000000, probs = c(0.025, 0.975))
sd(USstatePops2$Pop)/1000000
## [1] 6.053834
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5636  1.6970  4.3394  6.0538  6.6361 37.2540 
##       2.5%      97.5% 
##  0.6077275 23.7036967 
## [1] 6.823984

Compare sampleMeans, the previous sample and the true values. The sample mean is lower (0.816371 < 6.067976 \(\approx\) 6.053834), the extremes are way off ([0.0540466, 3.3519568] vs. [2.017338, 13.486258] vs. [0.6077275, 23.7036967]), the standard deviation is too small (1.616319 < 3.023153 < 6.823984).

If we thought sampleMeans was different than the true value apart from the mean, this new sample is totally ‘out of bounds’

Conclusive remark: MysterySample is a poor sample.

Back to sampleMeans. Compute a cut point for the population.

# population
sd(USstatePops2$Pop)/1000000
## [1] 6.823984
# standard error from 5 observations
stdError <- sd(USstatePops2$Pop)/sqrt(5)
stdError/1000000
## [1] 3.051779
# population cut point
CutPoint97_5 <- mean(USstatePops2$Pop) + (2 * stdError)
CutPoint97_5/1000000
## [1] 12.15739

Compute a cut point for the sample.

# sample cut point
quantile(sampleMeans/1000000, probs = c(0.975))
##   97.5% 
## 13.5765

The population cut point (12.15739) is close to the sample cut point (13.41045).

Conclusive remark: sampleMeans is a good sample and the difference from the actual populations comes from randomness.

Testing (normality; mean & variance equality)

The data

Above, we compared two samples to the population:

c(mean(samp5), mean(samp6))/1000000
## [1] 5.960696 5.988411
# vs
mean(USstatePops2$Pop/1000000)
## [1] 6.053834

Plot each sample (samp5 and samp6) distribution (the y-axis is standard).

two_sample <- data.frame(First = samp5, Second = samp6)

# boxplot
#boxplot(two_sample$First/1000000, ylab = 'Density', main = 'First', ylim = c(3,9), outcol = 'red3')
#boxplot(two_sample$Second/1000000, ylab = 'Density', main = 'Second', ylim = c(3,9), outcol = 'red3')

boxplot(two_sample$First/1000000, two_sample$Second/1000000, ylab = 'Density', main = '', ylim = c(3,9), outcol = 'red3', col = topo.colors(2), names = c('First', 'Second'))

#axis(1, at = 1:2, labels = c('First', 'Second'))

par(mfrow = c(1,2))

# Q-Q plot
qqnorm(two_sample$First, main = 'First', ylim = c(3500000, 9000000))
qqline(two_sample$First)

qqnorm(two_sample$Second, main = 'Second', ylim = c(3500000, 9000000))
qqline(two_sample$Second)

par(mfrow = c(1,1))
par(mfrow = c(2,1))

x <- two_sample$First

h <- hist(x, breaks = 10, main = 'First', xlab = '', xlim = c(3500000, 9000000))

xfit <- seq(min(x), max(x),length = 40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)

lines(xfit, yfit, col = 'blue', lwd = 2)

x <- two_sample$Second

h <- hist(x, breaks = 10, main = 'Second', xlab = '', xlim = c(3500000, 9000000))

xfit <- seq(min(x), max(x),length = 40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)

lines(xfit, yfit, col = 'blue', lwd = 2)

plot(density(two_sample$First), main = 'First', xlim = c(3500000, 9000000))
polygon(density(two_sample$First), col = 'red3', border = 'blue3')
plot(density(two_sample$Second), main = 'Second', xlim = c(3500000, 9000000))
polygon(density(two_sample$Second), col = 'red3', border = 'blue3')

par(mfrow = c(1,1))

Both samples are the result of bootstraps. First is the result of (51 states x 100 samples =) 5100 observations and Second is the result (120 states x 100 samples =) 12000 observations.

Compute the basic statistics: mean, standard deviation, and quantiles.

sapply(two_sample, mean, na.rm = TRUE)
##   First  Second 
## 5960696 5988411
sapply(two_sample, sd, na.rm = TRUE)
##    First   Second 
## 985418.0 562646.7
sapply(two_sample, quantile, na.rm = TRUE)
##        First  Second
## 0%   3843361 4812996
## 25%  5302991 5662777
## 50%  5841323 5943015
## 75%  6422946 6295298
## 100% 8785660 7770049

The visual inspection and simple statistics show how both samples differ. Nonetheless, these statistics remain descriptive and we are not 100 % sure.

Test normality

# Parametric
sapply(two_sample, shapiro.test)
##           First                         Second                       
## statistic 0.9738167                     0.983518                     
## p.value   0.04361168                    0.247159                     
## method    "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "X[[i]]"                      "X[[i]]"
  • First: we reject the hypothesis of normality (the visualizations shows a platykurtic distribution and fat tails).
  • Second: we accept the hypothesis of normality (the results of a larger sampling).

Test variance equality

# Parametric
var.test(two_sample$First, two_sample$Second, conf.level = 0.95)
## 
##  F test to compare two variances
## 
## data:  two_sample$First and two_sample$Second
## F = 3.0674, num df = 99, denom df = 99, p-value = 5.756e-08
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  2.063870 4.558863
## sample estimates:
## ratio of variances 
##           3.067393

Variances are not equal.

When variances are equal and distributions are normal, we can proceed with an ANOVA. If variances are unequal and distributions are normal, we can proceed with a oneway.test: whether two or more samples have the same means (by setting the parameter var.equal accordingly).

Other variance tests:

  • The bartlett.test.
  • The fligner.test for a rank-based (nonparametric) k-sample test for homogeneity of variances.
  • The ansari.test and mood.test for two rank based two-sample tests for difference in scale.
ansari.test(two_sample$First, two_sample$Second)
## 
##  Ansari-Bradley test
## 
## data:  two_sample$First and two_sample$Second
## AB = 4293, p-value = 0.0002161
## alternative hypothesis: true ratio of scales is not equal to 1
mood.test(two_sample$First, two_sample$Second)
## 
##  Mood two-sample test of scale
## 
## data:  two_sample$First and two_sample$Second
## Z = 4.2914, p-value = 1.776e-05
## alternative hypothesis: two.sided

Test equal means

# Parametric
t.test(two_sample$First, two_sample$Second, alternative = 'two.sided', conf.level = 0.95, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  two_sample$First and two_sample$Second
## t = -0.24424, df = 157.35, p-value = 0.8074
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -251842.3  196412.6
## sample estimates:
## mean of x mean of y 
##   5960696   5988411

Means are not equal.

# Non parametric
wilcox.test(two_sample$First, two_sample$Second, alternative = 'two.sided', paired = TRUE, conf.int = TRUE, conf.level = 0.95)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  two_sample$First and two_sample$Second
## V = 2358, p-value = 0.567
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -276301.1  158870.9
## sample estimates:
## (pseudo)median 
##      -62175.12

We confirm the t.test.

# Non parametric
kruskal.test(two_sample$First, two_sample$Second)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  two_sample$First and two_sample$Second
## Kruskal-Wallis chi-squared = 99, df = 99, p-value = 0.4811
# Or
kruskal.test(First ~ Second, data = two_sample)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  First by Second
## Kruskal-Wallis chi-squared = 99, df = 99, p-value = 0.4811

We confirm the t.test.