Foreword
Dataset
The dataset comes from the US Census Bureau.
Loading the data
On their website, open the Excel file ‘NST-EST2011-02’ about the annual estimates of the resident population. In the spreadsheet, starting from cell B11, copy and paste the data to the clipboard (from B11 to B61).
There are several ways to input data into R. With the clipboard, the safest way is to create a new worksheet where you paste raw data; without any formatting or formulas (value-only). Make sure the data is clean. Copy the raw data and run:
USstatePops <- read.table(file = 'clipboard', sep = '\t', header = FALSE)The data becomes a R object: a data frame. Print it out. Add column names.
head(USstatePops,5)## V1 V2
## 1 Alabama 4779735
## 2 Alaska 710231
## 3 Arizona 6392013
## 4 Arkansas 2915921
## 5 California 37253956
colnames(USstatePops) <- c('State', 'Pop')
head(USstatePops, 5)## State Pop
## 1 Alabama 4779735
## 2 Alaska 710231
## 3 Arizona 6392013
## 4 Arkansas 2915921
## 5 California 37253956
Check out the data frame.
str(USstatePops)## 'data.frame': 51 obs. of 2 variables:
## $ State: Factor w/ 51 levels "Alabama","Alaska",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Pop : int 4779735 710231 6392013 2915921 37253956 5029196 3574097 897934 601723 18801311 ...
There are 51 states (50 states and the District of Columbia). The states should be strings, not factors. The trick to extract numbers or strings without any loss from a factor structure is the following code:
# make a copy for safety
USstatePops2 <- USstatePops
# convert
USstatePops2$State <- as.character(levels(USstatePops2$State))Check out the new data frame.
str(USstatePops2)## 'data.frame': 51 obs. of 2 variables:
## $ State: chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Pop : int 4779735 710231 6392013 2915921 37253956 5029196 3574097 897934 601723 18801311 ...
head(USstatePops2, 5)## State Pop
## 1 Alabama 4779735
## 2 Alaska 710231
## 3 Arizona 6392013
## 4 Arkansas 2915921
## 5 California 37253956
Better; the states are strings (characters).
We now have a data frame to work with.
Compute the mode, mean, and median.
mode(USstatePops2$Pop)## [1] "numeric"
mean(USstatePops2$Pop)## [1] 6053834
median(USstatePops2$Pop)## [1] 4339362
Draw a basic histogram.
hist(USstatePops2$Pop/1000000, xlab = 'Pop (\'000 000)', main = '', xlim = c(0, 50))Control the number of bins.
hist(USstatePops2$Pop/1000000, breaks = 20, xlab = 'Pop (\'000 000)', main = '', xlim = c(0, 50))Notice that 15 states have a population under 2M. Subset the dataset by removing these states.
USstatePops3 <- USstatePops2$Pop[USstatePops2$Pop >= 2000000]How does it look now?
length(USstatePops3)## [1] 36
str(USstatePops3)## int [1:36] 4779735 6392013 2915921 37253956 5029196 3574097 18801311 9687660 12830632 6483800 ...
We have 36 states left.
Print summaries: ‘Before’ vs. ‘After’ removing (in millions).
# before
summary(USstatePops2$Pop/1000000)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5636 1.6970 4.3394 6.0538 6.6361 37.2540
# after
summary(USstatePops3/1000000)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.059 3.811 5.881 8.127 9.574 37.254
First, compute the mean and standard deviation.
mean(USstatePops2$Pop)## [1] 6053834
sd(USstatePops2$Pop)## [1] 6823984
Second, display the distribution (or density).
hist(USstatePops2$Pop/1000000, breaks = 20, xlab = 'Pop (\'000 000)', main = '', xlim = c(0, 50))Third, build the confidence interval; the easy way.
t.test(USstatePops2$Pop, conf.level = 0.95)$conf.int## [1] 4134558 7973111
## attr(,"conf.level")
## [1] 0.95
Building the confidence interval; the hard way.
Assess the sample size.
dim(USstatePops2)[1]## [1] 51
This is our sample size n or degree of freedom df.
Compute upper and lower limits of the 95 % confidence interval (with 2.5 % at each tail; the 97.5 % quantile).
mean(USstatePops2$Pop) - qt(0.975, df = 51) * sd(USstatePops2$Pop) / sqrt(51)## [1] 4135490
mean(USstatePops2$Pop) + qt(0.975, df = 51) * sd(USstatePops2$Pop) / sqrt(51)## [1] 7972179
Simulate the state population with a normal distribution with the rnorm(size, mean, sd) function:
hist(rnorm(51, mean(USstatePops2$Pop/1000000), sd(USstatePops2$Pop/1000000)), main = '', xlab = 'Pop (\'000 000)')Conclusive remark: if we rule out the negative numbers, we have a distribution that resembles the actual data. According to the Pareto principle, we could say that the simulation ‘infer’ 80% of the actuals.
The average US state population is:
## [1] 6053834
Notes
- Sample: a small part of the entire dataset.
- Population: the entire dataset.
- Observation: one item in the population sample.
It can be confusing since the 'population' is also the US state population.
In statistics, the 'population' could also be the daily output of unit A, all predators living on a given territory or the total inventory of books.
Sample the US state population (the ‘population’). Draw 16 states (or ‘observations’).
samp1 <- sample(USstatePops2$Pop, size = 16, replace = TRUE)
# in millions
samp1/1000000## [1] 8.001030 12.830632 9.883635 0.710231 25.145561 11.536502 4.533372
## [8] 0.563626 11.536502 8.791894 2.915921 4.339362 0.897934 25.145561
## [15] 1.567582 6.483800
Note
Replace, replacement: a state can be drawn more than once during the sampling.
Compute statistics.
mean(samp1/1000000)## [1] 8.430197
summary(samp1/1000000)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5636 2.5788 7.2424 8.4302 11.5365 25.1456
Draw 4 samples of 16 observations, with replacement, and compute the ‘mean of means’.
samp2 <- replicate(4, mean(sample(USstatePops2$Pop, size = 16, replace = TRUE)), simplify = TRUE)
summary(samp2/1000000)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.765 5.005 6.351 6.392 7.739 8.102
mean(samp2/1000000)## [1] 6.392239
Replacement is also necessary since \(4~samples~\times~16~obs = 64\) and \(64 > 50~states\). Some states must be drawn more than once.
Compute 400 samples of 16 observations, with replacement.
samp3 <- replicate(100, mean(sample(USstatePops2$Pop, size = 16, replace = TRUE)), simplify = TRUE)
summary(samp3/1000000)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.667 4.576 5.501 5.856 6.728 11.136
mean(samp3/1000000)## [1] 5.856276
Compute 4000 samples of 16 observations, with replacement.
samp4 <- replicate(4000, mean(sample(USstatePops2$Pop, size = 16, replace = TRUE)), simplify = TRUE)
summary(samp4/1000000)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.843 4.799 5.884 6.067 7.133 13.549
mean(samp4/1000000)## [1] 6.067052
Compare the samples with the population.
c(mean(samp1), mean(samp2), mean(samp3), mean(samp4))/1000000## [1] 8.430197 6.392239 5.856276 6.067052
# vs
mean(USstatePops2$Pop/1000000)## [1] 6.053834
As the number of samples increases, the ‘means of means’ are getting closer to the US state population mean.
Draw the histograms.
par(mfrow = c(2,2))
hist(samp1, main = '1 sample')
hist(samp2, main = '4 samples')
hist(samp3, main = '400 samples')
hist(samp4, main = '4000 samples')par(mfrow = c(1,1))Conclusive remark: we are getting closer to a normal distribution as we increase the number of samples or replications.
Draw among the 51 states, with replacement. Start with 100 samples or replications and compute the ‘mean of means’.
samp5 <- replicate(100, mean(sample(USstatePops2$Pop, size = 51, replace = TRUE)), simplify = TRUE)
mean(samp5/1000000)## [1] 5.960696
Draw among 120 states. It is beyond the number of states! We absolutely need ‘replacement’. Start with 100 samples and compute the ‘mean of means’.
samp6 <- replicate(100, mean(sample(USstatePops2$Pop, size = 120, replace = TRUE)), simplify = TRUE)
mean(samp6/1000000)## [1] 5.988411
Compare the samples with the population.
c(mean(samp5), mean(samp6))/1000000## [1] 5.960696 5.988411
# vs
mean(USstatePops2$Pop/1000000)## [1] 6.053834
Conclusive remark: we are getting closer to a normal distribution as the number of observations per sample increases.
Run a good sampling and explore the results (size, average, summary, and quantiles).
sampleMeans <- replicate(10000, mean(sample(USstatePops2$Pop, size = 5, replace = TRUE)), simplify = TRUE)
length(sampleMeans)## [1] 10000
mean(sampleMeans)/1000000
summary(sampleMeans)/1000000
quantile(sampleMeans/1000000, probs = c(0.25, 0.50, 0.75))## [1] 6.040959
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7226 3.8545 5.3393 6.0410 7.5887 22.9986
## 25% 50% 75%
## 3.854458 5.339335 7.588745
mean(USstatePops2$Pop)/1000000
summary(USstatePops2$Pop)/1000000
quantile(USstatePops2$Pop/1000000, probs = c(0.25, 0.50, 0.75))## [1] 6.053834
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5636 1.6970 4.3394 6.0538 6.6361 37.2540
## 25% 50% 75%
## 1.696962 4.339362 6.636084
The mean of the sample (6.047144) is almost equal to the mean of the population (6.053834).
Conclusive remark: this is a demonstration of the Law of Large Numbers.
The rest of the statistics (min, max, quarters) diverges from the population.
What about ‘rare events’ or ‘extreme values’?
quantile(sampleMeans/1000000, probs = c(0.025, 0.975))
sd(sampleMeans)/1000000## 2.5% 97.5%
## 2.015484 13.576499
## [1] 3.019792
quantile(USstatePops2$Pop/1000000, probs = c(0.025, 0.975))
sd(USstatePops2$Pop)/1000000## 2.5% 97.5%
## 0.6077275 23.7036967
## [1] 6.823984
Conclusive remark: sampleMeans is not perfect, but still good sample.
Keep in mind the differences between sampleMeans and the true values. We will compare true values with this new sample.
MysterySample <- c(3706690, 159358, 106405, 55519, 53883)mean(MysterySample)/1000000
summary(MysterySample)/1000000
quantile(MysterySample/1000000, probs = c(0.025, 0.975))
sd(MysterySample)/1000000## [1] 0.816371
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.05388 0.05552 0.10640 0.81637 0.15936 3.70669
## 2.5% 97.5%
## 0.0540466 3.3519568
## [1] 1.616319
mean(USstatePops2$Pop)/1000000
summary(USstatePops2$Pop)/1000000
quantile(USstatePops2$Pop/1000000, probs = c(0.025, 0.975))
sd(USstatePops2$Pop)/1000000## [1] 6.053834
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5636 1.6970 4.3394 6.0538 6.6361 37.2540
## 2.5% 97.5%
## 0.6077275 23.7036967
## [1] 6.823984
Compare sampleMeans, the previous sample and the true values. The sample mean is lower (0.816371 < 6.067976 \(\approx\) 6.053834), the extremes are way off ([0.0540466, 3.3519568] vs. [2.017338, 13.486258] vs. [0.6077275, 23.7036967]), the standard deviation is too small (1.616319 < 3.023153 < 6.823984).
If we thought sampleMeans was different than the true value apart from the mean, this new sample is totally ‘out of bounds’
Conclusive remark: MysterySample is a poor sample.
Back to sampleMeans. Compute a cut point for the population.
# population
sd(USstatePops2$Pop)/1000000## [1] 6.823984
# standard error from 5 observations
stdError <- sd(USstatePops2$Pop)/sqrt(5)
stdError/1000000## [1] 3.051779
# population cut point
CutPoint97_5 <- mean(USstatePops2$Pop) + (2 * stdError)
CutPoint97_5/1000000## [1] 12.15739
Compute a cut point for the sample.
# sample cut point
quantile(sampleMeans/1000000, probs = c(0.975))## 97.5%
## 13.5765
The population cut point (12.15739) is close to the sample cut point (13.41045).
Conclusive remark: sampleMeans is a good sample and the difference from the actual populations comes from randomness.
The data
Above, we compared two samples to the population:
c(mean(samp5), mean(samp6))/1000000## [1] 5.960696 5.988411
# vs
mean(USstatePops2$Pop/1000000)## [1] 6.053834
Plot each sample (samp5 and samp6) distribution (the y-axis is standard).
two_sample <- data.frame(First = samp5, Second = samp6)
# boxplot
#boxplot(two_sample$First/1000000, ylab = 'Density', main = 'First', ylim = c(3,9), outcol = 'red3')
#boxplot(two_sample$Second/1000000, ylab = 'Density', main = 'Second', ylim = c(3,9), outcol = 'red3')
boxplot(two_sample$First/1000000, two_sample$Second/1000000, ylab = 'Density', main = '', ylim = c(3,9), outcol = 'red3', col = topo.colors(2), names = c('First', 'Second'))#axis(1, at = 1:2, labels = c('First', 'Second'))
par(mfrow = c(1,2))
# Q-Q plot
qqnorm(two_sample$First, main = 'First', ylim = c(3500000, 9000000))
qqline(two_sample$First)
qqnorm(two_sample$Second, main = 'Second', ylim = c(3500000, 9000000))
qqline(two_sample$Second)par(mfrow = c(1,1))par(mfrow = c(2,1))
x <- two_sample$First
h <- hist(x, breaks = 10, main = 'First', xlab = '', xlim = c(3500000, 9000000))
xfit <- seq(min(x), max(x),length = 40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col = 'blue', lwd = 2)
x <- two_sample$Second
h <- hist(x, breaks = 10, main = 'Second', xlab = '', xlim = c(3500000, 9000000))
xfit <- seq(min(x), max(x),length = 40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col = 'blue', lwd = 2)plot(density(two_sample$First), main = 'First', xlim = c(3500000, 9000000))
polygon(density(two_sample$First), col = 'red3', border = 'blue3')
plot(density(two_sample$Second), main = 'Second', xlim = c(3500000, 9000000))
polygon(density(two_sample$Second), col = 'red3', border = 'blue3')par(mfrow = c(1,1))Both samples are the result of bootstraps. First is the result of (51 states x 100 samples =) 5100 observations and Second is the result (120 states x 100 samples =) 12000 observations.
Compute the basic statistics: mean, standard deviation, and quantiles.
sapply(two_sample, mean, na.rm = TRUE)## First Second
## 5960696 5988411
sapply(two_sample, sd, na.rm = TRUE)## First Second
## 985418.0 562646.7
sapply(two_sample, quantile, na.rm = TRUE)## First Second
## 0% 3843361 4812996
## 25% 5302991 5662777
## 50% 5841323 5943015
## 75% 6422946 6295298
## 100% 8785660 7770049
The visual inspection and simple statistics show how both samples differ. Nonetheless, these statistics remain descriptive and we are not 100 % sure.
Test normality
# Parametric
sapply(two_sample, shapiro.test)## First Second
## statistic 0.9738167 0.983518
## p.value 0.04361168 0.247159
## method "Shapiro-Wilk normality test" "Shapiro-Wilk normality test"
## data.name "X[[i]]" "X[[i]]"
First: we reject the hypothesis of normality (the visualizations shows a platykurtic distribution and fat tails).Second: we accept the hypothesis of normality (the results of a larger sampling).Test variance equality
# Parametric
var.test(two_sample$First, two_sample$Second, conf.level = 0.95)##
## F test to compare two variances
##
## data: two_sample$First and two_sample$Second
## F = 3.0674, num df = 99, denom df = 99, p-value = 5.756e-08
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 2.063870 4.558863
## sample estimates:
## ratio of variances
## 3.067393
Variances are not equal.
When variances are equal and distributions are normal, we can proceed with an ANOVA. If variances are unequal and distributions are normal, we can proceed with a oneway.test: whether two or more samples have the same means (by setting the parameter var.equal accordingly).
Other variance tests:
bartlett.test.fligner.test for a rank-based (nonparametric) k-sample test for homogeneity of variances.ansari.test and mood.test for two rank based two-sample tests for difference in scale.ansari.test(two_sample$First, two_sample$Second)##
## Ansari-Bradley test
##
## data: two_sample$First and two_sample$Second
## AB = 4293, p-value = 0.0002161
## alternative hypothesis: true ratio of scales is not equal to 1
mood.test(two_sample$First, two_sample$Second)##
## Mood two-sample test of scale
##
## data: two_sample$First and two_sample$Second
## Z = 4.2914, p-value = 1.776e-05
## alternative hypothesis: two.sided
Test equal means
# Parametric
t.test(two_sample$First, two_sample$Second, alternative = 'two.sided', conf.level = 0.95, var.equal = FALSE)##
## Welch Two Sample t-test
##
## data: two_sample$First and two_sample$Second
## t = -0.24424, df = 157.35, p-value = 0.8074
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -251842.3 196412.6
## sample estimates:
## mean of x mean of y
## 5960696 5988411
Means are not equal.
# Non parametric
wilcox.test(two_sample$First, two_sample$Second, alternative = 'two.sided', paired = TRUE, conf.int = TRUE, conf.level = 0.95)##
## Wilcoxon signed rank test with continuity correction
##
## data: two_sample$First and two_sample$Second
## V = 2358, p-value = 0.567
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -276301.1 158870.9
## sample estimates:
## (pseudo)median
## -62175.12
We confirm the t.test.
# Non parametric
kruskal.test(two_sample$First, two_sample$Second)##
## Kruskal-Wallis rank sum test
##
## data: two_sample$First and two_sample$Second
## Kruskal-Wallis chi-squared = 99, df = 99, p-value = 0.4811
# Or
kruskal.test(First ~ Second, data = two_sample)##
## Kruskal-Wallis rank sum test
##
## data: First by Second
## Kruskal-Wallis chi-squared = 99, df = 99, p-value = 0.4811
We confirm the t.test.