Foreword
The book backing up this project is OpenIntro.
Data is made of Excel and Rdata files.
In addition, the project has four custom functions:
plot_ci; for plotting confidence intervals (section 5).inference; for running bootstraps and plotting results (section 6).plot_ss; for plotting the sum of squares (section 8).multiLine; for plotting regressions with dummy variables (section 9).Welcome!
Load a dataset into R
The present dataset refers to the number of male and female births in the United States. It contains the data for all of the years from 1940 to 2002.
# This will print the dataset in the console
head(present)## year boys girls
## 1 1940 1211684 1148715
## 2 1941 1289734 1223693
## 3 1942 1444365 1364631
## 4 1943 1508959 1427901
## 5 1944 1435301 1359499
## 6 1945 1404587 1330869
Examining the dataset
# Print the names of the variables of the data frame
names(present)## [1] "year" "boys" "girls"
# Print the number of rows and variables with the 'dim' function
dim(present)## [1] 63 3
Some more exploration
# Find the number of boys born each year, and assign the answer to
num_boys <- present$boys
# Find the number of girls born each year, and assign the answer to
num_girls <- present$girls
# This will print the results
num_boys## [1] 1211684 1289734 1444365 1508959 1435301 1404587 1691220 1899876
## [9] 1813852 1826352 1823555 1923020 1971262 2001798 2059068 2073719
## [17] 2133588 2179960 2152546 2173638 2179708 2186274 2132466 2101632
## [25] 2060162 1927054 1845862 1803388 1796326 1846572 1915378 1822910
## [33] 1669927 1608326 1622114 1613135 1624436 1705916 1709394 1791267
## [41] 1852616 1860272 1885676 1865553 1879490 1927983 1924868 1951153
## [49] 2002424 2069490 2129495 2101518 2082097 2048861 2022589 1996355
## [57] 1990480 1985596 2016205 2026854 2076969 2057922 2057979
num_girls## [1] 1148715 1223693 1364631 1427901 1359499 1330869 1597452 1800064
## [9] 1721216 1733177 1730594 1827830 1875724 1900322 1958294 1973576
## [17] 2029502 2074824 2051266 2071158 2078142 2082052 2034896 1996388
## [25] 1967328 1833304 1760412 1717571 1705238 1753634 1816008 1733060
## [33] 1588484 1528639 1537844 1531063 1543352 1620716 1623885 1703131
## [41] 1759642 1768966 1794861 1773380 1789651 1832578 1831679 1858241
## [49] 1907086 1971468 2028717 2009389 1982917 1951379 1930178 1903234
## [57] 1901014 1895298 1925348 1932563 1981845 1968011 1963747
Putting it in a graph
# Type here the code to create the plot
plot(present$year, present$girls)Connecting the dots
# Create the plot here
plot(present$year, present$girls, type = 'l')There is initially an increase in the number of girls born, which peaks around 1960. After 1960 there is a decrease in the number of girls born, but the number begins to increase again in the early 1970s. Overall the trend is an increase in the number of girls born in the US since the 1940s.
R, the big calculator
# The vector babies
babies <- present$boys + present$girls
present2 <- cbind(present, babies)
head(present2)## year boys girls babies
## 1 1940 1211684 1148715 2360399
## 2 1941 1289734 1223693 2513427
## 3 1942 1444365 1364631 2808996
## 4 1943 1508959 1427901 2936860
## 5 1944 1435301 1359499 2794800
## 6 1945 1404587 1330869 2735456
# The plot
plot(present$year, babies, type = 'l')# The max point
subset(c(present2$year, present2$babies), present2$babies == max(present2$babies))## [1] 1961 4268326
There is a birth peak in 1961.
Comparing the dataset
# Check when boys outnumber girls
present$boys > present$girls## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
All TRUE: every year there are more boys born than girls.
Challenge
# Plot the boy-to-girl ratio for every year
plot(present$year, present$boys / (present$girls))There is initially a decrease in the boy-to-girl ratio, and then an increase between 1960 and 1970, followed by a decrease.
# absolute difference
present2 <- data.frame(year = present$year, ratio = present$boys - present$girls)
# find the year
subset(present2$year, present2$ratio == max(present2$ratio))## [1] 1963
The biggest absolute difference in the number of girls and number of boys born: 1963.
Welcome!
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage.
We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000.
Which variables are you working with?
# Print the names of the variables
names(cdc)## [1] "genhlth" "exerany" "hlthplan" "smoke100" "height" "weight"
## [7] "wtdesire" "age" "gender"
nrow(cdc)## [1] 20000
str(cdc)## 'data.frame': 20000 obs. of 9 variables:
## $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
## $ exerany : num 0 0 1 1 0 1 1 0 0 1 ...
## $ hlthplan: num 1 1 1 1 1 1 1 1 1 1 ...
## $ smoke100: num 0 1 1 0 0 0 0 0 1 0 ...
## $ height : num 70 64 60 66 61 64 71 67 65 70 ...
## $ weight : int 175 125 105 132 150 114 194 170 150 180 ...
## $ wtdesire: int 175 115 105 124 130 114 185 160 130 170 ...
## $ age : int 77 33 49 42 55 55 31 45 27 44 ...
## $ gender : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
20,000 cases; 9 variables
Taking a peek at the data
# Print the head and tails of the data frame:
head(cdc)## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 2 good 0 1 1 64 125 115 33 f
## 3 good 1 1 1 60 105 105 49 f
## 4 good 1 1 0 66 132 124 42 f
## 5 very good 0 1 0 61 150 130 55 f
## 6 very good 1 1 0 64 114 114 55 f
tail(cdc)## genhlth exerany hlthplan smoke100 height weight wtdesire age
## 19995 good 0 1 1 69 224 224 73
## 19996 good 1 1 0 66 215 140 23
## 19997 excellent 0 1 0 73 200 185 35
## 19998 poor 0 1 0 65 216 150 57
## 19999 good 1 1 0 67 165 165 81
## 20000 good 1 1 1 69 170 165 83
## gender
## 19995 m
## 19996 f
## 19997 m
## 19998 f
## 19999 f
## 20000 m
Type of data:
Let’s refresh
# View the head or tail of both the height and the genhlth variables
head(cdc$height)## [1] 70 64 60 66 61 64
tail(cdc$height)## [1] 69 66 73 65 67 69
head(cdc$genhlth)## [1] good good good good very good very good
## Levels: excellent very good good fair poor
tail(cdc$genhlth)## [1] good good excellent poor good good
## Levels: excellent very good good fair poor
Turning info into knowledge – Numerical data
The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all of that information into a few summary statistics and graphics.
Continous data.
mean(cdc$weight)## [1] 169.683
var(cdc$weight)## [1] 1606.484
median(cdc$weight)## [1] 165
summary(cdc$weight)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68.0 140.0 165.0 169.7 190.0 500.0
Turning info into knowledge – Categorical data
Categorical (discrete) data.
# Create the frequency table here
table(cdc$genhlth)##
## excellent very good good fair poor
## 4657 6972 5675 2019 677
# Create the relative frequency table here
table(cdc$genhlth) / nrow(cdc)##
## excellent very good good fair poor
## 0.23285 0.34860 0.28375 0.10095 0.03385
Creating a barplot
Non-smokers vs smokers.
# Draw the barplot
barplot(table(cdc$smoke100))Males vs Females.
# Draw another barplot
barplot(table(cdc$gender))# Number of males
sum(cdc$gender == 'm')## [1] 9569
Health.
# Proportion of 'excellent' health
sum(cdc$genhlth == 'excellent') / nrow(cdc)## [1] 0.23285
Even prettier: the mosaic plot
Two categories in one plot.
gender_smokers <- table(cdc$gender, cdc$smoke100)
gender_smokers##
## 0 1
## m 4547 5022
## f 6012 4419
# Plot the mosaicplot
mosaicplot(gender_smokers)Males are more likely to smoke than females.
Interlude: how R thinks about data (1)
Reduce, extract, subset to find answers.
# Create the subsets
height_1337 <- cdc[1337, 5]
weight_111 <- cdc[111, 6]
# Print the results
height_1337## [1] 70
weight_111## [1] 210
Interlude (2)
# Create the subsets
first8 <- cdc[1:8, 3:5]
wt_gen_10_20 <- cdc[10:20, 6:9]
# Print the subsets
first8## hlthplan smoke100 height
## 1 1 0 70
## 2 1 1 64
## 3 1 1 60
## 4 1 0 66
## 5 1 0 61
## 6 1 0 64
## 7 1 0 71
## 8 1 0 67
wt_gen_10_20## weight wtdesire age gender
## 10 180 170 44 m
## 11 186 175 46 m
## 12 168 148 62 m
## 13 185 220 21 m
## 14 170 170 69 m
## 15 170 170 23 m
## 16 185 175 79 m
## 17 156 150 47 m
## 18 185 185 76 m
## 19 200 190 43 m
## 20 125 120 33 f
Interlude (3)
# Create the subsets
resp205 <- cdc[205, ]
ht_wt <- cdc[ ,5:6]
# Print the subsets
resp205## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 205 fair 0 0 1 61 200 125 49 f
head(ht_wt)## height weight
## 1 70 175
## 2 64 125
## 3 60 105
## 4 66 132
## 5 61 150
## 6 64 114
Interlude (4)
# Create the subsets
resp1000_smk <- cdc$smoke100[1000]
first30_ht <- cdc$height[1:30]
# Print the subsets
resp1000_smk## [1] 1
first30_ht## [1] 70 64 60 66 61 64 71 67 65 70 69 69 66 70 69 73 67 71 75 67 69 65 73
## [24] 67 64 68 67 69 61 74
A little more on subsetting
# Create the subsets
very_good <- subset(cdc, cdc$genhlth == 'very good')
age_gt50 <- subset(cdc, cdc$age > 50)
# Print the subsets
head(very_good)## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 5 very good 0 1 0 61 150 130 55 f
## 6 very good 1 1 0 64 114 114 55 f
## 7 very good 1 1 0 71 194 185 31 m
## 8 very good 0 1 0 67 170 160 45 m
## 20 very good 1 1 0 67 125 120 33 f
## 21 very good 1 1 0 69 200 150 48 f
head(age_gt50)## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1 good 0 1 0 70 175 175 77 m
## 5 very good 0 1 0 61 150 130 55 f
## 6 very good 1 1 0 64 114 114 55 f
## 12 fair 1 1 1 69 168 148 62 m
## 14 excellent 1 1 1 70 170 170 69 m
## 16 good 1 1 1 73 185 175 79 m
Subset – one last time
# Create the subset
under23_and_smoke <- subset(cdc, cdc$age < 23 & cdc$smoke100 == 1)
# Print the top six rows of the subset
head(under23_and_smoke)## genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13 excellent 1 0 1 66 185 220 21 m
## 37 very good 1 0 1 70 160 140 18 f
## 96 excellent 1 1 1 74 175 200 22 m
## 180 good 1 1 1 64 190 140 20 f
## 182 very good 1 1 1 62 92 92 21 f
## 240 very good 1 0 1 64 125 115 22 f
# Number of observations
nrow(under23_and_smoke)## [1] 620
Visualizing with box plots
# Draw the box plot of the respondents heights
boxplot(cdc$height)# Print the summary
summary(cdc$height)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 48.00 64.00 67.00 67.18 70.00 93.00
For continuous data.
More on box plots
By category (mix continuous and categorical data).
# Draw the box plot of the weights versus smoking
# y ~ x
boxplot(cdc$weight ~ cdc$smoke100)One last box plot
Box plot of the Body Mass Index (BMI) vs health condition.
\[BMI = \frac{\mathrm{weight(lb)}}{\mathrm{height(in)}^{2}} \times 703\]
# Calculate the BMI
bmi <- (cdc$weight / (cdc$height ** 2)) * 703
# Draw the box plot
# y ~ x
boxplot(bmi ~ cdc$genhlth)The distributions of BMIs within each health status group is left skewed.
Histograms
Histograms are generally a very good way to see the shape of a single distribution, but that shape can change depending on how the data is split between the different bins.
# Draw a histogram of bmi
hist(bmi)# And one with breaks set to 50
hist(bmi, breaks = 50)# And one with breaks set to 100
hist(bmi, breaks = 100)Weight vs. Desired Weight
Correlation wuith a scatter plot.
# Draw a scatter plot
plot(cdc$wtdesire ~ cdc$weight)Moderately strong positive linear association.
However, the data is bunched up. Let’s improve the visualization.
library(hexbin)
# Draw a hexbinplot
hexbinplot(cdc$wtdesire ~ cdc$weight, aspect = 0.5)# Zero in
hexbinplot(cdc$wtdesire ~ cdc$weight, ylim = c(0, 400), aspect = 0.5)Getting started
Kobe Bryant of the Los Angeles Lakers.
His performance against the Orlando Magic in the 2009 NBA finals earned him the title “Most Valuable Player” and many spectators commented on how he appeared to show a hot hand.
When a basketball player makes several baskets in succession, they are describe as having a ‘hot hand’. Fans and players have long believed in this phenomenon, which refutes the assumption that each shot is independent of the next. However, a 1985 paper by Gilovich, Vallone, and Tversky found evidence that contradicted this belief and showed that successive shots are independent events.
# Inspect the data
head(kobe)## vs game quarter time
## 1 ORL 1 1 9:47
## 2 ORL 1 1 9:07
## 3 ORL 1 1 8:11
## 4 ORL 1 1 7:41
## 5 ORL 1 1 7:03
## 6 ORL 1 1 6:01
## description basket
## 1 Kobe Bryant makes 4-foot two point shot H
## 2 Kobe Bryant misses jumper M
## 3 Kobe Bryant misses 7-foot jumper M
## 4 Kobe Bryant makes 16-foot jumper (Derek Fisher assists) H
## 5 Kobe Bryant makes driving layup H
## 6 Kobe Bryant misses jumper M
tail(kobe)## vs game quarter time description basket
## 128 ORL 3 4 3:57 Bryant Jump Shot: Made (28 PTS) H
## 129 ORL 3 4 3:33 Bryant Layup Shot: Missed M
## 130 ORL 3 4 2:02 Bryant 3pt Shot: Missed M
## 131 ORL 3 4 00:23.9 Bryant 3pt Shot: Missed M
## 132 ORL 3 4 00:06.9 Bryant 3pt Shot: Missed M
## 133 ORL 3 4 00:00.5 Bryant Layup Shot: Made (31 PTS) H
Kobe’s track record
Just looking at the string of hits and misses, it can be difficult to gauge whether or not it seems like Kobe was shooting with a hot hand.
One way we can approach this is by considering the belief that hot hand shooters tend to go on shooting streaks.
# Print the variable names of the data frame
names(kobe)## [1] "vs" "game" "quarter" "time" "description"
## [6] "basket"
# Print the first 9 values of the 'basket' variable
kobe$basket[1:9]## [1] "H" "M" "M" "H" "H" "M" "M" "M" "M"
A first analysis
The distribution of the shooting streaks.
# Assign Kobe's streak lengths:
kobe_streak <- calc_streak(kobe$basket)
# Draw a barplot of the result:
barplot(table(kobe_streak))The distribution of Kobe’s streaks is unimodal and left skewed.
We’ve shown that Kobe had some long shooting streaks, but are they long enough to support the belief that he had hot hands? What can we compare them to?
To answer these questions, let’s return to the idea of independence.
Two processes are independent if the outcome of one process doesn’t effect the outcome of the second. If each shot that a player takes is an independent process, having made or missed your first shot will not affect the probability that you will make or miss your second shot.
A shooter with a hot hand will have shots that are not independent of one another. Specifically, if the shooter makes his first shot, the hot hand model says he will have a higher probability of making his second shot. Let’s suppose for a moment that the hot hand model is valid for Kobe. During his career, the percentage of time Kobe makes a basket (i.e. his shooting percentage) is about 45%, or in probability notation:
\[P(\mathrm{shot}_1=H) = 0.45.\]
If Kobe has a hot hand (not independent shots), then the probability that he makes his second shot would go up given that he made the first shot:
\[P(\mathrm{shot}_2=H~|~\mathrm{shot}_1=H) > 0.45.\]
Exactly, if Kobe has a hot hand, the probability that he makes his second shot would be higher, for example 0.60. As a result of these increased probabilities, you’d expect Kobe to have longer streaks.
Compare this to the skeptical perspective where Kobe does not have a hot hand, where each shot is independent of the next. If he hits his first shot, the probability that he makes the second is still 0.45.
In other words, making the first shot did nothing to effect the probability that he’d make his second shot. If Kobe’s shots are independent, then he’d have the same probability of hitting every shot regardless of his past shots: 45%.
Now that we’ve phrased the situation in terms of independent shots, let’s return to the question: how do we tell if Kobe’s shooting streaks are long enough to indicate that he has hot hands? We can compare his streak lengths to someone without hot hands: an independent shooter. Starting with the next exercise, you’ll learn how to simulate such an independent shooter in R.
If Kobe’s shooting streaks diverge significantly from an independent shooter’s streaks, we can conclude that Kobe likely has a hot hand.
Simulations in R
So the key is to compare Kobe’s data with that of a shooter who we know to have independent shots. While we don’t have any such data, that is very easy to simulate.
In a simulation, we set the ground rules of a random process and then the computer uses random numbers to generate an outcome that adheres to those rules.
# Try some simulations!
outcomes <- c('heads', 'tails')
sample(outcomes, size = 1, replace = TRUE)## [1] "tails"
sample(outcomes, size = 1, replace = TRUE)## [1] "heads"
sample(outcomes, size = 1, replace = TRUE)## [1] "heads"
sample(outcomes, size = 1, replace = TRUE)## [1] "heads"
sample(outcomes, size = 1, replace = TRUE)## [1] "tails"
sample(outcomes, size = 5, replace = TRUE)## [1] "tails" "tails" "heads" "tails" "tails"
Flipping 100 coins
If we wanted to simulate flipping a fair coin 100 times, you could either run the function 100 times or, more simply, adjust the size argument, which governs how many samples to draw. Since there are only two elements in outcomes, the probability that we ‘flip’ a coin and it lands heads is 0.5.
Suppose again that you have a hat with two slips of paper in it: one slip says ‘heads’ and the other says ‘tails’. Therefore at each draw, the probability of drawing a chip that says ‘head’ is 50%, and “tail” is 50%. The replace = TRUE argument indicates we put the chip back in the hat before drawing again, therefore always keeping the 50/50 odds.
# Run the simulation
outcomes <- c('heads', 'tails')
sim_fair_coin <- sample(outcomes, size = 100,replace = TRUE)
# Print the object
sim_fair_coin## [1] "heads" "tails" "heads" "tails" "heads" "heads" "tails" "heads"
## [9] "heads" "heads" "tails" "heads" "heads" "heads" "tails" "tails"
## [17] "tails" "heads" "tails" "heads" "heads" "heads" "tails" "heads"
## [25] "heads" "tails" "tails" "tails" "heads" "heads" "tails" "heads"
## [33] "heads" "heads" "tails" "tails" "tails" "heads" "heads" "tails"
## [41] "heads" "tails" "tails" "heads" "heads" "heads" "heads" "tails"
## [49] "heads" "tails" "heads" "tails" "heads" "heads" "tails" "heads"
## [57] "tails" "heads" "tails" "heads" "tails" "heads" "tails" "heads"
## [65] "tails" "tails" "tails" "tails" "tails" "heads" "tails" "heads"
## [73] "heads" "tails" "tails" "heads" "tails" "heads" "tails" "tails"
## [81] "tails" "tails" "tails" "tails" "heads" "heads" "heads" "heads"
## [89] "tails" "heads" "tails" "heads" "tails" "tails" "heads" "heads"
## [97] "heads" "tails" "tails" "heads"
# Compute the counts of heads and tails:
table(sim_fair_coin)## sim_fair_coin
## heads tails
## 52 48
Flipping an unfair coin
Until now we’ve been running simulations where each outcome had an equal probability.
Change that to a 20/80 ratio.
# Run the simulation
outcomes <- c('heads', 'tails')
sim_unfair_coin <- sample(outcomes, size = 100, replace = TRUE, prob = c(0.20, 0.80))
# Print the object
sim_unfair_coin## [1] "tails" "tails" "tails" "tails" "tails" "tails" "tails" "tails"
## [9] "tails" "tails" "tails" "heads" "tails" "tails" "tails" "tails"
## [17] "tails" "tails" "heads" "tails" "tails" "heads" "tails" "tails"
## [25] "tails" "tails" "tails" "tails" "tails" "tails" "tails" "tails"
## [33] "tails" "heads" "heads" "heads" "tails" "heads" "tails" "tails"
## [41] "tails" "tails" "tails" "heads" "tails" "heads" "heads" "tails"
## [49] "tails" "heads" "tails" "heads" "tails" "heads" "tails" "tails"
## [57] "heads" "tails" "tails" "tails" "tails" "tails" "heads" "heads"
## [65] "tails" "tails" "tails" "tails" "tails" "tails" "tails" "tails"
## [73] "tails" "tails" "heads" "tails" "tails" "tails" "tails" "tails"
## [81] "tails" "tails" "tails" "tails" "tails" "tails" "tails" "tails"
## [89] "tails" "tails" "tails" "tails" "tails" "tails" "tails" "tails"
## [97] "tails" "heads" "tails" "tails"
# Compute the counts of heads and tails
table(sim_unfair_coin)## sim_unfair_coin
## heads tails
## 18 82
Simulating the independent shooter
Simulating a basketball player who has independent shots uses the same mechanism that we used to simulate a coin flip:
# Run the simulation and assign the result to 'sim_basket'
outcomes <- c('H', 'M')
sim_basket <- sample(outcomes, size = 133, replace = TRUE, prob = c(0.45, 0.55))
table(sim_basket)## sim_basket
## H M
## 57 76
Kobe vs. the independent shooter
Both datasets represent the results of 133 shot attempts, each with the same shooting percentage of 45%.
We know that our simulated data is from a shooter that has independent shots. That is, we know the simulated shooter does not have a ‘hot hand’.
# Calculate streak lengths
kobe_streak <- calc_streak(kobe$basket)
#sim_streak <- calc_streak(sample(c('H','M'),size=133,replace=TRUE,prob=c(0.45,0.55)))
sim_streak <- calc_streak(sim_basket)
# Compute summaries
summary(kobe_streak)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.7632 1.0000 4.0000
summary(sim_streak)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.7403 1.0000 5.0000
# Make bar plots
kobe_table = table(kobe_streak)
sim_table = table(sim_streak)
par(mfrow = c(1, 2))
barplot(kobe_table, ylim = c(0, 50))
barplot(sim_table, ylim = c(0, 50))par(mfrow = c(1, 1))Somewhat similar distributions!
We can conclude that Kobe Bryant likely does not have a ‘hot hand’.
Sampling distributions
We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010.
This collection represents our population of interest (which is rare to have access to), but we will also work with smaller samples from this population.
# Make some preliminary inspections:
head(ames, 2)## Order PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street Alley
## 1 1 526301100 20 RL 141 31770 Pave <NA>
## 2 2 526350040 20 RH 80 11622 Pave <NA>
## Lot.Shape Land.Contour Utilities Lot.Config Land.Slope Neighborhood
## 1 IR1 Lvl AllPub Corner Gtl NAmes
## 2 Reg Lvl AllPub Inside Gtl NAmes
## Condition.1 Condition.2 Bldg.Type House.Style Overall.Qual Overall.Cond
## 1 Norm Norm 1Fam 1Story 6 5
## 2 Feedr Norm 1Fam 1Story 5 6
## Year.Built Year.Remod.Add Roof.Style Roof.Matl Exterior.1st Exterior.2nd
## 1 1960 1960 Hip CompShg BrkFace Plywood
## 2 1961 1961 Gable CompShg VinylSd VinylSd
## Mas.Vnr.Type Mas.Vnr.Area Exter.Qual Exter.Cond Foundation Bsmt.Qual
## 1 Stone 112 TA TA CBlock TA
## 2 None 0 TA TA CBlock TA
## Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2
## 1 Gd Gd BLQ 639 Unf
## 2 TA No Rec 468 LwQ
## BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF Heating Heating.QC Central.Air
## 1 0 441 1080 GasA Fa Y
## 2 144 270 882 GasA TA Y
## Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Gr.Liv.Area
## 1 SBrkr 1656 0 0 1656
## 2 SBrkr 896 0 0 896
## Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr
## 1 1 0 1 0 3
## 2 0 0 1 0 2
## Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Functional Fireplaces
## 1 1 TA 7 Typ 2
## 2 1 TA 5 Typ 0
## Fireplace.Qu Garage.Type Garage.Yr.Blt Garage.Finish Garage.Cars
## 1 Gd Attchd 1960 Fin 2
## 2 <NA> Attchd 1961 Unf 1
## Garage.Area Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF
## 1 528 TA TA P 210
## 2 730 TA TA Y 140
## Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area Pool.QC
## 1 62 0 0 0 0 <NA>
## 2 0 0 0 120 0 <NA>
## Fence Misc.Feature Misc.Val Mo.Sold Yr.Sold Sale.Type Sale.Condition
## 1 <NA> <NA> 0 5 2010 WD Normal
## 2 MnPrv <NA> 0 6 2010 WD Normal
## SalePrice
## 1 215000
## 2 105000
str(ames)## 'data.frame': 2930 obs. of 82 variables:
## $ Order : int 1 2 3 4 5 6 7 8 9 10 ...
## $ PID : int 526301100 526350040 526351010 526353030 527105010 527105030 527127150 527145080 527146030 527162130 ...
## $ MS.SubClass : int 20 20 20 20 60 60 120 120 120 60 ...
## $ MS.Zoning : Factor w/ 7 levels "A (agr)","C (all)",..: 6 5 6 6 6 6 6 6 6 6 ...
## $ Lot.Frontage : int 141 80 81 93 74 78 41 43 39 60 ...
## $ Lot.Area : int 31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
## $ Lot.Shape : Factor w/ 4 levels "IR1","IR2","IR3",..: 1 4 1 4 1 1 4 1 1 4 ...
## $ Land.Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 2 4 4 ...
## $ Utilities : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Lot.Config : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 1 1 5 5 5 5 5 5 ...
## $ Land.Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
## $ Neighborhood : Factor w/ 28 levels "Blmngtn","Blueste",..: 16 16 16 16 9 9 25 25 25 9 ...
## $ Condition.1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...
## $ Condition.2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Bldg.Type : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 5 5 5 1 ...
## $ House.Style : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 3 6 6 3 3 3 6 ...
## $ Overall.Qual : int 6 5 6 7 5 6 8 8 8 7 ...
## $ Overall.Cond : int 5 6 6 5 5 6 5 5 5 5 ...
## $ Year.Built : int 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
## $ Year.Remod.Add : int 1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
## $ Roof.Style : Factor w/ 6 levels "Flat","Gable",..: 4 2 4 4 2 2 2 2 2 2 ...
## $ Roof.Matl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior.1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ...
## $ Exterior.2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 4 15 15 6 7 6 15 ...
## $ Mas.Vnr.Type : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 6 5 3 5 5 3 5 5 5 5 ...
## $ Mas.Vnr.Area : int 112 0 108 0 0 20 0 0 0 0 ...
## $ Exter.Qual : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 4 4 3 4 4 3 3 3 4 ...
## $ Exter.Cond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 2 2 3 3 3 3 3 3 ...
## $ Bsmt.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 4 6 4 4 4 6 ...
## $ Bsmt.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 4 6 6 6 6 6 6 6 6 6 ...
## $ Bsmt.Exposure : Factor w/ 5 levels "","Av","Gd","Mn",..: 3 5 5 5 5 5 4 5 5 5 ...
## $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 3 6 2 2 4 4 4 2 4 7 ...
## $ BsmtFin.SF.1 : int 639 468 923 1065 791 602 616 263 1180 0 ...
## $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 5 7 7 7 7 7 7 7 7 ...
## $ BsmtFin.SF.2 : int 0 144 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Unf.SF : int 441 270 406 1045 137 324 722 1017 415 994 ...
## $ Total.Bsmt.SF : int 1080 882 1329 2110 928 926 1338 1280 1595 994 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Heating.QC : Factor w/ 5 levels "Ex","Fa","Gd",..: 2 5 5 1 3 1 1 1 1 3 ...
## $ Central.Air : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ Electrical : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ X1st.Flr.SF : int 1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
## $ X2nd.Flr.SF : int 0 0 0 0 701 678 0 0 0 776 ...
## $ Low.Qual.Fin.SF: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Gr.Liv.Area : int 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
## $ Bsmt.Full.Bath : int 1 0 0 1 0 0 1 0 1 0 ...
## $ Bsmt.Half.Bath : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Full.Bath : int 1 1 1 2 2 2 2 2 2 2 ...
## $ Half.Bath : int 0 0 1 1 1 1 0 0 0 1 ...
## $ Bedroom.AbvGr : int 3 2 3 3 3 3 2 2 2 3 ...
## $ Kitchen.AbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Kitchen.Qual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 3 1 5 3 3 3 3 3 ...
## $ TotRms.AbvGrd : int 7 5 6 8 6 7 6 5 5 7 ...
## $ Functional : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ Fireplaces : int 2 0 0 2 1 1 0 0 1 1 ...
## $ Fireplace.Qu : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA 5 5 3 NA NA 5 5 ...
## $ Garage.Type : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Garage.Yr.Blt : int 1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
## $ Garage.Finish : Factor w/ 4 levels "","Fin","RFn",..: 2 4 4 2 2 2 2 3 3 2 ...
## $ Garage.Cars : int 2 1 1 2 2 2 2 2 2 2 ...
## $ Garage.Area : int 528 730 312 522 482 470 582 506 608 442 ...
## $ Garage.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Garage.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Paved.Drive : Factor w/ 3 levels "N","P","Y": 2 3 3 3 3 3 3 3 3 3 ...
## $ Wood.Deck.SF : int 210 140 393 0 212 360 0 0 237 140 ...
## $ Open.Porch.SF : int 62 0 36 0 34 36 0 82 152 60 ...
## $ Enclosed.Porch : int 0 0 0 0 0 0 170 0 0 0 ...
## $ X3Ssn.Porch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Screen.Porch : int 0 120 0 0 0 0 0 144 0 0 ...
## $ Pool.Area : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Pool.QC : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA 3 NA NA 3 NA NA NA NA NA ...
## $ Misc.Feature : Factor w/ 5 levels "Elev","Gar2",..: NA NA 2 NA NA NA NA NA NA NA ...
## $ Misc.Val : int 0 0 12500 0 0 0 0 0 0 0 ...
## $ Mo.Sold : int 5 6 6 4 3 6 4 1 3 6 ...
## $ Yr.Sold : int 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
## $ Sale.Type : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ SalePrice : int 215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
A first distribution analysis
Restrict the attention to two of the variables.
# Assign the variables
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
# Calculate the summary and draw a histogram of 'area'
summary(area)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
# Plot it
hist(area)50% of houses in Ames are smaller than 1,500 square feet.
Sampling from the population
Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.
Estimate the mean living area in Ames based on a sample; collect a simple random sample of size 50.
# The 'ames' data frame and 'area' and 'price' objects are already loaded into the workspace
# Create the samples
samp0 <- sample(area, 50)
samp1 <- sample(area, 50)
# Draw the histograms
par(mfrow = c(1,2))
hist(samp0)
hist(samp1)par(mfrow = c(1,1))Depending on which 50 homes we selected, your estimate could be a bit above or a bit below the true population mean of approximately 1,500 square feet.
In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.
However, larger samples provide more accurate estimates of the population mean.
The sampling distribution
The distribution of sample means, called the sampling distribution, can help us understand this variability.
Generate 5000 samples and compute the sample mean of each.
# Set up an empty vector of 5000 NAs to store sample means
sample_means50 <- rep(NA, 5000)
# Take 5000 samples of size 50 of 'area' and store all of them in 'sample_means50'
for (i in 1:5000) {
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
# View the result. If you want, you can increase the bin width to show more detail by changing the 'breaks' argument
hist(sample_means50, breaks = 13)Interlude: The for loop
The idea behind the for loop is iteration: it allows to execute code as many times without having to type out every iteration. The code to generate 5000 numbers:
# Set up an empty vector of 5000 NAs to store sample means
sample_means50 <- rep(NA, 5000)
# Take 5000 samples of size 50 of 'area' and store all of them in 'sample_means50'
# Also print 'i' (the index counter):
for (i in 1:5000) {
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
print(i)
}Interlude: breaking it down
In the first line we initialize a vector. In this case, we created a vector of 5000 NAs called sample means50. This vector will store values generated within the for loop. NA means not available, and in this case they’re used as placeholders until we fill in the values with actual sample means. NA is also often used for missing data in R.
The second line calls the for loop itself. The syntax can be loosely read as, “for every element i from 1 to 5000, run the following lines of code”. You can think of i as the counter that keeps track of which loop you’re on. Therefore, more precisely, the loop will run once when i=1, then once when i=2, and so on up to i=5000.
The body of the for loop is the part inside the curly braces, and this set of code is run for each value of i. Here, on every loop, we take a random sample of size 50 from area, take its mean, and store it as the ith element of sample_means50. In order to display that this is really happening, we asked R to print i at each iteration. This line of code is optional and is only used for displaying what’s going on while the for loop is running.
The for loop allows us to not just run the code 5000 times, but to neatly package the results, element by element, into the empty vector that we initialized at the outset. The code:
# The vector 'sample_means50' is initialized with NA values
sample_means50 <- rep(NA, 5000)
# The for loop runs 5000 times, with 'i' taking values 1 up to 5000
for (i in 1:5000) {
# Take a random sample of size 50
samp <- sample(area, 50)
# Store the mean of the sample in the 'sample_means50' vector on the ith place
sample_means50[i] <- mean(samp)
# Print the counter 'i'
print(i)
}The result:
# Print the first few random means
head(sample_means50)## [1] 1440.38 1425.78 1500.44 1501.66 1418.44 1559.00
A first for loop
# Initialize the vector to store the means in
sample_means_small <- rep(NA, 100)
# Run your for loop
for (i in 1:100) {
samp <- sample(area,50)
sample_means_small[i] <- mean(samp)
}
# Print the result
sample_means_small## [1] 1539.88 1380.38 1570.22 1555.70 1527.24 1450.86 1585.26 1363.08
## [9] 1458.06 1490.30 1492.78 1473.94 1519.52 1463.06 1512.52 1549.02
## [17] 1602.76 1483.68 1505.58 1459.90 1464.42 1507.26 1365.04 1455.08
## [25] 1660.06 1467.30 1669.16 1435.54 1562.04 1460.86 1443.30 1410.28
## [33] 1453.38 1537.40 1466.74 1559.56 1543.74 1584.44 1531.44 1475.30
## [41] 1582.66 1532.10 1387.60 1523.92 1518.54 1517.92 1506.02 1432.80
## [49] 1417.70 1308.72 1482.80 1487.58 1490.12 1712.62 1490.88 1545.90
## [57] 1439.48 1594.44 1541.50 1519.30 1547.66 1517.20 1455.94 1472.08
## [65] 1597.16 1496.82 1566.50 1570.78 1564.44 1613.98 1521.22 1483.30
## [73] 1431.84 1420.40 1494.74 1443.18 1486.84 1459.86 1560.78 1529.48
## [81] 1543.60 1521.64 1509.62 1579.10 1474.94 1369.88 1490.14 1537.20
## [89] 1338.10 1346.68 1465.66 1554.30 1390.44 1411.20 1457.92 1461.42
## [97] 1486.52 1441.36 1548.58 1438.02
100 results, 100 elements. Each element represents a mean square footage from a simple random sample of 50 houses.
More on sampling
The sampling distribution that we computed tells us much about estimating the average living area in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average living area of the the population, and the spread of the distribution indicates how much variability is induced by sampling only 50 home sales.
# Initialize the sample distributions:
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
# Run the for loop:
for (i in 1:5000) {
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
# Take a look at the results
head(sample_means10)## [1] 1452.9 1437.2 1482.6 1399.5 1452.3 1605.8
head(sample_means50) # was already loaded## [1] 1440.38 1425.78 1500.44 1501.66 1418.44 1559.00
head(sample_means100)## [1] 1448.41 1472.83 1527.48 1538.21 1536.08 1453.46
Influence of sample size
To see the effect that different sample sizes have on the sampling distribution, let’s plot the three distributions on top of one another.
# Divide the plot window in 3 rows
par(mfrow = c(3, 1))
# Define the limits for the x-axis
xlimits <- range(sample_means10)
# Draw the histograms
hist(sample_means10)
hist(sample_means50)
hist(sample_means100)# with breaks
hist(sample_means10, breaks = 20, xlim = range(xlimits))
hist(sample_means50, breaks = 20, xlim = range(xlimits))
hist(sample_means100, breaks = 20, xlim = range(xlimits))par(mfrow = c(1, 1))As the sample size increases, the variability of the sampling distribution decreases.
Now: prices!
So far, we have only focused on estimating the mean living area in homes in Ames.
Now let’s take off our training wheels and do the same analysis for the sales prices.
# Take a sample of size 50 from 'price'
sample_50 <- sample(price, 50)
# Print the mean
mean(sample_50)## [1] 174484.6
Sampling distribution of prices
Since we have access to the population, we can simulate the sampling distribution for \(\overline{x}\) by taking 5000 samples of size 50 from the population and compute 5000 sample means.
sample_means50 <- rep(NA, 5000)
for (i in 1:5000) {
samp <- sample(price,50)
sample_means50[i] <- mean(samp)
}
head(sample_means50)## [1] 214974.5 190683.2 192283.1 163343.8 179480.0 199587.8
More on sampling distribution of prices
Means of samples of size 150.
sample_means150 <- rep(NA, 5000)
for (i in 1:5000) {
samp <- sample(price,150)
sample_means150[i] <- mean(samp)
}
par(mfrow = c(2, 1))
hist(sample_means50)
hist(sample_means150)par(mfrow = c(1, 1))The variability of the sampling distribution with the smaller sample size (sample_means50) is smaller than the variability of the sampling distribution with the larger sample size (sample_means150).
One sample from Ames, Iowa
we’ll translate that concept of reliability of mean estimates into something more tangible: confidence intervals.
If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straight forward to answer questions like, “How big is the typical house in Ames?” and “How much variation is there in sizes of houses?”. If you have access to only a sample of the population, as is often the case, the task becomes more complicated.
What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make inference on what your population looks like.
we’ll start with a simple random sample of size 60 from the population. Specifically, this is a simple random sample of size 60. Note that the dataset has information on many housing variables, but for the first portion of the lab we’ll focus on the size of the house, represented by the variable Gr.Liv.Area.
# Take a sample of size 60 of the population:
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
# Calculate the mean
sample_mean <- mean(samp)
# Draw a histogram
hist(samp)The distribution should be similar to others’ distributions who also collect random samples from this population, but it is likely not exactly the same since it’s a random sample.
Confidence intervals
Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean.
How uncertain we are of that estimate. This can be captured by using a confidence interval.
We can calculate a 95% confidence interval for a sample mean by adding and subtracting 1.96 standard errors to the point estimate.
# Calculate the standard error
se <- sd(samp)/sqrt(60)
# Calculate the lower and upper bounds of your confidence interval and print them
# 95% with z = 1.96
lower <- sample_mean - se * 1.96
upper <- sample_mean + se * 1.96
# to print several variables...
c(lower, sample_mean, upper)## [1] 1306.779 1436.067 1565.354
# Plot
stripchart(c(lower, sample_mean, upper))For the confidence interval to be valid, the sample mean must be normally distributed and have standard error:
\[\frac{s}{\sqrt(n)}\]
The sample must be random and the sample size, 60, is less than 10% of all houses.
A 95% confidence intervals would you expect to capture the true population mean; 95% of random samples of size 60 will yield confidence intervals that contain the true average area of houses in Ames, Iowa.
Challenge (1)
we’re going to recreate many samples using a for loop. Here is the rough outline:
# Initialize 'samp_mean', 'samp_sd' and 'n'
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60Challenge (2)
Now, we can use a for loop to calculate 50 times the mean and standard deviation of random samples.
# Initialize 'samp_mean', 'samp_sd' and 'n'
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
# For loop goes here
for (i in 1:50) {
samp <- sample(population, n)
samp_mean[i] <- mean(samp)
samp_sd[i] <- sd(samp)
}Challenge (3)
Lastly, we can construct the confidence intervals and plot them. In this exercise, you will have to calculate the 95% confidence intervals.
The lower and upper bound respectively:
\[\overline{x} - 1.96 \times \frac{\sigma}{\sqrt(n)}\]
\[\overline{x} + 1.96 \times \frac{\sigma}{\sqrt(n)}\]
# Initialize 'samp_mean', 'samp_sd' and 'n'
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
# For loop goes here
for (i in 1:50) {
samp <- sample(population, n)
samp_mean[i] <- mean(samp)
samp_sd[i] <- sd(samp)
}
# Calculate the interval bounds here
lower <- samp_mean - 1.96 * (samp_sd/sqrt(n))
upper <- samp_mean + 1.96 * (samp_sd/sqrt(n))Plot confidence interval – the function
plot_ci <- function(lo, hi, m){
par(mar=c(2, 1, 1, 1), mgp=c(2.7, 0.7, 0))
k <- 50
ci.max <- max(rowSums(matrix(c(-1*lo,hi),ncol=2)))
xR <- m + ci.max*c(-1, 1)
yR <- c(0, 41*k/40)
plot(xR, yR, type='n', xlab='', ylab='', axes=FALSE)
abline(v=m, lty=2, col='#00000088')
axis(1, at=m, paste("mu = ",round(m,4)), cex.axis=1.15)
#axis(2)
for(i in 1:k){
x <- mean(c(hi[i],lo[i]))
ci <- c(lo[i],hi[i])
if(contains(lo[i],hi[i],m)==FALSE){
col <- "#F05133"
points(x, i, cex=1.4, col=col)
# points(x, i, pch=20, cex=1.2, col=col)
lines(ci, rep(i, 2), col=col, lwd=5)
}
col <- 1
points(x, i, pch=20, cex=1.2, col=col)
lines(ci, rep(i, 2), col=col)
}
}Plot confidence interval – the plot
# Plotting the confidence intervals
pop_mean <- mean(population)
plot_ci(lower, upper, pop_mean)The 99%
Critical value for a 99% confidence level: 2.58.
# Initialize 'samp_mean', 'samp_sd' and 'n'
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
# For loop goes here
for (i in 1:50) {
samp <- sample(population, n)
samp_mean[i] <- mean(samp)
samp_sd[i] <- sd(samp)
}
# Calculate the interval bounds here
lower <- samp_mean - 2.58 * (samp_sd/sqrt(n))
upper <- samp_mean + 2.58 * (samp_sd/sqrt(n))
# Plotting the confidence intervals
pop_mean <- mean(population)
plot_ci(lower, upper, pop_mean)We would expect 99% of the intervals to contain the true population mean.
North Carolina births
In 2004, the state of North Carolina released a large dataset containing information on births recorded in their state. This dataset is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this dataset.
# List the variables in the dataset
names(nc)## [1] "fage" "mage" "mature" "weeks"
## [5] "premie" "visits" "marital" "gained"
## [9] "weight" "lowbirthweight" "gender" "habit"
## [13] "whitemom"
The nc data frame contains observations on 13 different variables, some categorical and some numerical. The meaning of each variable is as follows:
fage: father’s age in years.mage: mother’s age in years.mature: maturity status of mother.weeks: length of pregnancy in weeks.premie: whether the birth was classified as premature (premie) or full-term.visits: number of hospital visits during pregnancy.marital: whether mother is married or not married at birth.gained: weight gained by mother during pregnancy in pounds.weight: weight of the baby at birth in pounds.lowbirthweight: whether baby was classified as low birthweight (low) or not (not low).gender: gender of the baby, female or male.habit: status of the mother as a nonsmoker or a smoker.whitemom: whether mom is white or not white.A first analysis
# Compute summaries of the data
summary(nc)## fage mage mature weeks
## Min. :14.00 Min. :13 mature mom :133 Min. :20.00
## 1st Qu.:25.00 1st Qu.:22 younger mom:867 1st Qu.:37.00
## Median :30.00 Median :27 Median :39.00
## Mean :30.26 Mean :27 Mean :38.33
## 3rd Qu.:35.00 3rd Qu.:32 3rd Qu.:40.00
## Max. :55.00 Max. :50 Max. :45.00
## NA's :171 NA's :2
## premie visits marital gained
## full term:846 Min. : 0.0 married :386 Min. : 0.00
## premie :152 1st Qu.:10.0 not married:613 1st Qu.:20.00
## NA's : 2 Median :12.0 NA's : 1 Median :30.00
## Mean :12.1 Mean :30.33
## 3rd Qu.:15.0 3rd Qu.:38.00
## Max. :30.0 Max. :85.00
## NA's :9 NA's :27
## weight lowbirthweight gender habit
## Min. : 1.000 low :111 female:503 nonsmoker:873
## 1st Qu.: 6.380 not low:889 male :497 smoker :126
## Median : 7.310 NA's : 1
## Mean : 7.101
## 3rd Qu.: 8.060
## Max. :11.750
##
## whitemom
## not white:284
## white :714
## NA's : 2
##
##
##
##
# Use visualization and summary statistics to view the data for 'gained'
summary(nc$gained)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 20.00 30.00 30.33 38.00 85.00 27
par(mfrow = c(2, 1))
boxplot(nc$gained, horizontal = TRUE, ylim = c(0, 90))
hist(nc$gained, xlim = c(0, 90), main = '')par(mfrow = c(1, 1))NA weight gain = 27.
Cleaning your data
Since we have some missing data on weight gain, we’ll first create a cleaned-up version of the weight gain variable, and use this variable in the next few steps of the analysis.
# Create a clean version fo your dataset
gained_clean <- na.omit(nc$gained)
# Set 'n' to the length of 'gained_clean'
n <- length(gained_clean)The bootstrap
Using this sample we would like to construct a bootstrap confidence interval for the average weight gained by all mothers during pregnancy. A quick reminder of how bootstrapping works:
Let’s first take 100 bootstrap samples (i.e. with replacement), and record their means in a new object called boot_means.
# Initialize the 'boot_means' object
boot_means <- rep(NA, 100)
# Insert your for loop:
for (i in 1:100) {
boot_sample <- sample(gained_clean, n, replace = TRUE)
boot_means[i] <- mean(boot_sample)
}
# Make a histogram of 'boot_means'
hist(boot_means)Sampling distribution vs bootstrap distribution
The sampling distribution is calculated by resampling from the population, the bootstrap distribution is calculated by resampling from the sample.
The inference function
To construct the 95% bootstrap confidence interval using the percentile method, we do not estimate the values of the 5th and 95th percentiles of the bootstrap distribution as in the sampling distribution.
This function automates the process. By default the inference function takes 10,000 bootstrap samples (instead of the 100 you’ve taken above), creates a bootstrap distribution, and calculates the confidence interval, using the percentile method.
The code for the custom function is not shown.
# Run the inference function (and plot the results)
boot_awg_90 <- inference(nc$gained,
type = 'ci',
method = 'simulation',
conflevel = 0.9,
est = 'mean',
boot_method = 'perc')## Single mean
## Summary statistics:
## mean = 30.3258 ; sd = 14.2413 ; n = 973
## Bootstrap method: Percentile
## 90 % Bootstrap interval = ( 29.5827 , 31.0771 )
boot_awg_90## $mfrow
## [1] 1 2
##
## $mar
## [1] 4.0 4.0 0.5 0.5
Setting the confidence level
We can easily change the confidence level to 95% by changing the argument conflevel of the inference function.
# Run the inference function (and plot the results)
boot_awg_95 <- inference(nc$gained,
type = 'ci',
method = 'simulation',
conflevel = 0.95,
est = 'mean',
boot_method = 'perc')## Single mean
## Summary statistics:
## mean = 30.3258 ; sd = 14.2413 ; n = 973
## Bootstrap method: Percentile
## 95 % Bootstrap interval = ( 29.4481 , 31.2127 )
boot_awg_95## $mfrow
## [1] 1 2
##
## $mar
## [1] 4.0 4.0 0.5 0.5
Choosing a bootstrap method
We can also use the standard error method. Change the boot_method argument to ‘se’ and run the code.
# Adapt the inference function
inference(nc$gained,
type = 'ci',
method = 'simulation',
conflevel = 0.95,
est = 'mean',
boot_method = 'se')## Single mean
## Summary statistics:
## mean = 30.3258 ; sd = 14.2413 ; n = 973
## Bootstrap method: Standard error; Boot. SE = 0.4532
## 95 % Bootstrap interval = ( 29.4376 , 31.214 )
Setting the parameter of interest
To create an interval for the median instead of the mean, you can change the est argument.
# Adapt the inference function
inference(nc$gained,
type = 'ci',
method = 'simulation',
conflevel = 0.95,
est = 'median',
boot_method = 'se')## Single median
## Summary statistics:
## median = 30 ; n = 973
## Bootstrap method: Standard error; Boot. SE = 0.2467
## 95 % Bootstrap interval = ( 29.5164 , 30.4836 )
The bootstrap distribution of the median weight gain is not a smooth, unimodal, symmetric distribution; it does not yield a reliable confidence interval estimate.
Father’s age
Create a 95% bootstrap confidence interval for the mean age of fathers at the birth of their child using the standard error method.
# Adapt the inference function to create a 95% bootstrap interval for the mean age of fathers
inference(nc$fage,
type = 'ci',
method = 'simulation',
conflevel = 0.95,
est = 'mean',
boot_method = 'se')## Single mean
## Summary statistics:
## mean = 30.2557 ; sd = 6.7638 ; n = 829
## Bootstrap method: Standard error; Boot. SE = 0.2354
## 95 % Bootstrap interval = ( 29.7944 , 30.717 )
Relationships between two variables
When the response variable is numerical and the explanatory variable is categorical, we can evaluate the relationship between the two variables by comparing means (or medians, or other measures) of the numerical response variable across the levels of the explanatory categorical variable.
# Draw your plot here
boxplot(nc$weight ~ nc$habit, horizontal = TRUE)Both distributions are slightly right skewed (since the ‘middle’ is 6).
The by function
We can also compare the means of the distributions using the by function to split the weight variable into the habit groups.
# Use the 'by' function here
by(nc$weight, nc$habit, mean)## nc$habit: nonsmoker
## [1] 7.144273
## --------------------------------------------------------
## nc$habit: smoker
## [1] 6.82873
by(nc$weight, nc$habit, median)## nc$habit: nonsmoker
## [1] 7.31
## --------------------------------------------------------
## nc$habit: smoker
## [1] 7.06
by(nc$weight, nc$habit, summary)## nc$habit: nonsmoker
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.440 7.310 7.144 8.060 11.750
## --------------------------------------------------------
## nc$habit: smoker
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.690 6.077 7.060 6.829 7.735 9.190
Conditions for inference
Before you can start hypothesis testing, you need to think about the conditions that are necessary for inference.
Obtain sample sizes to check the conditions.
# Use the 'by' function here
by(nc$weight, nc$habit, length)## nc$habit: nonsmoker
## [1] 873
## --------------------------------------------------------
## nc$habit: smoker
## [1] 126
More inference
use the inference function for evaluating whether there is a difference between the average birth weights of babies born to smoker and non-smoker mothers.
Arguments:
y, which is the response variable that we are interested in: nc$weight.x, which is the explanatory variable - the grouping variable across the levels of which we’re comparing the average value for the response variable, smokers and non-smokers: nc$habit.est, is the parameter we’re interested in: "mean" (other options are "median", or "proportion".)type of inference we want: a hypothesis test ("ht") or a confidence interval("ci").null value, which in this case is 0, since the null hypothesis sets the two population means equal to each other.alternative hypothesis can be "less", "greater", or "twosided"."theoretical" or "simulation" based.# The code
inference(y = nc$weight,
x = nc$habit,
est = 'mean',
type = 'ht',
null = 0,
alternative = 'twosided',
method = 'theoretical')## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## Observed difference between means (nonsmoker-smoker) = 0.3155
## H0: mu_nonsmoker - mu_smoker = 0
## HA: mu_nonsmoker - mu_smoker != 0
## Standard error = 0.134
## Test statistic: Z = 2.359
## p-value = 0.0184
Changing the order
By default the inference() function sets the parameter of interest to be \(\mu_{\mathrm{nonsmoker}}-\mu_{\mathrm{smoker}}\). We can easily change this order by using the order argument.
To set the order to \(\mu_{\mathrm{first}} - \mu_{\mathrm{second}}\) use: order <- c("first","second").
# Levels: nonsmoker smoker
# Add the 'order' argument to the 'inference' function
inference(y = nc$weight,
x = nc$habit,
est = 'mean',
type = 'ht',
null = 0,
alternative = 'twosided',
method = 'theoretical',
order = c('smoker', 'nonsmoker'))## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## Observed difference between means (smoker-nonsmoker) = -0.3155
## H0: mu_smoker - mu_nonsmoker = 0
## HA: mu_smoker - mu_nonsmoker != 0
## Standard error = 0.134
## Test statistic: Z = -2.359
## p-value = 0.0184
inference(y = nc$weight,
x = nc$habit,
est = 'mean',
type = 'ci',
null = 0,
alternative = 'twosided',
method = 'theoretical',
order = c('smoker', 'nonsmoker'))## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## Observed difference between means (smoker-nonsmoker) = -0.3155
## Standard error = 0.1338
## 95 % Confidence interval = ( -0.5777 , -0.0534 )
We are 95% confident that babies born to nonsmoker mothers are on average 0.05 to 0.58 pounds heavier at birth than babies born to smoker mothers.
by(nc$mage, nc$mature, summary)## nc$mature: mature mom
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.00 35.00 37.00 37.18 38.00 50.00
## --------------------------------------------------------
## nc$mature: younger mom
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 21.00 25.00 25.44 30.00 34.00
The maximum age of younger moms is 34 and minimum age of mature moms is 35.
The General Social Survey
The General Social Survey (GSS) is a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. The survey asks many questions, two of which we will focus on for the next few exercises: wordsum (vocabulary test scores) and class (self-reported social class).
wordsum ranges between 0 and 10, and is calculated as the number of vocabulary questions (out of 10) that the respondent answered correctly.
head(gss$wordsum)## [1] 6 9 6 5 6 6
head(gss$class)## [1] MIDDLE WORKING WORKING WORKING WORKING WORKING
## Levels: LOWER MIDDLE UPPER WORKING
Analyze the variables
Create numerical and visual summaries of the two variables individually to better understand their distribution. Also compute summary statistics and visualizations that display the relationship between the two variables.
# Numerical and visual summaries of 'wordsum' and 'class'
summary(gss$wordsum)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.000 6.000 6.145 7.000 10.000
summary(gss$class)## LOWER MIDDLE UPPER WORKING
## 41 331 16 407
par(mfrow = c(2,1))
hist(gss$wordsum)
boxplot(gss$wordsum, horizontal = TRUE)par(mfrow = c(1,1))
# Numerical and visual summaries of their relations
by(gss$wordsum, gss$class, mean)## gss$class: LOWER
## [1] 5.073171
## --------------------------------------------------------
## gss$class: MIDDLE
## [1] 6.761329
## --------------------------------------------------------
## gss$class: UPPER
## [1] 6.1875
## --------------------------------------------------------
## gss$class: WORKING
## [1] 5.749386
boxplot(gss$wordsum ~ gss$class)ANOVA test
The ANOVA method can test for a difference between the average vocabulary test scores among the various classes.
# The test
inference(y = gss$wordsum,
x = gss$class,
est = 'mean',
method = 'theoretical',
type = 'ht',
alternative = 'greater')## Response variable: numerical, Explanatory variable: categorical
## ANOVA
## Summary statistics:
## n_LOWER = 41, mean_LOWER = 5.0732, sd_LOWER = 2.2404
## n_MIDDLE = 331, mean_MIDDLE = 6.7613, sd_MIDDLE = 1.8863
## n_UPPER = 16, mean_UPPER = 6.1875, sd_UPPER = 2.3443
## n_WORKING = 407, mean_WORKING = 5.7494, sd_WORKING = 1.8652
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 3 236.56 78.855 21.735 1.561e-13
## Residuals 791 2869.80 3.628
##
## Pairwise tests: t tests with pooled SD
## LOWER MIDDLE UPPER
## MIDDLE 0.0000 NA NA
## UPPER 0.0475 0.2396 NA
## WORKING 0.0306 0.0000 0.3671
Let’s examine the output:
Before we get to the ANOVA output we are presented with a series of summary statistics and the associated hypotheses.
Modified \(\alpha\) (\(\alpha^{*}\)):
\[\alpha^{*} = \frac{\alpha}{6} = 0.0083\]
The ‘middle’ and ‘lower’ p-values of the pairwise tests from the ANOVA output are below \(\alpha^{*}\) and conclude to be different (H_0 rejected) at the modified significance level. All the other p-values cannot reject the zero hypothesis.
In August of 2012, news outlets ranging from the Washington Post to the Huffington Post ran a story about the rise of atheism in America.
The source for the story was a poll that asked people, “Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person or a convinced atheist?”
This type of question, which asks people to classify themselves in one way or another, is common in polling and generates categorical data.
The data
summary(atheism)## nationality response year
## Pakistan : 5409 atheist : 5498 Min. :2005
## France : 3359 non-atheist:82534 1st Qu.:2005
## Korea, Rep (South): 3047 Median :2012
## Ghana : 2995 Mean :2009
## Macedonia : 2418 3rd Qu.:2012
## Peru : 2414 Max. :2012
## (Other) :68390
str(atheism)## 'data.frame': 88032 obs. of 3 variables:
## $ nationality: Factor w/ 57 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ response : Factor w/ 2 levels "atheist","non-atheist": 2 2 2 2 2 2 2 2 2 2 ...
## $ year : int 2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
head(atheism)## nationality response year
## 1 Afghanistan non-atheist 2012
## 2 Afghanistan non-atheist 2012
## 3 Afghanistan non-atheist 2012
## 4 Afghanistan non-atheist 2012
## 5 Afghanistan non-atheist 2012
## 6 Afghanistan non-atheist 2012
A poll conducted by WIN-Gallup International surveyed 51,927 people (a sample) from 57 countries and gather information face to face, by telephone and through the Internet: ‘Global Index of Religiosity of Atheism’, WIN-Gallup, 2012.
From pp. 15 and 16 of the report, Table 6 shows the sample and response percentages on individual persons from 57 countries.
Atheists in the US
Take a look at the estimated proportion of atheists in the United States (in 2012).
# Create the 'us12' subset
us12 <- subset(atheism, atheism$nationality == 'United States' & atheism$year == '2012')
# Calculate the proportion of atheist responses:
proportion <- nrow(subset(us12, us12$response == 'atheist'))/nrow(us12)
# Print the proportion
proportion## [1] 0.0499002
round(proportion, 2)## [1] 0.05
Table 6 of the report provides statistics, that is, calculations made from the sample of 51,927 people.
The question “What proportion of people in your sample reported being atheists?” is a statistic; while the question “What proportion of people on earth would report being atheists” is answered with an estimate of the parameter.
The inferential tools for estimating population proportion are analogous to those used for means in the previous lab: the confidence interval and the hypothesis test.
Although formal confidence intervals and hypothesis tests don’t show up in the report, suggestions of inference appear at the bottom of page 7: “In general, the error margin for surveys of this kind is ? 3-5% at 95% confidence.”
inference(us12$response,
est = "proportion",
type = "ci",
method = "theoretical",
success = "atheist")## Single proportion -- success: atheist
## Summary statistics:
## p_hat = 0.0499 ; n = 1002
## Check conditions: number of successes = 50 ; number of failures = 952
## Standard error = 0.0069
## 95 % Confidence interval = ( 0.0364 , 0.0634 )
Standard error = 0.0069.
Margin of error (95%) = 0.0069*1.96 = 0.013524.
What about India?
# The subset for India for 2012
india <- subset(atheism, atheism$nationality == 'India' & atheism$year == 2012)
# The analysis using the 'inference' function
inference(india$response,
est = 'proportion',
type = 'ci',
method = 'theoretical',
success = 'atheist')## Single proportion -- success: atheist
## Summary statistics:
## p_hat = 0.0302 ; n = 1092
## Check conditions: number of successes = 33 ; number of failures = 1059
## Standard error = 0.0052
## 95 % Confidence interval = ( 0.0201 , 0.0404 )
Confidence intervals for the proportion of atheists in India (in 2012).
And China?
# The subset for China for 2012
china <- subset(atheism, atheism$nationality == 'China' & atheism$year == 2012)
# The analysis using the 'inference' function
inference(china$response,
est = 'proportion',
type = 'ci',
method = 'theoretical',
success = 'atheist')## Single proportion -- success: atheist
## Summary statistics:
## p_hat = 0.47 ; n = 500
## Check conditions: number of successes = 235 ; number of failures = 265
## Standard error = 0.0223
## 95 % Confidence interval = ( 0.4263 , 0.5137 )
Confidence intervals for the proportion of atheists in China (in 2012).
The margin of error
The formula for the standard error:
\[SE = \sqrt{p(1-p)/n}\]
The formula for the margin of error for a 95% confidence interval:
\[ME = 1.96 \times SE = 1.96 \times \sqrt {p(1-p)/n}\]
Since the population proportion p is in this me formula, it should make sense that the margin of error is in some way dependent on the population proportion.
We can visualize this relationship by creating a plot of me vs p.
# The first step is to make a vector p that is a sequence from 0 to 1 with each number separated by 0.01
n <- 1000
p <- seq(0, 1, 0.01)
# We then create a vector of the margin of error (me) associated with each of these values of p using the familiar approximate formula (ME = 2 X SE)
me <- 2 * sqrt(p * (1 - p)/n)
# 1.96
# Finally, plot the two vectors against each other to reveal their relationship
plot(me ~ p)me reaches a minimum at p == 0 and p == 1.me is maximized when p == 0.5.Atheism in Spain
Table 4 on page 13 of the report summarizes survey results from 2005 and 2012 for 39 countries.
# Take the 'spain' subset
spain <- subset(atheism, atheism$nationality == 'Spain')
# & atheism$year == 2012)
# Calculate the proportion of atheists from the 'spain' subset
proportion <- nrow(subset(spain, spain$response == 'atheist'))/nrow(spain)
# Use the inference function
inference(spain$response,
spain$year,
est = 'proportion',
type = 'ci',
method = 'theoretical',
success = 'atheist')## Response variable: categorical, Explanatory variable: categorical
## Two categorical variables
## Difference between two proportions -- success: atheist
## Summary statistics:
## x
## y 2005 2012 Sum
## atheist 115 103 218
## non-atheist 1031 1042 2073
## Sum 1146 1145 2291
## Observed difference between proportions (2005-2012) = 0.0104
##
## Check conditions:
## 2005 : number of successes = 115 ; number of failures = 1031
## 2012 : number of successes = 103 ; number of failures = 1042
## Standard error = 0.0123
## 95 % Confidence interval = ( -0.0136 , 0.0344 )
There is no convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012.
Rising in the US?
Will this situation be the same in the United States?
# Take the 'us' subset
us <- subset(atheism, atheism$nationality == 'United States')
# & atheism$year == 2012)
# Calculate the proportion of atheists from the 'us' subset
proportion <- nrow(subset(us, us$response == 'atheist'))/nrow(us)
# Use the inference function
inference(us$response,
us$year,
est = 'proportion',
type = 'ci',
method = 'theoretical',
success = 'atheist')## Response variable: categorical, Explanatory variable: categorical
## Two categorical variables
## Difference between two proportions -- success: atheist
## Summary statistics:
## x
## y 2005 2012 Sum
## atheist 10 50 60
## non-atheist 992 952 1944
## Sum 1002 1002 2004
## Observed difference between proportions (2005-2012) = -0.0399
##
## Check conditions:
## 2005 : number of successes = 10 ; number of failures = 992
## 2012 : number of successes = 50 ; number of failures = 952
## Standard error = 0.0076
## 95 % Confidence interval = ( -0.0547 , -0.0251 )
There is convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012.
In Table 4 of the report, in how many of those (39) countries would you expect to detect a change (at a significance level of 0.05) simply by chance?
39*0.05 = 1.95 percent.
Moneyball
The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.
The data is from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.
head(mlb11)## team runs at_bats hits homeruns bat_avg strikeouts
## 1 Texas Rangers 855 5659 1599 210 0.283 930
## 2 Boston Red Sox 875 5710 1600 203 0.280 1108
## 3 Detroit Tigers 787 5563 1540 169 0.277 1143
## 4 Kansas City Royals 730 5672 1560 129 0.275 1006
## 5 St. Louis Cardinals 762 5532 1513 162 0.273 978
## 6 New York Mets 718 5600 1477 108 0.264 1085
## stolen_bases wins new_onbase new_slug new_obs
## 1 143 96 0.340 0.460 0.800
## 2 102 90 0.349 0.461 0.810
## 3 49 95 0.340 0.434 0.773
## 4 153 71 0.329 0.415 0.744
## 5 57 90 0.341 0.425 0.766
## 6 130 77 0.335 0.391 0.725
In addition to runs scored, there are seven traditionally used variables in the mlb11 dataset: at_bats, hits, homeruns, bat_avg (batting average), strikeouts, stolen_bases, and wins.
There are also three newer variables: new_onbase (on base percentage), new_slug (slugging percentage), and new_obs (on-base plus slugging).
For the first portion of the analysis, we’ll consider the seven traditional variables.
Quantifying the strength of a linear relationship
# Calculate the correlation between runs and at_bats
correlation <- cor(mlb11$runs, mlb11$at_bats)
# Print the result
correlation## [1] 0.610627
# Plot it
plot(mlb11$runs ~ mlb11$at_bats)The relationship is positive, linear, and moderately strong. One of the potential outliers is a team with approximately 5520 at bats.
Plot sum of squares – the function
plot_ss <- function(x, y, x1, y1, x2, y2, showSquares = FALSE, leastSquares = FALSE){
plot(y~x, asp = 1)# xlab = paste(substitute(x)), ylab = paste(substitute(y)))
if(leastSquares){
m1 <- lm(y~x)
y.hat <- m1$fit
} else{
#cat("Click two points to make a line.")
#pt1 <- locator(1)
points(x1, y1, pch = 21, col='red3', bg='red3', cex=1.8)
#pt2 <- locator(1)
points(x2, y2, pch = 21, col='red3', bg='red3', cex=1.8)
pts <- data.frame("x" = c(x1, x2),"y" = c(y1, y2))
m1 <- lm(y ~ x, data = pts)
y.hat <- predict(m1, newdata = data.frame(x))
}
r <- y - y.hat
abline(m1)
oSide <- x - r
LLim <- par()$usr[1]
RLim <- par()$usr[2]
oSide[oSide < LLim | oSide > RLim] <- c(x + r)[oSide < LLim | oSide > RLim] # move boxes to avoid margins
n <- length(y.hat)
for(i in 1:n){
lines(rep(x[i], 2), c(y[i], y.hat[i]), lty = 2, col = "blue")
if(showSquares){
lines(rep(oSide[i], 2), c(y[i], y.hat[i]), lty = 3, col = "orange")
lines(c(oSide[i], x[i]), rep(y.hat[i],2), lty = 3, col = "orange")
lines(c(oSide[i], x[i]), rep(y[i],2), lty = 3, col = "orange")
}
}
SS <- round(sum(r^2), 3)
cat("\r ")
print(m1)
cat("Sum of Squares: ", SS)
}Estimating the best fit
This function will first draw a scatterplot of the first two arguments x and y. Then it draws two points (x1, y1) and (x2, y2) that are shown as red circles. These points are used to draw the line that represents the regression estimate. The line you specified is shown in black and the residuals in blue. Note that there are 30 residuals, one for each of the 30 observations. Recall that the residuals are the difference between the observed values and the values predicted by the line:
\[e_i = y_i - \hat{y_i}\]
# Use the 'plot_ss' function to draw an estimated regression line. Two points are given to get you started
x1 <- 5400
y1 <- 750
x2 <- 5700
y2 <- 650
# Plot sum of squares
plot_ss(x = mlb11$at_bats, y = mlb11$runs, x1, y1, x2, y2, showSquares = FALSE)##
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 2550.0000 -0.3333
##
## Sum of Squares: 302572.3
The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, you can rerun the plot_ss command and add the argument showSquares = TRUE.
# Use the 'plot_ss' function to draw an estimated regression line. Two points are given to get you started
x1 <- 5400
y1 <- 750
x2 <- 5700
y2 <- 650
# Plot sum of squares
plot_ss(x = mlb11$at_bats, y = mlb11$runs, x1, y1, x2, y2, showSquares = TRUE)##
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 2550.0000 -0.3333
##
## Sum of Squares: 302572.3
The best fit
Removing the coordinates for estimating…
# Adapt the function to plot the best fitting line
plot_ss(x = mlb11$at_bats, y = mlb11$runs, leastSquares = TRUE)##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
The linear model
It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. Instead we can use the lm function.
# Use the 'lm' function to make the linear model
m1 <- lm(runs ~ at_bats, data = mlb11)
# Print the model
m1##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Coefficients:
## (Intercept) at_bats
## -2789.2429 0.6305
Understanding the output of lm
Access more information (coefficients).
# Use the 'lm' function to make the linear model
m1 <- lm(runs ~ at_bats, data = mlb11)
# The summary of 'm1'
summary(m1)##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
Breaking it down
# Use the 'lm' function to make the linear model Print the summary
m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
summary(lm(runs ~ homeruns, data = mlb11))##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
For each additional home run, the model predicts 1.83 more runs, on average.
Prediction and prediction errors
abline takes the slope and intercept as arguments. Here, we use a shortcut by providing the model itself, which contains both parameter estimates: abline(m1).
# Create a scatterplot
# x, y
plot(mlb11$at_bats, mlb11$runs)
# The linear model:
# y ~ x
m1 <- lm(runs ~ at_bats, data = mlb11)
# Plot the least squares line
abline(m1, lty = 3, lwd = 2, col = 'red3')How many runs would he or she predict for a team with 5,579 at-bats?
summary(m1)[[4]][1]## [1] -2789.243
summary(m1)[[4]][2]## [1] 0.63055
est_runs <- summary(m1)[[4]][1] + summary(m1)[[4]][2] * 5579
true_runs <- subset(mlb11$runs, mlb11$at_bats == 5579)
est_runs## [1] 728.5955
true_runs## [1] 713
true_runs - est_runs## [1] -15.59552
To assess whether the linear model is reliable, you need to check for three things:
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3, lwd = 2, col = 'red3')The residuals show a curved pattern. Overall, constant variability condition is met (3).
To check ‘nearly normal residuals’, plot a normal probability chart of the residuals.
qqnorm(m1$residuals)
qqline(m1$residuals)We can also look at a histogram.
hist(m1$residuals)The residuals are failry symmetric, with only a slightly longer tail on the right, hence it would be appropriate to deem the the normal distribution of residuals condition met (1 and 2).
In conclusion, batting average is the best predictor of runs.
attach(mlb11)
par(mfrow = c(1, 2))
# at_bats
plot(runs ~ at_bats)
abline(lm(runs ~ at_bats, data = mlb11), lty = 3, lwd = 2, col = 'red3')
# hits
plot(runs ~ hits)
abline(lm(runs ~ hits, data = mlb11), lty = 3, lwd = 2, col = 'red3')# homeruns
plot(runs ~ homeruns)
abline(lm(runs ~ homeruns, data = mlb11), lty = 3, lwd = 2, col = 'red3')
# bat_avg (batting average)
plot(runs ~ bat_avg)
abline(lm(runs ~ bat_avg, data = mlb11), lty = 3, lwd = 2, col = 'red3')# strikeouts
plot(runs ~ strikeouts)
abline(lm(runs ~ strikeouts, data = mlb11), lty = 3, lwd = 2, col = 'red3')
# stolen_bases
plot(runs ~ stolen_bases)
abline(lm(runs ~ stolen_bases, data = mlb11), lty = 3, lwd = 2, col = 'red3')# wins
plot(runs ~ wins)
abline(lm(runs ~ wins, data = mlb11), lty = 3, lwd = 2, col = 'red3')
# new_onbase (on base percentage)
plot(runs ~ new_onbase)
abline(lm(runs ~ new_onbase, data = mlb11), lty = 3, lwd = 2, col = 'red3')# new_slug (slugging percentage)
plot(runs ~ new_slug)
abline(lm(runs ~ new_slug, data = mlb11), lty = 3, lwd = 2, col = 'red3')
# new_obs (on-base plus slugging)
plot(runs ~ new_obs)
abline(lm(runs ~ new_obs, data = mlb11), lty = 3, lwd = 2, col = 'red3')par(mfrow = c(1, 1))
detach(mlb11)The data
Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor.
The article titled, “Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity” (Hamermesh and Parker, 2005) found that instructors who are viewed to be better looking receive higher instructional ratings.
The data were gathered from end of semester student evaluations for a large sample of professors from the University of Texas at Austin. In addition, six students rated the professors’ physical appearance.
This is a slightly modified version of the original dataset that was released as part of the replication data for Data Analysis Using Regression and Multilevel/Hierarchical Models (Gelman and Hill, 2007).
Different variables in the data frame, and their meaning:
score: average professor evaluation score: (1) very unsatisfactory - (5) excellent.rank: rank of professor: teaching, tenure track, tenured.ethnicity: ethnicity of professor: not minority, minority.gender: gender of professor: female, male.language: language of school where professor received education: english or non-english.age: age of professor.cls_perc_eval: percent of students in class who completed evaluation.cls_did_eval: number of students in class who completed evaluation.cls_students: total number of students in class.cls_level: class level: lower, upper.cls_profs: mber of professors teaching sections in course in sample: single, multiple.cls_credits: number of credits of class: one credit (lab, PE, etc.), multi credit.bty_f1lower: beauty rating of professor from lower level female: (1) lowest - (10) highest.bty_f1upper: beauty rating of professor from upper level female: (1) lowest - (10) highest.bty_f2upper: beauty rating of professor from second upper level female: (1) lowest - (10) highest.bty_m1lower: beauty rating of professor from lower level male: (1) lowest - (10) highest.bty_m1upper: beauty rating of professor from upper level male: (1) lowest - (10) highest.bty_m2upper: beauty rating of professor from second upper level male: (1) lowest - (10) highest.bty_avg: average beauty rating of professor.pic_outfit: outfit of professor in picture: not formal, formal. (not used in this analysis)pic_color: color of professor’s picture: color, black & white. (not used in this analysis)The original research question posed in the paper is whether beauty leads directly to the differences in course evaluations. It should be revised to “Is there an association between beauty and course evaluations”.
summary(evals$score)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.300 3.800 4.300 4.175 4.600 5.000
par(mfrow = c(2, 1))
hist(evals$score, main = '')
boxplot(evals$score, horizontal = TRUE)par(mfrow = c(1, 1))The left skewness of the data suggests that the students are more likely to rate the professors highly.
Visualizing relationships
More analysis.
# Create a scatterplot for 'age' vs 'bty_avg'
plot(evals$age, evals$bty_avg)# Create a boxplot for 'age' and 'gender'
boxplot(evals$age ~ evals$gender)# Create a mosaic plot for 'rank' and 'gender'
mosaicplot(evals$rank ~ evals$gender)Simple linear regression
The fundamental phenomenon suggested by the study is that better looking teachers are evaluated more favorably.
# Create a scatterplot for 'score' and 'bty_avg'
plot(evals$bty_avg, evals$score)Replot the scatterplot, but this time use the jitter function.
The jitter function
# Apply 'jitter' on the 'bty_avg' or 'score' variable of your initial plot
plot(jitter(evals$bty_avg), evals$score)The hexbin plot
library(hexbin)
# Draw a hexbinplot
hexbinplot(evals$score ~ evals$bty_avg, aspect = 0.5)More than natural variation?
Fit a linear model called m_bty to predict average professor score by average beauty rating.
# Your initial plot
plot(evals$score ~ jitter(evals$bty_avg))
# Construct the linear model
m_bty <- lm(score ~ bty_avg, data = evals)
# Plot your linear model on your plot
abline(m_bty, lty = 3, lwd = 2, col = 'red3')Average beauty score does not seem to be a statistically and practically significant predictor.
plot(m_bty$residuals ~ evals$bty_avg)
abline(h = 0, lty = 3, lwd = 2, col = 'red3')qqnorm(m_bty$residuals)
qqline(m_bty$residuals)hist(m_bty$residuals)Multiple linear regression
The dataset contains several variables on the beauty score of the professor: individual ratings from each of the six students who were asked to score the physical appearance of the professors and the average of these six scores.
# Your scatterplot
plot(evals$bty_f1lower, evals$bty_avg)# The correlation
cor(evals$bty_f1lower, evals$bty_avg)## [1] 0.8439112
The relationship between all beauty variables
The relationship is quite strong; after all, the average score is calculated using the individual scores. We can actually take a look at the relationships between all beauty variables (columns 13 through 19).
# The relationships between all beauty variables
plot(evals[, 13:19])Taking into account gender
In order to see if beauty is still a significant predictor of professor score after we’ve accounted for the gender is the professor, we can add the gender term into the model.
# Your new linear model
m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)
# Study the outcome
summary(m_bty_gen)##
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8305 -0.3625 0.1055 0.4213 0.9314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.74734 0.08466 44.266 < 2e-16 ***
## bty_avg 0.07416 0.01625 4.563 6.48e-06 ***
## gendermale 0.17239 0.05022 3.433 0.000652 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared: 0.05912, Adjusted R-squared: 0.05503
## F-statistic: 14.45 on 2 and 460 DF, p-value: 8.177e-07
p-values and parameter estimates should only be trusted if the conditions for the regression are reasonable. Using diagnostic plots (about a linear regression’s principal assumptions), we can conclude that the conditions for this model are reasonable.
Custom function to draw multiLines
The code for the custom function is not shown.
Gendermale
Note that the estimate for gender is called gendermale in your summary output.
Whenever we introduce a categorical variable, we recode gender from having the values of female and male to being an indicator variable called gendermale that takes a value of 0 for females and a value of 1 for males. (Such variables are often referred to as ‘dummy’ variables.)
As a result, for females, the parameter estimate is multiplied by zero, leaving the intercept and slope form familiar from simple regression.
# Your plot
multiLines(m_bty_gen)For two professors (one male and one female) who received the same beauty rating, the male professor is predicted to have the higher course evaluation score than the female.
Switching rank and gender
Create a new model called m_bty_rank with gender removed and rank added in.
# Your linear model with rank and average beauty
m_bty_rank <- lm(score ~ bty_avg + rank, data = evals)
# View the regression output
summary(m_bty_rank)##
## Call:
## lm(formula = score ~ bty_avg + rank, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8713 -0.3642 0.1489 0.4103 0.9525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.98155 0.09078 43.860 < 2e-16 ***
## bty_avg 0.06783 0.01655 4.098 4.92e-05 ***
## ranktenure track -0.16070 0.07395 -2.173 0.0303 *
## ranktenured -0.12623 0.06266 -2.014 0.0445 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5328 on 459 degrees of freedom
## Multiple R-squared: 0.04652, Adjusted R-squared: 0.04029
## F-statistic: 7.465 on 3 and 459 DF, p-value: 6.88e-05
Correct order of the three levels of rank if we were to order them from lowest predicted course evaluation score to highest predicted course evaluation score:
The search for the best model
We will start with a full model that predicts professor score based on rank, ethnicity, gender, language of the university where they got their degree, age, proportion of students that filled out evaluations, class size, course level, number of professors, number of credits, average beauty rating, outfit, and picture color.
Note you do not included the pic_outfit or pic_color variables in the full model because the original study states that these variables were used in a different analysis evaluating whether they’re related to how highly the six students involved in the study score the professors’ beauty (not related to how the students evaluate their professors in class).
# The full model
m_full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)
# View the regression output
summary(m_full)##
## Call:
## lm(formula = score ~ rank + ethnicity + gender + language + age +
## cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits +
## bty_avg, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.84482 -0.31367 0.08559 0.35732 1.10105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5305036 0.2408200 14.660 < 2e-16 ***
## ranktenure track -0.1070121 0.0820250 -1.305 0.192687
## ranktenured -0.0450371 0.0652185 -0.691 0.490199
## ethnicitynot minority 0.1869649 0.0775329 2.411 0.016290 *
## gendermale 0.1786166 0.0515346 3.466 0.000579 ***
## languagenon-english -0.1268254 0.1080358 -1.174 0.241048
## age -0.0066498 0.0030830 -2.157 0.031542 *
## cls_perc_eval 0.0056996 0.0015514 3.674 0.000268 ***
## cls_students 0.0004455 0.0003585 1.243 0.214596
## cls_levelupper 0.0187105 0.0555833 0.337 0.736560
## cls_profssingle -0.0085751 0.0513527 -0.167 0.867458
## cls_creditsone credit 0.5087427 0.1170130 4.348 1.7e-05 ***
## bty_avg 0.0612651 0.0166755 3.674 0.000268 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.504 on 450 degrees of freedom
## Multiple R-squared: 0.1635, Adjusted R-squared: 0.1412
## F-statistic: 7.331 on 12 and 450 DF, p-value: 2.406e-12
Non-minority professors (identified in the results as ethnicitynot minority) are expected on average to score 0.19 points higher than minority professors, all else held constant.
Eliminating variables from the model – p-value selection
Drop the variable with the highest p-value in the m_full model.
# The full model
m_full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)
# Your new model (w/o cls_credits)
m_new <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_credits + bty_avg, data = evals)
# View the regression output
summary(m_new)##
## Call:
## lm(formula = score ~ rank + ethnicity + gender + language + age +
## cls_perc_eval + cls_students + cls_level + cls_credits +
## bty_avg, data = evals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.85048 -0.31394 0.08052 0.35956 1.10356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.5286297 0.2402990 14.684 < 2e-16 ***
## ranktenure track -0.1073638 0.0819096 -1.311 0.190606
## ranktenured -0.0453744 0.0651169 -0.697 0.486278
## ethnicitynot minority 0.1893718 0.0760992 2.488 0.013189 *
## gendermale 0.1780270 0.0513581 3.466 0.000578 ***
## languagenon-english -0.1265737 0.1079088 -1.173 0.241427
## age -0.0066619 0.0030788 -2.164 0.031006 *
## cls_perc_eval 0.0056790 0.0015448 3.676 0.000265 ***
## cls_students 0.0004493 0.0003573 1.257 0.209319
## cls_levelupper 0.0183743 0.0554870 0.331 0.740687
## cls_creditsone credit 0.5109162 0.1161614 4.398 1.36e-05 ***
## bty_avg 0.0611497 0.0166432 3.674 0.000267 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5035 on 451 degrees of freedom
## Multiple R-squared: 0.1635, Adjusted R-squared: 0.1431
## F-statistic: 8.012 on 11 and 451 DF, p-value: 8.303e-13
Eliminating variables from the model – adjusted R-squared selection
Now you will create a new model, where you will drop the variable that when dropped yields the highest improvement in the adjusted R-squared.
# The full model
m_full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)
summary(m_full)$adj.r.squared## [1] 0.1412172
# Remove rank
m1 <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)
summary(m1)$adj.r.squared## [1] 0.1417823
# Remove ethnicity
m2 <- lm(score ~ rank + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)
summary(m2)$adj.r.squared## [1] 0.1320486
# Remove gender
m3 <- lm(score ~ rank + ethnicity + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)
summary(m3)$adj.r.squared## [1] 0.1202468
# Remove language
m4 <- lm(score ~ rank + ethnicity + gender + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)
summary(m4)$adj.r.squared## [1] 0.1404972
# Remove age
m5 <- lm(score ~ rank + ethnicity + gender + language + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)
summary(m5)$adj.r.squared## [1] 0.1342627
# Remove cls_perc_eval
m6 <- lm(score ~ rank + ethnicity + gender + language + age + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)
summary(m6)$adj.r.squared## [1] 0.1174213
# Remove cls_students
m7 <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_level + cls_profs + cls_credits + bty_avg, data = evals)
summary(m7)$adj.r.squared## [1] 0.1401804
# Remove cls_level
m8 <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_profs + cls_credits + bty_avg, data = evals)
summary(m8)$adj.r.squared## [1] 0.1429056
# Remove cls_profs
m9 <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_credits + bty_avg, data = evals)
summary(m9)$adj.r.squared## [1] 0.1430683
# Remove cls_credits
m10 <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + bty_avg, data = evals)
summary(m10)$adj.r.squared## [1] 0.107127
# The bty_avg
m11 <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits, data = evals)
summary(m11)$adj.r.squared## [1] 0.1174189
results <- c(summary(m_full)$adj.r.squared,
summary(m1)$adj.r.squared,
summary(m2)$adj.r.squared,
summary(m3)$adj.r.squared,
summary(m4)$adj.r.squared,
summary(m5)$adj.r.squared,
summary(m6)$adj.r.squared,
summary(m7)$adj.r.squared,
summary(m8)$adj.r.squared,
summary(m9)$adj.r.squared,
summary(m10)$adj.r.squared,
summary(m11)$adj.r.squared)
names(results) <- c('FULL MODEL',
'Remove rank',
'Remove ethnicity',
'Remove gender',
'Remove language',
'Remove age',
'Remove cls_perc_eval',
'Remove cls_students',
'Remove cls_level',
'Remove cls_profs',
'Remove cls_credits',
'AVERAGE')
dotchart(sort(results), ylim = c(0.10, 0.16), main = 'Models & Adjusted R-squared')Removal of cls_profs yields the highest adjusted R-squared.