Foreword

Output options: the ‘tango’ syntax and the ‘readable’ theme.
Snippets and results.
Source: ‘Data Analysis and Statistical Inference’ from DataCamp.

Documentation & Data

The book backing up this project is OpenIntro.

Data is made of Excel and Rdata files.

In addition, the project has four custom functions:

plot_ci; for plotting confidence intervals (section 5).
inference; for running bootstraps and plotting results (section 6).
plot_ss; for plotting the sum of squares (section 8).
multiLine; for plotting regressions with dummy variables (section 9).

Introduction to R

Welcome!

Load a dataset into R

The present dataset refers to the number of male and female births in the United States. It contains the data for all of the years from 1940 to 2002.

# This will print the dataset in the console
head(present)

##   year    boys   girls
## 1 1940 1211684 1148715
## 2 1941 1289734 1223693
## 3 1942 1444365 1364631
## 4 1943 1508959 1427901
## 5 1944 1435301 1359499
## 6 1945 1404587 1330869

Examining the dataset

# Print the names of the variables of the data frame
names(present)

## [1] "year"  "boys"  "girls"

# Print the number of rows and variables with the 'dim' function
dim(present)

## [1] 63  3

Some more exploration

# Find the number of boys born each year, and assign the answer to
num_boys <- present$boys

# Find the number of girls born each year, and assign the answer to
num_girls <- present$girls

# This will print the results
num_boys

##  [1] 1211684 1289734 1444365 1508959 1435301 1404587 1691220 1899876
##  [9] 1813852 1826352 1823555 1923020 1971262 2001798 2059068 2073719
## [17] 2133588 2179960 2152546 2173638 2179708 2186274 2132466 2101632
## [25] 2060162 1927054 1845862 1803388 1796326 1846572 1915378 1822910
## [33] 1669927 1608326 1622114 1613135 1624436 1705916 1709394 1791267
## [41] 1852616 1860272 1885676 1865553 1879490 1927983 1924868 1951153
## [49] 2002424 2069490 2129495 2101518 2082097 2048861 2022589 1996355
## [57] 1990480 1985596 2016205 2026854 2076969 2057922 2057979

num_girls

##  [1] 1148715 1223693 1364631 1427901 1359499 1330869 1597452 1800064
##  [9] 1721216 1733177 1730594 1827830 1875724 1900322 1958294 1973576
## [17] 2029502 2074824 2051266 2071158 2078142 2082052 2034896 1996388
## [25] 1967328 1833304 1760412 1717571 1705238 1753634 1816008 1733060
## [33] 1588484 1528639 1537844 1531063 1543352 1620716 1623885 1703131
## [41] 1759642 1768966 1794861 1773380 1789651 1832578 1831679 1858241
## [49] 1907086 1971468 2028717 2009389 1982917 1951379 1930178 1903234
## [57] 1901014 1895298 1925348 1932563 1981845 1968011 1963747

Putting it in a graph

# Type here the code to create the plot
plot(present$year, present$girls)

Connecting the dots

# Create the plot here
plot(present$year, present$girls, type = 'l')

There is initially an increase in the number of girls born, which peaks around 1960. After 1960 there is a decrease in the number of girls born, but the number begins to increase again in the early 1970s. Overall the trend is an increase in the number of girls born in the US since the 1940s.

R, the big calculator

# The vector babies
babies <- present$boys + present$girls

present2 <- cbind(present, babies)
head(present2)

##   year    boys   girls  babies
## 1 1940 1211684 1148715 2360399
## 2 1941 1289734 1223693 2513427
## 3 1942 1444365 1364631 2808996
## 4 1943 1508959 1427901 2936860
## 5 1944 1435301 1359499 2794800
## 6 1945 1404587 1330869 2735456

# The plot
plot(present$year, babies, type = 'l')

# The max point
subset(c(present2$year, present2$babies), present2$babies == max(present2$babies))

## [1]    1961 4268326

There is a birth peak in 1961.

Comparing the dataset

# Check when boys outnumber girls
present$boys > present$girls

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

All TRUE: every year there are more boys born than girls.

Challenge

# Plot the boy-to-girl ratio for every year
plot(present$year, present$boys / (present$girls))

There is initially a decrease in the boy-to-girl ratio, and then an increase between 1960 and 1970, followed by a decrease.

# absolute difference
present2 <- data.frame(year = present$year, ratio = present$boys - present$girls)

# find the year
subset(present2$year, present2$ratio == max(present2$ratio))

## [1] 1963

The biggest absolute difference in the number of girls and number of boys born: 1963.

Introduction to Data

Welcome!

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000.

Which variables are you working with?

# Print the names of the variables
names(cdc)

## [1] "genhlth"  "exerany"  "hlthplan" "smoke100" "height"   "weight"  
## [7] "wtdesire" "age"      "gender"

nrow(cdc)

## [1] 20000

str(cdc)

## 'data.frame':    20000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ exerany : num  0 0 1 1 0 1 1 0 0 1 ...
##  $ hlthplan: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num  0 1 1 0 0 0 0 0 1 0 ...
##  $ height  : num  70 64 60 66 61 64 71 67 65 70 ...
##  $ weight  : int  175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire: int  175 115 105 124 130 114 185 160 130 170 ...
##  $ age     : int  77 33 49 42 55 55 31 45 27 44 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...

20,000 cases; 9 variables

Taking a peek at the data

# Print the head and tails of the data frame:
head(cdc)

##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1      good       0        1        0     70    175      175  77      m
## 2      good       0        1        1     64    125      115  33      f
## 3      good       1        1        1     60    105      105  49      f
## 4      good       1        1        0     66    132      124  42      f
## 5 very good       0        1        0     61    150      130  55      f
## 6 very good       1        1        0     64    114      114  55      f

tail(cdc)

##         genhlth exerany hlthplan smoke100 height weight wtdesire age
## 19995      good       0        1        1     69    224      224  73
## 19996      good       1        1        0     66    215      140  23
## 19997 excellent       0        1        0     73    200      185  35
## 19998      poor       0        1        0     65    216      150  57
## 19999      good       1        1        0     67    165      165  81
## 20000      good       1        1        1     69    170      165  83
##       gender
## 19995      m
## 19996      f
## 19997      m
## 19998      f
## 19999      f
## 20000      m

Type of data:

strings vs numerical.
numerical: continuous vs discrete.
discrete: categorical (factor) or ordinal (ordered).

Let’s refresh

# View the head or tail of both the height and the genhlth variables
head(cdc$height)

## [1] 70 64 60 66 61 64

tail(cdc$height)

## [1] 69 66 73 65 67 69

head(cdc$genhlth)

## [1] good      good      good      good      very good very good
## Levels: excellent very good good fair poor

tail(cdc$genhlth)

## [1] good      good      excellent poor      good      good     
## Levels: excellent very good good fair poor

Turning info into knowledge – Numerical data

The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all of that information into a few summary statistics and graphics.

Continous data.

mean(cdc$weight)

## [1] 169.683

var(cdc$weight)

## [1] 1606.484

median(cdc$weight)

## [1] 165

summary(cdc$weight)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    68.0   140.0   165.0   169.7   190.0   500.0

Turning info into knowledge – Categorical data

Categorical (discrete) data.

# Create the frequency table here
table(cdc$genhlth)

## 
## excellent very good      good      fair      poor 
##      4657      6972      5675      2019       677

# Create the relative frequency table here
table(cdc$genhlth) / nrow(cdc)

## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385

Creating a barplot

Non-smokers vs smokers.

# Draw the barplot
barplot(table(cdc$smoke100))

Males vs Females.

# Draw another barplot
barplot(table(cdc$gender))

# Number of males
sum(cdc$gender == 'm')

## [1] 9569

Health.

# Proportion of 'excellent' health
sum(cdc$genhlth == 'excellent') / nrow(cdc)

## [1] 0.23285

Even prettier: the mosaic plot

Two categories in one plot.

gender_smokers <- table(cdc$gender, cdc$smoke100)
gender_smokers

##    
##        0    1
##   m 4547 5022
##   f 6012 4419

# Plot the mosaicplot
mosaicplot(gender_smokers)

Males are more likely to smoke than females.

Interlude: how R thinks about data (1)

Reduce, extract, subset to find answers.

# Create the subsets
height_1337 <- cdc[1337, 5]
weight_111 <- cdc[111, 6]

# Print the results
height_1337

## [1] 70

weight_111

## [1] 210

Interlude (2)

# Create the subsets
first8 <- cdc[1:8, 3:5]
wt_gen_10_20 <- cdc[10:20, 6:9]

# Print the subsets
first8

##   hlthplan smoke100 height
## 1        1        0     70
## 2        1        1     64
## 3        1        1     60
## 4        1        0     66
## 5        1        0     61
## 6        1        0     64
## 7        1        0     71
## 8        1        0     67

wt_gen_10_20

##    weight wtdesire age gender
## 10    180      170  44      m
## 11    186      175  46      m
## 12    168      148  62      m
## 13    185      220  21      m
## 14    170      170  69      m
## 15    170      170  23      m
## 16    185      175  79      m
## 17    156      150  47      m
## 18    185      185  76      m
## 19    200      190  43      m
## 20    125      120  33      f

Interlude (3)

# Create the subsets
resp205 <- cdc[205, ]
ht_wt <- cdc[ ,5:6]

# Print the subsets
resp205

##     genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 205    fair       0        0        1     61    200      125  49      f

head(ht_wt)

##   height weight
## 1     70    175
## 2     64    125
## 3     60    105
## 4     66    132
## 5     61    150
## 6     64    114

Interlude (4)

# Create the subsets
resp1000_smk <- cdc$smoke100[1000]
first30_ht <- cdc$height[1:30]

# Print the subsets
resp1000_smk

## [1] 1

first30_ht

##  [1] 70 64 60 66 61 64 71 67 65 70 69 69 66 70 69 73 67 71 75 67 69 65 73
## [24] 67 64 68 67 69 61 74

A little more on subsetting

# Create the subsets
very_good <- subset(cdc, cdc$genhlth == 'very good')
age_gt50 <-  subset(cdc, cdc$age > 50)

# Print the subsets
head(very_good)

##      genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 5  very good       0        1        0     61    150      130  55      f
## 6  very good       1        1        0     64    114      114  55      f
## 7  very good       1        1        0     71    194      185  31      m
## 8  very good       0        1        0     67    170      160  45      m
## 20 very good       1        1        0     67    125      120  33      f
## 21 very good       1        1        0     69    200      150  48      f

head(age_gt50)

##      genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 1       good       0        1        0     70    175      175  77      m
## 5  very good       0        1        0     61    150      130  55      f
## 6  very good       1        1        0     64    114      114  55      f
## 12      fair       1        1        1     69    168      148  62      m
## 14 excellent       1        1        1     70    170      170  69      m
## 16      good       1        1        1     73    185      175  79      m

Subset – one last time

# Create the subset
under23_and_smoke <- subset(cdc, cdc$age < 23 & cdc$smoke100 == 1)

# Print the top six rows of the subset
head(under23_and_smoke)

##       genhlth exerany hlthplan smoke100 height weight wtdesire age gender
## 13  excellent       1        0        1     66    185      220  21      m
## 37  very good       1        0        1     70    160      140  18      f
## 96  excellent       1        1        1     74    175      200  22      m
## 180      good       1        1        1     64    190      140  20      f
## 182 very good       1        1        1     62     92       92  21      f
## 240 very good       1        0        1     64    125      115  22      f

# Number of observations
nrow(under23_and_smoke)

## [1] 620

Visualizing with box plots

# Draw the box plot of the respondents heights
boxplot(cdc$height)

# Print the summary
summary(cdc$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   48.00   64.00   67.00   67.18   70.00   93.00

For continuous data.

More on box plots

By category (mix continuous and categorical data).

# Draw the box plot of the weights versus smoking
# y ~ x
boxplot(cdc$weight ~ cdc$smoke100)

One last box plot

Box plot of the Body Mass Index (BMI) vs health condition.

\[BMI = \frac{\mathrm{weight(lb)}}{\mathrm{height(in)}^{2}} \times 703\]

# Calculate the BMI
bmi <- (cdc$weight / (cdc$height ** 2)) * 703

# Draw the box plot
# y ~ x
boxplot(bmi ~ cdc$genhlth)

The distributions of BMIs within each health status group is left skewed.

Histograms

Histograms are generally a very good way to see the shape of a single distribution, but that shape can change depending on how the data is split between the different bins.

# Draw a histogram of bmi
hist(bmi)

# And one with breaks set to 50
hist(bmi, breaks = 50)

# And one with breaks set to 100
hist(bmi, breaks = 100)

Weight vs. Desired Weight

Correlation wuith a scatter plot.

# Draw a scatter plot
plot(cdc$wtdesire ~ cdc$weight)

Moderately strong positive linear association.

However, the data is bunched up. Let’s improve the visualization.

library(hexbin)

# Draw a hexbinplot
hexbinplot(cdc$wtdesire ~ cdc$weight, aspect = 0.5)

# Zero in
hexbinplot(cdc$wtdesire ~ cdc$weight, ylim = c(0, 400), aspect = 0.5)

Probability

Getting started

Kobe Bryant of the Los Angeles Lakers.

His performance against the Orlando Magic in the 2009 NBA finals earned him the title “Most Valuable Player” and many spectators commented on how he appeared to show a hot hand.

When a basketball player makes several baskets in succession, they are describe as having a ‘hot hand’. Fans and players have long believed in this phenomenon, which refutes the assumption that each shot is independent of the next. However, a 1985 paper by Gilovich, Vallone, and Tversky found evidence that contradicted this belief and showed that successive shots are independent events.

# Inspect the data
head(kobe)

##    vs game quarter time
## 1 ORL    1       1 9:47
## 2 ORL    1       1 9:07
## 3 ORL    1       1 8:11
## 4 ORL    1       1 7:41
## 5 ORL    1       1 7:03
## 6 ORL    1       1 6:01
##                                               description basket
## 1                 Kobe Bryant makes 4-foot two point shot      H
## 2                               Kobe Bryant misses jumper      M
## 3                        Kobe Bryant misses 7-foot jumper      M
## 4 Kobe Bryant makes 16-foot jumper (Derek Fisher assists)      H
## 5                         Kobe Bryant makes driving layup      H
## 6                               Kobe Bryant misses jumper      M

tail(kobe)

##      vs game quarter    time                        description basket
## 128 ORL    3       4    3:57  Bryant Jump Shot: Made (28 PTS)        H
## 129 ORL    3       4    3:33        Bryant Layup Shot: Missed        M
## 130 ORL    3       4    2:02          Bryant 3pt Shot: Missed        M
## 131 ORL    3       4 00:23.9          Bryant 3pt Shot: Missed        M
## 132 ORL    3       4 00:06.9          Bryant 3pt Shot: Missed        M
## 133 ORL    3       4 00:00.5 Bryant Layup Shot: Made (31 PTS)        H

Kobe’s track record

Just looking at the string of hits and misses, it can be difficult to gauge whether or not it seems like Kobe was shooting with a hot hand.

One way we can approach this is by considering the belief that hot hand shooters tend to go on shooting streaks.

# Print the variable names of the data frame
names(kobe)

## [1] "vs"          "game"        "quarter"     "time"        "description"
## [6] "basket"

# Print the first 9 values of the 'basket' variable
kobe$basket[1:9]

## [1] "H" "M" "M" "H" "H" "M" "M" "M" "M"

A first analysis

The distribution of the shooting streaks.

# Assign Kobe's streak lengths:
kobe_streak <- calc_streak(kobe$basket)

# Draw a barplot of the result:
barplot(table(kobe_streak))

The distribution of Kobe’s streaks is unimodal and left skewed.

We’ve shown that Kobe had some long shooting streaks, but are they long enough to support the belief that he had hot hands? What can we compare them to?

To answer these questions, let’s return to the idea of independence.

Two processes are independent if the outcome of one process doesn’t effect the outcome of the second. If each shot that a player takes is an independent process, having made or missed your first shot will not affect the probability that you will make or miss your second shot.

A shooter with a hot hand will have shots that are not independent of one another. Specifically, if the shooter makes his first shot, the hot hand model says he will have a higher probability of making his second shot. Let’s suppose for a moment that the hot hand model is valid for Kobe. During his career, the percentage of time Kobe makes a basket (i.e. his shooting percentage) is about 45%, or in probability notation:

\[P(\mathrm{shot}_1=H) = 0.45.\]

If Kobe has a hot hand (not independent shots), then the probability that he makes his second shot would go up given that he made the first shot:

\[P(\mathrm{shot}_2=H~|~\mathrm{shot}_1=H) > 0.45.\]

Exactly, if Kobe has a hot hand, the probability that he makes his second shot would be higher, for example 0.60. As a result of these increased probabilities, you’d expect Kobe to have longer streaks.

Compare this to the skeptical perspective where Kobe does not have a hot hand, where each shot is independent of the next. If he hits his first shot, the probability that he makes the second is still 0.45.

In other words, making the first shot did nothing to effect the probability that he’d make his second shot. If Kobe’s shots are independent, then he’d have the same probability of hitting every shot regardless of his past shots: 45%.

Now that we’ve phrased the situation in terms of independent shots, let’s return to the question: how do we tell if Kobe’s shooting streaks are long enough to indicate that he has hot hands? We can compare his streak lengths to someone without hot hands: an independent shooter. Starting with the next exercise, you’ll learn how to simulate such an independent shooter in R.

If Kobe’s shooting streaks diverge significantly from an independent shooter’s streaks, we can conclude that Kobe likely has a hot hand.

Simulations in R

So the key is to compare Kobe’s data with that of a shooter who we know to have independent shots. While we don’t have any such data, that is very easy to simulate.

In a simulation, we set the ground rules of a random process and then the computer uses random numbers to generate an outcome that adheres to those rules.

# Try some simulations!
outcomes <- c('heads', 'tails')
sample(outcomes, size = 1, replace = TRUE)

## [1] "tails"

sample(outcomes, size = 1, replace = TRUE)

## [1] "heads"

sample(outcomes, size = 1, replace = TRUE)

## [1] "heads"

sample(outcomes, size = 1, replace = TRUE)

## [1] "heads"

sample(outcomes, size = 1, replace = TRUE)

## [1] "tails"

sample(outcomes, size = 5, replace = TRUE)

## [1] "tails" "tails" "heads" "tails" "tails"

Flipping 100 coins

If we wanted to simulate flipping a fair coin 100 times, you could either run the function 100 times or, more simply, adjust the size argument, which governs how many samples to draw. Since there are only two elements in outcomes, the probability that we ‘flip’ a coin and it lands heads is 0.5.

Suppose again that you have a hat with two slips of paper in it: one slip says ‘heads’ and the other says ‘tails’. Therefore at each draw, the probability of drawing a chip that says ‘head’ is 50%, and “tail” is 50%. The replace = TRUE argument indicates we put the chip back in the hat before drawing again, therefore always keeping the 50/50 odds.

# Run the simulation
outcomes <- c('heads', 'tails')
sim_fair_coin <- sample(outcomes, size = 100,replace = TRUE)

# Print the object
sim_fair_coin

##   [1] "heads" "tails" "heads" "tails" "heads" "heads" "tails" "heads"
##   [9] "heads" "heads" "tails" "heads" "heads" "heads" "tails" "tails"
##  [17] "tails" "heads" "tails" "heads" "heads" "heads" "tails" "heads"
##  [25] "heads" "tails" "tails" "tails" "heads" "heads" "tails" "heads"
##  [33] "heads" "heads" "tails" "tails" "tails" "heads" "heads" "tails"
##  [41] "heads" "tails" "tails" "heads" "heads" "heads" "heads" "tails"
##  [49] "heads" "tails" "heads" "tails" "heads" "heads" "tails" "heads"
##  [57] "tails" "heads" "tails" "heads" "tails" "heads" "tails" "heads"
##  [65] "tails" "tails" "tails" "tails" "tails" "heads" "tails" "heads"
##  [73] "heads" "tails" "tails" "heads" "tails" "heads" "tails" "tails"
##  [81] "tails" "tails" "tails" "tails" "heads" "heads" "heads" "heads"
##  [89] "tails" "heads" "tails" "heads" "tails" "tails" "heads" "heads"
##  [97] "heads" "tails" "tails" "heads"

# Compute the counts of heads and tails:
table(sim_fair_coin)

## sim_fair_coin
## heads tails 
##    52    48

Flipping an unfair coin

Until now we’ve been running simulations where each outcome had an equal probability.

Change that to a 20/80 ratio.

# Run the simulation
outcomes <- c('heads', 'tails')
sim_unfair_coin <- sample(outcomes, size = 100, replace = TRUE, prob = c(0.20, 0.80))

# Print the object
sim_unfair_coin

##   [1] "tails" "tails" "tails" "tails" "tails" "tails" "tails" "tails"
##   [9] "tails" "tails" "tails" "heads" "tails" "tails" "tails" "tails"
##  [17] "tails" "tails" "heads" "tails" "tails" "heads" "tails" "tails"
##  [25] "tails" "tails" "tails" "tails" "tails" "tails" "tails" "tails"
##  [33] "tails" "heads" "heads" "heads" "tails" "heads" "tails" "tails"
##  [41] "tails" "tails" "tails" "heads" "tails" "heads" "heads" "tails"
##  [49] "tails" "heads" "tails" "heads" "tails" "heads" "tails" "tails"
##  [57] "heads" "tails" "tails" "tails" "tails" "tails" "heads" "heads"
##  [65] "tails" "tails" "tails" "tails" "tails" "tails" "tails" "tails"
##  [73] "tails" "tails" "heads" "tails" "tails" "tails" "tails" "tails"
##  [81] "tails" "tails" "tails" "tails" "tails" "tails" "tails" "tails"
##  [89] "tails" "tails" "tails" "tails" "tails" "tails" "tails" "tails"
##  [97] "tails" "heads" "tails" "tails"

# Compute the counts of heads and tails
table(sim_unfair_coin)

## sim_unfair_coin
## heads tails 
##    18    82

Simulating the independent shooter

Simulating a basketball player who has independent shots uses the same mechanism that we used to simulate a coin flip:

Set the number of shots (or size).
Set the probability (prob).
Set the indenpendence criterion (replacement).

# Run the simulation and assign the result to 'sim_basket'
outcomes <- c('H', 'M')

sim_basket <- sample(outcomes, size = 133, replace = TRUE, prob = c(0.45, 0.55))

table(sim_basket)

## sim_basket
##  H  M 
## 57 76

Kobe vs. the independent shooter

Both datasets represent the results of 133 shot attempts, each with the same shooting percentage of 45%.

We know that our simulated data is from a shooter that has independent shots. That is, we know the simulated shooter does not have a ‘hot hand’.

# Calculate streak lengths
kobe_streak <- calc_streak(kobe$basket)

#sim_streak <- calc_streak(sample(c('H','M'),size=133,replace=TRUE,prob=c(0.45,0.55)))
sim_streak <- calc_streak(sim_basket)

# Compute summaries
summary(kobe_streak)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7632  1.0000  4.0000

summary(sim_streak)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.7403  1.0000  5.0000

# Make bar plots
kobe_table = table(kobe_streak)
sim_table = table(sim_streak)

par(mfrow = c(1, 2))

barplot(kobe_table, ylim = c(0, 50))
barplot(sim_table, ylim = c(0, 50))

par(mfrow = c(1, 1))

Somewhat similar distributions!

We can conclude that Kobe Bryant likely does not have a ‘hot hand’.

Foundations for Inference: Sampling Distributions

Sampling distributions

We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010.

This collection represents our population of interest (which is rare to have access to), but we will also work with smaller samples from this population.

# Make some preliminary inspections:
head(ames, 2)

##   Order       PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street Alley
## 1     1 526301100          20        RL          141    31770   Pave  <NA>
## 2     2 526350040          20        RH           80    11622   Pave  <NA>
##   Lot.Shape Land.Contour Utilities Lot.Config Land.Slope Neighborhood
## 1       IR1          Lvl    AllPub     Corner        Gtl        NAmes
## 2       Reg          Lvl    AllPub     Inside        Gtl        NAmes
##   Condition.1 Condition.2 Bldg.Type House.Style Overall.Qual Overall.Cond
## 1        Norm        Norm      1Fam      1Story            6            5
## 2       Feedr        Norm      1Fam      1Story            5            6
##   Year.Built Year.Remod.Add Roof.Style Roof.Matl Exterior.1st Exterior.2nd
## 1       1960           1960        Hip   CompShg      BrkFace      Plywood
## 2       1961           1961      Gable   CompShg      VinylSd      VinylSd
##   Mas.Vnr.Type Mas.Vnr.Area Exter.Qual Exter.Cond Foundation Bsmt.Qual
## 1        Stone          112         TA         TA     CBlock        TA
## 2         None            0         TA         TA     CBlock        TA
##   Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2
## 1        Gd            Gd            BLQ          639            Unf
## 2        TA            No            Rec          468            LwQ
##   BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF Heating Heating.QC Central.Air
## 1            0         441          1080    GasA         Fa           Y
## 2          144         270           882    GasA         TA           Y
##   Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Gr.Liv.Area
## 1      SBrkr        1656           0               0        1656
## 2      SBrkr         896           0               0         896
##   Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr
## 1              1              0         1         0             3
## 2              0              0         1         0             2
##   Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Functional Fireplaces
## 1             1           TA             7        Typ          2
## 2             1           TA             5        Typ          0
##   Fireplace.Qu Garage.Type Garage.Yr.Blt Garage.Finish Garage.Cars
## 1           Gd      Attchd          1960           Fin           2
## 2         <NA>      Attchd          1961           Unf           1
##   Garage.Area Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF
## 1         528          TA          TA           P          210
## 2         730          TA          TA           Y          140
##   Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area Pool.QC
## 1            62              0           0            0         0    <NA>
## 2             0              0           0          120         0    <NA>
##   Fence Misc.Feature Misc.Val Mo.Sold Yr.Sold Sale.Type Sale.Condition
## 1  <NA>         <NA>        0       5    2010       WD          Normal
## 2 MnPrv         <NA>        0       6    2010       WD          Normal
##   SalePrice
## 1    215000
## 2    105000

str(ames)

## 'data.frame':    2930 obs. of  82 variables:
##  $ Order          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ PID            : int  526301100 526350040 526351010 526353030 527105010 527105030 527127150 527145080 527146030 527162130 ...
##  $ MS.SubClass    : int  20 20 20 20 60 60 120 120 120 60 ...
##  $ MS.Zoning      : Factor w/ 7 levels "A (agr)","C (all)",..: 6 5 6 6 6 6 6 6 6 6 ...
##  $ Lot.Frontage   : int  141 80 81 93 74 78 41 43 39 60 ...
##  $ Lot.Area       : int  31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
##  $ Street         : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley          : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
##  $ Lot.Shape      : Factor w/ 4 levels "IR1","IR2","IR3",..: 1 4 1 4 1 1 4 1 1 4 ...
##  $ Land.Contour   : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 2 4 4 ...
##  $ Utilities      : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lot.Config     : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 1 1 5 5 5 5 5 5 ...
##  $ Land.Slope     : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood   : Factor w/ 28 levels "Blmngtn","Blueste",..: 16 16 16 16 9 9 25 25 25 9 ...
##  $ Condition.1    : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...
##  $ Condition.2    : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Bldg.Type      : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 5 5 5 1 ...
##  $ House.Style    : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 3 6 6 3 3 3 6 ...
##  $ Overall.Qual   : int  6 5 6 7 5 6 8 8 8 7 ...
##  $ Overall.Cond   : int  5 6 6 5 5 6 5 5 5 5 ...
##  $ Year.Built     : int  1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
##  $ Year.Remod.Add : int  1960 1961 1958 1968 1998 1998 2001 1992 1996 1999 ...
##  $ Roof.Style     : Factor w/ 6 levels "Flat","Gable",..: 4 2 4 4 2 2 2 2 2 2 ...
##  $ Roof.Matl      : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior.1st   : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ...
##  $ Exterior.2nd   : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 4 15 15 6 7 6 15 ...
##  $ Mas.Vnr.Type   : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 6 5 3 5 5 3 5 5 5 5 ...
##  $ Mas.Vnr.Area   : int  112 0 108 0 0 20 0 0 0 0 ...
##  $ Exter.Qual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 4 4 3 4 4 3 3 3 4 ...
##  $ Exter.Cond     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Foundation     : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 2 2 3 3 3 3 3 3 ...
##  $ Bsmt.Qual      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 4 6 4 4 4 6 ...
##  $ Bsmt.Cond      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 4 6 6 6 6 6 6 6 6 6 ...
##  $ Bsmt.Exposure  : Factor w/ 5 levels "","Av","Gd","Mn",..: 3 5 5 5 5 5 4 5 5 5 ...
##  $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 3 6 2 2 4 4 4 2 4 7 ...
##  $ BsmtFin.SF.1   : int  639 468 923 1065 791 602 616 263 1180 0 ...
##  $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 5 7 7 7 7 7 7 7 7 ...
##  $ BsmtFin.SF.2   : int  0 144 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Unf.SF    : int  441 270 406 1045 137 324 722 1017 415 994 ...
##  $ Total.Bsmt.SF  : int  1080 882 1329 2110 928 926 1338 1280 1595 994 ...
##  $ Heating        : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Heating.QC     : Factor w/ 5 levels "Ex","Fa","Gd",..: 2 5 5 1 3 1 1 1 1 3 ...
##  $ Central.Air    : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Electrical     : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ X1st.Flr.SF    : int  1656 896 1329 2110 928 926 1338 1280 1616 1028 ...
##  $ X2nd.Flr.SF    : int  0 0 0 0 701 678 0 0 0 776 ...
##  $ Low.Qual.Fin.SF: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Gr.Liv.Area    : int  1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
##  $ Bsmt.Full.Bath : int  1 0 0 1 0 0 1 0 1 0 ...
##  $ Bsmt.Half.Bath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Full.Bath      : int  1 1 1 2 2 2 2 2 2 2 ...
##  $ Half.Bath      : int  0 0 1 1 1 1 0 0 0 1 ...
##  $ Bedroom.AbvGr  : int  3 2 3 3 3 3 2 2 2 3 ...
##  $ Kitchen.AbvGr  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen.Qual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 3 1 5 3 3 3 3 3 ...
##  $ TotRms.AbvGrd  : int  7 5 6 8 6 7 6 5 5 7 ...
##  $ Functional     : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ Fireplaces     : int  2 0 0 2 1 1 0 0 1 1 ...
##  $ Fireplace.Qu   : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA 5 5 3 NA NA 5 5 ...
##  $ Garage.Type    : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Garage.Yr.Blt  : int  1960 1961 1958 1968 1997 1998 2001 1992 1995 1999 ...
##  $ Garage.Finish  : Factor w/ 4 levels "","Fin","RFn",..: 2 4 4 2 2 2 2 3 3 2 ...
##  $ Garage.Cars    : int  2 1 1 2 2 2 2 2 2 2 ...
##  $ Garage.Area    : int  528 730 312 522 482 470 582 506 608 442 ...
##  $ Garage.Qual    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Garage.Cond    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Paved.Drive    : Factor w/ 3 levels "N","P","Y": 2 3 3 3 3 3 3 3 3 3 ...
##  $ Wood.Deck.SF   : int  210 140 393 0 212 360 0 0 237 140 ...
##  $ Open.Porch.SF  : int  62 0 36 0 34 36 0 82 152 60 ...
##  $ Enclosed.Porch : int  0 0 0 0 0 0 170 0 0 0 ...
##  $ X3Ssn.Porch    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Screen.Porch   : int  0 120 0 0 0 0 0 144 0 0 ...
##  $ Pool.Area      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool.QC        : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence          : Factor w/ 4 levels "GdPrv","GdWo",..: NA 3 NA NA 3 NA NA NA NA NA ...
##  $ Misc.Feature   : Factor w/ 5 levels "Elev","Gar2",..: NA NA 2 NA NA NA NA NA NA NA ...
##  $ Misc.Val       : int  0 0 12500 0 0 0 0 0 0 0 ...
##  $ Mo.Sold        : int  5 6 6 4 3 6 4 1 3 6 ...
##  $ Yr.Sold        : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Sale.Type      : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ SalePrice      : int  215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...

A first distribution analysis

Restrict the attention to two of the variables.

# Assign the variables
area <- ames$Gr.Liv.Area
price <- ames$SalePrice

# Calculate the summary and draw a histogram of 'area'
summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

# Plot it
hist(area)

50% of houses in Ames are smaller than 1,500 square feet.

Sampling from the population

Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.

Estimate the mean living area in Ames based on a sample; collect a simple random sample of size 50.

# The 'ames' data frame and 'area' and 'price' objects are already loaded into the workspace
# Create the samples
samp0 <- sample(area, 50)
samp1 <- sample(area, 50)

# Draw the histograms
par(mfrow = c(1,2))

hist(samp0)
hist(samp1)

par(mfrow = c(1,1))

Depending on which 50 homes we selected, your estimate could be a bit above or a bit below the true population mean of approximately 1,500 square feet.

In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.

However, larger samples provide more accurate estimates of the population mean.

The sampling distribution

The distribution of sample means, called the sampling distribution, can help us understand this variability.

Generate 5000 samples and compute the sample mean of each.

# Set up an empty vector of 5000 NAs to store sample means
sample_means50 <- rep(NA, 5000)

# Take 5000 samples of size 50 of 'area' and store all of them in 'sample_means50'
for (i in 1:5000) {
    samp <- sample(area, 50)
    sample_means50[i] <- mean(samp)
}

# View the result. If you want, you can increase the bin width to show more detail by changing the 'breaks' argument
hist(sample_means50, breaks = 13)

Interlude: The for loop

The idea behind the for loop is iteration: it allows to execute code as many times without having to type out every iteration. The code to generate 5000 numbers:

# Set up an empty vector of 5000 NAs to store sample means
sample_means50 <- rep(NA, 5000)

# Take 5000 samples of size 50 of 'area' and store all of them in 'sample_means50'

# Also print 'i' (the index counter):
for (i in 1:5000) {
    samp <- sample(area, 50)
    sample_means50[i] <- mean(samp)
    print(i)
}

Interlude: breaking it down

In the first line we initialize a vector. In this case, we created a vector of 5000 NAs called sample means50. This vector will store values generated within the for loop. NA means not available, and in this case they’re used as placeholders until we fill in the values with actual sample means. NA is also often used for missing data in R.

The second line calls the for loop itself. The syntax can be loosely read as, “for every element i from 1 to 5000, run the following lines of code”. You can think of i as the counter that keeps track of which loop you’re on. Therefore, more precisely, the loop will run once when i=1, then once when i=2, and so on up to i=5000.

The body of the for loop is the part inside the curly braces, and this set of code is run for each value of i. Here, on every loop, we take a random sample of size 50 from area, take its mean, and store it as the ith element of sample_means50. In order to display that this is really happening, we asked R to print i at each iteration. This line of code is optional and is only used for displaying what’s going on while the for loop is running.

The for loop allows us to not just run the code 5000 times, but to neatly package the results, element by element, into the empty vector that we initialized at the outset. The code:

# The vector 'sample_means50' is initialized with NA values
sample_means50 <- rep(NA, 5000)

# The for loop runs 5000 times, with 'i' taking values 1 up to 5000
for (i in 1:5000) {
    # Take a random sample of size 50
    samp <- sample(area, 50)
    # Store the mean of the sample in the 'sample_means50' vector on the ith place
    sample_means50[i] <- mean(samp)
    # Print the counter 'i'
    print(i)
}

The result:

# Print the first few random means
head(sample_means50)

## [1] 1440.38 1425.78 1500.44 1501.66 1418.44 1559.00

A first for loop

# Initialize the vector to store the means in
sample_means_small <- rep(NA, 100)

# Run your for loop
for (i in 1:100) {
    samp <- sample(area,50)
    sample_means_small[i] <- mean(samp)
}

# Print the result
sample_means_small

##   [1] 1539.88 1380.38 1570.22 1555.70 1527.24 1450.86 1585.26 1363.08
##   [9] 1458.06 1490.30 1492.78 1473.94 1519.52 1463.06 1512.52 1549.02
##  [17] 1602.76 1483.68 1505.58 1459.90 1464.42 1507.26 1365.04 1455.08
##  [25] 1660.06 1467.30 1669.16 1435.54 1562.04 1460.86 1443.30 1410.28
##  [33] 1453.38 1537.40 1466.74 1559.56 1543.74 1584.44 1531.44 1475.30
##  [41] 1582.66 1532.10 1387.60 1523.92 1518.54 1517.92 1506.02 1432.80
##  [49] 1417.70 1308.72 1482.80 1487.58 1490.12 1712.62 1490.88 1545.90
##  [57] 1439.48 1594.44 1541.50 1519.30 1547.66 1517.20 1455.94 1472.08
##  [65] 1597.16 1496.82 1566.50 1570.78 1564.44 1613.98 1521.22 1483.30
##  [73] 1431.84 1420.40 1494.74 1443.18 1486.84 1459.86 1560.78 1529.48
##  [81] 1543.60 1521.64 1509.62 1579.10 1474.94 1369.88 1490.14 1537.20
##  [89] 1338.10 1346.68 1465.66 1554.30 1390.44 1411.20 1457.92 1461.42
##  [97] 1486.52 1441.36 1548.58 1438.02

100 results, 100 elements. Each element represents a mean square footage from a simple random sample of 50 houses.

More on sampling

The sampling distribution that we computed tells us much about estimating the average living area in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average living area of the the population, and the spread of the distribution indicates how much variability is induced by sampling only 50 home sales.

# Initialize the sample distributions:
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

# Run the for loop:
for (i in 1:5000) {
    samp <- sample(area, 10)
    sample_means10[i] <- mean(samp)
    samp <- sample(area, 100)
    sample_means100[i] <- mean(samp)
}

# Take a look at the results
head(sample_means10)

## [1] 1452.9 1437.2 1482.6 1399.5 1452.3 1605.8

head(sample_means50) # was already loaded

## [1] 1440.38 1425.78 1500.44 1501.66 1418.44 1559.00

head(sample_means100)

## [1] 1448.41 1472.83 1527.48 1538.21 1536.08 1453.46

Influence of sample size

To see the effect that different sample sizes have on the sampling distribution, let’s plot the three distributions on top of one another.

# Divide the plot window in 3 rows
par(mfrow = c(3, 1))

# Define the limits for the x-axis
xlimits <- range(sample_means10)

# Draw the histograms
hist(sample_means10)
hist(sample_means50)
hist(sample_means100)

# with breaks
hist(sample_means10, breaks = 20, xlim = range(xlimits))
hist(sample_means50, breaks = 20, xlim = range(xlimits))
hist(sample_means100, breaks = 20, xlim = range(xlimits))

par(mfrow = c(1, 1))

As the sample size increases, the variability of the sampling distribution decreases.

Now: prices!

So far, we have only focused on estimating the mean living area in homes in Ames.

Now let’s take off our training wheels and do the same analysis for the sales prices.

# Take a sample of size 50 from 'price'
sample_50 <- sample(price, 50)

# Print the mean
mean(sample_50)

## [1] 174484.6

Sampling distribution of prices

Since we have access to the population, we can simulate the sampling distribution for $\overline{x}$ by taking 5000 samples of size 50 from the population and compute 5000 sample means.

sample_means50 <- rep(NA, 5000)
for (i in 1:5000) {
    samp <- sample(price,50)
    sample_means50[i] <- mean(samp)
}

head(sample_means50)

## [1] 214974.5 190683.2 192283.1 163343.8 179480.0 199587.8

More on sampling distribution of prices

Means of samples of size 150.

sample_means150 <- rep(NA, 5000)
for (i in 1:5000) {
    samp <- sample(price,150)
    sample_means150[i] <- mean(samp)
}

par(mfrow = c(2, 1))

hist(sample_means50)
hist(sample_means150)

par(mfrow = c(1, 1))

The variability of the sampling distribution with the smaller sample size (sample_means50) is smaller than the variability of the sampling distribution with the larger sample size (sample_means150).

Foundations for Inference: Confidence Intervals

One sample from Ames, Iowa

we’ll translate that concept of reliability of mean estimates into something more tangible: confidence intervals.

If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straight forward to answer questions like, “How big is the typical house in Ames?” and “How much variation is there in sizes of houses?”. If you have access to only a sample of the population, as is often the case, the task becomes more complicated.

What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make inference on what your population looks like.

we’ll start with a simple random sample of size 60 from the population. Specifically, this is a simple random sample of size 60. Note that the dataset has information on many housing variables, but for the first portion of the lab we’ll focus on the size of the house, represented by the variable Gr.Liv.Area.

# Take a sample of size 60 of the population:
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)

# Calculate the mean
sample_mean <- mean(samp)

# Draw a histogram
hist(samp)

The distribution should be similar to others’ distributions who also collect random samples from this population, but it is likely not exactly the same since it’s a random sample.

Confidence intervals

Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean.

How uncertain we are of that estimate. This can be captured by using a confidence interval.

We can calculate a 95% confidence interval for a sample mean by adding and subtracting 1.96 standard errors to the point estimate.

# Calculate the standard error
se <- sd(samp)/sqrt(60)

# Calculate the lower and upper bounds of your confidence interval and print them
# 95% with z = 1.96
lower <- sample_mean - se * 1.96 
upper <- sample_mean + se * 1.96

# to print several variables...
c(lower, sample_mean, upper)

## [1] 1306.779 1436.067 1565.354

# Plot
stripchart(c(lower, sample_mean, upper))

For the confidence interval to be valid, the sample mean must be normally distributed and have standard error:

\[\frac{s}{\sqrt(n)}\]

The sample must be random and the sample size, 60, is less than 10% of all houses.

A 95% confidence intervals would you expect to capture the true population mean; 95% of random samples of size 60 will yield confidence intervals that contain the true average area of houses in Ames, Iowa.

Challenge (1)

we’re going to recreate many samples using a for loop. Here is the rough outline:

Obtain a random sample.
Calculate the sample’s mean and standard deviation.
Use these statistics to calculate a confidence interval.
Repeat steps (1)-(3) 50 times.

# Initialize 'samp_mean', 'samp_sd' and 'n'
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60

Challenge (2)

Now, we can use a for loop to calculate 50 times the mean and standard deviation of random samples.

# Initialize 'samp_mean', 'samp_sd' and 'n'
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60

# For loop goes here
for (i in 1:50) {
    samp <- sample(population, n)
    samp_mean[i] <- mean(samp)
    samp_sd[i] <- sd(samp)
}

Challenge (3)

Lastly, we can construct the confidence intervals and plot them. In this exercise, you will have to calculate the 95% confidence intervals.

The lower and upper bound respectively:

\[\overline{x} - 1.96 \times \frac{\sigma}{\sqrt(n)}\]

\[\overline{x} + 1.96 \times \frac{\sigma}{\sqrt(n)}\]

# Initialize 'samp_mean', 'samp_sd' and 'n'
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60

# For loop goes here
for (i in 1:50) {
  samp <- sample(population, n) 
  samp_mean[i] <- mean(samp)
  samp_sd[i] <- sd(samp)
}

# Calculate the interval bounds here
lower <- samp_mean - 1.96 * (samp_sd/sqrt(n))
upper <- samp_mean + 1.96 * (samp_sd/sqrt(n))

Plot confidence interval – the function

plot_ci <- function(lo, hi, m){
  par(mar=c(2, 1, 1, 1), mgp=c(2.7, 0.7, 0))
  k <- 50
  ci.max <- max(rowSums(matrix(c(-1*lo,hi),ncol=2)))

  xR <- m + ci.max*c(-1, 1)
  yR <- c(0, 41*k/40)

  plot(xR, yR, type='n', xlab='', ylab='', axes=FALSE)
  abline(v=m, lty=2, col='#00000088')
  axis(1, at=m, paste("mu = ",round(m,4)), cex.axis=1.15)
  #axis(2)
  for(i in 1:k){
    x <- mean(c(hi[i],lo[i]))
      ci <- c(lo[i],hi[i])
      if(contains(lo[i],hi[i],m)==FALSE){
          col <- "#F05133"
          points(x, i, cex=1.4, col=col)
#         points(x, i, pch=20, cex=1.2, col=col)
          lines(ci, rep(i, 2), col=col, lwd=5)
      }
      col <- 1
    points(x, i, pch=20, cex=1.2, col=col)
      lines(ci, rep(i, 2), col=col)
  }
}

Plot confidence interval – the plot

# Plotting the confidence intervals
pop_mean <- mean(population)
plot_ci(lower, upper, pop_mean)

The 99%

Critical value for a 99% confidence level: 2.58.

# Initialize 'samp_mean', 'samp_sd' and 'n'
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60

# For loop goes here
for (i in 1:50) {
  samp <- sample(population, n) 
  samp_mean[i] <- mean(samp)
  samp_sd[i] <- sd(samp)
}

# Calculate the interval bounds here
lower <- samp_mean - 2.58 * (samp_sd/sqrt(n))
upper <- samp_mean + 2.58 * (samp_sd/sqrt(n))

# Plotting the confidence intervals
pop_mean <- mean(population)
plot_ci(lower, upper, pop_mean)

We would expect 99% of the intervals to contain the true population mean.

Inference for Numerical Data

North Carolina births

In 2004, the state of North Carolina released a large dataset containing information on births recorded in their state. This dataset is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this dataset.

# List the variables in the dataset
names(nc)

##  [1] "fage"           "mage"           "mature"         "weeks"         
##  [5] "premie"         "visits"         "marital"        "gained"        
##  [9] "weight"         "lowbirthweight" "gender"         "habit"         
## [13] "whitemom"

The nc data frame contains observations on 13 different variables, some categorical and some numerical. The meaning of each variable is as follows:

fage: father’s age in years.
mage: mother’s age in years.
mature: maturity status of mother.
weeks: length of pregnancy in weeks.
premie: whether the birth was classified as premature (premie) or full-term.
visits: number of hospital visits during pregnancy.
marital: whether mother is married or not married at birth.
gained: weight gained by mother during pregnancy in pounds.
weight: weight of the baby at birth in pounds.
lowbirthweight: whether baby was classified as low birthweight (low) or not (not low).
gender: gender of the baby, female or male.
habit: status of the mother as a nonsmoker or a smoker.
whitemom: whether mom is white or not white.

A first analysis

# Compute summaries of the data
summary(nc)

##       fage            mage            mature        weeks      
##  Min.   :14.00   Min.   :13   mature mom :133   Min.   :20.00  
##  1st Qu.:25.00   1st Qu.:22   younger mom:867   1st Qu.:37.00  
##  Median :30.00   Median :27                     Median :39.00  
##  Mean   :30.26   Mean   :27                     Mean   :38.33  
##  3rd Qu.:35.00   3rd Qu.:32                     3rd Qu.:40.00  
##  Max.   :55.00   Max.   :50                     Max.   :45.00  
##  NA's   :171                                    NA's   :2      
##        premie        visits            marital        gained     
##  full term:846   Min.   : 0.0   married    :386   Min.   : 0.00  
##  premie   :152   1st Qu.:10.0   not married:613   1st Qu.:20.00  
##  NA's     :  2   Median :12.0   NA's       :  1   Median :30.00  
##                  Mean   :12.1                     Mean   :30.33  
##                  3rd Qu.:15.0                     3rd Qu.:38.00  
##                  Max.   :30.0                     Max.   :85.00  
##                  NA's   :9                        NA's   :27     
##      weight       lowbirthweight    gender          habit    
##  Min.   : 1.000   low    :111    female:503   nonsmoker:873  
##  1st Qu.: 6.380   not low:889    male  :497   smoker   :126  
##  Median : 7.310                               NA's     :  1  
##  Mean   : 7.101                                              
##  3rd Qu.: 8.060                                              
##  Max.   :11.750                                              
##                                                              
##       whitemom  
##  not white:284  
##  white    :714  
##  NA's     :  2  
##                 
##                 
##                 
##

# Use visualization and summary statistics to view the data for 'gained'
summary(nc$gained)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   20.00   30.00   30.33   38.00   85.00      27

par(mfrow = c(2, 1))

boxplot(nc$gained, horizontal = TRUE, ylim = c(0, 90))
hist(nc$gained, xlim = c(0, 90), main = '')

par(mfrow = c(1, 1))

NA weight gain = 27.

Cleaning your data

Since we have some missing data on weight gain, we’ll first create a cleaned-up version of the weight gain variable, and use this variable in the next few steps of the analysis.

# Create a clean version fo your dataset
gained_clean <- na.omit(nc$gained)

# Set 'n' to the length of 'gained_clean'
n <- length(gained_clean)

The bootstrap

Using this sample we would like to construct a bootstrap confidence interval for the average weight gained by all mothers during pregnancy. A quick reminder of how bootstrapping works:

Take a bootstrap sample (a random sample with replacement of size equal to the original sample size) from the original sample.
Record the mean of this bootstrap sample.
Repeat steps (1) and (2) many times to build a bootstrap distribution.
Calculate the XX% interval using the percentile or the standard error method.

Let’s first take 100 bootstrap samples (i.e. with replacement), and record their means in a new object called boot_means.

# Initialize the 'boot_means' object
boot_means <- rep(NA, 100)

# Insert your for loop:
for (i in 1:100) {
    boot_sample <- sample(gained_clean, n, replace = TRUE)
    boot_means[i] <- mean(boot_sample)
}

# Make a histogram of 'boot_means'
hist(boot_means)

Sampling distribution vs bootstrap distribution

The sampling distribution is calculated by resampling from the population, the bootstrap distribution is calculated by resampling from the sample.

The inference function

To construct the 95% bootstrap confidence interval using the percentile method, we do not estimate the values of the 5th and 95th percentiles of the bootstrap distribution as in the sampling distribution.

This function automates the process. By default the inference function takes 10,000 bootstrap samples (instead of the 100 you’ve taken above), creates a bootstrap distribution, and calculates the confidence interval, using the percentile method.

The code for the custom function is not shown.

# Run the inference function (and plot the results)
boot_awg_90 <- inference(nc$gained,
          type = 'ci',
          method = 'simulation',
          conflevel = 0.9,
          est = 'mean',
          boot_method = 'perc')

## Single mean 
## Summary statistics:

## mean = 30.3258 ;  sd = 14.2413 ;  n = 973

## Bootstrap method: Percentile
## 90 % Bootstrap interval = ( 29.5827 , 31.0771 )

boot_awg_90

## $mfrow
## [1] 1 2
## 
## $mar
## [1] 4.0 4.0 0.5 0.5

Setting the confidence level

We can easily change the confidence level to 95% by changing the argument conflevel of the inference function.

# Run the inference function (and plot the results)
boot_awg_95 <- inference(nc$gained,
          type = 'ci',
          method = 'simulation',
          conflevel = 0.95,
          est = 'mean',
          boot_method = 'perc')

## Single mean 
## Summary statistics:

## mean = 30.3258 ;  sd = 14.2413 ;  n = 973

## Bootstrap method: Percentile
## 95 % Bootstrap interval = ( 29.4481 , 31.2127 )

boot_awg_95

## $mfrow
## [1] 1 2
## 
## $mar
## [1] 4.0 4.0 0.5 0.5

Choosing a bootstrap method

We can also use the standard error method. Change the boot_method argument to ‘se’ and run the code.

# Adapt the inference function
inference(nc$gained, 
          type = 'ci', 
          method = 'simulation', 
          conflevel = 0.95, 
          est = 'mean', 
          boot_method = 'se')

## Single mean 
## Summary statistics:

## mean = 30.3258 ;  sd = 14.2413 ;  n = 973

## Bootstrap method: Standard error; Boot. SE = 0.4532 
## 95 % Bootstrap interval = ( 29.4376 , 31.214 )

Setting the parameter of interest

To create an interval for the median instead of the mean, you can change the est argument.

# Adapt the inference function
inference(nc$gained,
          type = 'ci', 
          method = 'simulation', 
          conflevel = 0.95, 
          est = 'median', 
          boot_method = 'se')

## Single median 
## Summary statistics:

## median = 30 ;  n = 973

## Bootstrap method: Standard error; Boot. SE = 0.2467 
## 95 % Bootstrap interval = ( 29.5164 , 30.4836 )

The bootstrap distribution of the median weight gain is not a smooth, unimodal, symmetric distribution; it does not yield a reliable confidence interval estimate.

Father’s age

Create a 95% bootstrap confidence interval for the mean age of fathers at the birth of their child using the standard error method.

# Adapt the inference function to create a 95% bootstrap interval for the mean age of fathers
inference(nc$fage, 
          type = 'ci', 
          method = 'simulation', 
          conflevel = 0.95, 
          est = 'mean', 
          boot_method = 'se')

## Single mean 
## Summary statistics:

## mean = 30.2557 ;  sd = 6.7638 ;  n = 829

## Bootstrap method: Standard error; Boot. SE = 0.2354 
## 95 % Bootstrap interval = ( 29.7944 , 30.717 )

Relationships between two variables

When the response variable is numerical and the explanatory variable is categorical, we can evaluate the relationship between the two variables by comparing means (or medians, or other measures) of the numerical response variable across the levels of the explanatory categorical variable.

# Draw your plot here
boxplot(nc$weight ~ nc$habit, horizontal = TRUE)

Both distributions are slightly right skewed (since the ‘middle’ is 6).

The by function

We can also compare the means of the distributions using the by function to split the weight variable into the habit groups.

# Use the 'by' function here
by(nc$weight, nc$habit, mean)

## nc$habit: nonsmoker
## [1] 7.144273
## -------------------------------------------------------- 
## nc$habit: smoker
## [1] 6.82873

by(nc$weight, nc$habit, median)

## nc$habit: nonsmoker
## [1] 7.31
## -------------------------------------------------------- 
## nc$habit: smoker
## [1] 7.06

by(nc$weight, nc$habit, summary)

## nc$habit: nonsmoker
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.440   7.310   7.144   8.060  11.750 
## -------------------------------------------------------- 
## nc$habit: smoker
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.690   6.077   7.060   6.829   7.735   9.190

Conditions for inference

Before you can start hypothesis testing, you need to think about the conditions that are necessary for inference.

Obtain sample sizes to check the conditions.

# Use the 'by' function here
by(nc$weight, nc$habit, length)

## nc$habit: nonsmoker
## [1] 873
## -------------------------------------------------------- 
## nc$habit: smoker
## [1] 126

More inference

use the inference function for evaluating whether there is a difference between the average birth weights of babies born to smoker and non-smoker mothers.

Arguments:

The first argument is y, which is the response variable that we are interested in: nc$weight.
The second argument is the grouping variable, x, which is the explanatory variable - the grouping variable across the levels of which we’re comparing the average value for the response variable, smokers and non-smokers: nc$habit.
The third argument, est, is the parameter we’re interested in: "mean" (other options are "median", or "proportion".)
Next we decide on the type of inference we want: a hypothesis test ("ht") or a confidence interval("ci").
When performing a hypothesis test, we also need to supply the null value, which in this case is 0, since the null hypothesis sets the two population means equal to each other.
The alternative hypothesis can be "less", "greater", or "twosided".
Lastly, the method of inference can be "theoretical" or "simulation" based.

# The code
inference(y = nc$weight,
          x = nc$habit, 
          est = 'mean', 
          type = 'ht', 
          null = 0, 
          alternative = 'twosided', 
          method = 'theoretical')

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862

## Observed difference between means (nonsmoker-smoker) = 0.3155
## H0: mu_nonsmoker - mu_smoker = 0 
## HA: mu_nonsmoker - mu_smoker != 0 
## Standard error = 0.134 
## Test statistic: Z =  2.359 
## p-value =  0.0184

Changing the order

By default the inference() function sets the parameter of interest to be $\mu_{\mathrm{nonsmoker}}-\mu_{\mathrm{smoker}}$. We can easily change this order by using the order argument.

To set the order to $\mu_{\mathrm{first}} - \mu_{\mathrm{second}}$ use: order <- c("first","second").

# Levels: nonsmoker smoker

# Add the 'order' argument to the 'inference' function
inference(y = nc$weight,
          x = nc$habit,
          est = 'mean',
          type = 'ht',
          null = 0,
          alternative = 'twosided',
          method = 'theoretical',
          order = c('smoker', 'nonsmoker'))

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187

## Observed difference between means (smoker-nonsmoker) = -0.3155
## H0: mu_smoker - mu_nonsmoker = 0 
## HA: mu_smoker - mu_nonsmoker != 0 
## Standard error = 0.134 
## Test statistic: Z =  -2.359 
## p-value =  0.0184

inference(y = nc$weight,
          x = nc$habit,
          est = 'mean',
          type = 'ci',
          null = 0,
          alternative = 'twosided',
          method = 'theoretical',
          order = c('smoker', 'nonsmoker'))

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_smoker = 126, mean_smoker = 6.8287, sd_smoker = 1.3862
## n_nonsmoker = 873, mean_nonsmoker = 7.1443, sd_nonsmoker = 1.5187

## Observed difference between means (smoker-nonsmoker) = -0.3155
## Standard error = 0.1338 
## 95 % Confidence interval = ( -0.5777 , -0.0534 )

Observed difference between means (smoker-nonsmoker) = -0.3155
Standard error = 0.1338.
Confidence interval = ( -0.5777 , -0.0534 ).

We are 95% confident that babies born to nonsmoker mothers are on average 0.05 to 0.58 pounds heavier at birth than babies born to smoker mothers.

by(nc$mage, nc$mature, summary)

## nc$mature: mature mom
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.00   35.00   37.00   37.18   38.00   50.00 
## -------------------------------------------------------- 
## nc$mature: younger mom
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   21.00   25.00   25.44   30.00   34.00

The maximum age of younger moms is 34 and minimum age of mature moms is 35.

The General Social Survey

The General Social Survey (GSS) is a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. The survey asks many questions, two of which we will focus on for the next few exercises: wordsum (vocabulary test scores) and class (self-reported social class).

wordsum ranges between 0 and 10, and is calculated as the number of vocabulary questions (out of 10) that the respondent answered correctly.

head(gss$wordsum)

## [1] 6 9 6 5 6 6

head(gss$class)

## [1] MIDDLE  WORKING WORKING WORKING WORKING WORKING
## Levels: LOWER MIDDLE UPPER WORKING

Analyze the variables

Create numerical and visual summaries of the two variables individually to better understand their distribution. Also compute summary statistics and visualizations that display the relationship between the two variables.

# Numerical and visual summaries of 'wordsum' and 'class'
summary(gss$wordsum)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.000   6.000   6.145   7.000  10.000

summary(gss$class)

##   LOWER  MIDDLE   UPPER WORKING 
##      41     331      16     407

par(mfrow = c(2,1))

hist(gss$wordsum)
boxplot(gss$wordsum, horizontal = TRUE)

par(mfrow = c(1,1))

# Numerical and visual summaries of their relations
by(gss$wordsum, gss$class, mean)

## gss$class: LOWER
## [1] 5.073171
## -------------------------------------------------------- 
## gss$class: MIDDLE
## [1] 6.761329
## -------------------------------------------------------- 
## gss$class: UPPER
## [1] 6.1875
## -------------------------------------------------------- 
## gss$class: WORKING
## [1] 5.749386

boxplot(gss$wordsum ~ gss$class)

ANOVA test

The ANOVA method can test for a difference between the average vocabulary test scores among the various classes.

# The test
inference(y = gss$wordsum, 
          x = gss$class, 
          est = 'mean', 
          method = 'theoretical', 
          type = 'ht', 
          alternative = 'greater')

## Response variable: numerical, Explanatory variable: categorical
## ANOVA
## Summary statistics:
## n_LOWER = 41, mean_LOWER = 5.0732, sd_LOWER = 2.2404
## n_MIDDLE = 331, mean_MIDDLE = 6.7613, sd_MIDDLE = 1.8863
## n_UPPER = 16, mean_UPPER = 6.1875, sd_UPPER = 2.3443
## n_WORKING = 407, mean_WORKING = 5.7494, sd_WORKING = 1.8652

## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
## 
## Response: y
##            Df  Sum Sq Mean Sq F value    Pr(>F)
## x           3  236.56  78.855  21.735 1.561e-13
## Residuals 791 2869.80   3.628                  
## 
## Pairwise tests: t tests with pooled SD 
##          LOWER MIDDLE  UPPER
## MIDDLE  0.0000     NA     NA
## UPPER   0.0475 0.2396     NA
## WORKING 0.0306 0.0000 0.3671

Let’s examine the output:

First, we have a numerical response variable (score on vocabulary test) and a categorical explanatory variable (class).
Since class has four levels, comparing average scores across the levels of the class variable requires ANOVA.

Before we get to the ANOVA output we are presented with a series of summary statistics and the associated hypotheses.

Modified $\alpha$ ($\alpha^{*}$):

\[\alpha^{*} = \frac{\alpha}{6} = 0.0083\]

The ‘middle’ and ‘lower’ p-values of the pairwise tests from the ANOVA output are below $\alpha^{*}$ and conclude to be different (H_0 rejected) at the modified significance level. All the other p-values cannot reject the zero hypothesis.

Inference for Categorical Data

In August of 2012, news outlets ranging from the Washington Post to the Huffington Post ran a story about the rise of atheism in America.

The source for the story was a poll that asked people, “Irrespective of whether you attend a place of worship or not, would you say you are a religious person, not a religious person or a convinced atheist?”

This type of question, which asks people to classify themselves in one way or another, is common in polling and generates categorical data.

The data

summary(atheism)

##              nationality           response          year     
##  Pakistan          : 5409   atheist    : 5498   Min.   :2005  
##  France            : 3359   non-atheist:82534   1st Qu.:2005  
##  Korea, Rep (South): 3047                       Median :2012  
##  Ghana             : 2995                       Mean   :2009  
##  Macedonia         : 2418                       3rd Qu.:2012  
##  Peru              : 2414                       Max.   :2012  
##  (Other)           :68390

str(atheism)

## 'data.frame':    88032 obs. of  3 variables:
##  $ nationality: Factor w/ 57 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ response   : Factor w/ 2 levels "atheist","non-atheist": 2 2 2 2 2 2 2 2 2 2 ...
##  $ year       : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...

head(atheism)

##   nationality    response year
## 1 Afghanistan non-atheist 2012
## 2 Afghanistan non-atheist 2012
## 3 Afghanistan non-atheist 2012
## 4 Afghanistan non-atheist 2012
## 5 Afghanistan non-atheist 2012
## 6 Afghanistan non-atheist 2012

A poll conducted by WIN-Gallup International surveyed 51,927 people (a sample) from 57 countries and gather information face to face, by telephone and through the Internet: ‘Global Index of Religiosity of Atheism’, WIN-Gallup, 2012.

From pp. 15 and 16 of the report, Table 6 shows the sample and response percentages on individual persons from 57 countries.

Atheists in the US

Take a look at the estimated proportion of atheists in the United States (in 2012).

# Create the 'us12' subset 
us12 <- subset(atheism, atheism$nationality == 'United States' & atheism$year == '2012')

# Calculate the proportion of atheist responses:
proportion <- nrow(subset(us12, us12$response == 'atheist'))/nrow(us12)

# Print the proportion
proportion

## [1] 0.0499002

round(proportion, 2)

## [1] 0.05

Table 6 of the report provides statistics, that is, calculations made from the sample of 51,927 people.

The question “What proportion of people in your sample reported being atheists?” is a statistic; while the question “What proportion of people on earth would report being atheists” is answered with an estimate of the parameter.

The inferential tools for estimating population proportion are analogous to those used for means in the previous lab: the confidence interval and the hypothesis test.

Although formal confidence intervals and hypothesis tests don’t show up in the report, suggestions of inference appear at the bottom of page 7: “In general, the error margin for surveys of this kind is ? 3-5% at 95% confidence.”

inference(us12$response, 
          est = "proportion", 
          type = "ci", 
          method = "theoretical", 
          success = "atheist")

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0499 ;  n = 1002 
## Check conditions: number of successes = 50 ; number of failures = 952 
## Standard error = 0.0069 
## 95 % Confidence interval = ( 0.0364 , 0.0634 )

Standard error = 0.0069.

Margin of error (95%) = 0.0069*1.96 = 0.013524.

What about India?

# The subset for India for 2012
india <- subset(atheism, atheism$nationality == 'India' & atheism$year == 2012)

# The analysis using the 'inference' function
inference(india$response, 
          est = 'proportion', 
          type = 'ci', 
          method = 'theoretical', 
          success = 'atheist')

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.0302 ;  n = 1092 
## Check conditions: number of successes = 33 ; number of failures = 1059 
## Standard error = 0.0052 
## 95 % Confidence interval = ( 0.0201 , 0.0404 )

Confidence intervals for the proportion of atheists in India (in 2012).

And China?

# The subset for China for 2012
china <- subset(atheism, atheism$nationality == 'China' & atheism$year == 2012)

# The analysis using the 'inference' function
inference(china$response, 
          est = 'proportion', 
          type = 'ci', 
          method = 'theoretical', 
          success = 'atheist')

## Single proportion -- success: atheist 
## Summary statistics:

## p_hat = 0.47 ;  n = 500 
## Check conditions: number of successes = 235 ; number of failures = 265 
## Standard error = 0.0223 
## 95 % Confidence interval = ( 0.4263 , 0.5137 )

Confidence intervals for the proportion of atheists in China (in 2012).

The margin of error

The formula for the standard error:

\[SE = \sqrt{p(1-p)/n}\]

The formula for the margin of error for a 95% confidence interval:

\[ME = 1.96 \times SE = 1.96 \times \sqrt {p(1-p)/n}\]

Since the population proportion p is in this me formula, it should make sense that the margin of error is in some way dependent on the population proportion.

We can visualize this relationship by creating a plot of me vs p.

# The first step is to make a vector p that is a sequence from 0 to 1 with each number separated by 0.01
n <- 1000
p <- seq(0, 1, 0.01)

# We then create a vector of the margin of error (me) associated with each of these values of p using the familiar approximate formula (ME = 2 X SE)
me <- 2 * sqrt(p * (1 - p)/n)

# 1.96
# Finally, plot the two vectors against each other to reveal their relationship
plot(me ~ p)

The me reaches a minimum at p == 0 and p == 1.
The me is maximized when p == 0.5.

Atheism in Spain

Table 4 on page 13 of the report summarizes survey results from 2005 and 2012 for 39 countries.

# Take the 'spain' subset
spain <- subset(atheism, atheism$nationality == 'Spain')

# & atheism$year == 2012)
# Calculate the proportion of atheists from the 'spain' subset
proportion <- nrow(subset(spain, spain$response == 'atheist'))/nrow(spain)

# Use the inference function
inference(spain$response, 
          spain$year, 
          est = 'proportion', 
          type = 'ci', 
          method = 'theoretical', 
          success = 'atheist')

## Response variable: categorical, Explanatory variable: categorical
## Two categorical variables
## Difference between two proportions -- success: atheist
## Summary statistics:
##              x
## y             2005 2012  Sum
##   atheist      115  103  218
##   non-atheist 1031 1042 2073
##   Sum         1146 1145 2291

## Observed difference between proportions (2005-2012) = 0.0104
## 
## Check conditions:
##    2005 : number of successes = 115 ; number of failures = 1031 
##    2012 : number of successes = 103 ; number of failures = 1042 
## Standard error = 0.0123 
## 95 % Confidence interval = ( -0.0136 , 0.0344 )

There is no convincing evidence that Spain has seen a change in its atheism index between 2005 and 2012.

Rising in the US?

Will this situation be the same in the United States?

# Take the 'us' subset
us <- subset(atheism, atheism$nationality == 'United States')

# & atheism$year == 2012)
# Calculate the proportion of atheists from the 'us' subset
proportion <- nrow(subset(us, us$response == 'atheist'))/nrow(us)

# Use the inference function
inference(us$response, 
          us$year, 
          est = 'proportion', 
          type = 'ci', 
          method = 'theoretical', 
          success = 'atheist')

## Response variable: categorical, Explanatory variable: categorical
## Two categorical variables
## Difference between two proportions -- success: atheist
## Summary statistics:
##              x
## y             2005 2012  Sum
##   atheist       10   50   60
##   non-atheist  992  952 1944
##   Sum         1002 1002 2004

## Observed difference between proportions (2005-2012) = -0.0399
## 
## Check conditions:
##    2005 : number of successes = 10 ; number of failures = 992 
##    2012 : number of successes = 50 ; number of failures = 952 
## Standard error = 0.0076 
## 95 % Confidence interval = ( -0.0547 , -0.0251 )

There is convincing evidence that the United States has seen a change in its atheism index between 2005 and 2012.

In Table 4 of the report, in how many of those (39) countries would you expect to detect a change (at a significance level of 0.05) simply by chance?

39*0.05 = 1.95 percent.

Introduction to Linear Regression

Moneyball

The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.

The data is from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.

head(mlb11)

##                  team runs at_bats hits homeruns bat_avg strikeouts
## 1       Texas Rangers  855    5659 1599      210   0.283        930
## 2      Boston Red Sox  875    5710 1600      203   0.280       1108
## 3      Detroit Tigers  787    5563 1540      169   0.277       1143
## 4  Kansas City Royals  730    5672 1560      129   0.275       1006
## 5 St. Louis Cardinals  762    5532 1513      162   0.273        978
## 6       New York Mets  718    5600 1477      108   0.264       1085
##   stolen_bases wins new_onbase new_slug new_obs
## 1          143   96      0.340    0.460   0.800
## 2          102   90      0.349    0.461   0.810
## 3           49   95      0.340    0.434   0.773
## 4          153   71      0.329    0.415   0.744
## 5           57   90      0.341    0.425   0.766
## 6          130   77      0.335    0.391   0.725

In addition to runs scored, there are seven traditionally used variables in the mlb11 dataset: at_bats, hits, homeruns, bat_avg (batting average), strikeouts, stolen_bases, and wins.

There are also three newer variables: new_onbase (on base percentage), new_slug (slugging percentage), and new_obs (on-base plus slugging).

For the first portion of the analysis, we’ll consider the seven traditional variables.

Quantifying the strength of a linear relationship

# Calculate the correlation between runs and at_bats
correlation <- cor(mlb11$runs, mlb11$at_bats)

# Print the result
correlation

## [1] 0.610627

# Plot it
plot(mlb11$runs ~ mlb11$at_bats)

The relationship is positive, linear, and moderately strong. One of the potential outliers is a team with approximately 5520 at bats.

Plot sum of squares – the function

plot_ss <- function(x, y, x1, y1, x2, y2, showSquares = FALSE, leastSquares = FALSE){
  plot(y~x, asp = 1)# xlab = paste(substitute(x)), ylab = paste(substitute(y)))
  
  if(leastSquares){
    m1 <- lm(y~x)
    y.hat <- m1$fit
  } else{
    #cat("Click two points to make a line.")
    #pt1 <- locator(1)
    points(x1, y1, pch = 21, col='red3', bg='red3', cex=1.8)
    #pt2 <- locator(1)
    points(x2, y2, pch = 21, col='red3', bg='red3', cex=1.8)
    pts <- data.frame("x" = c(x1, x2),"y" = c(y1, y2))
    m1 <- lm(y ~ x, data = pts)
    y.hat <- predict(m1, newdata = data.frame(x))
  }
  r <- y - y.hat
  abline(m1)
  
  oSide <- x - r
  LLim <- par()$usr[1]
  RLim <- par()$usr[2]
  oSide[oSide < LLim | oSide > RLim] <- c(x + r)[oSide < LLim | oSide > RLim] # move boxes to avoid margins
  
  n <- length(y.hat)
  for(i in 1:n){
    lines(rep(x[i], 2), c(y[i], y.hat[i]), lty = 2, col = "blue")
    if(showSquares){
      lines(rep(oSide[i], 2), c(y[i], y.hat[i]), lty = 3, col = "orange")
      lines(c(oSide[i], x[i]), rep(y.hat[i],2), lty = 3, col = "orange")
      lines(c(oSide[i], x[i]), rep(y[i],2), lty = 3, col = "orange")
    }
  }
  
  SS <- round(sum(r^2), 3)
  cat("\r                                ")
  print(m1)
  cat("Sum of Squares: ", SS)
}

Estimating the best fit

This function will first draw a scatterplot of the first two arguments x and y. Then it draws two points (x1, y1) and (x2, y2) that are shown as red circles. These points are used to draw the line that represents the regression estimate. The line you specified is shown in black and the residuals in blue. Note that there are 30 residuals, one for each of the 30 observations. Recall that the residuals are the difference between the observed values and the values predicted by the line:

\[e_i = y_i - \hat{y_i}\]

# Use the 'plot_ss' function to draw an estimated regression line. Two points are given to get you started
x1 <- 5400
y1 <- 750
x2 <- 5700
y2 <- 650

# Plot sum of squares
plot_ss(x = mlb11$at_bats, y = mlb11$runs, x1, y1, x2, y2, showSquares = FALSE)

## 
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   2550.0000      -0.3333  
## 
## Sum of Squares:  302572.3

The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, you can rerun the plot_ss command and add the argument showSquares = TRUE.

# Use the 'plot_ss' function to draw an estimated regression line. Two points are given to get you started
x1 <- 5400
y1 <- 750
x2 <- 5700
y2 <- 650

# Plot sum of squares
plot_ss(x = mlb11$at_bats, y = mlb11$runs, x1, y1, x2, y2, showSquares = TRUE)

## 
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   2550.0000      -0.3333  
## 
## Sum of Squares:  302572.3

The best fit

Removing the coordinates for estimating…

# Adapt the function to plot the best fitting line
plot_ss(x = mlb11$at_bats, y = mlb11$runs, leastSquares = TRUE)

## 
                                
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

The linear model

It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. Instead we can use the lm function.

# Use the 'lm' function to make the linear model
m1 <- lm(runs ~ at_bats, data = mlb11)

# Print the model
m1

## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Coefficients:
## (Intercept)      at_bats  
##  -2789.2429       0.6305

Understanding the output of lm

Access more information (coefficients).

# Use the 'lm' function to make the linear model
m1 <- lm(runs ~ at_bats, data = mlb11)

# The summary of 'm1'
summary(m1)

## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

Breaking it down

# Use the 'lm' function to make the linear model Print the summary
m2 <- lm(runs ~ homeruns, data = mlb11)

summary(m2)

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

First, the formula used to describe the model is shown at the top.
After the formula you find the five-number summary of the residuals.
The “Coefficients” table shown next is key; its first column displays the linear model’s y-intercept and the coefficient of homeruns.
One last piece of information we will discuss from the summary output is the Multiple R-squared.

summary(lm(runs ~ homeruns, data = mlb11))

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

For each additional home run, the model predicts 1.83 more runs, on average.

Prediction and prediction errors

abline takes the slope and intercept as arguments. Here, we use a shortcut by providing the model itself, which contains both parameter estimates: abline(m1).

# Create a scatterplot
# x, y
plot(mlb11$at_bats, mlb11$runs)

# The linear model:
# y ~ x
m1 <- lm(runs ~ at_bats, data = mlb11)

# Plot the least squares line
abline(m1, lty = 3, lwd = 2, col = 'red3')

How many runs would he or she predict for a team with 5,579 at-bats?

summary(m1)[[4]][1]

## [1] -2789.243

summary(m1)[[4]][2]

## [1] 0.63055

est_runs <- summary(m1)[[4]][1] + summary(m1)[[4]][2] * 5579

true_runs <- subset(mlb11$runs, mlb11$at_bats == 5579)

est_runs

## [1] 728.5955

true_runs

## [1] 713

true_runs - est_runs

## [1] -15.59552

To assess whether the linear model is reliable, you need to check for three things:

1. linearity
1. nearly normal residuals
1. constant variability

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3, lwd = 2, col = 'red3')

The residuals show a curved pattern. Overall, constant variability condition is met (3).

To check ‘nearly normal residuals’, plot a normal probability chart of the residuals.

qqnorm(m1$residuals)
qqline(m1$residuals)

We can also look at a histogram.

hist(m1$residuals)

The residuals are failry symmetric, with only a slightly longer tail on the right, hence it would be appropriate to deem the the normal distribution of residuals condition met (1 and 2).

In conclusion, batting average is the best predictor of runs.

attach(mlb11)

par(mfrow = c(1, 2))

# at_bats
plot(runs ~ at_bats)
abline(lm(runs ~ at_bats, data = mlb11), lty = 3, lwd = 2, col = 'red3')

# hits
plot(runs ~ hits)
abline(lm(runs ~ hits, data = mlb11), lty = 3, lwd = 2, col = 'red3')

# homeruns
plot(runs ~ homeruns)
abline(lm(runs ~ homeruns, data = mlb11), lty = 3, lwd = 2, col = 'red3')

# bat_avg (batting average)
plot(runs ~ bat_avg)
abline(lm(runs ~ bat_avg, data = mlb11), lty = 3, lwd = 2, col = 'red3')

# strikeouts
plot(runs ~ strikeouts)
abline(lm(runs ~ strikeouts, data = mlb11), lty = 3, lwd = 2, col = 'red3')

# stolen_bases
plot(runs ~ stolen_bases)
abline(lm(runs ~ stolen_bases, data = mlb11), lty = 3, lwd = 2, col = 'red3')

# wins
plot(runs ~ wins)
abline(lm(runs ~ wins, data = mlb11), lty = 3, lwd = 2, col = 'red3')

# new_onbase (on base percentage)
plot(runs ~ new_onbase)
abline(lm(runs ~ new_onbase, data = mlb11), lty = 3, lwd = 2, col = 'red3')

# new_slug (slugging percentage)
plot(runs ~ new_slug)
abline(lm(runs ~ new_slug, data = mlb11), lty = 3, lwd = 2, col = 'red3')

# new_obs (on-base plus slugging)
plot(runs ~ new_obs)
abline(lm(runs ~ new_obs, data = mlb11), lty = 3, lwd = 2, col = 'red3')

par(mfrow = c(1, 1))

detach(mlb11)

Multiple Linear Regression

The data

Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor.

The article titled, “Beauty in the classroom: instructors’ pulchritude and putative pedagogical productivity” (Hamermesh and Parker, 2005) found that instructors who are viewed to be better looking receive higher instructional ratings.

The data were gathered from end of semester student evaluations for a large sample of professors from the University of Texas at Austin. In addition, six students rated the professors’ physical appearance.

This is a slightly modified version of the original dataset that was released as part of the replication data for Data Analysis Using Regression and Multilevel/Hierarchical Models (Gelman and Hill, 2007).

Different variables in the data frame, and their meaning:

score: average professor evaluation score: (1) very unsatisfactory - (5) excellent.
rank: rank of professor: teaching, tenure track, tenured.
ethnicity: ethnicity of professor: not minority, minority.
gender: gender of professor: female, male.
language: language of school where professor received education: english or non-english.
age: age of professor.
cls_perc_eval: percent of students in class who completed evaluation.
cls_did_eval: number of students in class who completed evaluation.
cls_students: total number of students in class.
cls_level: class level: lower, upper.
cls_profs: mber of professors teaching sections in course in sample: single, multiple.
cls_credits: number of credits of class: one credit (lab, PE, etc.), multi credit.
bty_f1lower: beauty rating of professor from lower level female: (1) lowest - (10) highest.
bty_f1upper: beauty rating of professor from upper level female: (1) lowest - (10) highest.
bty_f2upper: beauty rating of professor from second upper level female: (1) lowest - (10) highest.
bty_m1lower: beauty rating of professor from lower level male: (1) lowest - (10) highest.
bty_m1upper: beauty rating of professor from upper level male: (1) lowest - (10) highest.
bty_m2upper: beauty rating of professor from second upper level male: (1) lowest - (10) highest.
bty_avg: average beauty rating of professor.
pic_outfit: outfit of professor in picture: not formal, formal. (not used in this analysis)
pic_color: color of professor’s picture: color, black & white. (not used in this analysis)

The original research question posed in the paper is whether beauty leads directly to the differences in course evaluations. It should be revised to “Is there an association between beauty and course evaluations”.

summary(evals$score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.300   3.800   4.300   4.175   4.600   5.000

par(mfrow = c(2, 1))

hist(evals$score, main = '')
boxplot(evals$score, horizontal = TRUE)

par(mfrow = c(1, 1))

The left skewness of the data suggests that the students are more likely to rate the professors highly.

Visualizing relationships

More analysis.

# Create a scatterplot for 'age' vs 'bty_avg'
plot(evals$age, evals$bty_avg)

# Create a boxplot for 'age' and 'gender'
boxplot(evals$age ~ evals$gender)

# Create a mosaic plot for 'rank' and 'gender'
mosaicplot(evals$rank ~ evals$gender)

Simple linear regression

The fundamental phenomenon suggested by the study is that better looking teachers are evaluated more favorably.

# Create a scatterplot for 'score' and 'bty_avg'
plot(evals$bty_avg, evals$score)

Replot the scatterplot, but this time use the jitter function.

The jitter function

# Apply 'jitter' on the 'bty_avg' or 'score' variable of your initial plot
plot(jitter(evals$bty_avg), evals$score)

The hexbin plot

library(hexbin)

# Draw a hexbinplot
hexbinplot(evals$score ~ evals$bty_avg, aspect = 0.5)

More than natural variation?

Fit a linear model called m_bty to predict average professor score by average beauty rating.

# Your initial plot
plot(evals$score ~ jitter(evals$bty_avg))

# Construct the linear model
m_bty <- lm(score ~ bty_avg, data = evals)

# Plot your linear model on your plot
abline(m_bty, lty = 3, lwd = 2, col = 'red3')

Average beauty score does not seem to be a statistically and practically significant predictor.

plot(m_bty$residuals ~ evals$bty_avg)
abline(h = 0, lty = 3, lwd = 2, col = 'red3')

qqnorm(m_bty$residuals)
qqline(m_bty$residuals)

hist(m_bty$residuals)

Linear association: The residuals plot shows a random scatter.
Constant variance of residuals: no fan shape in residuals plot.
Independent observations: Classes sampled randomly, no order effect.

Multiple linear regression

The dataset contains several variables on the beauty score of the professor: individual ratings from each of the six students who were asked to score the physical appearance of the professors and the average of these six scores.

# Your scatterplot
plot(evals$bty_f1lower, evals$bty_avg)

# The correlation
cor(evals$bty_f1lower, evals$bty_avg)

## [1] 0.8439112

The relationship between all beauty variables

The relationship is quite strong; after all, the average score is calculated using the individual scores. We can actually take a look at the relationships between all beauty variables (columns 13 through 19).

# The relationships between all beauty variables
plot(evals[, 13:19])

Taking into account gender

In order to see if beauty is still a significant predictor of professor score after we’ve accounted for the gender is the professor, we can add the gender term into the model.

# Your new linear model
m_bty_gen <- lm(score ~ bty_avg + gender, data = evals)

# Study the outcome
summary(m_bty_gen)

## 
## Call:
## lm(formula = score ~ bty_avg + gender, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8305 -0.3625  0.1055  0.4213  0.9314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.74734    0.08466  44.266  < 2e-16 ***
## bty_avg      0.07416    0.01625   4.563 6.48e-06 ***
## gendermale   0.17239    0.05022   3.433 0.000652 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5287 on 460 degrees of freedom
## Multiple R-squared:  0.05912,    Adjusted R-squared:  0.05503 
## F-statistic: 14.45 on 2 and 460 DF,  p-value: 8.177e-07

p-values and parameter estimates should only be trusted if the conditions for the regression are reasonable. Using diagnostic plots (about a linear regression’s principal assumptions), we can conclude that the conditions for this model are reasonable.

Custom function to draw multiLines

The code for the custom function is not shown.

Gendermale

Note that the estimate for gender is called gendermale in your summary output.

Whenever we introduce a categorical variable, we recode gender from having the values of female and male to being an indicator variable called gendermale that takes a value of 0 for females and a value of 1 for males. (Such variables are often referred to as ‘dummy’ variables.)

As a result, for females, the parameter estimate is multiplied by zero, leaving the intercept and slope form familiar from simple regression.

# Your plot
multiLines(m_bty_gen)

For two professors (one male and one female) who received the same beauty rating, the male professor is predicted to have the higher course evaluation score than the female.

Switching rank and gender

Create a new model called m_bty_rank with gender removed and rank added in.

# Your linear model with rank and average beauty
m_bty_rank <- lm(score ~ bty_avg + rank, data = evals)

# View the regression output
summary(m_bty_rank)

## 
## Call:
## lm(formula = score ~ bty_avg + rank, data = evals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8713 -0.3642  0.1489  0.4103  0.9525 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.98155    0.09078  43.860  < 2e-16 ***
## bty_avg           0.06783    0.01655   4.098 4.92e-05 ***
## ranktenure track -0.16070    0.07395  -2.173   0.0303 *  
## ranktenured      -0.12623    0.06266  -2.014   0.0445 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5328 on 459 degrees of freedom
## Multiple R-squared:  0.04652,    Adjusted R-squared:  0.04029 
## F-statistic: 7.465 on 3 and 459 DF,  p-value: 6.88e-05

Correct order of the three levels of rank if we were to order them from lowest predicted course evaluation score to highest predicted course evaluation score:

Tenure Track (-0.16).
Tenured (-0.13).
Teaching (0.07).

The search for the best model

We will start with a full model that predicts professor score based on rank, ethnicity, gender, language of the university where they got their degree, age, proportion of students that filled out evaluations, class size, course level, number of professors, number of credits, average beauty rating, outfit, and picture color.

Note you do not included the pic_outfit or pic_color variables in the full model because the original study states that these variables were used in a different analysis evaluating whether they’re related to how highly the six students involved in the study score the professors’ beauty (not related to how the students evaluate their professors in class).

# The full model
m_full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)

# View the regression output
summary(m_full)

## 
## Call:
## lm(formula = score ~ rank + ethnicity + gender + language + age + 
##     cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + 
##     bty_avg, data = evals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.84482 -0.31367  0.08559  0.35732  1.10105 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.5305036  0.2408200  14.660  < 2e-16 ***
## ranktenure track      -0.1070121  0.0820250  -1.305 0.192687    
## ranktenured           -0.0450371  0.0652185  -0.691 0.490199    
## ethnicitynot minority  0.1869649  0.0775329   2.411 0.016290 *  
## gendermale             0.1786166  0.0515346   3.466 0.000579 ***
## languagenon-english   -0.1268254  0.1080358  -1.174 0.241048    
## age                   -0.0066498  0.0030830  -2.157 0.031542 *  
## cls_perc_eval          0.0056996  0.0015514   3.674 0.000268 ***
## cls_students           0.0004455  0.0003585   1.243 0.214596    
## cls_levelupper         0.0187105  0.0555833   0.337 0.736560    
## cls_profssingle       -0.0085751  0.0513527  -0.167 0.867458    
## cls_creditsone credit  0.5087427  0.1170130   4.348  1.7e-05 ***
## bty_avg                0.0612651  0.0166755   3.674 0.000268 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.504 on 450 degrees of freedom
## Multiple R-squared:  0.1635, Adjusted R-squared:  0.1412 
## F-statistic: 7.331 on 12 and 450 DF,  p-value: 2.406e-12

Non-minority professors (identified in the results as ethnicitynot minority) are expected on average to score 0.19 points higher than minority professors, all else held constant.

Eliminating variables from the model – p-value selection

Drop the variable with the highest p-value in the m_full model.

# The full model
m_full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)

# Your new model (w/o cls_credits)
m_new <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_credits + bty_avg, data = evals)

# View the regression output
summary(m_new)

## 
## Call:
## lm(formula = score ~ rank + ethnicity + gender + language + age + 
##     cls_perc_eval + cls_students + cls_level + cls_credits + 
##     bty_avg, data = evals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.85048 -0.31394  0.08052  0.35956  1.10356 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.5286297  0.2402990  14.684  < 2e-16 ***
## ranktenure track      -0.1073638  0.0819096  -1.311 0.190606    
## ranktenured           -0.0453744  0.0651169  -0.697 0.486278    
## ethnicitynot minority  0.1893718  0.0760992   2.488 0.013189 *  
## gendermale             0.1780270  0.0513581   3.466 0.000578 ***
## languagenon-english   -0.1265737  0.1079088  -1.173 0.241427    
## age                   -0.0066619  0.0030788  -2.164 0.031006 *  
## cls_perc_eval          0.0056790  0.0015448   3.676 0.000265 ***
## cls_students           0.0004493  0.0003573   1.257 0.209319    
## cls_levelupper         0.0183743  0.0554870   0.331 0.740687    
## cls_creditsone credit  0.5109162  0.1161614   4.398 1.36e-05 ***
## bty_avg                0.0611497  0.0166432   3.674 0.000267 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5035 on 451 degrees of freedom
## Multiple R-squared:  0.1635, Adjusted R-squared:  0.1431 
## F-statistic: 8.012 on 11 and 451 DF,  p-value: 8.303e-13

Eliminating variables from the model – adjusted R-squared selection

Now you will create a new model, where you will drop the variable that when dropped yields the highest improvement in the adjusted R-squared.

# The full model
m_full <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)

summary(m_full)$adj.r.squared

## [1] 0.1412172

# Remove rank
m1 <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)

summary(m1)$adj.r.squared

## [1] 0.1417823

# Remove ethnicity
m2 <- lm(score ~ rank + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)

summary(m2)$adj.r.squared

## [1] 0.1320486

# Remove gender
m3 <- lm(score ~ rank + ethnicity + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)

summary(m3)$adj.r.squared

## [1] 0.1202468

# Remove language
m4 <- lm(score ~ rank + ethnicity + gender + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)

summary(m4)$adj.r.squared

## [1] 0.1404972

# Remove age
m5 <- lm(score ~ rank + ethnicity + gender + language + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)

summary(m5)$adj.r.squared

## [1] 0.1342627

# Remove cls_perc_eval
m6 <- lm(score ~ rank + ethnicity + gender + language + age + cls_students + cls_level + cls_profs + cls_credits + bty_avg, data = evals)

summary(m6)$adj.r.squared

## [1] 0.1174213

# Remove cls_students
m7 <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_level + cls_profs + cls_credits + bty_avg, data = evals)

summary(m7)$adj.r.squared

## [1] 0.1401804

# Remove cls_level
m8 <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_profs + cls_credits + bty_avg, data = evals)

summary(m8)$adj.r.squared

## [1] 0.1429056

# Remove cls_profs
m9 <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_credits + bty_avg, data = evals)

summary(m9)$adj.r.squared

## [1] 0.1430683

# Remove cls_credits
m10 <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + bty_avg, data = evals)

summary(m10)$adj.r.squared

## [1] 0.107127

# The bty_avg
m11 <- lm(score ~ rank + ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits, data = evals)

summary(m11)$adj.r.squared

## [1] 0.1174189

results <- c(summary(m_full)$adj.r.squared,
             summary(m1)$adj.r.squared,
             summary(m2)$adj.r.squared,
             summary(m3)$adj.r.squared,
             summary(m4)$adj.r.squared,
             summary(m5)$adj.r.squared,
             summary(m6)$adj.r.squared,
             summary(m7)$adj.r.squared,
             summary(m8)$adj.r.squared,
             summary(m9)$adj.r.squared,
             summary(m10)$adj.r.squared,
             summary(m11)$adj.r.squared)

names(results) <- c('FULL MODEL',
                    'Remove rank',
                    'Remove ethnicity',
                    'Remove gender',
                    'Remove language',
                    'Remove age',
                    'Remove cls_perc_eval',
                    'Remove cls_students',
                    'Remove cls_level',
                    'Remove cls_profs',
                    'Remove cls_credits',
                    'AVERAGE')

dotchart(sort(results), ylim = c(0.10, 0.16), main = 'Models & Adjusted R-squared')

Removal of cls_profs yields the highest adjusted R-squared.

Data Analysis & Statistical Inference