Foreword

  • Snippets and results.
  • Source: 'Exploring Pitch Data with R' from DataCamp fitted into Jupyter/IPython using the IRkernel.


Clean the data

The data contain information about every pitch thrown by Zack Greinke during the 2015 regular season.

These data include over 3,200 individual pitches and 29 variables associated with the game date, inning, location, velocity, movement, pitch type, pitch results, at bat results, and more.

There are some missing data in the start_speed variable that will require removal. You can easily do this with the subset() function, which takes a dataset or vector you wish to subset and a conditional statement that tells R how it is you wish to subset. Once you have the data prepped and understand some of the basic characteristics of the variables, we can move onto some interesting analyses of Greinke's 2015 season statistics.

First, load the data and take it a look

In [30]:
greinke <- read.table(file = 'Baseball.csv', header = TRUE, sep = ';')
In [31]:
# Print the first 6 rows of the data
head(greinke, 6)
tail(greinke,6)
p_namepitcher_idbatter_standpitch_typepitch_resultatbat_resultstart_speedz0x0pfx_x...outsgame_dateinninginning_topbotbatted_ball_typebatted_ball_velocityhc_xhc_ypitch_iddistance_feet
Zack Greinke 425844 R FF Ball Walk 94.2 5.997 -0.675 -4.457 ... 2 10/3/2015 4 top NA 0.00 0.00 160 NA
Zack Greinke 425844 R FF SwingingStrikeSingle 92.4 6.281 -0.760 -1.590 ... 0 10/3/2015 3 top 104 123.56 97.26 95 0
Zack Greinke 425844 R FF CalledStrike Home Run 92.7 6.168 -0.958 -1.884 ... 1 10/3/2015 5 top 103 50.88 31.17 218 425
Zack Greinke 425844 R SL SwingingStrikeStrikeout 86.9 6.077 -0.939 3.594 ... 0 10/3/2015 6 top NA 0.00 0.00 265 NA
Zack Greinke 425844 R FF SwingingStrikeStrikeout 92.8 6.107 -0.524 -0.558 ... 0 10/3/2015 8 top NA 0.00 0.00 374 NA
Zack Greinke 425844 R SL SwingingStrikeStrikeout 87.8 6.321 -0.948 4.313 ... 1 10/3/2015 1 top NA 0.00 0.00 14 NA
p_namepitcher_idbatter_standpitch_typepitch_resultatbat_resultstart_speedz0x0pfx_x...outsgame_dateinninginning_topbotbatted_ball_typebatted_ball_velocityhc_xhc_ypitch_iddistance_feet
3231Zack Greinke 425844 L CH Foul Field Error 86.0 6.078 -1.437 -6.671 ... 2 4/7/2015 2 top 99 114.81 151.43 80 0
3232Zack Greinke 425844 L CH Inplay,noout Field Error 88.7 6.022 -1.339 -5.558 ... 2 4/7/2015 2 top GB 99 114.81 151.43 84 0
3233Zack Greinke 425844 L FF CalledStrike Field Error 90.6 6.020 -1.177 -0.055 ... 2 4/7/2015 2 top 99 114.81 151.43 79 0
3234Zack Greinke 425844 L FT Foul Field Error 91.3 6.006 -1.117 -6.464 ... 2 4/7/2015 2 top 99 114.81 151.43 83 0
3235Zack Greinke 425844 L CH Inplay,out(s)Groundout 87.1 6.083 -1.143 -3.877 ... 0 4/7/2015 2 top GB 65 144.06 162.93 69 0
3236Zack Greinke 425844 L CH Ball Field Error 86.9 6.002 -1.556 -5.261 ... 2 4/7/2015 2 top 99 114.81 151.43 82 0
In [32]:
# Print the number of rows in the data frame
nrow(greinke)
3236
In [33]:
# Check out the dataset
str(greinke)
'data.frame':	3236 obs. of  29 variables:
 $ p_name              : Factor w/ 1 level "Zack Greinke": 1 1 1 1 1 1 1 1 1 1 ...
 $ pitcher_id          : int  425844 425844 425844 425844 425844 425844 425844 425844 425844 425844 ...
 $ batter_stand        : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
 $ pitch_type          : Factor w/ 7 levels "CH","CU","EP",..: 4 4 4 7 4 7 1 4 7 7 ...
 $ pitch_result        : Factor w/ 15 levels "Ball","BallInDirt",..: 1 14 3 14 14 14 15 3 4 14 ...
 $ atbat_result        : Factor w/ 24 levels "Bunt Groundout ",..: 24 20 13 22 22 22 22 22 11 24 ...
 $ start_speed         : num  94.2 92.4 92.7 86.9 92.8 87.8 90.3 92.7 85.5 87.3 ...
 $ z0                  : num  6 6.28 6.17 6.08 6.11 ...
 $ x0                  : num  -0.675 -0.76 -0.958 -0.939 -0.524 ...
 $ pfx_x               : num  -4.457 -1.59 -1.884 3.594 -0.558 ...
 $ pfx_z               : num  9.76 11.4 9.245 0.762 11.134 ...
 $ px                  : num  1.714 0.589 0.399 0.764 1.517 ...
 $ pz                  : num  1.92 3.27 2.92 1.31 2.19 ...
 $ break_angle         : num  24.8 10.1 9.2 -11.4 -0.4 -13.6 22.5 25.1 -8.4 -11.3 ...
 $ break_length        : num  3.5 2.7 3.5 8 2.8 7.8 7.4 3.8 7.5 7.4 ...
 $ spin_rate           : num  2189 2312 1890 694 2243 ...
 $ spin_dir            : num  204 188 191 103 183 ...
 $ balls               : int  2 1 0 1 1 2 1 0 0 0 ...
 $ strikes             : int  2 1 0 2 2 2 2 2 0 1 ...
 $ outs                : int  2 0 1 0 0 1 1 2 2 2 ...
 $ game_date           : Factor w/ 32 levels "10/3/2015","4/12/2015",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ inning              : int  4 3 5 6 8 1 6 5 8 4 ...
 $ inning_topbot       : Factor w/ 2 levels "bot","top": 2 2 2 2 2 2 2 2 2 2 ...
 $ batted_ball_type    : Factor w/ 5 levels "","FB","GB","LD",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ batted_ball_velocity: int  NA 104 103 NA NA NA NA NA NA NA ...
 $ hc_x                : num  0 123.6 50.9 0 0 ...
 $ hc_y                : num  0 97.3 31.2 0 0 ...
 $ pitch_id            : int  160 95 218 265 374 14 279 231 386 156 ...
 $ distance_feet       : int  NA 0 425 NA NA NA NA NA NA NA ...
In [34]:
# Convert some data
greinke$p_name <- as.character(greinke$p_name)
In [35]:
# Check out the dataset again
str(greinke$p_name)
 chr [1:3236] "Zack Greinke" "Zack Greinke" "Zack Greinke" ...
In [36]:
str(greinke$pitch_id)
 int [1:3236] 160 95 218 265 374 14 279 231 386 156 ...
In [37]:
# Summarize the start_speed variable
summary(greinke$start_speed)
hist(greinke$start_speed)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  52.20   87.30   89.80   88.44   91.80   95.40 
In [39]:
# Get rid of data without start_speed
greinke <- subset(greinke, is.na(start_speed) == FALSE)
In [40]:
# Print the number of complete entries
nrow(greinke)
3236

Check dates

It turns out that the game_date variable in the greinke dataset has been stored as a character string instead of a date. The as.Date() function is helpful here.

In [41]:
# Check if dates are formatted as dates
class(greinke$game_date)
'factor'
In [42]:
# Change them to dates
greinke$game_date <- as.Date(greinke$game_date, format = "%m/%d/%Y", origin = "01/01/1900")

# Check that the variable is now formatted as a date
class(greinke$game_date)
'Date'

Delimit dates

Split up the dates so that you have a month, day, and year variable in the data to group the data for later analyses.

Leverage the separate() function from the tidyr package. This function allows to easily split an existing column ('col') in a data frame ('data') into new columns, which are specified with the into argument (e.g. into = c("year", "month")). Lastly, the remove argument tells R whether to remove the original column from the data and the sep argument tells R what to split on.

The ifelse() function will help to make the new variable july for later computations.

In [43]:
age <- c(3, 12, 32, 40, 17)
ifelse(age < 19, 'child', 'adult')
  1. 'child'
  2. 'child'
  3. 'adult'
  4. 'adult'
  5. 'child'

This ifelse() statement returns 'child' or 'adult' depending on whether the conditional statement age < 19 is TRUE or FALSE.

In [44]:
# load tidyr package
# install.packages('tidyr') in R
library(tidyr)
In [45]:
# Separate game_date into "year", "month", and "day"
greinke <- separate(data = greinke, col = game_date, 
                    into = c("year", "month", "day"), 
                    sep = "-", remove = FALSE)
str(greinke)
'data.frame':	3236 obs. of  32 variables:
 $ p_name              : chr  "Zack Greinke" "Zack Greinke" "Zack Greinke" "Zack Greinke" ...
 $ pitcher_id          : int  425844 425844 425844 425844 425844 425844 425844 425844 425844 425844 ...
 $ batter_stand        : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
 $ pitch_type          : Factor w/ 7 levels "CH","CU","EP",..: 4 4 4 7 4 7 1 4 7 7 ...
 $ pitch_result        : Factor w/ 15 levels "Ball","BallInDirt",..: 1 14 3 14 14 14 15 3 4 14 ...
 $ atbat_result        : Factor w/ 24 levels "Bunt Groundout ",..: 24 20 13 22 22 22 22 22 11 24 ...
 $ start_speed         : num  94.2 92.4 92.7 86.9 92.8 87.8 90.3 92.7 85.5 87.3 ...
 $ z0                  : num  6 6.28 6.17 6.08 6.11 ...
 $ x0                  : num  -0.675 -0.76 -0.958 -0.939 -0.524 ...
 $ pfx_x               : num  -4.457 -1.59 -1.884 3.594 -0.558 ...
 $ pfx_z               : num  9.76 11.4 9.245 0.762 11.134 ...
 $ px                  : num  1.714 0.589 0.399 0.764 1.517 ...
 $ pz                  : num  1.92 3.27 2.92 1.31 2.19 ...
 $ break_angle         : num  24.8 10.1 9.2 -11.4 -0.4 -13.6 22.5 25.1 -8.4 -11.3 ...
 $ break_length        : num  3.5 2.7 3.5 8 2.8 7.8 7.4 3.8 7.5 7.4 ...
 $ spin_rate           : num  2189 2312 1890 694 2243 ...
 $ spin_dir            : num  204 188 191 103 183 ...
 $ balls               : int  2 1 0 1 1 2 1 0 0 0 ...
 $ strikes             : int  2 1 0 2 2 2 2 2 0 1 ...
 $ outs                : int  2 0 1 0 0 1 1 2 2 2 ...
 $ game_date           : Date, format: "2015-10-03" "2015-10-03" ...
 $ year                : chr  "2015" "2015" "2015" "2015" ...
 $ month               : chr  "10" "10" "10" "10" ...
 $ day                 : chr  "03" "03" "03" "03" ...
 $ inning              : int  4 3 5 6 8 1 6 5 8 4 ...
 $ inning_topbot       : Factor w/ 2 levels "bot","top": 2 2 2 2 2 2 2 2 2 2 ...
 $ batted_ball_type    : Factor w/ 5 levels "","FB","GB","LD",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ batted_ball_velocity: int  NA 104 103 NA NA NA NA NA NA NA ...
 $ hc_x                : num  0 123.6 50.9 0 0 ...
 $ hc_y                : num  0 97.3 31.2 0 0 ...
 $ pitch_id            : int  160 95 218 265 374 14 279 231 386 156 ...
 $ distance_feet       : int  NA 0 425 NA NA NA NA NA NA NA ...
In [46]:
# Convert month to numeric
greinke$month <- as.numeric(greinke$month)

# Create the july variable
greinke$july <- ifelse(greinke$month == 7, "july", "other")
str(greinke)
'data.frame':	3236 obs. of  33 variables:
 $ p_name              : chr  "Zack Greinke" "Zack Greinke" "Zack Greinke" "Zack Greinke" ...
 $ pitcher_id          : int  425844 425844 425844 425844 425844 425844 425844 425844 425844 425844 ...
 $ batter_stand        : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
 $ pitch_type          : Factor w/ 7 levels "CH","CU","EP",..: 4 4 4 7 4 7 1 4 7 7 ...
 $ pitch_result        : Factor w/ 15 levels "Ball","BallInDirt",..: 1 14 3 14 14 14 15 3 4 14 ...
 $ atbat_result        : Factor w/ 24 levels "Bunt Groundout ",..: 24 20 13 22 22 22 22 22 11 24 ...
 $ start_speed         : num  94.2 92.4 92.7 86.9 92.8 87.8 90.3 92.7 85.5 87.3 ...
 $ z0                  : num  6 6.28 6.17 6.08 6.11 ...
 $ x0                  : num  -0.675 -0.76 -0.958 -0.939 -0.524 ...
 $ pfx_x               : num  -4.457 -1.59 -1.884 3.594 -0.558 ...
 $ pfx_z               : num  9.76 11.4 9.245 0.762 11.134 ...
 $ px                  : num  1.714 0.589 0.399 0.764 1.517 ...
 $ pz                  : num  1.92 3.27 2.92 1.31 2.19 ...
 $ break_angle         : num  24.8 10.1 9.2 -11.4 -0.4 -13.6 22.5 25.1 -8.4 -11.3 ...
 $ break_length        : num  3.5 2.7 3.5 8 2.8 7.8 7.4 3.8 7.5 7.4 ...
 $ spin_rate           : num  2189 2312 1890 694 2243 ...
 $ spin_dir            : num  204 188 191 103 183 ...
 $ balls               : int  2 1 0 1 1 2 1 0 0 0 ...
 $ strikes             : int  2 1 0 2 2 2 2 2 0 1 ...
 $ outs                : int  2 0 1 0 0 1 1 2 2 2 ...
 $ game_date           : Date, format: "2015-10-03" "2015-10-03" ...
 $ year                : chr  "2015" "2015" "2015" "2015" ...
 $ month               : num  10 10 10 10 10 10 10 10 10 10 ...
 $ day                 : chr  "03" "03" "03" "03" ...
 $ inning              : int  4 3 5 6 8 1 6 5 8 4 ...
 $ inning_topbot       : Factor w/ 2 levels "bot","top": 2 2 2 2 2 2 2 2 2 2 ...
 $ batted_ball_type    : Factor w/ 5 levels "","FB","GB","LD",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ batted_ball_velocity: int  NA 104 103 NA NA NA NA NA NA NA ...
 $ hc_x                : num  0 123.6 50.9 0 0 ...
 $ hc_y                : num  0 97.3 31.2 0 0 ...
 $ pitch_id            : int  160 95 218 265 374 14 279 231 386 156 ...
 $ distance_feet       : int  NA 0 425 NA NA NA NA NA NA NA ...
 $ july                : chr  "other" "other" "other" "other" ...
In [47]:
# View the head() of greinke
head(greinke)
p_namepitcher_idbatter_standpitch_typepitch_resultatbat_resultstart_speedz0x0pfx_x...dayinninginning_topbotbatted_ball_typebatted_ball_velocityhc_xhc_ypitch_iddistance_feetjuly
Zack Greinke 425844 R FF Ball Walk 94.2 5.997 -0.675 -4.457 ... 03 4 top NA 0.00 0.00 160 NA other
Zack Greinke 425844 R FF SwingingStrikeSingle 92.4 6.281 -0.760 -1.590 ... 03 3 top 104 123.56 97.26 95 0 other
Zack Greinke 425844 R FF CalledStrike Home Run 92.7 6.168 -0.958 -1.884 ... 03 5 top 103 50.88 31.17 218 425 other
Zack Greinke 425844 R SL SwingingStrikeStrikeout 86.9 6.077 -0.939 3.594 ... 03 6 top NA 0.00 0.00 265 NA other
Zack Greinke 425844 R FF SwingingStrikeStrikeout 92.8 6.107 -0.524 -0.558 ... 03 8 top NA 0.00 0.00 374 NA other
Zack Greinke 425844 R SL SwingingStrikeStrikeout 87.8 6.321 -0.948 4.313 ... 03 1 top NA 0.00 0.00 14 NA other
In [48]:
# Print a summary of the july variable
summary(factor(greinke$july))
july
524
other
2712

Velocity distribution

Now that the data are ready to go, it can be useful to look at the distribution of start_speed (velocity in miles per hour) of Greinke's pitches; across pitch_type and or across month to assess distributional differences across these groups.

Narrow down the interest to just four-seam fastballs. Subset the data for July and other months and make histograms for all pitches. Group by the july variable to show side-by-side histograms for comparison.

In [49]:
# Make a histogram of Greinke's start speed
hist(greinke$start_speed)
In [50]:
# Create greinke_july
greinke_july <- subset(greinke, greinke$july == "july")

# Create greinke_other
greinke_other <- subset(greinke, greinke$july != "july")

# Use par to format your plot layout
par(mfrow = c(1, 2))

# Plot start_speed histogram from july
hist(greinke_july$start_speed)

# Plot start_speed histogram for other months
hist(greinke_other$start_speed)

Fastball velocity distribution

Notice that the distribution of pitches was strongly skewed. This is because it includes both offspeed pitches and fastballs. Offspeed pitches tend to have lower velocities, so the distribution has a long tail.

This time, look only at four-seam fastballs (denoted as "FF" under the pitch_type variable) in order to evaluate the distribution of the hardest pitch thrown by Greinke. Compare July relative to other months.

In [51]:
# Create july_ff (fastballs)
july_ff <- subset(greinke_july, pitch_type == "FF")

# Create other_ff
other_ff <- subset(greinke_other, pitch_type == "FF")

# Formatting code, don't change this
par(mfrow = c(1, 2))

# Plot histogram of July fastball speeds
hist(july_ff$start_speed)

# Plot histogram of other month fastball speeds
hist(other_ff$start_speed)

The distributions look more symmetrical now, but plotted side-by-side, it's hard to tell whether Greinke's four-seam fastballs were on average slower or faster in July compared to those from all other months.

Distribution comparisons with color

When plotting histograms on different panels, it can sometimes be difficult to evaluate whether one is shifted to the left or right, relative to the other. This is especially the case when there are only small differences.

You can often get more insight from two histograms when they are plotted overlapping one another. It's even more helpful to incorporate transparent colors. Using the hex color scale, where col = "#rrggbbaa", rr defines the amount of red, gg the amount of green, bb the amount of blue, and aa the level of transparency. All four are specified on a scale from 00 to 99.

In addition to the histogram, it can also be instructive to plot the mean of each group for more direct comparison.

In [52]:
# Make a fastball speed histogram for other months
hist(other_ff$start_speed, 
     col = "#00009950", freq = FALSE, 
     ylim = c(0, 0.40), xlab = "Velocity (mph)", 
     main = "Greinke 4-Seam Fastball Velocity")

# Add a histogram for July
hist(july_ff$start_speed,
     col = "#99000050", freq = FALSE,
     add = TRUE)

# Draw vertical line at the mean of other_ff
abline(v = mean(other_ff$start_speed), col = '#00009950', lwd = 2)

# Draw vertical line at the mean of july_ff
abline(v = mean(july_ff$start_speed), col = '#99000050', lwd = 2)

tapply() for velocity changes

There are differences in start_speed between July and other months for all pitch types; then for four-seam fastballs only.

tapply() allows to apply other functions (e.g. mean()) to a continuous variable for each group of a factor (categorical) variable using only a single line of code.

In [53]:
# Summarize velocity in July and other months
tapply(greinke$start_speed, greinke$july, mean)
july
88.8648854961832
other
88.3560103244838
In [54]:
# Create greinke_ff
greinke_ff <- subset(greinke, pitch_type == "FF")

# Calculate mean fastball velocities: ff_velo_month
ff_velo_month <- tapply(greinke_ff$start_speed, greinke_ff$july, mean)

# Print ff_velo_month
ff_velo_month
july
92.4207729468599
other
91.6647355163728

Game-by-game velocity changes

Now, look at velocity by game_date using a new dataset, which is a subset of four-seam fastballs only.

Use tapply() in combination with the data.frame() functionThis requires some renaming of columns and rows.

In [55]:
# Create ff_dt
ff_dt <- data.frame(tapply(greinke_ff$start_speed, greinke_ff$game_date, mean))

# Print the first 6 rows of ff_dt
head(ff_dt, 6)
tapply.greinke_ff.start_speed..greinke_ff.game_date..mean.
2015-04-0790.82632
2015-04-1290.51622
2015-04-1890.28654
2015-04-2490.51277
2015-04-2990.40732
2015-05-0590.33043

Tidying the data frame

Game dates are now the row names of the new dataset ff_dt. Include the row names as a variable in ff_dt, formatted as dates.

In [56]:
# Create game_date in ff_dt
ff_dt$game_date <- as.Date(row.names(ff_dt), "%Y-%m-%d")

# Rename the first column
colnames(ff_dt) <- c('start_speed', 'game_date')

# Remove row names
row.names(ff_dt) <- NULL

# View head of ff_dt
head(ff_dt)
start_speedgame_date
90.82632 2015-04-07
90.51622 2015-04-12
90.28654 2015-04-18
90.51277 2015-04-24
90.40732 2015-04-29
90.33043 2015-05-05

A game-by-game line plot

The printed dataset is already ordered by the game_date variable. The tapply() function does this automatically when it calculates the group-level statistics.

Plot Zack Greinke's fastball velocity on a game-by-game basis using a line plot.

In [57]:
# Plot game-by-game 4-seam fastballs
plot(ff_dt$start_speed ~ ff_dt$game_date,
     lwd = 4, type = "l", ylim = c(88, 95),
     main = "Greinke 4-Seam Fastball Velocity",
     xlab = "Date", ylab = "Velocity (mph)")

Adding jittered points

It can sometimes be helpful to plot individual pitches as points() along with the game average line. The points() function takes three arguments: a formula that specifies which values should be added as points against which variable, pch that specifies what kind of points should be added to the plot, and col that specifies the color of the points to be added. Because the x-axis is discrete, use the jitter() function within the points() call to ensure the points don't overlap too much.

Pairing jitter() with transparent coloring allows for better visualization of the density of each velocity for each game.

In [58]:
# Code from last exercise, don't change this
plot(ff_dt$start_speed ~ ff_dt$game_date, 
     lwd = 4, type = "l", ylim = c(88, 95),
     main = "Greinke 4-Seam Fastball Velocity", 
     xlab = "Date", ylab = "Velocity (mph)")

# Add jittered points to the plot
points(jitter(ff_dt$start_speed) ~ jitter(as.numeric(ff_dt$game_date)),
    pch = 16, col = "#99004450")


Pitch mix tables

So far, the focus was mainly on velocity of pitches.

There are other characteristics that make a pitcher successful: pitch mix and location. For example, perhaps the increased velocity led to Greinke taking advantage of his fastball a bit more. Or maybe it made his off-speed pitches more effective, ultimately leading to greater usage of these pitch types.

Use the table() and prop.table() functions to explore changes in pitch mix. We switch from tapply() to the table() functions here due to the fact that the variable of interest, pitch_type, is categorical. This allows us to see how many of each type of pitch were thrown in July compared to other months; proportions are easier to interpret.

Note that there are some pitches we're not very interested in, so you will first subset the data to remove Eephus pitches and intentional balls. Eephus pitches were most likely misclassified by the automated pitch classifier.

In [59]:
# Subset the data to remove pitch types "IN" and "EP"
greinke <- subset(greinke, pitch_type != "IN" & pitch_type != "EP")

# Drop the levels from pitch_type
greinke$pitch_type <- droplevels(greinke$pitch_type)

# Create type_tab
type_tab <- table(greinke$pitch_type, greinke$july)

# Print type_tab
type_tab
    
     july other
  CH  112   487
  CU   51   242
  FF  207  1191
  FT   66   255
  SL   86   535

Pitches

Pitch mix table using prop.table() with the data frame that displays the number of pitches broken down by pitch type:

> type_tab

     july other  legend
  CH  112   487  CHangeup
  CU   51   242  CUrveball
  FF  207  1191  Four-seam Fastball
  FT   66   255  Two-seam Fastball
  SL   86   535  SLider

To learn more about pitch types and abbreviations.

It is more informative and interpretable when the information is presented as proportions using the prop.table() function. This function takes two arguments: a table (e.g. type_tab) and the margin over which to compute proportions.

Use margin = 2, which tells R to compute proportions within each column. That is, the proportions in each column of the result will sum to one.

If the margin argument is not specified, the proportions are computed relative to the entire table (i.e. the entire table sums to one).

In [74]:
# Create type_prop table
type_prop <- round(prop.table(type_tab, margin = 2), 3)

# Print type_prop
type_prop
    
      july other
  CH 0.215 0.180
  CU 0.098 0.089
  FF 0.397 0.439
  FT 0.126 0.094
  SL 0.165 0.197

Notice that each column in type_prop sums to one because prop.table() with argument margin = 2 computes proportions within each column.

Pitch mix tables - July vs. other

Next, compare the velocity to fastball usage overall. In doing so, gauge whether a quicker fastball increases the use of that pitch, maybe due to increased effectiveness.

Alternatively, if the proportion of non-fastball pitches increases alongside fastball velocity, this could indicate that Greinke sees the non-fastball pitches as being more effective when there's a larger difference in velocity between them and his fastball.

In [75]:
# Create ff_prop
ff_prop <- type_prop[3,]

# Print ff_prop
ff_prop

# Print ff_velo_month
ff_velo_month
july
0.397
other
0.439
july
92.4207729468599
other
91.6647355163728

Pitch mix tables - changes in pitch type rates

Now, calculate the difference in pitch type rates for July and the other months.

To do so, simply subtract one column from the other in the type_prop data frame.

However, it might be most useful to understand the percentage change (rather than the absolute change in proportion), so make this small transformation in your code.

In [76]:
# Create the Difference column
# type_prop$Difference <- (type_prop$july - type_prop$other) / type_prop$other
type_prop$Difference <- (type_prop[,1] - type_prop[,2]) / type_prop[,2]
Warning message in type_prop$Difference <- (type_prop[, 1] - type_prop[, 2])/type_prop[, :
"Conversion automatique de LHS en liste"
In [78]:
# Print type_prop
type_prop$Difference
CH
0.194444444444444
CU
0.101123595505618
FF
-0.0956719817767653
FT
0.340425531914894
SL
-0.16243654822335
In [79]:
# Plot a barplot
barplot(type_prop$Difference, names.arg = type_prop$Pitch, 
        main = "Pitch Usage in July vs. Other Months", 
        ylab = "Percentage Change in July", 
        ylim = c(-0.3, 0.3))

Greinke decreased his slider (SL) much more in July than any other pitch.

Ball-strike count frequency

While pitch types are interesting in their own right, it might be more useful to think about what types of pitches Greinke uses in different ball-strike counts.

Before getting to that, first make a table to examine the rate at which Greinke throws pitches in each of the ball-strike counts.

In [80]:
# Create bs_table
bs_table <- table(greinke$balls, greinke$strikes)

# Create bs_prop_table
bs_prop_table <- round(prop.table(bs_table), 3)

# Print bs_prop_table
bs_prop_table
   
        0     1     2
  0 0.261 0.135 0.062
  1 0.095 0.115 0.096
  2 0.026 0.053 0.093
  3 0.006 0.015 0.043
In [81]:
# Print row sums
rowSums(bs_prop_table)
0
0.458
1
0.306
2
0.172
3
0.064
In [82]:
# Print column sums
colSums(bs_prop_table)
0
0.388
1
0.318
2
0.294

Notice that the proportion of pitches thrown decreases as balls or strikes increases. This makes sense, since the lower order counts have to be passed through in order to get to more balls or more strikes.

Make a new variable

Create a new variable called bs_count that combines the balls and strikes variables into a single ball-strike count.

In [83]:
# Create bs_count
greinke$bs_count <- paste(greinke$balls, greinke$strikes, sep = "-")

# Print the first 6 rows of greinke
head(greinke)
p_namepitcher_idbatter_standpitch_typepitch_resultatbat_resultstart_speedz0x0pfx_x...inninginning_topbotbatted_ball_typebatted_ball_velocityhc_xhc_ypitch_iddistance_feetjulybs_count
Zack Greinke 425844 R FF Ball Walk 94.2 5.997 -0.675 -4.457 ... 4 top NA 0.00 0.00 160 NA other 2-2
Zack Greinke 425844 R FF SwingingStrikeSingle 92.4 6.281 -0.760 -1.590 ... 3 top 104 123.56 97.26 95 0 other 1-1
Zack Greinke 425844 R FF CalledStrike Home Run 92.7 6.168 -0.958 -1.884 ... 5 top 103 50.88 31.17 218 425 other 0-0
Zack Greinke 425844 R SL SwingingStrikeStrikeout 86.9 6.077 -0.939 3.594 ... 6 top NA 0.00 0.00 265 NA other 1-2
Zack Greinke 425844 R FF SwingingStrikeStrikeout 92.8 6.107 -0.524 -0.558 ... 8 top NA 0.00 0.00 374 NA other 1-2
Zack Greinke 425844 R SL SwingingStrikeStrikeout 87.8 6.321 -0.948 4.313 ... 1 top NA 0.00 0.00 14 NA other 2-2

Ball-strike count in July vs. other months

Identify the percentage change in the rate at which Greinke put himself in each of the ball-strike counts.

In [84]:
# Create bs_count_tab
bs_count_tab <- table(greinke$bs_count, greinke$july)

# Create bs_month
bs_month <- round(prop.table(bs_count_tab, margin = 2), 3)

# Print bs_month
bs_month
     
       july other
  0-0 0.261 0.262
  0-1 0.134 0.135
  0-2 0.056 0.063
  1-0 0.105 0.093
  1-1 0.123 0.113
  1-2 0.092 0.097
  2-0 0.029 0.025
  2-1 0.052 0.053
  2-2 0.086 0.094
  3-0 0.006 0.006
  3-1 0.015 0.015
  3-2 0.042 0.043

This is interesting. It looks like Greinke got into more 1-0 and 2-0 counts, but less two strike counts. That's a bit unexpected, since he had so much July success.

Visualizing ball-strike count in July vs. other months

Create a bar plot to visualize how common each ball-strike count was in July vs. other months. Let's get started.

In [85]:
# Create diff_bs
diff_bs <- round(((bs_month[,1] - bs_month[,2]) / bs_month[,2]),3)

# Print diff_bs
diff_bs
0-0
-0.004
0-1
-0.007
0-2
-0.111
1-0
0.129
1-1
0.088
1-2
-0.052
2-0
0.16
2-1
-0.019
2-2
-0.085
3-0
0
3-1
0
3-2
-0.023
In [86]:
# Create a bar plot of the changes
barplot(diff_bs, main = "Ball-Strike Count Rate in July vs. Other Months", 
        ylab = "Percentage Change in July", ylim = c(-0.15, 0.15), las = 2)

Cross-tabulate pitch use in ball-strike counts

Interestingly, Greinke was in more hitter friendly counts in July than the other months in which he pitched. That's a somewhat unexpected result.

It could be that he was more willing to use off-speed pitches earlier in the count, and if those are thrown for strikes less often, then this might be driving this result. You will look at these outcomes a bit more in the last chapter of this course.

Take a look to see if Greinke used certain pitches more or less often in specific counts overall. In particular, tabulate the proportion of times he throws each pitch for each count.

In [87]:
# Create type_bs
type_bs <- table(greinke$pitch_type, greinke$bs_count)

# Print type_bs
type_bs
    
     0-0 0-1 0-2 1-0 1-1 1-2 2-0 2-1 2-2 3-0 3-1 3-2
  CH  92  93  36  70  79  62  27  46  52   0  18  24
  CU 124  49  10  34  38   9   4  12   9   0   0   4
  FF 482 167  61 136 136  89  37  71 109  17  24  69
  FT  54  55  19  32  50  31  11  18  34   2   3  12
  SL  93  71  75  35  68 119   5  24  96   0   5  30
In [51]:
# Create type_bs_prop
type_bs_prop <- round(prop.table(type_bs, margin = 2), 3)

# Print type_bs_prop
type_bs_prop
    
       0-0   0-1   0-2   1-0   1-1   1-2   2-0   2-1   2-2   3-0   3-1   3-2
  CH 0.109 0.214 0.179 0.228 0.213 0.200 0.321 0.269 0.173 0.000 0.360 0.173
  CU 0.147 0.113 0.050 0.111 0.102 0.029 0.048 0.070 0.030 0.000 0.000 0.029
  FF 0.570 0.384 0.303 0.443 0.367 0.287 0.440 0.415 0.363 0.895 0.480 0.496
  FT 0.064 0.126 0.095 0.104 0.135 0.100 0.131 0.105 0.113 0.105 0.060 0.086
  SL 0.110 0.163 0.373 0.114 0.183 0.384 0.060 0.140 0.320 0.000 0.100 0.216
  • He used his 4-seam fastball more than any other pitch in 3-0 counts.
  • He used his slider most when he had 2 strikes on a batter.
  • He never threw his curveball in a 3-0 or a 3-1 count.

Pitch mix late in games

There is often talk about pitchers having more trouble late in games. There are a number of reasons for this. They could be getting tired and losing velocity, or batters may have already seen pitches they throw. Given this, we'll try to see if Greinke resorts more to his off-speed pitches later in games.

First, create a variable indicating that a pitch was thrown late in a game, defined as any pitch past the 5th inning. Then, make a table of pitch selection for late-game pitches.

In [88]:
# Create the late_in_game column
greinke$late_in_game <- ifelse(greinke$inning > 5, 1, 0)

# Convert late_in_game
greinke$late_in_game <- factor(greinke$late_in_game)

# Create type_late
type_late <- table(greinke$pitch_type, greinke$late_in_game)

# Create type_late_prop
type_late_prop <- round(prop.table(type_late, margin = 2), 3)

# Print type_late_prop
type_late_prop
    
         0     1
  CH 0.178 0.204
  CU 0.086 0.102
  FF 0.444 0.403
  FT 0.107 0.080
  SL 0.185 0.211

Late game pitch mix - grouped barplots

Assess whether there are changes in pitch selection for specific pitches early vs. late in the game.

Use the barplot() function, which also allows for creation of grouped barplots, paired with the transpose function, t(), and the parameter beside = TRUE.

In [89]:
# Create t_type_late
t_type_late <- t(type_late)

# Print dimensions of t_type_late
dim(t_type_late)
  1. 2
  2. 5
In [90]:
# Print dimensions of type_late
dim(type_late)
  1. 5
  2. 2
In [101]:
# Change row names
rownames(t_type_late) <- c("Early", "Late")
t_type_late
       
          CH   CU   FF   FT   SL
  Early  416  201 1036  249  431
  Late   183   92  362   72  190
In [100]:
# Make barplot using t_type_late
barplot(t_type_late, beside = TRUE, col = c("red", "blue"), 
        main = "Early vs. Late In Game Pitch Selection", 
        ylab = "Pitch Selection Proportion", 
        legend = rownames(t_type_late))

Conclusions:

  • He uses his fastballs less often late in games.
  • He uses his FF in all-ball-strike counts.
  • The more balls, the more FF. In 3-0 counts, FF 90% of the time.
  • With 2 strikes on the batter, more SL.
  • Pitch Mix: doesn't pitch FF and FT (4-seam and 2 seam fastballs) late in the game. Tired?
  • Educated guesses: tired, cannot through as fast or cheat the batters?


Locational changes

Let's start using the horizontal and vertical location variables: px and pz, respectively.

Calculate the average pitch height pz for Greinke in July relative to other months. Note that it's multiplied by 12 so that the answer is in inches, while the variable is recorded in feet.

Find the average horizontal location to left-handed batters (LHB) and right-handed batters (RHB), respectively. Remember that a positive px value is outside against righties and inside against lefties.

In [102]:
# Calculate average pitch height in inches in July vs. other months
tapply(greinke$pz, greinke$july, mean) * 12

# Create greinke_lhb
greinke_lhb <- subset(greinke, batter_stand == "L")

# Create greinke_rhb
greinke_rhb <- subset(greinke, batter_stand == "R")

# Compute average px location for LHB
tapply(greinke_lhb$px, greinke_lhb$july, mean) * 12
july
26.2600229885057
other
26.3990435424354
july
-4.62735483870968
other
-6.32014393305439
In [103]:
str(greinke)
'data.frame':	3232 obs. of  35 variables:
 $ p_name              : chr  "Zack Greinke" "Zack Greinke" "Zack Greinke" "Zack Greinke" ...
 $ pitcher_id          : int  425844 425844 425844 425844 425844 425844 425844 425844 425844 425844 ...
 $ batter_stand        : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
 $ pitch_type          : Factor w/ 5 levels "CH","CU","FF",..: 3 3 3 5 3 5 1 3 5 5 ...
 $ pitch_result        : Factor w/ 15 levels "Ball","BallInDirt",..: 1 14 3 14 14 14 15 3 4 14 ...
 $ atbat_result        : Factor w/ 24 levels "Bunt Groundout ",..: 24 20 13 22 22 22 22 22 11 24 ...
 $ start_speed         : num  94.2 92.4 92.7 86.9 92.8 87.8 90.3 92.7 85.5 87.3 ...
 $ z0                  : num  6 6.28 6.17 6.08 6.11 ...
 $ x0                  : num  -0.675 -0.76 -0.958 -0.939 -0.524 ...
 $ pfx_x               : num  -4.457 -1.59 -1.884 3.594 -0.558 ...
 $ pfx_z               : num  9.76 11.4 9.245 0.762 11.134 ...
 $ px                  : num  1.714 0.589 0.399 0.764 1.517 ...
 $ pz                  : num  1.92 3.27 2.92 1.31 2.19 ...
 $ break_angle         : num  24.8 10.1 9.2 -11.4 -0.4 -13.6 22.5 25.1 -8.4 -11.3 ...
 $ break_length        : num  3.5 2.7 3.5 8 2.8 7.8 7.4 3.8 7.5 7.4 ...
 $ spin_rate           : num  2189 2312 1890 694 2243 ...
 $ spin_dir            : num  204 188 191 103 183 ...
 $ balls               : int  2 1 0 1 1 2 1 0 0 0 ...
 $ strikes             : int  2 1 0 2 2 2 2 2 0 1 ...
 $ outs                : int  2 0 1 0 0 1 1 2 2 2 ...
 $ game_date           : Date, format: "2015-10-03" "2015-10-03" ...
 $ year                : chr  "2015" "2015" "2015" "2015" ...
 $ month               : num  10 10 10 10 10 10 10 10 10 10 ...
 $ day                 : chr  "03" "03" "03" "03" ...
 $ inning              : int  4 3 5 6 8 1 6 5 8 4 ...
 $ inning_topbot       : Factor w/ 2 levels "bot","top": 2 2 2 2 2 2 2 2 2 2 ...
 $ batted_ball_type    : Factor w/ 5 levels "","FB","GB","LD",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ batted_ball_velocity: int  NA 104 103 NA NA NA NA NA NA NA ...
 $ hc_x                : num  0 123.6 50.9 0 0 ...
 $ hc_y                : num  0 97.3 31.2 0 0 ...
 $ pitch_id            : int  160 95 218 265 374 14 279 231 386 156 ...
 $ distance_feet       : int  NA 0 425 NA NA NA NA NA NA NA ...
 $ july                : chr  "other" "other" "other" "other" ...
 $ bs_count            : chr  "2-2" "1-1" "0-0" "1-2" ...
 $ late_in_game        : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 2 1 2 1 ...
In [104]:
# Compute average px location for RHB
tapply(greinke_rhb$px, greinke_rhb$july, mean) * 12
july
4.03222641509434
other
5.7318495049505

Describe the locations

The results from above: pitch_heights, pitch_width_lhb and pitch_width_rhb.

Noting that negative px refers to outside half pitches to LHB, while positive px refers to outside half pitches to RHB, what do you notice about how far away from the center of the plate that Greinke threw in July relative to other months?

In [105]:
# Plot location of all pitches
plot(greinke$pz ~ greinke$px,
     col = factor(greinke$july),
     xlim = c(-3, 3))

In July (red), Greinke threw his pitches slightly lower overall, and less outside to both LHB and RHB.

Locational changes - visualization

Greinke was pitching much closer to the center of the plate to both left- and right-handed batters in July. But it's often more helpful to visualize the pitch location, rather than guess based on averages of horizontal location numbers.

Adjust the plots to help elucidate differences in pitch location using different plotting strategies.

In [106]:
# Formatting code, don't change this
par(mfrow = c(1, 2))

# Plot the pitch locations for July
plot(greinke$pz ~ greinke$px, data = greinke_july,
     col = 'red', pch = 16,
     xlim = c(-3, 3), ylim = c(-1, 6),
     main = "July")

# Plot the pitch locations for other months
plot(greinke$pz ~ greinke$px, data = greinke_other,
     col = 'black', pch = 16,
     xlim = c(-3, 3), ylim = c(-1, 6),
     main = "Other months")

Locational changes - plotting a grid

Plotting each group on the different panels didn't seem to help much. One way to get around the lack of useful interpretation from a scatter plot is to bin the data.

Binning data into groups and plotting it as a grid is a way of summarizing the location of pitches. There are direct relationships between visualizing locational density and the simple binning.

In [107]:
# Create greinke_sub
greinke_sub <- subset(greinke,
    greinke$px > -2 & 
    greinke$px <  2 &
    greinke$pz >  0 &
    greinke$pz <  5)

# Plot pitch location window
plot(x = c(-2, 2), y = c(0, 5), type = "n",
     main = "Greinke Locational Zone Proportions",
     xlab = "Horizontal Location (ft.; Catcher's View)",
     ylab = "Vertical Location (ft.)")

# Add the grid lines
grid(lty = "solid", col = "black")

Binning locational data

Note that a new variable called zone has been added to the greinke data frame.

There are 20 possibilities for the zone variable, numbered 1 through 20. Each classification tells us about the location of the given pitch, binned as a grid across the strike zone and just outside the strike zone. There is also a zone_px and zone_pz variable that identify the middle of each of these locational bins. Conveniently, these directly relate to the grid.

In [178]:
greinke_sub <- read.table(file = 'Baseball_sub.csv', header = TRUE, sep = ';')
str(greinke_sub)
'data.frame':	3627 obs. of  38 variables:
 $ p_name              : Factor w/ 2 levels "","Zack Greinke": 2 2 2 2 2 2 2 2 2 2 ...
 $ pitcher_id          : int  425844 425844 425844 425844 425844 425844 425844 425844 425844 425844 ...
 $ batter_stand        : Factor w/ 3 levels "","L","R": 3 3 3 3 3 3 3 3 3 3 ...
 $ pitch_type          : Factor w/ 6 levels "","CH","CU","FF",..: 4 4 4 6 4 6 2 4 6 6 ...
 $ pitch_result        : Factor w/ 15 levels "","Ball","BallInDirt",..: 2 14 4 14 14 14 15 4 5 14 ...
 $ atbat_result        : Factor w/ 32 levels "","Bunt  Groundout ",..: 32 28 18 30 30 30 30 30 15 32 ...
 $ start_speed         : num  94.2 92.4 92.7 86.9 92.8 87.8 90.3 92.7 85.5 87.3 ...
 $ z0                  : num  6 6.28 6.17 6.08 6.11 ...
 $ x0                  : num  -0.675 -0.76 -0.958 -0.939 -0.524 ...
 $ pfx_x               : num  -4.457 -1.59 -1.884 3.594 -0.558 ...
 $ pfx_z               : num  9.76 11.4 9.245 0.762 11.134 ...
 $ px                  : num  1.714 0.589 0.399 0.764 1.517 ...
 $ pz                  : num  1.92 3.27 2.92 1.31 2.19 ...
 $ break_angle         : num  24.8 10.1 9.2 -11.4 -0.4 -13.6 22.5 25.1 -8.4 -11.3 ...
 $ break_length        : num  3.5 2.7 3.5 8 2.8 7.8 7.4 3.8 7.5 7.4 ...
 $ spin_rate           : num  2189 2312 1890 694 2243 ...
 $ spin_dir            : num  204 188 191 103 183 ...
 $ balls               : int  2 1 0 1 1 2 1 0 0 0 ...
 $ strikes             : int  2 1 0 2 2 2 2 2 0 1 ...
 $ outs                : int  2 0 1 0 0 1 1 2 2 2 ...
 $ game_date           : Factor w/ 33 levels "","2015-04-07",..: 33 33 33 33 33 33 33 33 33 33 ...
 $ inning              : int  4 3 5 6 8 1 6 5 8 4 ...
 $ inning_topbot       : Factor w/ 3 levels "","bot","top": 3 3 3 3 3 3 3 3 3 3 ...
 $ batted_ball_type    : Factor w/ 5 levels "","FB","GB","LD",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ batted_ball_velocity: int  NA 104 103 NA NA NA NA NA NA NA ...
 $ hc_x                : num  0 123.6 50.9 0 0 ...
 $ hc_y                : num  0 97.3 31.2 0 0 ...
 $ pitch_id            : int  160 95 218 265 374 14 279 231 386 156 ...
 $ distance_feet       : int  NA 0 425 NA NA NA NA NA NA NA ...
 $ year                : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
 $ month               : int  10 10 10 10 10 10 10 10 10 10 ...
 $ day                 : int  3 3 3 3 3 3 3 3 3 3 ...
 $ july                : Factor w/ 3 levels "","july","other": 3 3 3 3 3 3 3 3 3 3 ...
 $ bs_count            : Factor w/ 13 levels "","0-0","0-1",..: 10 6 2 7 7 10 7 4 2 3 ...
 $ late_in_game        : int  0 0 0 1 1 0 1 0 1 0 ...
 $ zone                : int  16 7 11 15 12 7 18 11 11 16 ...
 $ zone_px             : num  1.5 0.5 0.5 0.5 1.5 0.5 -0.5 0.5 0.5 1.5 ...
 $ zone_pz             : num  1.5 3.5 2.5 1.5 2.5 3.5 0.5 2.5 2.5 1.5 ...
In [157]:
# Create greinke_table
greinke_table <- table(greinke_sub$zone)

# Create zone_prop
zone_prop <- round(prop.table(greinke_table), 3)

# Plot strike zone grid, don't change this
plot(x = c(-2, 2), y = c(0, 5), type = "n",
     main = "Greinke Locational Zone Proportions",
     xlab = "Horizontal Location (ft.; Catcher's View)",
     ylab = "Vertical Location (ft.)")
grid(lty = "solid", col = "black")

# Add text from zone_prop[1]
text(x = -1.5, y = 4.5, zone_prop[1], cex = 1.5)

Based on the figure shown, in 2015, Greinke threw to the top left grid panel approximately 0.7% of the time.

For loops and plotting locational grid proportions

Use a for loop to plot the proportions for each zone in the grid (all zones (1 through 20).

Note that in the data, each zone is associated with a given zone_px and zone_pz coordinate for plotting the text. Additionally, each zone proportion in the zone_prop table is associated with a given zone number.

In [160]:
# Plot grid
plot(x = c(-2, 2), y = c(0, 5), type = "n",
     main = "Greinke Locational Zone Proportions",
     xlab = "Horizontal Location (ft.; Catcher's View)",
     ylab = "Vertical Location (ft.)")
grid(lty = "solid", col = "black")

# Plot text using for loop
for(i in 1:20) {
    text(mean(greinke_sub$zone_px[greinke_sub$zone == i]),
    mean(greinke_sub$zone_pz[greinke_sub$zone == i]),
    zone_prop[i], cex = 1.5)
}

text(x = -1.5, y = 4.5, zone_prop[1], cex = 1.5)
text(x = -1.5, y = 3.5, zone_prop[5], cex = 1.5)
text(x = -1.5, y = 2.5, zone_prop[9], cex = 1.5)
text(x = -1.5, y = 1.5, zone_prop[13], cex = 1.5)
text(x = -1.5, y = 0.5, zone_prop[17], cex = 1.5)
text(x = -0.5, y = 4.5, zone_prop[2], cex = 1.5)
text(x = -0.5, y = 3.5, zone_prop[6], cex = 1.5)
text(x = -0.5, y = 2.5, zone_prop[10], cex = 1.5)
text(x = -0.5, y = 1.5, zone_prop[14], cex = 1.5)
text(x = -0.5, y = 0.5, zone_prop[18], cex = 1.5)
text(x = 0.5, y = 4.5, zone_prop[3], cex = 1.5)
text(x = 0.5, y = 3.5, zone_prop[7], cex = 1.5)
text(x = 0.5, y = 2.5, zone_prop[11], cex = 1.5)
text(x = 0.5, y = 1.5, zone_prop[15], cex = 1.5)
text(x = 0.5, y = 0.5, zone_prop[19], cex = 1.5)
text(x = 1.5, y = 4.5, zone_prop[4], cex = 1.5)
text(x = 1.5, y = 3.5, zone_prop[8], cex = 1.5)
text(x = 1.5, y = 2.5, zone_prop[12], cex = 1.5)
text(x = 1.5, y = 1.5, zone_prop[16], cex = 1.5)
text(x = 1.5, y = 0.5, zone_prop[20], cex = 1.5)

It looks like things make sense, as the plot seems to say that Greinke is throwing his pitches near the middle, where the strike zone is.

Binned locational differences

Because our inquiry for this chapter has been Greinke's pitching trends in July relative to other months, let's now look at his zone location proportion differences from that perspective.

Create tables, but this time selecting separately for july versus other months.

Notice that Greinke did not throw any pitches to zone 4 in July. For the code to work correctly, we need to add to the zone_prop_july vector to ensure it has the same length as the zone_prop_other vector.

In [161]:
# zone_prop_july
# zone_prop_other
# Create zone_prop_july
zone_prop_july <- round(
  table(greinke_sub$zone[greinke_sub$july == "july"]) /
    nrow(subset(greinke_sub, july == "july")), 3)

# Create zone_prop_other
zone_prop_other <- round(
  table(greinke_sub$zone[greinke_sub$july == "other"]) /
    nrow(subset(greinke_sub, july == "other")), 3)

# Print zone_prop_july (#4 misses)
zone_prop_july

# Print zone_prop_other
zone_prop_other

# Fix zone_prop_july vector, don't change this (add #4)
zone_prop_july2 <- c(zone_prop_july[1:3], 0.00, zone_prop_july[4:19])
names(zone_prop_july2) <- c(1:20)

# Create zone_prop_diff
zone_prop_diff <- zone_prop_july2 - zone_prop_other

# Print zone_prop_diff
zone_prop_diff
    1     2     3     5     6     7     8     9    10    11    12    13    14 
0.004 0.002 0.006 0.036 0.058 0.060 0.020 0.090 0.126 0.160 0.030 0.040 0.128 
   15    16    17    18    19    20 
0.110 0.050 0.002 0.036 0.028 0.016 
    1     2     3     4     5     6     7     8     9    10    11    12    13 
0.008 0.008 0.010 0.002 0.028 0.044 0.070 0.018 0.058 0.128 0.163 0.049 0.039 
   14    15    16    17    18    19    20 
0.114 0.123 0.056 0.006 0.025 0.032 0.019 
     1      2      3      4      5      6      7      8      9     10     11 
-0.004 -0.006 -0.004 -0.002  0.008  0.014 -0.010  0.002  0.032 -0.002 -0.003 
    12     13     14     15     16     17     18     19     20 
-0.019  0.001  0.014 -0.013 -0.006 -0.004  0.011 -0.004 -0.003 
In [162]:
zone_prop
zone_prop_diff
    1     2     3     4     5     6     7     8     9    10    11    12    13 
0.007 0.007 0.009 0.002 0.029 0.046 0.068 0.018 0.063 0.128 0.163 0.046 0.039 
   14    15    16    17    18    19    20 
0.116 0.121 0.055 0.006 0.027 0.032 0.019 
     1      2      3      4      5      6      7      8      9     10     11 
-0.004 -0.006 -0.004 -0.002  0.008  0.014 -0.010  0.002  0.032 -0.002 -0.003 
    12     13     14     15     16     17     18     19     20 
-0.019  0.001  0.014 -0.013 -0.006 -0.004  0.011 -0.004 -0.003 

Plotting zone proportion differences

Add to each zone the corresponding proportion difference you found above. This will give you a visualization of the change in propensity for Greinke to throw to certain areas of the zone in July relative to other months. You will do this by calling the text() function within the for loop. The text() function takes an argument label, which specifies the text to be written. It also takes arguments x and y, which provide the x- and y-coordinates where label is to be placed. Lastly, the argument cex specifies font size.

For example,

text(1, 3, "Hello world!", cex = 1.5)

would print "Hello world!" in the (1, 3) position in the grid. Notice that the x and y arguments are specified before the label argument.

[NOTE: Since zone_px and zone_py are the same for all pitches within a single zone, the mean() of these values will simply return the appropriate coordinates for that zone. That is, the mean of 1.5, 1.5, and 1.5 is, well... 1.5! We could just as easily have taken the first or last value and gotten the same results, but the mean is simple enough.]

In [163]:
plot(x = c(-2, 2), y = c(0, 5), type = "n",
     main = "Greinke Locational Zone Proportions",
     xlab = "Horizontal Location (ft.; Catcher's View)",
     ylab = "Vertical Location (ft.)")
grid(lty = "solid", col = "black")

text(x = -1.5, y = 4.5, zone_prop_diff[1], cex = 1.5)
text(x = -1.5, y = 3.5, zone_prop_diff[5], cex = 1.5)
text(x = -1.5, y = 2.5, zone_prop_diff[9], cex = 1.5)
text(x = -1.5, y = 1.5, zone_prop_diff[13], cex = 1.5)
text(x = -1.5, y = 0.5, zone_prop_diff[17], cex = 1.5)
text(x = -0.5, y = 4.5, zone_prop_diff[2], cex = 1.5)
text(x = -0.5, y = 3.5, zone_prop_diff[6], cex = 1.5)
text(x = -0.5, y = 2.5, zone_prop_diff[10], cex = 1.5)
text(x = -0.5, y = 1.5, zone_prop_diff[14], cex = 1.5)
text(x = -0.5, y = 0.5, zone_prop_diff[18], cex = 1.5)
text(x = 0.5, y = 4.5, zone_prop_diff[3], cex = 1.5)
text(x = 0.5, y = 3.5, zone_prop_diff[7], cex = 1.5)
text(x = 0.5, y = 2.5, zone_prop_diff[11], cex = 1.5)
text(x = 0.5, y = 1.5, zone_prop_diff[15], cex = 1.5)
text(x = 0.5, y = 0.5, zone_prop_diff[19], cex = 1.5)
text(x = 1.5, y = 4.5, zone_prop_diff[4], cex = 1.5)
text(x = 1.5, y = 3.5, zone_prop_diff[8], cex = 1.5)
text(x = 1.5, y = 2.5, zone_prop_diff[12], cex = 1.5)
text(x = 1.5, y = 1.5, zone_prop_diff[16], cex = 1.5)
text(x = 1.5, y = 0.5, zone_prop_diff[20], cex = 1.5)

Looks like our assumption was safe. Note that a positive number indicates pitches were thrown to that part of the grid more often in July than other months, while a negative number indicates pitches were thrown to that portion of the zone less often. Zone 9 has the most dramatic increase.

Descriptions Of Various Baseball Pitches

Now that you've looked at pictures of different baseball pitches, let's take a closer look at some of the more common ones - what they do and how they're used to get batters out.

Pitches

  • Four-seam fastball - Maximum velocity and should have best command. This is the most important pitch because everything else works off of it.
  • Two-seam fastball (a.k.a. sinker) - This fastball does just that, it sinks. A very good pitch for inducing ground balls.
  • Cut-fastball - Holding the ball slightly off center, it will run away from the arm side. Usually a few mph slower than a four-seam fastball. Good for jamming hitters.
  • Split-finger fastball - Strictly an out pitch. Dives down hard at home plate, many times getting missed swings.
  • Change-up - Slower than a fastball, but thrown with the same arm action. The arm speed is very important in getting the maximum effectiveness. This pitch helps control bat speed.
  • Curveball - Most often a strikeout pitch. Dives down as it gets to home plate. Many times the velocity is as effective as the movement, because it's usually much slower than a fastball.
  • Slider - In between a fastball and a curveball. It's harder than a curveball with less downward action. The slider has a smaller break with a tighter spin. Many times you can see a small dot in the baseball as it's coming toward you.
  • Knuckleball - A pitch that has very little or no spin. It's very difficult to control and catch. No one knows what it will do usually, which makes it also hard to hit. A very hard pitch to throw.
  • Forkball - Thrown hard while held between the index and middle fingers at varying depths. Usually tumbles and drops violently, often diagonally. Known as an out pitch, but also can be hard on the arm.

Location and ball-strike count

Now, let's evaluate the difference in Greinke's propensity to throw to each zone location depending on the count (i.e. number of balls and strikes).

In [164]:
# Create greinke_zone_tab
greinke_zone_tab <- table(greinke_sub$zone, greinke_sub$bs_count)

# Create zone_count_prop
zone_count_prop <- round(prop.table(greinke_zone_tab, margin = 2), 3)

# Print zone_count_prop
zone_count_prop
    
        0-0   0-1   0-2   1-0   1-1   1-2   2-0   2-1   2-2   3-0   3-1   3-2
  1   0.007 0.002 0.006 0.010 0.005 0.025 0.000 0.000 0.000 0.000 0.012 0.009
  2   0.005 0.007 0.018 0.007 0.000 0.014 0.000 0.012 0.007 0.000 0.012 0.019
  3   0.007 0.012 0.012 0.003 0.008 0.025 0.000 0.006 0.010 0.000 0.000 0.000
  4   0.001 0.009 0.006 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
  5   0.039 0.035 0.012 0.026 0.027 0.018 0.036 0.065 0.007 0.053 0.037 0.000
  6   0.065 0.024 0.006 0.056 0.030 0.021 0.107 0.053 0.041 0.000 0.086 0.074
  7   0.072 0.078 0.054 0.056 0.077 0.050 0.024 0.059 0.092 0.105 0.037 0.074
  8   0.018 0.033 0.048 0.010 0.019 0.028 0.000 0.000 0.003 0.000 0.000 0.000
  9   0.065 0.094 0.030 0.066 0.066 0.050 0.083 0.041 0.054 0.158 0.049 0.037
  10  0.131 0.073 0.054 0.171 0.148 0.050 0.167 0.183 0.126 0.105 0.259 0.241
  11  0.191 0.130 0.078 0.197 0.156 0.121 0.179 0.189 0.153 0.263 0.074 0.259
  12  0.049 0.071 0.078 0.036 0.060 0.032 0.012 0.018 0.044 0.000 0.000 0.009
  13  0.026 0.068 0.072 0.026 0.052 0.074 0.024 0.018 0.010 0.053 0.025 0.009
  14  0.096 0.111 0.114 0.138 0.101 0.138 0.167 0.142 0.129 0.105 0.160 0.083
  15  0.132 0.083 0.120 0.115 0.101 0.113 0.131 0.154 0.139 0.105 0.198 0.111
  16  0.058 0.068 0.096 0.023 0.082 0.057 0.000 0.018 0.061 0.053 0.025 0.009
  17  0.001 0.009 0.018 0.003 0.008 0.014 0.000 0.000 0.007 0.000 0.000 0.000
  18  0.012 0.035 0.048 0.016 0.030 0.060 0.000 0.018 0.041 0.000 0.012 0.009
  19  0.012 0.033 0.072 0.033 0.019 0.060 0.060 0.018 0.054 0.000 0.012 0.037
  20  0.011 0.024 0.054 0.007 0.011 0.050 0.012 0.006 0.020 0.000 0.000 0.019

This table is a bit unwieldy, so let's try to break things down a bit in the following exercises.

0-2 vs. 3-0 locational changes

Let's create a table of differences for just the 0-2 and 3-0 counts.

In [174]:
# Create zone_count_diff
zone_count_diff <- zone_count_prop[,3] - zone_count_prop[,10]

# Print the table
zone_count_diff2 <- round(zone_count_diff, 3)
zone_count_diff2
1
0.002
2
0
3
0.002
4
0.009
5
0.028
6
-0.017
7
-0.014
8
0.03
9
0.04
10
-0.053
11
-0.023
12
0.027
13
0.058
14
-0.018
15
-0.056
16
0.007
17
0.002
18
-0.006
19
-0.021
20
0.004

Plotting count-based locational differences

Now that you have made a table of differences, it's time to plot this on the same zone grid that you used before.

In [175]:
# Add text to the figure for location differences
#for(i in 1:20) {
#  text(mean(greinke_sub$zone_px[greinke_sub$zone == i]),
#       mean(greinke_sub$zone_pz[greinke_sub$zone == i]),
#       zone_count_diff[i, ], cex = 1.5)
#}

# zone_count_diff is a d.f

plot(x = c(-2, 2), y = c(0, 5), type = "n",
     main = "Greinke Locational Zone Proportions",
     xlab = "Horizontal Location (ft.; Catcher's View)",
     ylab = "Vertical Location (ft.)")
grid(lty = "solid", col = "black")

text(x = -1.5, y = 4.5, zone_count_diff2[1], cex = 1.5)
text(x = -1.5, y = 3.5, zone_count_diff2[5], cex = 1.5)
text(x = -1.5, y = 2.5, zone_count_diff2[9], cex = 1.5)
text(x = -1.5, y = 1.5, zone_count_diff2[13], cex = 1.5)
text(x = -1.5, y = 0.5, zone_count_diff2[17], cex = 1.5)
text(x = -0.5, y = 4.5, zone_count_diff2[2], cex = 1.5)
text(x = -0.5, y = 3.5, zone_count_diff2[6], cex = 1.5)
text(x = -0.5, y = 2.5, zone_count_diff2[10], cex = 1.5)
text(x = -0.5, y = 1.5, zone_count_diff2[14], cex = 1.5)
text(x = -0.5, y = 0.5, zone_count_diff2[18], cex = 1.5)
text(x = 0.5, y = 4.5, zone_count_diff2[3], cex = 1.5)
text(x = 0.5, y = 3.5, zone_count_diff2[7], cex = 1.5)
text(x = 0.5, y = 2.5, zone_count_diff2[11], cex = 1.5)
text(x = 0.5, y = 1.5, zone_count_diff2[15], cex = 1.5)
text(x = 0.5, y = 0.5, zone_count_diff2[19], cex = 1.5)
text(x = 1.5, y = 4.5, zone_count_diff2[4], cex = 1.5)
text(x = 1.5, y = 3.5, zone_count_diff2[8], cex = 1.5)
text(x = 1.5, y = 2.5, zone_count_diff2[12], cex = 1.5)
text(x = 1.5, y = 1.5, zone_count_diff2[16], cex = 1.5)
text(x = 1.5, y = 0.5, zone_count_diff2[20], cex = 1.5)

It definitely looks like Greinke is throwing pitches to the middle of the strike zone less often in 0-2 counts here. That makes sense.

Pitches in difference situations. Bin the data to organized the important amount of information. In July, Greinke thow off-center and outside the batter. Uses its 2-seam fastball in July, a 2-seam is more off-center. Pitches depend on the ball-strike count. In 0-2, pitches are low; in 3-0, pitches are close to the center with a 4-seam fastball.



Velocity impact on contact rate

It might be worthwhile to know if increased velocity is even associated with better outcomes for Greinke. After all, this was the implication when you compared July's velocity to other months.

Analyze the impact of velocity on the likelihood that a pitch is missed by the batter. You will begin with the greinke_ff dataset, now including all the new variables we've created. Rather than compare across months, you will simply look at the impact of start_speed on the rate that the batter makes contact.

In [179]:
greinke_ff <- read.table(file = 'Baseball_ff.csv', header = TRUE, sep = ';')
str(greinke_ff)
'data.frame':	1371 obs. of  39 variables:
 $ p_name              : Factor w/ 1 level "Zack Greinke": 1 1 1 1 1 1 1 1 1 1 ...
 $ pitcher_id          : int  425844 425844 425844 425844 425844 425844 425844 425844 425844 425844 ...
 $ batter_stand        : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
 $ pitch_type          : Factor w/ 1 level "FF": 1 1 1 1 1 1 1 1 1 1 ...
 $ pitch_result        : Factor w/ 13 levels "Ball","BallInDirt",..: 1 13 3 13 3 11 10 13 1 1 ...
 $ atbat_result        : Factor w/ 21 levels "BuntGroundout",..: 21 17 11 18 18 11 15 18 17 18 ...
 $ start_speed         : num  94.2 92.4 92.7 92.8 92.7 92.3 90 93 91.8 93.8 ...
 $ z0                  : num  6 6.28 6.17 6.11 6.1 ...
 $ x0                  : num  -0.675 -0.76 -0.958 -0.524 -0.752 ...
 $ pfx_x               : num  -4.457 -1.59 -1.884 -0.558 -4.795 ...
 $ pfx_z               : num  9.76 11.4 9.24 11.13 10.16 ...
 $ px                  : num  1.714 0.589 0.399 1.517 0.994 ...
 $ pz                  : num  1.92 3.27 2.92 2.19 2.24 ...
 $ break_angle         : Factor w/ 1068 levels "-0.1","-0.2",..: 444 246 1061 4 446 292 424 273 728 305 ...
 $ break_length        : num  3.5 2.7 3.5 2.8 3.8 3.5 3.6 3.1 3.5 2.8 ...
 $ spin_rate           : num  2189 2312 1890 2243 2208 ...
 $ spin_dir            : num  204 188 191 183 205 ...
 $ balls               : int  2 1 0 1 0 0 0 3 0 1 ...
 $ strikes             : int  2 1 0 2 2 1 0 2 0 2 ...
 $ outs                : int  2 0 1 0 2 1 1 0 0 0 ...
 $ game_date           : Factor w/ 32 levels "2015-04-07","2015-04-12",..: 32 32 32 32 32 32 32 32 32 32 ...
 $ inning              : Factor w/ 88 levels "0      1","0      2",..: 36 27 45 72 45 45 27 36 27 36 ...
 $ inning_topbot       : Factor w/ 2 levels "bot","top": 2 2 2 2 2 2 2 2 2 2 ...
 $ batted_ball_type    : Factor w/ 5 levels "","FB","GB","LD",..: 1 1 1 1 1 2 3 1 1 1 ...
 $ batted_ball_velocity: int  NA 104 103 NA NA 103 NA NA 104 NA ...
 $ hc_x                : num  0 123.6 50.9 0 0 ...
 $ hc_y                : num  0 97.3 31.2 0 0 ...
 $ pitch_id            : Factor w/ 1111 levels "0        4","0        5",..: 202 1108 324 528 338 325 165 194 1106 192 ...
 $ distance_feet       : int  NA 0 425 NA NA 425 NA NA 0 NA ...
 $ year                : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
 $ month               : int  10 10 10 10 10 10 10 10 10 10 ...
 $ day                 : int  3 3 3 3 3 3 3 3 3 3 ...
 $ july                : Factor w/ 2 levels "july","other": 2 2 2 2 2 2 2 2 2 2 ...
 $ bs_count            : Factor w/ 12 levels "0-0","0-1","0-2",..: 9 5 1 6 3 2 1 12 1 6 ...
 $ late_in_game        : int  0 0 0 1 0 0 0 0 0 0 ...
 $ zone                : int  16 7 11 12 11 6 10 11 12 8 ...
 $ zone_px             : num  1.5 0.5 0.5 1.5 0.5 -0.5 -0.5 0.5 1.5 1.5 ...
 $ zone_pz             : num  1.5 3.5 2.5 2.5 2.5 3.5 2.5 2.5 2.5 3.5 ...
 $ batter_swing        : int  0 1 0 1 0 1 1 1 0 0 ...
In [189]:
# Create batter_swing
no_swing <- c("Ball", "CalledStrike", "BallinDirt", "HitByPitch")
greinke_ff$batter_swing <- ifelse(greinke_ff$pitch_result %in% no_swing, 0, 1)

# Create swing_ff
swing_ff <- subset(greinke_ff, greinke_ff$batter_swing == 1)

# Create the contact variable
no_contact <- c("SwingingStrike", "MissedBunt")
swing_ff$contact <- ifelse(swing_ff$pitch_result %in% no_contact, 0, 1)

# Create velo_bin: add one line for "Fast"
swing_ff$velo_bin <- ifelse(swing_ff$start_speed < 90.5, "Slow", NA)

swing_ff$velo_bin <- ifelse(swing_ff$start_speed >= 90.5 & swing_ff$start_speed < 92.5, "Medium", swing_ff$velo_bin)

swing_ff$velo_bin <- ifelse(swing_ff$start_speed >= 92.5, "Fast", swing_ff$velo_bin)

# Aggregate contact rate by velocity bin
tapply(swing_ff$contact, swing_ff$velo_bin, mean)
Fast
0.793859649122807
Medium
0.832807570977918
Slow
0.843373493975904

Pitch type impact on contact rate

Now that you have seen the relationship between fastball start_speed and contact rate, let's explore the relationship between pitch_type and contact rate.

This time, because average start_speed varies by pitch_type, you will need to reconfigure your velo_bin variable. Specifically, structure this into 3 groups for each pitch, with each group the within-pitch start_speed.

The swings dataset, which includes only pitches at which a batter has swung, has been created for you. Also note that we have written a new function called bin_pitch_speed() for use in calculating quantiles.

In [201]:
swings <- read.table(file = 'Baseball_swings.csv', header = TRUE, sep = ';')
str(swings)
'data.frame':	1600 obs. of  40 variables:
 $ p_name              : Factor w/ 1 level "Zack Greinke": 1 1 1 1 1 1 1 1 1 1 ...
 $ pitcher_id          : int  425844 425844 425844 425844 425844 425844 425844 425844 425844 425844 ...
 $ batter_stand        : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
 $ pitch_type          : Factor w/ 5 levels "CH","CU","FF",..: 3 5 3 5 1 5 5 5 3 5 ...
 $ pitch_result        : Factor w/ 11 levels "BallIn Dirt",..: 11 11 11 11 10 2 11 11 8 7 ...
 $ atbat_result        : Factor w/ 23 levels "BuntGroundout",..: 19 20 20 20 20 11 23 20 13 15 ...
 $ start_speed         : num  92.4 86.9 92.8 87.8 90.3 85.5 87.3 87.4 92.3 87.1 ...
 $ z0                  : num  6.28 6.08 6.11 6.32 5.87 ...
 $ x0                  : num  -0.76 -0.939 -0.524 -0.948 -1.252 ...
 $ pfx_x               : num  -1.59 3.594 -0.558 4.313 -7.525 ...
 $ pfx_z               : num  11.4 0.762 11.134 0.132 1.745 ...
 $ px                  : num  0.589 0.764 1.517 0.695 -0.131 ...
 $ pz                  : num  3.271 1.306 2.193 3.431 0.922 ...
 $ break_angle         : num  10.1 -11.4 -0.4 -13.6 22.5 -8.4 -11.3 -6.5 15.8 -4.8 ...
 $ break_length        : num  2.7 8 2.8 7.8 7.4 7.5 7.4 6.7 3.5 7.8 ...
 $ spin_rate           : num  2312 694 2243 829 1521 ...
 $ spin_dir            : num  187.9 102.6 182.9 92.3 256.6 ...
 $ balls               : int  1 1 1 2 1 0 0 1 0 0 ...
 $ strikes             : int  1 2 2 2 2 0 1 0 1 2 ...
 $ outs                : int  0 0 0 1 1 2 2 1 1 1 ...
 $ game_date           : Factor w/ 32 levels "2015-04-07","2015-04-12",..: 32 32 32 32 32 32 32 32 32 32 ...
 $ inning              : int  3 6 8 1 6 8 4 1 5 7 ...
 $ inning_topbot       : Factor w/ 2 levels "bot","top": 2 2 2 2 2 2 2 2 2 2 ...
 $ batted_ball_type    : Factor w/ 5 levels "","FB","GB","LD",..: 1 1 1 1 1 1 1 1 2 5 ...
 $ batted_ball_velocity: int  104 NA NA NA NA NA NA NA 103 73 ...
 $ hc_x                : num  124 0 0 0 0 ...
 $ hc_y                : num  97.3 0 0 0 0 ...
 $ pitch_id            : int  95 265 374 14 279 386 156 10 219 320 ...
 $ distance_feet       : int  0 NA NA NA NA NA NA NA 425 0 ...
 $ year                : int  2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
 $ month               : int  10 10 10 10 10 10 10 10 10 10 ...
 $ day                 : int  3 3 3 3 3 3 3 3 3 3 ...
 $ july                : Factor w/ 2 levels "july","other": 2 2 2 2 2 2 2 2 2 2 ...
 $ bs_count            : Factor w/ 12 levels "0-0","0-1","0-2",..: 5 8 8 7 8 1 2 10 2 3 ...
 $ late_in_game        : int  0 1 1 0 1 1 0 0 0 1 ...
 $ zone                : int  7 15 12 7 18 11 16 12 6 15 ...
 $ zone_px             : num  0.5 0.5 1.5 0.5 -0.5 0.5 1.5 1.5 -0.5 0.5 ...
 $ zone_pz             : num  3.5 1.5 2.5 3.5 0.5 2.5 1.5 2.5 3.5 1.5 ...
 $ batter_swing        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ contact             : int  0 0 0 0 1 1 0 0 1 1 ...
In [202]:
# Create the subsets for each pitch type
swing_ff <- subset(swings, pitch_type == "FF")
swing_ch <- subset(swings, pitch_type == "CH")
swing_cu <- subset(swings, pitch_type == "CU")
swing_ft <- subset(swings, pitch_type == "FT")
swing_sl <- subset(swings, pitch_type == "SL")
In [207]:
# create a function
bin_pitch_speed <- function(start_speed)
{
    as.integer(cut(start_speed, quantile(start_speed, probs = 0:3 / 3), include.lowest = TRUE))
}
In [208]:
# Make velo_bin_pitch variable for each subset
swing_ff$velo_bin <- bin_pitch_speed(swing_ff$start_speed)
swing_ch$velo_bin <- bin_pitch_speed(swing_ch$start_speed)
swing_cu$velo_bin <- bin_pitch_speed(swing_cu$start_speed)
swing_ft$velo_bin <- bin_pitch_speed(swing_ft$start_speed)
swing_sl$velo_bin <- bin_pitch_speed(swing_sl$start_speed)

# Print quantile levels for each pitch
thirds <- c(0, 1/3, 2/3, 1)
quantile(swing_ff$start_speed, probs = thirds)
quantile(swing_ch$start_speed, probs = thirds)
quantile(swing_cu$start_speed, probs = thirds)
quantile(swing_ft$start_speed, probs = thirds)
quantile(swing_sl$start_speed, probs = thirds)
0%
88.2
33.33333%
91.3
66.66667%
92.5
100%
94.9
0%
80.4
33.33333%
87.8
66.66667%
88.9666666666667
100%
91.5
0%
63.7
33.33333%
73.3
66.66667%
75.3
100%
79.4
0%
87.9
33.33333%
90.5
66.66667%
91.8666666666667
100%
95.4
0%
79.8
33.33333%
86.5
66.66667%
87.6
100%
91.4

Velocity impact on contact by pitch type

Now that you've created your new velo_bin variable, in this exercise you will evaluate whether there is any relationship between start_speed and contact rate by pitch type.

In [240]:
# Calculate contact rate by velocity for swing_ff
tapply(swing_ff$contact, swing_ff$velo_bin, mean)

# Calculate contact rate by velocity for swing_ft
tapply(swing_ft$contact, swing_ft$velo_bin, mean)

# Calculate contact rate by velocity for swing_ch
tapply(swing_ch$contact, swing_ch$velo_bin, mean)

# Calculate contact rate by velocity for swing_cu
tapply(swing_cu$contact, swing_cu$velo_bin, mean)

# Calculate contact rate by velocity for swing_sl
tapply(swing_sl$contact, swing_sl$velo_bin, mean)
1
0.836448598130841
2
0.841346153846154
3
0.781553398058252
1
0.810344827586207
2
0.857142857142857
3
0.87037037037037
1
0.760330578512397
2
0.675925925925926
3
0.765217391304348
1
0.870967741935484
2
0.833333333333333
3
0.888888888888889
1
0.714285714285714
2
0.723076923076923
3
0.680327868852459

It's interesting to see that the relationship between contact rate and velocity for other pitches is less clear than it was for four-seam fastballs.

Greinke's out pitch?

While there wasn't much happening with velocity within pitch type in the previous exercise, it can often be useful to think about situational characteristics of pitching. Take a look at Greinke's propensity to use certain pitches in 2-strike counts.

Specifically, it might be interesting to know if Greinke uses certain pitches in 2-strike counts more often than others. And, if so, are they more successful in getting a swing and a miss? Or are certain pitches getting more outs when they're put in play?

The dataset called swings has been loaded, now with the velo_bin variable, which stacks the 5 pitch type subsets you made in the previous exercises. You will use this new dataset here.

In [241]:
# Create swings_str2
swings_str2 <- subset(swings, swings$strikes == 2)

# Print number of observations
nrow(swings_str2)

# Print a table of pitch use
pitch_tab <- table(swings_str2$pitch_type)
pitch_tab

# Calculate contact rate by pitch type
contact_rate <- round(tapply(swings_str2$contact, swings_str2$pitch_type, mean), 3)
contact_rate
560
 CH  CU  FF  FT  SL 
115  15 182  50 198 
CH
0.774
CU
0.867
FF
0.846
FT
0.8
SL
0.798

Describe 2-strike pitch usage

The tables you created in the last exercise have been loaded into your workspace as pitch_tab and contact_rate.

Which pitch did Greinke use most in 2-strike counts in 2015 and which was most successful in getting swings and misses?

pitch_tab

 CH  CU  FF  FT  SL 
115  15 182  50 198 

contact_rate
   CH    CU    FF    FT    SL 
0.774 0.867 0.846 0.800 0.798

He used his slider (SL) most often, but his curveball (CU) was most successful.

Impact of pitch location on contact rate

Now, instead of velocity, visualize contact rates based on location. Here, only look at the zone areas that are near the strike zone (in this case, zones 6, 7, 10, 11, 14, and 15).

Begin by calculating the contact rate for each zone separately for right- and left-handed batters. Then, you'll use what you learned from previous exercises to plot() them on a figure.

In [242]:
# Create subset of swings: swings_rhb
swings_lhb <- subset(swings, swings$batter_stand == "L")

# Create subset of swings: swings_lhb
swings_rhb <- subset(swings, swings$batter_stand == "R")

# Create zone_contact_r
zone_contact_l <- round(tapply(swings_lhb$contact, swings_lhb$zone, mean), 3)

# Create zone_contact_l
zone_contact_r <- round(tapply(swings_rhb$contact, swings_rhb$zone, mean), 3)

# Plot figure grid for RHB
par(mfrow = c(1, 2))
plot(x = c(-1, 1), y = c(1, 4), type = "n", 
     main = "Contact Rate by Location (RHB)", 
     xlab = "Horizontal Location (ft.; Catcher's View)", 
     ylab = "Vertical Location (ft.)")
abline(v = 0)
abline(h = 2)
abline(h = 3)

# Add text for RHB contact rate
for(i in unique(c(6, 7, 10, 11, 14, 15))) {
  text(mean(swings_rhb$zone_px[swings_rhb$zone == i]), 
       mean(swings_rhb$zone_pz[swings_rhb$zone == i]), 
       zone_contact_r[rownames(zone_contact_r) == i], cex = 1.5)
}

# Add LHB plot
plot(x = c(-1, 1), y = c(1, 4), type = "n", 
     main = "Contact Rate by Location (LHB)", 
     xlab = "Horizontal Location (ft.; Catcher's View)", 
     ylab = "Vertical Location (ft.)")
abline(v = 0)
abline(h = 2)
abline(h = 3)

# Add text for LHB contact rate
for(i in unique(c(6, 7, 10, 11, 14, 15))) {
  text(mean(swings_lhb$zone_px[swings_lhb$zone == i]), 
       mean(swings_lhb$zone_pz[swings_lhb$zone == i]), 
       zone_contact_l[rownames(zone_contact_l) == i], cex = 1.5)
}

It looks like low pitches are much more difficult for batters to make contact with when they swing. Maybe Greinke should keep his pitches down, especially against lefties!

Rethinking the use of for loops

Use the seq() and rep() functions to create vectors of coordinates for each zone. Then, put these together in a new data frame. This data frame will be added to and used for plotting.

In [243]:
# Create vector px
px <- rep(seq(from = -1.5, to = 1.5, by = 1), times = 5)

# Create vector pz
pz <- rep(seq(from = 4.5, to = 0.5, by = -1), each = 4)

# Create vector of zone numbers
zone <- seq(from = 1, to = 20, by = 1)

# Create locgrid
locgrid <- data.frame(zone, px, pz)

# Print locgrid
locgrid
zonepxpz
1 -1.54.5
2 -0.54.5
3 0.54.5
4 1.54.5
5 -1.53.5
6 -0.53.5
7 0.53.5
8 1.53.5
9 -1.52.5
10 -0.52.5
11 0.52.5
12 1.52.5
13 -1.51.5
14 -0.51.5
15 0.51.5
16 1.51.5
17 -1.50.5
18 -0.50.5
19 0.50.5
20 1.50.5

Contact rate with ggplot2

The zone_contact_r and zone_contact_l tables are now stored as data frames with two columns: the contact rate and the zone number. Note that not all 20 zones have data for both left- and right-handed batters.

Now, let's put these data frames to work with the help of locgrid and ggplot2 to make some plots without using any for loops!

In [238]:
# load the gridExtra package
library(gridExtra)
In [221]:
locgrid <- read.table(file = 'locgrid.csv', header = TRUE, sep = ';')
str(locgrid)
'data.frame':	20 obs. of  5 variables:
 $ zone          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ px            : num  -1.5 -0.5 0.5 1.5 -1.5 -0.5 0.5 1.5 -1.5 -0.5 ...
 $ pz            : num  4.5 4.5 4.5 4.5 3.5 3.5 3.5 3.5 2.5 2.5 ...
 $ contact_rate_r: num  0.5 NA 0.667 NA 0.833 0.729 0.701 0.917 1 0.88 ...
 $ contact_rate_l: num  NA 1 NA NA 0.5 0.926 0.809 1 0.767 0.939 ...
In [248]:
zone_contact_r <- read.table(file = 'zone_contact_r.csv', header = TRUE, sep = ';')
zone_contact_l <- read.table(file = 'zone_contact_l.csv', header = TRUE, sep = ';')
str(zone_contact_r)
str(zone_contact_l)
'data.frame':	17 obs. of  2 variables:
 $ contact_rate_r: num  0.5 0.667 0.833 0.729 0.701 0.917 1 0.88 0.81 0.75 ...
 $ zone          : int  1 3 5 6 7 8 9 10 11 12 ...
'data.frame':	16 obs. of  2 variables:
 $ contact_rate_l: num  1 0.5 0.926 0.809 1 0.767 0.939 0.945 0.857 0.625 ...
 $ zone          : int  2 5 6 7 8 9 10 11 12 13 ...
In [249]:
# Examine new contact data
zone_contact_r
zone_contact_l
contact_rate_rzone
0.500 1
0.667 3
0.833 5
0.729 6
0.701 7
0.917 8
1.000 9
0.88010
0.81011
0.75012
0.60013
0.71214
0.68915
0.35316
0.72718
0.82119
0.84020
contact_rate_lzone
1.000 2
0.500 5
0.926 6
0.809 7
1.000 8
0.767 9
0.93910
0.94511
0.85712
0.62513
0.73614
0.65115
0.00016
1.00017
0.75018
0.78919
In [250]:
# Merge locgrid with zone_contact_r
# Specify two additional arguments: by = "zone", which tells R which column to match on, and all.x = TRUE, which will make sure you don't lose any zones (1-20)
locgrid <- merge(locgrid, zone_contact_r, by = "zone", all.x = TRUE)

# Merge locgrid with zone_contact_l
locgrid <- merge(locgrid, zone_contact_l, by = "zone", all.x = TRUE)
In [251]:
# Print locgrid to the console
locgrid
zonepxpzcontact_rate_rcontact_rate_l
1 -1.5 4.5 0.500 NA
2 -0.5 4.5 NA1.000
3 0.5 4.5 0.667 NA
4 1.5 4.5 NA NA
5 -1.5 3.5 0.8330.500
6 -0.5 3.5 0.7290.926
7 0.5 3.5 0.7010.809
8 1.5 3.5 0.9171.000
9 -1.5 2.5 1.0000.767
10 -0.5 2.5 0.8800.939
11 0.5 2.5 0.8100.945
12 1.5 2.5 0.7500.857
13 -1.5 1.5 0.6000.625
14 -0.5 1.5 0.7120.736
15 0.5 1.5 0.6890.651
16 1.5 1.5 0.3530.000
17 -1.5 0.5 NA1.000
18 -0.5 0.5 0.7270.750
19 0.5 0.5 0.8210.789
20 1.5 0.5 0.840 NA

Adding titles and axes to ggplot2 figure

It's time to add a title and axis labels to your plots. This is done using the ggtitle() and labs() functions in the ggplot2 package.

The default title text size is a bit small sometimes: make use of the theme() function in the ggplot2 package.

In [264]:
library(ggplot2)
# add this package for more colors
library(RColorBrewer)
In [259]:
# Make base grid with ggplot()
plot_base_grid <- ggplot(locgrid, aes(x = px, y = pz))

# Arrange the plots side-by-side
# from the gridExtra package
grid.arrange(plot_base_grid, ncol = 2, plot_base_grid)
In [260]:
# Adding text for contact rate values
# Make RHB plot
plot_titles_rhb <- plot_base_grid + 
  ggtitle("RHB Contact Rates") + 
  labs(x = "Horizontal Location(ft.; Catcher's View)", 
       y = "Vertical Location (ft.)") + 
  theme(plot.title = element_text(size = 15))

# Make LHB plot
plot_titles_lhb <- plot_base_grid + 
  ggtitle("LHB Contact Rates") + 
  labs(x = "Horizontal Location(ft.; Catcher's View)", 
       y = "Vertical Location (ft.)") + 
  theme(plot.title = element_text(size = 15))

# Display both side-by-side
grid.arrange(plot_titles_lhb, plot_titles_rhb, ncol = 2)

Making a heat map - visualizing hot and cold zones

Visualizing probabilities on a grid can often be improved by using color. In this case, the figure could be referred to as a heat map. There are several ways to build heat maps in R, but in this exercise, you'll do it with ggplot2.

The color of each zone in your heat map will correspond to the probability of making contact. Scaling colors in this way gives an additional visual edge so that the locational impact on contact rate can be inferred and compared more quickly.

The geom_tile() function does this automatically, but you will customize the color scheme with the scale_fill_gradientn() function in combination with brewer.pal() from the RColorBrewer package. (RColorBrewer provides some nice color schemes that would be time-consuming to reproduce manually. You will use mostly red colors here, but you can type display.brewer.all() in the R console to see the available color palattes.)

Darker and lighter shades of red will indicate higher and lower contact rates, respectively.

In [263]:
# Make RHB plot
plot_colors_rhb <- plot_titles_rhb + 
  geom_tile(aes(fill = contact_rate_r)) + 
  scale_fill_gradientn(name = "Contact Rate", 
                       limits = c(0.5, 1), 
                       breaks = seq(from = 0.5, to = 1, by = 0.1), 
                       colors = c(brewer.pal(n = 7, name = "Reds")))

# Make LHB plot
plot_colors_lhb <- plot_titles_lhb + 
  geom_tile(aes(fill = contact_rate_l)) + 
  scale_fill_gradientn(name = "Contact Rate", 
                       limits = c(0.5, 1), 
                       breaks = seq(from = 0.5, to = 1, by = 0.1), 
                       colors = c(brewer.pal(n = 7, name = "Blues")))

# Display plots side-by-side
grid.arrange(plot_colors_lhb, plot_colors_rhb, ncol = 2)

Adding text for contact rate values

Now, show the actual contact_rate value within each zone box. You already did this before with your for loops. However, ggplot2() allows this to be done a bit more easily with the locgrid data frame and a new function called annotate().

Here, you will simply be adding the additional parameters to the plot_colors_rhb and plot_colors_lhb objects.

In [268]:
# Make RHB plot
plot_contact_rhb <- plot_colors_rhb + 
  annotate("text", x = locgrid$px, y = locgrid$pz, 
           label = locgrid$contact_rate_r, size = 3)

# Make LHB plot
plot_contact_lhb <- plot_colors_lhb + 
  annotate("text", x = locgrid$px, y = locgrid$pz, 
           label = locgrid$contact_rate_l, size = 3)

# Plot them side-by-side
grid.arrange(plot_contact_lhb, plot_contact_rhb, ncol = 2)
Warning message:
"Removed 4 rows containing missing values (geom_text)."Warning message:
"Removed 3 rows containing missing values (geom_text)."

Contact and exit speed

While contact rate can be a useful measure for knowing how successful a pitcher is, there are other ways to measure outcomes. Begin to work with the batted_ball_velocity variable. This measures how hard the ball was hit by the batter.

Unfortunately, there is also a large amount of missing data in the batted_ball_velocity variable. For our purposes here, we'll assume that there's no pattern to which data are missing, so you'll simply remove any observations that are missing a value for this variable.

In [269]:
# Create pcontact
pcontact <- subset(swings, swings$contact == 1 & !is.na(swings$batted_ball_velocity))

# Create pcontact_r
pcontact_r <- subset(pcontact, pcontact$batter_stand == "R")

# Create pcontact_l
pcontact_l <- subset(pcontact, pcontact$batter_stand == "L")

Location and exit speed

Get the average batted_ball_velocity by zone using tapply() and plot it as text on your zone grid. The pcontact_r and pcontact_l data, in addition to the locgrid data, are available in the workspace.

In [270]:
# Create exit_speed_r
exit_speed_r <- data.frame(tapply(pcontact_r$batted_ball_velocity, 
                                  pcontact_r$zone, mean))
exit_speed_r <- round(exit_speed_r, 1)
colnames(exit_speed_r) <- "exit_speed_rhb"
exit_speed_r$zone <- row.names(exit_speed_r)

# Create exit_speed_l
exit_speed_l <- data.frame(tapply(pcontact_l$batted_ball_velocity, 
                                  pcontact_l$zone, mean))
exit_speed_l <- round(exit_speed_l, 1)
colnames(exit_speed_l) <- "exit_speed_lhb"
exit_speed_l$zone <- row.names(exit_speed_l)
  
# Merge with locgrid
locgrid <- merge(locgrid, exit_speed_l, by = "zone", all.x = T)
locgrid <- merge(locgrid, exit_speed_r, by = "zone", all.x = T)

# Print locgrid
locgrid
zonepxpzcontact_rate_rcontact_rate_lexit_speed_lhbexit_speed_rhb
1 -1.5 4.5 0.500 NA NA NA
2 -0.5 4.5 NA1.00072.0 NA
3 0.5 4.5 0.667 NA NA 92.0
4 1.5 4.5 NA NA NA NA
5 -1.5 3.5 0.8330.50091.0 60.5
6 -0.5 3.5 0.7290.92691.0 88.0
7 0.5 3.5 0.7010.80981.0 89.1
8 1.5 3.5 0.9171.000 NA 78.8
9 -1.5 2.5 1.0000.76783.9 86.4
10 -0.5 2.5 0.8800.93991.0 89.9
11 0.5 2.5 0.8100.94586.1 89.5
12 1.5 2.5 0.7500.85773.0 83.4
13 -1.5 1.5 0.6000.62577.1 85.0
14 -0.5 1.5 0.7120.73689.9 88.7
15 0.5 1.5 0.6890.65192.8 89.8
16 1.5 1.5 0.3530.000 NA 84.2
17 -1.5 0.5 NA1.00083.5 NA
18 -0.5 0.5 0.7270.75092.0 89.0
19 0.5 0.5 0.8210.78984.6 84.6
20 1.5 0.5 0.840 NA NA 91.1

Plotting exit speed as a heat map

Use the swings dataset to plot exit speed by zone location with ggplot2.

As with contact_rate, use the exit_speed information from locgrid to create a heat map of exit speed by location.

In [274]:
# Create LHB exit speed plotting object
plot_exit_lhb <- plot_base_grid + 
  geom_tile(data = locgrid, aes(fill = exit_speed_lhb)) + 
  scale_fill_gradientn(name = "Exit Speed (mph)", 
                       limits = c(60, 95), 
                       breaks = seq(from = 60, to = 95, by = 5), 
                       colors = c(brewer.pal(n = 7, name = "Blues"))) + 
  annotate("text", x = locgrid$px, y = locgrid$pz, 
           label = locgrid$exit_speed_rhb, size = 3) + 
  ggtitle("LHB Exit Velocity (mph)") + 
  labs(x = "Horizontal Location(ft.; Catcher's View)", 
       y = "Vertical Location (ft.)") + 
  theme(plot.title = element_text(size = 15))

# Create RHB exit speed plotting object
plot_exit_rhb <- plot_base_grid + 
  geom_tile(data = locgrid, aes(fill = exit_speed_rhb)) + 
  scale_fill_gradientn(name = "Exit Speed (mph)", 
                       limits = c(60, 95), 
                       breaks = seq(from = 60, to = 95, by = 5), 
                       colors = c(brewer.pal(n = 7, name = "Reds"))) + 
  annotate("text", x = locgrid$px, y = locgrid$pz, 
           label = locgrid$exit_speed_rhb, size = 3) + 
  ggtitle("RHB Exit Velocity (mph)") + 
  labs(x = "Horizontal Location(ft.; Catcher's View)", 
       y = "Vertical Location (ft.)") + 
  theme(plot.title = element_text(size = 15))

# Plot each side-by-side
grid.arrange(plot_exit_lhb, plot_exit_rhb, ncol = 2)
Warning message:
"Removed 4 rows containing missing values (geom_text)."Warning message:
"Removed 4 rows containing missing values (geom_text)."

Using tidy data and facets in ggplot2

Creating side-by-side plots for with the grid_arrange() function misses an important aspect of plotting with ggplot2: facets. Facets are used to create separate windows for plotting information about different groups; in this case, right- and left-handed batters. They can also automatically scale the heat map colors.

Facets require the restructuration of the locgrid dataset. This has been done for you, and is preloaded as a new dataset called exit_tidy. We have removed the contact rate information, so exit_tidy only includes information on exit_speed.

In [276]:
exit_tidy <- read.table(file = 'exit_tidy.csv', header = TRUE, sep = ';')
str(exit_tidy)
'data.frame':	40 obs. of  5 variables:
 $ zone        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ px          : num  -1.5 -0.5 0.5 1.5 -1.5 -0.5 0.5 1.5 -1.5 -0.5 ...
 $ pz          : num  4.5 4.5 4.5 4.5 3.5 3.5 3.5 3.5 2.5 2.5 ...
 $ batter_stand: Factor w/ 2 levels "LHB","RHB": 2 2 2 2 2 2 2 2 2 2 ...
 $ exit_speed  : num  NA NA 92 NA 60.5 88 89.1 78.8 86.4 89.9 ...
In [278]:
# Examine head() and tail() of exit_tidy
head(exit_tidy)
tail(exit_tidy)
zonepxpzbatter_standexit_speed
1 -1.54.5 RHB NA
2 -0.54.5 RHB NA
3 0.54.5 RHB 92.0
4 1.54.5 RHB NA
5 -1.53.5 RHB 60.5
6 -0.53.5 RHB 88.0
zonepxpzbatter_standexit_speed
3515 0.51.5 LHB 92.8
3616 1.51.5 LHB NA
3717 -1.50.5 LHB 83.5
3818 -0.50.5 LHB 92.0
3919 0.50.5 LHB 84.6
4020 1.50.5 LHB NA
In [279]:
# Create plot_exit
plot_exit <- plot_base_grid + 
  geom_tile(data = exit_tidy, aes(fill = exit_speed)) + 
  scale_fill_gradientn(name = "Exit Speed (mph)", 
                       colors = c(brewer.pal(n = 7, name = "Reds"))) + 
  ggtitle("Exit Speed (mph)") + 
  labs(x = "Horizontal Location(ft.; Catcher's View)", 
       y = "Vertical Location (ft.)") + 
  theme(plot.title = element_text(size = 15)) +
  facet_grid(. ~ batter_stand)

# Display plot_exit
plot_exit

Some randomness in baseball outcomes

Despite trends and correlations, everything might not have a cause... Good luck in July or other reasons, Greinke overall outcomes were good.

In 2015, Greinke was in the top-10 best pitcher (7th).


More

Baseball data is publicky available. We can use it to analyze pitching, batting, catching, umpireing, and fielding.