Foreword

  • Output options: the ‘tango’ syntax and the ‘readable’ theme.
  • Snippets and results.
  • htmlTable


Why?

Any company might want to know how many units the competition can produce.

Using inferential statistics, we can estimate the total number of units.

Just like a survey of 2000 voters (the sample) can assess the population sentiment about a political party. This time, we assess a percentage. Whether we assess a number of units or a percentage, the result will have a margin of error, therefore, a lower and an upper bound.

When dealing with manufactured good, the technique consists in using serial numbers or sequences (i.e., 1, 2, 3, …, n). For example, with serial numbers gathered through online discussions, the technique was used to estimate the number of iphones sold. It was estimated that Apple had sold around 9.1 million phones to the end of September 2008.

Problem

During WWII, the Western Allies tried to determine the extent of German production. The Allies used conventional intelligence gathering in conjunction with statistical estimation.

According to conventional intelligence estimates, the Germans production was around 1,400 tanks a month between June 1940 and September 1942 or:

Period Intelligence estimate
1 June 1940 1000
2 June 1941 1550
3 August 1942 1550

For some, the numbers were inflated! This is where inferential statistics comes in.

How

The Allies used the serial numbers on captured or destroyed tanks.

  • The principal numbers used were gearbox numbers.
  • Chassis and engine numbers were also used, though their use was more complicated.
  • Various other components were used to cross-check the analysis. Similar analyses were done on tires, which were observed to be.
  • The analysis of tank wheels yielded an estimate for the number of wheel molds that were in use.

Theory

Serial numbers are a sequence. A sequence (i.e., 1, 2, 3, …, n) is a uniform discrete distribution. Like throwing a dice: you have 1/6 or 16.7% of rolling a 1 or a 4. The sequence on a dice in 1 to 6; 6 being the maximum.

Therefore, we want to estimate the ‘maximum’ of a discrete uniform distribution from sampling. Since each serial number is unique, the sample should be ‘without replacement’.

In small populations and often in large ones, such sampling is typically done ‘without replacement’.

Suppose k = 4 tanks (sample size) with serial numbers 19, 40, 42 and 60 are captured. The maximal observed serial number, m = 60. The unknown total number of tanks is called N.

The formula for estimating the total number of tanks is:

\[ N = m + \frac{m}{k} - 1 \]

N would be 74.

Now, the larger the sample, the better.

Conclusion

Applying the formula to the serial numbers of captured tanks, the number was calculated to be 256 a month.

Conventional intelligence estimates: around 1,400 a month.

After the war, captured German production figures from the ministry of Albert Speer showed the actual number to be 255.

Period Intelligence estimate German records
1 June 1940 1000 122
2 June 1941 1550 271
3 August 1942 1550 343

Estimating production was not the only use of this serial-number analysis. It was also used to understand German production more generally, including the number of factories, the relative importance of factories, the length of supply chain (based on the lag between production and use), changes in production, and use of resources such as rubber.

Additional readings:

A Monte Carlo test

Starting from the formula:

\[N = m + \frac{m}{k} - 1 \]

Code the equation which estimates \(N\) or german_tank with \(m\) or maximum value in tank_sample and \(k\) or tank_sample.

german_tank <- function(tank_sample) {
    max(tank_sample) + max(tank_sample)/length(tank_sample) - 1
}

Run the Monte Carlo simulation on the equation. Each sample size is 20.

# A blank log-log plot to get started
plot(100, 100,
     xlim = c(100, 10^7), 
     ylim = c(100,10^7), 
     log = 'xy', # both axes
     pch = '.', 
     col = 'white',
     frame.plot = FALSE, 
     xlab = 'True values',
     ylab = 'Predicted values',
     main = 'MC Simulation') 


# Track residuals
trueTops = c()
resids = c()
sampleTops = c()

x = runif(100, 2, 6)
for (i in x) {
    trueTop = 10^i
    for(j in 1:50) {
        observeds = sample(1:trueTop, 20) # No replacement here
        guess = german_tank(observeds)
 
        # Plot the true value vs the predicted one
        points(trueTop, guess, pch=".", col = "blue", cex = 2) 
 
        trueTops = c(trueTops, trueTop)
        resids = c(resids, trueTop - guess)
        sampleTops = c(sampleTops, max(observeds))
    }
} 

# Platonic line of perfectly placed predictions
lines(c(100, 10^6), 
      c(100, 10^6),
      lty = "dashed", 
      col = "gray", 
      lwd = 1)

Plot residuals too.

plot(trueTops, 
     log = "x", 
     resids, pch = 20, 
     col = "blue", 
     xlab = "True value", 
     ylab = "Residual", 
     main = "Residuals plot")
abline(h = 0)