11 Foundations of Frequentist Statistics
In this chapter, we embark on a journey into the heart of frequentist statistical inference—a framework that dominates modern empirical research. At its core, frequentist statistics is about making observations from a sample and then drawing inferences about the broader population from which that sample was drawn. The fundamental question we seek to answer is: How confident can we be that the patterns we observe in our limited sample reflect true patterns in the population?
By the end of this chapter, you will understand the foundational concepts that underpin frequentist inference, including the philosophy of repeated sampling, the nature of estimators, and the mathematical criteria we use to distinguish good estimators from poor ones.
11.1 The Nature of Inferential Statistics
This distinction—between what we observe in our sample and what we can confidently claim about the population—lies at the heart of all inferential statistics. But what exactly are we making claims about when we talk about populations?
Population Parameters vs. Sample Statistics
When we make claims about a population, we are not making claims about individual observations. After all, populations are conceptually infinite in size. Instead, we make claims about specific parameters of the population’s distribution. The two parameters we encounter most frequently are:
The population mean (\(\mu\)): This is by far the most common parameter we test hypotheses about in applied statistics.
The population variance (\(\sigma^2\)): This parameter is crucial because tests for the population mean often depend on our ability to estimate the population variance.
Because we never truly know the values of \(\mu\) or \(\sigma^2\), we must estimate them using sample data. The corresponding quantities we calculate from our sample are:
- Sample mean (\(\bar{y}\)): The analog to the population mean
- Sample variance (\(s^2\)): The analog to the population variance
11.2 Transformations of Random Variables
Before we dive into the philosophy of estimation, we need to develop some mathematical machinery. In statistics, we routinely transform data—we take numbers, apply formulas to them, and generate new numbers. Understanding how these transformations affect the mean and variance of our data is essential.
Affine Transformations
Consider a simple but powerful type of transformation called an affine transformation. If we have a random variable \(X\) with mean \(\bar{x}\) and variance \(s^2\), we might create a new variable:
\[Y = mX + c\]
where \(m\) is a multiplicative constant (the slope) and \(c\) is an additive constant (the intercept). This is exactly the form of a linear equation you’ve seen since high school algebra.
The question is: if we know the mean and variance of \(X\), what are the mean and variance of \(Y\)?
We can decompose this affine transformation into two simpler operations:
- Translation: \(X \rightarrow X + c\) (adding a constant)
- Linear transformation: \(X \rightarrow mX\) (multiplying by a constant)
Properties of Translation
When you add a constant \(c\) to every value in your dataset, creating \(Y = X + c\):
\[ \begin{aligned} \text{Mean of } Y &= \bar{x} + c \\ \text{Variance of } Y &= s^2 \end{aligned} \]
The mean shifts by exactly \(c\), but the variance remains unchanged. Why? Because variance measures the spread of data around the mean, and when you shift all values by the same amount, their relative positions don’t change.
Properties of Linear Transformation
When you multiply every value by a constant \(m\), creating \(Y = mX\):
\[ \begin{aligned} \text{Mean of } Y &= m\bar{x} \\ \text{Variance of } Y &= m^2 s^2 \\ \text{Standard deviation of } Y &= |m| s \end{aligned} \]
Notice that the variance is multiplied by \(m^2\), not \(m\). This occurs because variance involves squared deviations, so a multiplicative constant gets squared in the process.
Combining Both Transformations
For the full affine transformation \(Y = mX + c\):
\[ \begin{aligned} \text{Mean of } Y &= m\bar{x} + c \\ \text{Variance of } Y &= m^2 s^2 \\ \text{Standard deviation of } Y &= |m| s \end{aligned} \]
These formulas will prove invaluable as we develop more sophisticated statistical techniques.
11.3 The Frequentist Philosophy: Repeated Sampling
We now arrive at the conceptual heart of frequentist statistics. The entire edifice of frequentist inference rests on an imaginary exercise: repeated sampling.
Here’s the crucial point: in practice, we only sample once. But theoretically, we imagine what would happen if we could sample infinitely many times. The behavior of our estimator across these hypothetical repeated samples tells us whether it’s a good estimator or not.
The Concept of an Estimator
An estimator is simply a formula that we apply to sample data to estimate a population parameter. Importantly, there are infinitely many possible estimators for any given parameter.
For example, suppose we want to estimate the population mean \(\mu\). Here are just a few of the infinitely many estimators we could choose:
- The first observation: \(\hat{\mu}_1 = y_1\)
- The sum of the first two observations: \(\hat{\mu}_2 = y_1 + y_2\)
- The cube of the first observation: \(\hat{\mu}_3 = y_1^3\)
- The fourth power of the seventh observation times the sine of the second: \(\hat{\mu}_4 = y_7^4 \times \sin(y_2)\)
- The sample mean: \(\hat{\mu}_5 = \bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i\)
Most of these are obviously terrible estimators. But the point is that we can construct any formula we want, and each formula defines a different estimator. The set of all possible estimators is infinite.
So how do we choose among them? How do we determine which estimators are “good” and which are “bad”?
The answer lies in examining the sampling distribution of each estimator.
Sampling Distributions
For any estimator, we can imagine the repeated sampling process:
- Draw a sample of size \(n\)
- Apply the estimator to get an estimate
- Record that estimate
- Repeat infinitely many times
The distribution of all these estimates is called the sampling distribution of the estimator. Each different estimator has its own sampling distribution.
11.4 A Concrete Example: Estimating from a Simple Population
To make these abstract ideas concrete, let’s work through a simple example. Consider a population with only three values: \(\{1, 2, 3\}\). Since there’s one of each value, each has probability \(1/3\) of being selected if we draw randomly from this population.
The True Population Parameters
This is a discrete uniform distribution, and we can easily calculate the true population mean and variance:
\[ \mu = E[Y] = 1 \cdot \frac{1}{3} + 2 \cdot \frac{1}{3} + 3 \cdot \frac{1}{3} = 2 \]
For the variance, we first calculate the expected value of \(Y^2\):
\[ E[Y^2] = 1^2 \cdot \frac{1}{3} + 2^2 \cdot \frac{1}{3} + 3^2 \cdot \frac{1}{3} = \frac{14}{3} \]
Then, using the formula \(\text{Var}(Y) = E[Y^2] - (E[Y])^2\):
\[ \sigma^2 = \frac{14}{3} - 2^2 = \frac{14}{3} - 4 = \frac{2}{3} \]
We can also verify this directly by calculating the squared deviations:
| Value \((y)\) | Deviation \((y - \mu)\) | Squared Deviation \((y - \mu)^2\) |
|---|---|---|
| 1 | -1 | 1 |
| 2 | 0 | 0 |
| 3 | 1 | 1 |
\[ \sigma^2 = \frac{1 + 0 + 1}{3} = \frac{2}{3} \]
So we know that \(\mu = 2\) and \(\sigma^2 = 2/3\).
The Frequentist Approach: Pretending We Don’t Know
Now comes the frequentist thought experiment. Imagine we don’t know these population parameters. Our goal is to estimate \(\mu\) using only sample data.
We’ll draw samples of size \(n = 2\) with replacement. Sampling with replacement means:
- Draw an observation and record its value
- Return it to the population
- Draw a second observation
This ensures that our two observations are independent—the value of the second observation doesn’t depend on the first.
Enumerating All Possible Samples
With a population of size 3 and sample size 2, how many possible samples can we draw (with replacement, where order matters)?
- First observation: 3 possibilities
- Second observation: 3 possibilities
- Total: \(3 \times 3 = 9\) possible samples
Here are all nine possible samples and their means:
| Sample | Values | Sample Mean \(\bar{y}\) |
|---|---|---|
| 1 | (1, 1) | 1.0 |
| 2 | (1, 2) | 1.5 |
| 3 | (1, 3) | 2.0 |
| 4 | (2, 1) | 1.5 |
| 5 | (2, 2) | 2.0 |
| 6 | (2, 3) | 2.5 |
| 7 | (3, 1) | 2.0 |
| 8 | (3, 2) | 2.5 |
| 9 | (3, 3) | 3.0 |
The Sampling Distribution of the Sample Mean
Now we can construct the sampling distribution of \(\bar{y}\). We have nine equally likely samples, so each sample mean has probability \(1/9\):
| Sample Mean \(\bar{y}\) | Frequency | Probability |
|---|---|---|
| 1.0 | 1 | 1/9 |
| 1.5 | 2 | 2/9 |
| 2.0 | 3 | 3/9 |
| 2.5 | 2 | 2/9 |
| 3.0 | 1 | 1/9 |
The expected value of this sampling distribution is:
\[ E[\bar{y}] = 1.0 \cdot \frac{1}{9} + 1.5 \cdot \frac{2}{9} + 2.0 \cdot \frac{3}{9} + 2.5 \cdot \frac{2}{9} + 3.0 \cdot \frac{1}{9} = 2 \]
This equals the true population mean! This is our first glimpse of an important property: unbiasedness.
11.5 Properties of Good Estimators
We began with infinitely many possible estimators. To choose among them, we need criteria for what makes an estimator “good.” Statisticians have identified many desirable properties, but we’ll focus on three fundamental ones:
- Unbiasedness
- Efficiency
- Consistency
Property 1: Unbiasedness
An estimator is unbiased if the expected value of its sampling distribution equals the true population parameter being estimated.
For the sample mean as an estimator of the population mean:
\[E[\bar{y}] = \mu\]
We can prove this generally. For a sample \(y_1, y_2, \ldots, y_n\) drawn from a population with mean \(\mu\):
\[ \begin{aligned} E[\bar{y}] &= E\left[\frac{1}{n}\sum_{i=1}^{n} y_i\right] \\ &= \frac{1}{n}\sum_{i=1}^{n} E[y_i] \\ &= \frac{1}{n}\sum_{i=1}^{n} \mu \\ &= \frac{1}{n} \cdot n\mu \\ &= \mu \end{aligned} \]
The sample mean is an unbiased estimator of the population mean. It gives us the correct answer “on average” across repeated samples.
But here’s the challenge: infinitely many estimators are unbiased. Applying the filter of unbiasedness still leaves us with infinitely many candidates. We need additional criteria.
Property 2: Efficiency
Among all unbiased estimators, we prefer the one with the smallest variance in its sampling distribution.
Why do we care about efficiency? If two estimators are both unbiased (correct on average), but one has a tighter sampling distribution, we have more confidence in estimates from the tighter distribution. Each individual estimate is more likely to be close to the true parameter value.
Imagine two unbiased estimators, A and B:
- Estimator A: \(E[\hat{\mu}_A] = \mu\) and \(\text{Var}(\hat{\mu}_A) = 0.5\)
- Estimator B: \(E[\hat{\mu}_B] = \mu\) and \(\text{Var}(\hat{\mu}_B) = 2.0\)
Both are unbiased, but A is more efficient. Estimates from A will cluster more tightly around \(\mu\) than estimates from B.
The remarkable result: Among all unbiased estimators of the population mean, the sample mean has the smallest variance. It is the most efficient unbiased estimator.
Variance of the Sample Mean
For a sample of size \(n\) drawn from a population with variance \(\sigma^2\), the variance of the sampling distribution of \(\bar{y}\) is:
\[ \text{Var}(\bar{y}) = \frac{\sigma^2}{n} \]
We can derive this using our transformation rules:
\[ \begin{aligned} \text{Var}(\bar{y}) &= \text{Var}\left(\frac{1}{n}\sum_{i=1}^{n} y_i\right) \\ &= \frac{1}{n^2} \text{Var}\left(\sum_{i=1}^{n} y_i\right) \\ &= \frac{1}{n^2} \sum_{i=1}^{n} \text{Var}(y_i) \quad \text{(assuming independence)} \\ &= \frac{1}{n^2} \cdot n\sigma^2 \\ &= \frac{\sigma^2}{n} \end{aligned} \]
This formula will become crucial in our next property.
Property 3: Consistency
Both unbiasedness and efficiency are properties that hold for a fixed sample size \(n\). Our third property asks: what happens as \(n\) increases?
This property is formalized by the Law of Large Numbers, which states that as the sample size grows, the sample mean converges to the population mean.
Looking at our variance formula:
\[ \text{Var}(\bar{y}) = \frac{\sigma^2}{n} \]
As \(n\) increases, the variance decreases. As \(n \rightarrow \infty\):
\[ \lim_{n \rightarrow \infty} \text{Var}(\bar{y}) = \lim_{n \rightarrow \infty} \frac{\sigma^2}{n} = 0 \]
The distribution collapses to a single point. When our sample size equals the entire population, every sample mean equals the population mean exactly. The sample mean is a consistent estimator.
11.6 Summary: Why the Sample Mean?
We can now answer definitively why we use the sample mean to estimate the population mean:
- Unbiasedness: The sample mean is correct on average across repeated samples
- Efficiency: Among all unbiased estimators, it has the smallest variance
- Consistency: As sample size increases, the sample mean converges to the population mean
These three properties—proven mathematically, not assumed—make the sample mean the optimal choice for estimating population means in the frequentist framework.
11.7 Interpretive Questions
11.8 Looking Ahead
In this chapter, we’ve established the fundamental philosophy of frequentist statistics and proven that the sample mean is an optimal estimator of the population mean. In the next chapter, we’ll turn our attention to estimating the population variance and develop the tools necessary for hypothesis testing—the framework that allows us to make formal claims about statistical significance.
The journey from these theoretical foundations to practical hypothesis testing may seem long, but every step is essential. Statistical inference is not a collection of arbitrary formulas—it is a coherent mathematical framework built on rigorous principles. Understanding these principles will make you a more thoughtful and capable analyst.