The Philomath Home
Data Analytics for Critical Thinkers

11  Foundations of Frequentist Statistics

In this chapter, we embark on a journey into the heart of frequentist statistical inference—a framework that dominates modern empirical research. At its core, frequentist statistics is about making observations from a sample and then drawing inferences about the broader population from which that sample was drawn. The fundamental question we seek to answer is: How confident can we be that the patterns we observe in our limited sample reflect true patterns in the population?

By the end of this chapter, you will understand the foundational concepts that underpin frequentist inference, including the philosophy of repeated sampling, the nature of estimators, and the mathematical criteria we use to distinguish good estimators from poor ones.

11.1 The Nature of Inferential Statistics

NoteWhat Are We Really Doing?

Inferential statistics is fundamentally about making observations in sample data and then attempting to extrapolate causal connections or patterns to the population data. When we successfully extrapolate these connections, we say our results are statistically significant. When we cannot extrapolate with confidence, we say our results are not statistically significant.

This distinction—between what we observe in our sample and what we can confidently claim about the population—lies at the heart of all inferential statistics. But what exactly are we making claims about when we talk about populations?

Population Parameters vs. Sample Statistics

When we make claims about a population, we are not making claims about individual observations. After all, populations are conceptually infinite in size. Instead, we make claims about specific parameters of the population’s distribution. The two parameters we encounter most frequently are:

  1. The population mean (\(\mu\)): This is by far the most common parameter we test hypotheses about in applied statistics.

  2. The population variance (\(\sigma^2\)): This parameter is crucial because tests for the population mean often depend on our ability to estimate the population variance.

Because we never truly know the values of \(\mu\) or \(\sigma^2\), we must estimate them using sample data. The corresponding quantities we calculate from our sample are:

  • Sample mean (\(\bar{y}\)): The analog to the population mean
  • Sample variance (\(s^2\)): The analog to the population variance
ImportantA Critical Distinction

When we call the sample mean and sample variance “analogs” or “counterparts” to their population equivalents, we mean only that they correspond conceptually. We are not claiming they are equal or even necessarily good estimates. Establishing which sample statistics make good estimators of population parameters is precisely what this chapter is about.

11.2 Transformations of Random Variables

Before we dive into the philosophy of estimation, we need to develop some mathematical machinery. In statistics, we routinely transform data—we take numbers, apply formulas to them, and generate new numbers. Understanding how these transformations affect the mean and variance of our data is essential.

Affine Transformations

Consider a simple but powerful type of transformation called an affine transformation. If we have a random variable \(X\) with mean \(\bar{x}\) and variance \(s^2\), we might create a new variable:

\[Y = mX + c\]

where \(m\) is a multiplicative constant (the slope) and \(c\) is an additive constant (the intercept). This is exactly the form of a linear equation you’ve seen since high school algebra.

The question is: if we know the mean and variance of \(X\), what are the mean and variance of \(Y\)?

We can decompose this affine transformation into two simpler operations:

  1. Translation: \(X \rightarrow X + c\) (adding a constant)
  2. Linear transformation: \(X \rightarrow mX\) (multiplying by a constant)

Properties of Translation

When you add a constant \(c\) to every value in your dataset, creating \(Y = X + c\):

\[ \begin{aligned} \text{Mean of } Y &= \bar{x} + c \\ \text{Variance of } Y &= s^2 \end{aligned} \]

The mean shifts by exactly \(c\), but the variance remains unchanged. Why? Because variance measures the spread of data around the mean, and when you shift all values by the same amount, their relative positions don’t change.

TipConnecting to Earlier Concepts

You’ve already encountered this idea when we discussed the \(z\)-transformation. When we subtract the mean from a variable, we’re performing a translation that shifts the entire distribution to have mean zero. The shape and spread of the distribution remain the same.

Properties of Linear Transformation

When you multiply every value by a constant \(m\), creating \(Y = mX\):

\[ \begin{aligned} \text{Mean of } Y &= m\bar{x} \\ \text{Variance of } Y &= m^2 s^2 \\ \text{Standard deviation of } Y &= |m| s \end{aligned} \]

Notice that the variance is multiplied by \(m^2\), not \(m\). This occurs because variance involves squared deviations, so a multiplicative constant gets squared in the process.

Combining Both Transformations

For the full affine transformation \(Y = mX + c\):

\[ \begin{aligned} \text{Mean of } Y &= m\bar{x} + c \\ \text{Variance of } Y &= m^2 s^2 \\ \text{Standard deviation of } Y &= |m| s \end{aligned} \]

These formulas will prove invaluable as we develop more sophisticated statistical techniques.

11.3 The Frequentist Philosophy: Repeated Sampling

We now arrive at the conceptual heart of frequentist statistics. The entire edifice of frequentist inference rests on an imaginary exercise: repeated sampling.

NoteThe Thought Experiment

Imagine that we could:

  1. Draw a random sample from the population
  2. Calculate some statistic from that sample
  3. Return the sample to the population
  4. Draw another random sample
  5. Calculate the statistic again
  6. Repeat this process infinitely many times

This thought experiment—sampling repeatedly from the same population—forms the foundation for how we evaluate estimators in frequentist statistics.

Here’s the crucial point: in practice, we only sample once. But theoretically, we imagine what would happen if we could sample infinitely many times. The behavior of our estimator across these hypothetical repeated samples tells us whether it’s a good estimator or not.

The Concept of an Estimator

An estimator is simply a formula that we apply to sample data to estimate a population parameter. Importantly, there are infinitely many possible estimators for any given parameter.

For example, suppose we want to estimate the population mean \(\mu\). Here are just a few of the infinitely many estimators we could choose:

  • The first observation: \(\hat{\mu}_1 = y_1\)
  • The sum of the first two observations: \(\hat{\mu}_2 = y_1 + y_2\)
  • The cube of the first observation: \(\hat{\mu}_3 = y_1^3\)
  • The fourth power of the seventh observation times the sine of the second: \(\hat{\mu}_4 = y_7^4 \times \sin(y_2)\)
  • The sample mean: \(\hat{\mu}_5 = \bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_i\)

Most of these are obviously terrible estimators. But the point is that we can construct any formula we want, and each formula defines a different estimator. The set of all possible estimators is infinite.

So how do we choose among them? How do we determine which estimators are “good” and which are “bad”?

The answer lies in examining the sampling distribution of each estimator.

Sampling Distributions

For any estimator, we can imagine the repeated sampling process:

  1. Draw a sample of size \(n\)
  2. Apply the estimator to get an estimate
  3. Record that estimate
  4. Repeat infinitely many times

The distribution of all these estimates is called the sampling distribution of the estimator. Each different estimator has its own sampling distribution.

ImportantKey Insight

The sampling distribution is a theoretical construct. We never actually observe it because we only sample once in practice. But by imagining what it would look like, we can develop mathematical criteria for judging the quality of different estimators.

11.4 A Concrete Example: Estimating from a Simple Population

To make these abstract ideas concrete, let’s work through a simple example. Consider a population with only three values: \(\{1, 2, 3\}\). Since there’s one of each value, each has probability \(1/3\) of being selected if we draw randomly from this population.

The True Population Parameters

This is a discrete uniform distribution, and we can easily calculate the true population mean and variance:

\[ \mu = E[Y] = 1 \cdot \frac{1}{3} + 2 \cdot \frac{1}{3} + 3 \cdot \frac{1}{3} = 2 \]

For the variance, we first calculate the expected value of \(Y^2\):

\[ E[Y^2] = 1^2 \cdot \frac{1}{3} + 2^2 \cdot \frac{1}{3} + 3^2 \cdot \frac{1}{3} = \frac{14}{3} \]

Then, using the formula \(\text{Var}(Y) = E[Y^2] - (E[Y])^2\):

\[ \sigma^2 = \frac{14}{3} - 2^2 = \frac{14}{3} - 4 = \frac{2}{3} \]

We can also verify this directly by calculating the squared deviations:

Value \((y)\) Deviation \((y - \mu)\) Squared Deviation \((y - \mu)^2\)
1 -1 1
2 0 0
3 1 1

\[ \sigma^2 = \frac{1 + 0 + 1}{3} = \frac{2}{3} \]

So we know that \(\mu = 2\) and \(\sigma^2 = 2/3\).

The Frequentist Approach: Pretending We Don’t Know

Now comes the frequentist thought experiment. Imagine we don’t know these population parameters. Our goal is to estimate \(\mu\) using only sample data.

We’ll draw samples of size \(n = 2\) with replacement. Sampling with replacement means:

  1. Draw an observation and record its value
  2. Return it to the population
  3. Draw a second observation

This ensures that our two observations are independent—the value of the second observation doesn’t depend on the first.

NoteWhy Sample With Replacement?

In real research, we don’t literally sample with replacement. But we make the assumption that our samples are small enough relative to the population that each observation is effectively independent of the others. Since populations are conceptually infinite, even a sample of 5,000 is negligible compared to the population size, justifying the independence assumption.

Enumerating All Possible Samples

With a population of size 3 and sample size 2, how many possible samples can we draw (with replacement, where order matters)?

  • First observation: 3 possibilities
  • Second observation: 3 possibilities
  • Total: \(3 \times 3 = 9\) possible samples

Here are all nine possible samples and their means:

Sample Values Sample Mean \(\bar{y}\)
1 (1, 1) 1.0
2 (1, 2) 1.5
3 (1, 3) 2.0
4 (2, 1) 1.5
5 (2, 2) 2.0
6 (2, 3) 2.5
7 (3, 1) 2.0
8 (3, 2) 2.5
9 (3, 3) 3.0

The Sampling Distribution of the Sample Mean

Now we can construct the sampling distribution of \(\bar{y}\). We have nine equally likely samples, so each sample mean has probability \(1/9\):

Sample Mean \(\bar{y}\) Frequency Probability
1.0 1 1/9
1.5 2 2/9
2.0 3 3/9
2.5 2 2/9
3.0 1 1/9

The expected value of this sampling distribution is:

\[ E[\bar{y}] = 1.0 \cdot \frac{1}{9} + 1.5 \cdot \frac{2}{9} + 2.0 \cdot \frac{3}{9} + 2.5 \cdot \frac{2}{9} + 3.0 \cdot \frac{1}{9} = 2 \]

This equals the true population mean! This is our first glimpse of an important property: unbiasedness.

11.5 Properties of Good Estimators

We began with infinitely many possible estimators. To choose among them, we need criteria for what makes an estimator “good.” Statisticians have identified many desirable properties, but we’ll focus on three fundamental ones:

  1. Unbiasedness
  2. Efficiency
  3. Consistency

Property 1: Unbiasedness

An estimator is unbiased if the expected value of its sampling distribution equals the true population parameter being estimated.

NoteDefinition: Unbiased Estimator

An estimator \(\hat{\theta}\) of a population parameter \(\theta\) is unbiased if:

\[E[\hat{\theta}] = \theta\]

In words: if we could sample repeatedly and average all our estimates, we would get the correct answer.

For the sample mean as an estimator of the population mean:

\[E[\bar{y}] = \mu\]

We can prove this generally. For a sample \(y_1, y_2, \ldots, y_n\) drawn from a population with mean \(\mu\):

\[ \begin{aligned} E[\bar{y}] &= E\left[\frac{1}{n}\sum_{i=1}^{n} y_i\right] \\ &= \frac{1}{n}\sum_{i=1}^{n} E[y_i] \\ &= \frac{1}{n}\sum_{i=1}^{n} \mu \\ &= \frac{1}{n} \cdot n\mu \\ &= \mu \end{aligned} \]

The sample mean is an unbiased estimator of the population mean. It gives us the correct answer “on average” across repeated samples.

ImportantA Crucial Point

We can prove unbiasedness without knowing the actual value of \(\mu\). We only need to know that the population has a well-defined mean. This is the power of mathematical proof—we can establish properties of estimators without knowing the specific parameters we’re estimating.

But here’s the challenge: infinitely many estimators are unbiased. Applying the filter of unbiasedness still leaves us with infinitely many candidates. We need additional criteria.

Property 2: Efficiency

Among all unbiased estimators, we prefer the one with the smallest variance in its sampling distribution.

NoteDefinition: Efficiency

Among unbiased estimators, the most efficient estimator is the one with the smallest variance in its sampling distribution. An estimator with smaller variance produces estimates in a narrower band around the true parameter value.

Why do we care about efficiency? If two estimators are both unbiased (correct on average), but one has a tighter sampling distribution, we have more confidence in estimates from the tighter distribution. Each individual estimate is more likely to be close to the true parameter value.

Imagine two unbiased estimators, A and B:

  • Estimator A: \(E[\hat{\mu}_A] = \mu\) and \(\text{Var}(\hat{\mu}_A) = 0.5\)
  • Estimator B: \(E[\hat{\mu}_B] = \mu\) and \(\text{Var}(\hat{\mu}_B) = 2.0\)

Both are unbiased, but A is more efficient. Estimates from A will cluster more tightly around \(\mu\) than estimates from B.

The remarkable result: Among all unbiased estimators of the population mean, the sample mean has the smallest variance. It is the most efficient unbiased estimator.

TipWhy We Use Sample Means

This is why we routinely use sample averages to estimate population averages. It’s not arbitrary—it’s mathematically optimal. The sample mean is both unbiased and most efficient among unbiased estimators.

Variance of the Sample Mean

For a sample of size \(n\) drawn from a population with variance \(\sigma^2\), the variance of the sampling distribution of \(\bar{y}\) is:

\[ \text{Var}(\bar{y}) = \frac{\sigma^2}{n} \]

We can derive this using our transformation rules:

\[ \begin{aligned} \text{Var}(\bar{y}) &= \text{Var}\left(\frac{1}{n}\sum_{i=1}^{n} y_i\right) \\ &= \frac{1}{n^2} \text{Var}\left(\sum_{i=1}^{n} y_i\right) \\ &= \frac{1}{n^2} \sum_{i=1}^{n} \text{Var}(y_i) \quad \text{(assuming independence)} \\ &= \frac{1}{n^2} \cdot n\sigma^2 \\ &= \frac{\sigma^2}{n} \end{aligned} \]

This formula will become crucial in our next property.

Property 3: Consistency

Both unbiasedness and efficiency are properties that hold for a fixed sample size \(n\). Our third property asks: what happens as \(n\) increases?

NoteDefinition: Consistency

An estimator is consistent if its sampling distribution becomes increasingly concentrated around the true parameter value as the sample size increases. In the limit as \(n \rightarrow \infty\), the distribution collapses to a point at the true parameter.

This property is formalized by the Law of Large Numbers, which states that as the sample size grows, the sample mean converges to the population mean.

Looking at our variance formula:

\[ \text{Var}(\bar{y}) = \frac{\sigma^2}{n} \]

As \(n\) increases, the variance decreases. As \(n \rightarrow \infty\):

\[ \lim_{n \rightarrow \infty} \text{Var}(\bar{y}) = \lim_{n \rightarrow \infty} \frac{\sigma^2}{n} = 0 \]

The distribution collapses to a single point. When our sample size equals the entire population, every sample mean equals the population mean exactly. The sample mean is a consistent estimator.

ImportantInterpreting Consistency

Consistency gives us confidence that more data leads to better estimates (assuming the data are properly collected). This is the mathematical justification for why researchers pursue larger sample sizes—not just for statistical power, but because larger samples yield more precise estimates.

11.6 Summary: Why the Sample Mean?

We can now answer definitively why we use the sample mean to estimate the population mean:

  1. Unbiasedness: The sample mean is correct on average across repeated samples
  2. Efficiency: Among all unbiased estimators, it has the smallest variance
  3. Consistency: As sample size increases, the sample mean converges to the population mean

These three properties—proven mathematically, not assumed—make the sample mean the optimal choice for estimating population means in the frequentist framework.

11.7 Interpretive Questions

TipQuestion 1

Explain in your own words why we can establish that the sample mean is unbiased without knowing the actual value of \(\mu\). What does this tell us about the power of mathematical reasoning in statistics?

TipQuestion 2

Consider two researchers studying the same population. Researcher A collects a sample of size 50, while Researcher B collects a sample of size 200. Both use the sample mean as their estimator. How do their sampling distributions differ? Which researcher should have more confidence in their estimate, and why?

TipQuestion 3

The frequentist approach relies on imagining repeated sampling, even though we only sample once in practice. Does this make the approach purely theoretical, or does it provide practical value for real-world inference? Defend your answer.

11.8 Looking Ahead

In this chapter, we’ve established the fundamental philosophy of frequentist statistics and proven that the sample mean is an optimal estimator of the population mean. In the next chapter, we’ll turn our attention to estimating the population variance and develop the tools necessary for hypothesis testing—the framework that allows us to make formal claims about statistical significance.

The journey from these theoretical foundations to practical hypothesis testing may seem long, but every step is essential. Statistical inference is not a collection of arbitrary formulas—it is a coherent mathematical framework built on rigorous principles. Understanding these principles will make you a more thoughtful and capable analyst.