The Philomath Home
Data Analytics for Critical Thinkers

12  Estimating the population mean

In this chapter, we’ll explore three fundamental properties of statistical estimators that form the backbone of statistical inference. We’ll prove that the sample mean is an unbiased estimator of the population mean, demonstrate that it’s the most efficient among all unbiased estimators, and examine why the sample variance requires a correction factor. These proofs are not merely mathematical exercises—they reveal deep truths about how we can reliably learn about populations from samples.

By the end of this chapter, you will understand:

12.1 The Unbiasedness of the Sample Mean

Let’s begin with a fundamental question that underlies all of statistical inference.

NoteQuestion

How do we know that the sample mean is a reliable estimator of the population mean? Could there be systematic error in our estimates?

The sample mean is unbiased, meaning that if we could take infinitely many samples and calculate the mean for each one, the average of all those sample means would exactly equal the population mean. This property holds regardless of sample size or the shape of the population distribution—it’s assumption-free except for requiring that the population mean is finite.

Understanding Unbiasedness

An estimator is unbiased if its expected value equals the parameter it’s trying to estimate. For the sample mean \(\bar{Y}\) estimating the population mean \(\mu\), we want to show:

\[ E[\bar{Y}] = \mu \]

This is a powerful property because it guarantees that our estimator has no systematic tendency to overestimate or underestimate the true parameter. Some samples will give us values above \(\mu\), others below, but on average—across infinitely many samples—we hit the target exactly.

The Proof

The proof is remarkably elegant. Let’s work through it step by step.

We start with the definition of the sample mean:

\[ \bar{Y} = \frac{Y_1 + Y_2 + \cdots + Y_n}{n} \]

Taking the expected value of both sides:

\[ E[\bar{Y}] = E\left[\frac{Y_1 + Y_2 + \cdots + Y_n}{n}\right] \]

Since \(n\) is a constant (our fixed sample size), we can factor it out using the linearity of expectation:

\[ E[\bar{Y}] = \frac{1}{n} E[Y_1 + Y_2 + \cdots + Y_n] \]

Using the linearity property again, the expectation of a sum equals the sum of expectations:

\[ E[\bar{Y}] = \frac{1}{n} \left(E[Y_1] + E[Y_2] + \cdots + E[Y_n]\right) \]

Now comes the key insight. Each observation \(Y_i\) is drawn from the same population, so each has the same expected value—the population mean \(\mu\). Think about it this way: if you could observe infinitely many “first observations” from different samples, their distribution would look exactly like the population distribution, and their mean would be \(\mu\). The same holds for the second observation, the third, and all others.

Therefore:

\[ E[\bar{Y}] = \frac{1}{n}(\mu + \mu + \cdots + \mu) = \frac{1}{n}(n\mu) = \mu \]

The proof is complete. ∎

ImportantImportant: Assumption-Free Result

This proof makes no assumptions about:

  • The sample size \(n\) (it can be as small as 2)
  • The distribution of the population (it can be skewed, multimodal, or anything)
  • The variance of the population (it doesn’t even need to exist)

The only requirement is that \(\mu\) is finite. There are some exotic distributions (like the Cauchy distribution) whose mean is undefined, but for all practical purposes, if the population has a finite mean, the sample mean is an unbiased estimator of it.

12.2 The Efficiency of the Sample Mean

Proving unbiasedness is just the first step. After all, there are infinitely many unbiased estimators of the population mean. We need another criterion to choose among them.

NoteQuestion

If there are infinitely many unbiased estimators, how do we decide which one to use? What makes the sample mean special?

Among all unbiased estimators, the sample mean has the smallest variance—a property called efficiency. This means it gives us the most precise estimates on average. While other unbiased estimators might occasionally give better results in particular samples, the sample mean is most reliable in the long run.

The Problem of Too Many Estimators

Consider this: the first observation \(Y_1\) by itself is an unbiased estimator of \(\mu\) since \(E[Y_1] = \mu\). So is the second observation. So is any weighted average of your observations, as long as the weights sum to one. For example:

\[ T = \frac{1}{4}Y_1 + \frac{3}{4}Y_2 \]

This is unbiased (you can verify that \(E[T] = \mu\)). But it completely ignores observations 3 through \(n\) if you have more data! Intuitively, this seems wasteful. We need a way to filter these infinitely many unbiased estimators.

The Efficiency Criterion

We use efficiency as our second filter. An efficient estimator is one that has the minimum variance among all unbiased estimators. Why variance? Because variance measures precision—how much our estimates vary from sample to sample. Lower variance means more consistent, reliable estimates.

Setting Up the Proof

Let’s define a general unbiased estimator as a weighted combination of our observations:

\[ T = \sum_{i=1}^{n} A_i Y_i \]

where the \(A_i\) are arbitrary weights (constants). Since we’re restricting attention to unbiased estimators, we require:

\[ E[T] = \mu \]

Let’s see what this constraint implies. Taking expectations:

\[ E[T] = E\left[\sum_{i=1}^{n} A_i Y_i\right] = \sum_{i=1}^{n} A_i E[Y_i] = \sum_{i=1}^{n} A_i \mu = \mu \sum_{i=1}^{n} A_i \]

For this to equal \(\mu\), we need:

\[ \sum_{i=1}^{n} A_i = 1 \]

This is our first constraint: the weights must sum to one. The sample mean satisfies this with \(A_i = 1/n\) for all \(i\).

Finding the Variance

Now let’s calculate the variance of our general estimator \(T\):

\[ \text{Var}(T) = \text{Var}\left(\sum_{i=1}^{n} A_i Y_i\right) \]

Since the \(A_i\) are constants and the observations are independent:

\[ \text{Var}(T) = \sum_{i=1}^{n} A_i^2 \text{Var}(Y_i) = \sum_{i=1}^{n} A_i^2 \sigma^2 = \sigma^2 \sum_{i=1}^{n} A_i^2 \]

where \(\sigma^2\) is the population variance (assumed to be the same for all observations, since they’re all drawn from the same population).

The Key Inequality

Now we employ a clever algebraic trick. Notice that:

\[ \sum_{i=1}^{n} A_i^2 = \sum_{i=1}^{n} \left(A_i - \frac{1}{n}\right)^2 + \frac{1}{n} \]

This might seem to come out of nowhere, but let’s verify it by expanding the squared term:

\[\begin{align} \sum_{i=1}^{n} \left(A_i - \frac{1}{n}\right)^2 &= \sum_{i=1}^{n} \left(A_i^2 - 2A_i \cdot \frac{1}{n} + \frac{1}{n^2}\right)\\ &= \sum_{i=1}^{n} A_i^2 - \frac{2}{n}\sum_{i=1}^{n} A_i + \sum_{i=1}^{n}\frac{1}{n^2}\\ &= \sum_{i=1}^{n} A_i^2 - \frac{2}{n}(1) + \frac{n}{n^2}\\ &= \sum_{i=1}^{n} A_i^2 - \frac{2}{n} + \frac{1}{n}\\ &= \sum_{i=1}^{n} A_i^2 - \frac{1}{n} \end{align}\]

where we used the constraint that \(\sum A_i = 1\). Rearranging gives us the identity we claimed.

Completing the Proof

Since squares are always non-negative, we have:

\[ \sum_{i=1}^{n} \left(A_i - \frac{1}{n}\right)^2 \geq 0 \]

with equality if and only if \(A_i = 1/n\) for all \(i\). Therefore:

\[ \sum_{i=1}^{n} A_i^2 \geq \frac{1}{n} \]

Multiplying both sides by \(\sigma^2\):

\[ \text{Var}(T) = \sigma^2 \sum_{i=1}^{n} A_i^2 \geq \frac{\sigma^2}{n} = \text{Var}(\bar{Y}) \]

The minimum variance is achieved when \(A_i = 1/n\) for all \(i\)—which is precisely the sample mean! ∎

ImportantThe Meaning of Efficiency

The sample mean is efficient because it makes optimal use of all available information. Giving equal weight to each observation minimizes the variance of our estimate. Any other weighting scheme—whether it’s emphasizing early observations, ignoring some data, or using unequal weights—will produce a less precise estimator.

This result has a beautiful interpretation: democracy in data is optimal. No observation deserves more weight than any other when they’re all drawn from the same population.

12.3 The Bias of the Sample Variance

Having established the virtues of the sample mean, we now turn to a more subtle problem: estimating the population variance \(\sigma^2\).

NoteQuestion

The natural estimator of population variance would seem to be the average squared deviation from the sample mean: \(\frac{1}{n}\sum_{i=1}^{n}(Y_i - \bar{Y})^2\). Why do we use \(n-1\) instead of \(n\) in the denominator? Is this just a convention, or is there a deeper reason?

The formula with \(n\) in the denominator is actually biased—it systematically underestimates the true population variance. Using \(n-1\) corrects this bias, giving us an unbiased estimator. This isn’t arbitrary; it follows from a careful mathematical analysis of how sample statistics relate to population parameters.

Defining the Sample Variance

Let’s define what we’ll call the “uncorrected” sample variance:

\[ S^2 = \frac{1}{n}\sum_{i=1}^{n}(Y_i - \bar{Y})^2 \]

This is the natural definition—it’s the average squared deviation from the sample mean. But is it unbiased? To answer this, we need to calculate \(E[S^2]\) and see if it equals \(\sigma^2\).

A Clever Algebraic Manipulation

The key to this proof is recognizing that we can rewrite each deviation \((Y_i - \bar{Y})\) in terms of deviations from the true population mean \(\mu\):

\[ Y_i - \bar{Y} = (Y_i - \mu) - (\bar{Y} - \mu) \]

This is just adding and subtracting \(\mu\). Now let’s square both sides:

\[ (Y_i - \bar{Y})^2 = [(Y_i - \mu) - (\bar{Y} - \mu)]^2 \]

Expanding the square:

\[ (Y_i - \bar{Y})^2 = (Y_i - \mu)^2 - 2(Y_i - \mu)(\bar{Y} - \mu) + (\bar{Y} - \mu)^2 \]

Summing Over All Observations

Now sum both sides from \(i=1\) to \(n\):

\[ \sum_{i=1}^{n}(Y_i - \bar{Y})^2 = \sum_{i=1}^{n}(Y_i - \mu)^2 - 2(\bar{Y} - \mu)\sum_{i=1}^{n}(Y_i - \mu) + \sum_{i=1}^{n}(\bar{Y} - \mu)^2 \]

Let’s examine each term carefully. For the middle term, notice that:

\[ \sum_{i=1}^{n}(Y_i - \mu) = \sum_{i=1}^{n}Y_i - n\mu = n\bar{Y} - n\mu = n(\bar{Y} - \mu) \]

So the middle term becomes:

\[ -2(\bar{Y} - \mu) \cdot n(\bar{Y} - \mu) = -2n(\bar{Y} - \mu)^2 \]

For the last term, \((\bar{Y} - \mu)^2\) doesn’t depend on \(i\), so:

\[ \sum_{i=1}^{n}(\bar{Y} - \mu)^2 = n(\bar{Y} - \mu)^2 \]

Putting it all together:

\[ \sum_{i=1}^{n}(Y_i - \bar{Y})^2 = \sum_{i=1}^{n}(Y_i - \mu)^2 - 2n(\bar{Y} - \mu)^2 + n(\bar{Y} - \mu)^2 \]

\[ = \sum_{i=1}^{n}(Y_i - \mu)^2 - n(\bar{Y} - \mu)^2 \]

Therefore, dividing both sides by \(n\):

\[ S^2 = \frac{1}{n}\sum_{i=1}^{n}(Y_i - \mu)^2 - (\bar{Y} - \mu)^2 \]

Taking Expectations

Now we take the expected value of both sides:

\[ E[S^2] = E\left[\frac{1}{n}\sum_{i=1}^{n}(Y_i - \mu)^2\right] - E[(\bar{Y} - \mu)^2] \]

Let’s evaluate each term. For the first term:

\[ E\left[\frac{1}{n}\sum_{i=1}^{n}(Y_i - \mu)^2\right] = \frac{1}{n}\sum_{i=1}^{n}E[(Y_i - \mu)^2] \]

But \(E[(Y_i - \mu)^2]\) is precisely the definition of the population variance \(\sigma^2\) (the expected squared deviation from the population mean). So:

\[ \frac{1}{n}\sum_{i=1}^{n}E[(Y_i - \mu)^2] = \frac{1}{n} \cdot n\sigma^2 = \sigma^2 \]

For the second term, \(E[(\bar{Y} - \mu)^2]\) is the expected squared deviation of the sample mean from the population mean—which is exactly the variance of the sample mean:

\[ E[(\bar{Y} - \mu)^2] = \text{Var}(\bar{Y}) = \frac{\sigma^2}{n} \]

We proved this earlier when showing that the sample mean is efficient.

The Final Result

Combining these results:

\[ E[S^2] = \sigma^2 - \frac{\sigma^2}{n} = \sigma^2\left(1 - \frac{1}{n}\right) = \sigma^2 \cdot \frac{n-1}{n} \]

This is the crucial finding: the expected value of \(S^2\) is not \(\sigma^2\), but rather \(\sigma^2 \cdot \frac{n-1}{n}\). Since \((n-1)/n < 1\) for all \(n > 1\), the uncorrected sample variance systematically underestimates the true population variance. It is biased downward.

ImportantWhy the Bias Occurs

The bias arises because we use \(\bar{Y}\) instead of \(\mu\) in our formula. The sample mean is calculated from the same data we’re using to measure variation, so it’s “closer” to the data points than the true mean would be. The deviations \((Y_i - \bar{Y})\) are systematically smaller than the deviations \((Y_i - \mu)\) would be, leading to underestimation.

Using \(n-1\) in the denominator exactly corrects for this bias. The unbiased sample variance is:

\[ s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(Y_i - \bar{Y})^2 \]

since \(E[s^2] = \frac{n}{n-1} \cdot E[S^2] = \frac{n}{n-1} \cdot \sigma^2 \cdot \frac{n-1}{n} = \sigma^2\).

12.4 A Challenge for You

Now that we’ve proven the sample variance with \(n\) in the denominator is biased, here’s a thought question:

NoteChallenge Question

We know that \(S^2 = \frac{1}{n}\sum_{i=1}^{n}(Y_i - \bar{Y})^2\) is a biased estimator of \(\sigma^2\), with \(E[S^2] = \frac{n-1}{n}\sigma^2\).

Can you propose an unbiased estimator of the population variance? Don’t look it up—use what we’ve learned in this chapter to construct one yourself.

Hint: If you know the expected value of \(S^2\), what simple transformation would make it unbiased?

12.5 Synthesis and Reflection

Let’s step back and consider what these three proofs reveal about the nature of statistical estimation.

The sample mean emerged as the gold standard estimator not by accident, but because it possesses two fundamental virtues: it’s unbiased (correct on average) and efficient (most precise). These aren’t just mathematical curiosities—they have practical implications. When you calculate a sample mean, you can trust that you’re using the best possible estimator given your data.

The sample variance case is more subtle. The natural estimator \(S^2\) turns out to be biased, but the bias is systematic and predictable, allowing us to correct it. This illustrates an important principle: not all biases are equal. A systematic, known bias that we can correct is far less problematic than an unknown or random error.

Moreover, these proofs showcase the power of mathematical statistics. We’re not guessing or using intuition—we’re proving with logical certainty that our estimators have desirable properties. This rigor is what allows statistics to be a reliable tool for scientific inference.

ImportantKey Takeaways
  1. The sample mean is unbiased: \(E[\bar{Y}] = \mu\), regardless of sample size or population distribution (as long as \(\mu\) is finite).

  2. The sample mean is efficient: Among all unbiased estimators, \(\bar{Y}\) has the minimum variance. Equal weighting is optimal.

  3. The sample variance needs correction: The natural estimator \(\frac{1}{n}\sum(Y_i - \bar{Y})^2\) underestimates \(\sigma^2\) by a factor of \((n-1)/n\). Dividing by \(n-1\) instead of \(n\) corrects this bias.

  4. Degrees of freedom matter: The \(n-1\) denominator reflects that we “lose” one degree of freedom by using \(\bar{Y}\) instead of \(\mu\).

These results form the foundation for much of what follows in statistical inference. Understanding why they’re true—not just memorizing the formulas—will serve you well as we build toward more sophisticated techniques.