The Philomath Home
Data Analytics for Critical Thinkers

21  Dichotomous Choice Modeling

21.1 Introduction: A Problem of Transportation Planning

In the mid-1960s, traffic congestion in the Bay Area had reached a critical juncture. The California Highway Commission faced a fundamental decision: should they continue investing in freeway expansion, or could a new mass transit system offer a better path forward? They proposed an ambitious solution—a network of buses and rail that would connect the region, fundamentally reshaping how people commuted.

But there was a problem. Before committing billions in public resources, the Commission needed to answer a deceptively simple question: How many people would actually use this system?

In 1969, with the first BART station under construction, the Commission faced a pilot phase evaluation. They needed to estimate ridership—not based on hunches or optimistic projections, but on actual data about people’s choices. So they conducted an extensive survey of Bay Area residents, asking a seemingly straightforward question: Would you take the bus instead of driving?

Yes or no.

This binary question—a dichotomous choice—would unlock something far more significant than transit planning. It would lead to the discovery of a new statistical framework that would transform how economists, marketers, and policymakers understand decision-making itself.

21.2 The Birth of a Framework: Dan McFadden’s Insight

The Commission’s first instinct was to use standard regression—treating the yes/no responses as if they were continuous measurements. Using this linear probability model, they estimated that about 15% of Bay Area residents would use the new transit system.

But then they hired a young economist named Dan McFadden, recently arrived at UC Berkeley. McFadden looked at the problem differently. He recognized something fundamental: when people make discrete choices—yes or no, use transit or drive, buy or don’t buy—the standard tools of regression analysis were fundamentally mismatched to the problem.

McFadden developed a new approach using what he called latent variable models. The insight was elegant: behind every observed choice lies an unobserved psychological disposition. When someone decides whether to take the bus, they’re processing information about commute time, cost, convenience, and comfort—all of which feed into a latent evaluation of the option. When that latent evaluation exceeds some threshold, they choose to use transit.

Using this framework, McFadden predicted that only about 6.3% of residents would use BART. His colleagues dismissed this as too pessimistic. Yet when BART opened and ridership was measured, it came in at 6.2%—remarkably close to McFadden’s prediction.

This work on discrete choice modeling was so significant that in 2000—more than three decades later—McFadden was awarded the Nobel Prize in Economics. The Nobel citation recognized his contribution: “he showed how to statistically handle fundamental aspects of microdata, namely data on the most important decisions we make in life: the choice of education, occupation, place of residence, marital status, number of children, so called discrete choices.”

Today, the methods McFadden pioneered are used everywhere: predicting consumer behavior, understanding labor market decisions, analyzing election outcomes, and evaluating policy interventions.

21.3 Back to Boston: Understanding Commuting Choices

Let’s return to a more focused question, the kind that McFadden’s methods are perfect for answering. In the 1980s, researchers Ben-Akiva and Lerman interviewed commuters in Boston about their transportation choices. Their specific research question was simple but important: Does the difference in commute time between car and bus affect people’s mode choice?

To answer this, they surveyed 21 commuters, collecting data on: - Their actual commute times by car and by bus - Their actual commuting choice (0 = drove to work, 1 = took the bus)

The data tells a clear visual story. When the difference in commute time favors driving (negative values), people drive. When the difference favors the bus (positive values), people take transit. Yet there’s variation even within these patterns—some people take the bus despite longer commute times, and others drive even when the bus would be faster.

Our task is to model this relationship: to understand how the difference in commute time influences the probability that someone will choose transit.

21.4 The Binary Probability Function

To begin, let’s establish some basic foundations. When people make dichotomous (two-choice) decisions, we can describe the outcome using the Bernoulli distribution.

ImportantThe Bernoulli Distribution

For a dichotomous outcome \(Y\) that takes the value 1 with probability \(p\) and the value 0 with probability \((1-p)\), the probability function is:

\[f(y) = p^y(1-p)^{1-y}\]

The expected value of \(Y\) is simply:

\[E(Y) = (1-p) \times 0 + p \times 1 = p\]

In our commuting example, \(Y = 1\) represents choosing transit and \(Y = 0\) represents choosing a car. The probability \(p\) represents the probability that an individual will choose transit, given their specific circumstances.

Following standard econometric practice, we decompose the observed outcome into a deterministic part (what we can predict) and a stochastic part (random variation):

\[Y_i = p_i + \epsilon_i\]

where \(p_i\) is the predicted probability for individual \(i\) and \(\epsilon_i\) is the error term.

The key question becomes: How does the difference in commute times relate to \(p\)?

21.5 The Linear Probability Model: A First Attempt

The most straightforward approach is the Linear Probability Model (LPM), which assumes a linear relationship between the commute time difference and the probability of choosing transit:

\[p_i = \beta_0 + \beta_1 \cdot \text{diff}_i\]

where \(\text{diff}_i = \text{car\_time}_i - \text{bus\_time}_i\) is the difference in commute times for individual \(i\).

NoteQuestion

What are the advantages of starting with a linear model?

21.6 Answer

The linear model is computationally simple and has a clear interpretation: \(\beta_1\) represents the change in probability per unit increase in the commute time difference. It’s easy to estimate using OLS (ordinary least squares) regression, and students already understand how to interpret linear coefficients.

Estimating the LPM

When we estimate this model on the Ben-Akiva commuting data, we get:

\[\widehat{p}_i = 0.515 + 0.007031 \cdot \text{diff}_i\]

The coefficient on diff is positive and statistically significant, confirming that a longer drive time relative to bus time increases the probability of choosing transit. The magnitude suggests that each additional minute of driving time (relative to bus time) increases the probability of choosing the bus by about 0.7 percentage points.

We can also calculate the threshold difference at which someone is indifferent between modes:

\[0.5 = 0.515 + 0.007031 \cdot \text{diff}\]

Solving for the threshold: \(\text{diff} = -6.93\) minutes. This means when the bus time exceeds car time by about 7 minutes, we’d predict a 50% probability of choosing transit.

Problems with the Linear Probability Model

Despite its simplicity, the LPM has serious problems that make it unsuitable for modeling binary choices:

1. Unbounded predictions: The model can predict values less than 0 or greater than 1, which are nonsensical for probabilities. Looking at our scatter plot, we can see that the fitted line would predict negative probabilities for very negative differences in commute times.

2. Constant marginal effects: The model assumes that each additional minute of commute time difference has the same effect on the choice probability, regardless of the current situation. But this seems unrealistic. The effect of an additional minute likely matters more when someone is on the fence (near 50% probability) than when they’re already strongly committed to one mode.

3. Heteroskedasticity: Since \(Y\) is binary, the variance of the error term is \(\text{Var}(\epsilon_i) = p_i(1-p_i)\), which depends on \(p_i\). This violates the homoskedasticity assumption of OLS, making standard errors unreliable.

ImportantThe Core Problem

The fundamental issue is that we’re using a tool designed for continuous outcomes to model a categorical choice. We need a fundamentally different approach—one that respects the binary nature of the outcome.

21.7 Latent Variable Models: A Better Framework

McFadden’s insight was to introduce a latent variable—an unobserved psychological disposition that drives the observed choice. Think of it this way:

When a commuter considers transit versus driving, they’re mentally evaluating the overall utility (satisfaction) of each option. Let \(Y^*_i\) represent this latent evaluation of transit relative to driving:

  • If \(Y^*_i > 0\): The net benefit of transit exceeds that of driving → choose transit
  • If \(Y^*_i \leq 0\): Driving is better → choose driving

The observed choice is determined by an indicator function:

\[Y_i = \begin{cases} 1 & \text{if } Y^*_i > 0 \\ 0 & \text{if } Y^*_i \leq 0 \end{cases}\]

We assume the latent variable depends on observed commute times plus unobserved factors:

\[Y^*_i = \beta_0 + \beta_1 \cdot \text{diff}_i + \epsilon_i\]

The error term \(\epsilon_i\) captures all the unmeasured factors affecting choice—personal preferences, comfort sensitivity, environmental consciousness, and so on.

Deriving the Probability

Now we can derive the probability of observing \(Y_i = 1\):

\[\Pr(Y_i = 1) = \Pr(Y^*_i > 0) = \Pr(\epsilon_i > -(\beta_0 + \beta_1 \cdot \text{diff}_i))\]

Let \(F\) denote the cumulative distribution function (CDF) of \(\epsilon_i\). Assuming \(\epsilon_i\) has zero mean and a symmetric distribution around that mean (standard assumptions), we can show:

\[p_i = \Pr(Y_i = 1) = F(\beta_0 + \beta_1 \cdot \text{diff}_i)\]

This is a beautiful result. The probability of choosing transit is a nonlinear function of the commute time difference, determined by whatever distribution we assume for \(\epsilon_i\).

21.8 The Probit Model: Using the Normal Distribution

The probit model assumes \(\epsilon_i\) follows a standard normal distribution:

\[f(\epsilon_i) = \phi(\epsilon_i) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}\epsilon_i^2}\]

\[F(\epsilon_i) = \Phi(\epsilon_i) = \int_{-\infty}^{\epsilon_i} \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}x^2} dx\]

Therefore:

\[p_i = \Phi(\beta_0 + \beta_1 \cdot \text{diff}_i)\]

where \(\Phi\) is the standard normal CDF (familiar from introductory statistics as the standard normal cumulative probability).

Interpreting Probit Coefficients

The probit coefficients don’t have a direct probability interpretation. Instead, they tell us the effect on the latent variable \(Y^*_i\). The actual effect on the probability of choosing transit is the marginal effect:

\[\frac{dp_i}{d(\text{diff}_i)} = \phi(\beta_0 + \beta_1 \cdot \text{diff}_i) \times \beta_1\]

where \(\phi\) is the standard normal probability density function (PDF). Notice that this marginal effect varies depending on the value of the commute time difference—it’s largest when \(\beta_0 + \beta_1 \cdot \text{diff}_i \approx 0\) (near 50% probability) and smaller at the extremes.

21.9 The Logit Model: Using the Logistic Distribution

An alternative to probit is the logit model, which assumes \(\epsilon_i\) follows a logistic distribution:

\[f(\epsilon_i) = \lambda(\epsilon_i) = \frac{e^{\epsilon_i}}{(1+e^{\epsilon_i})^2}\]

\[F(\epsilon_i) = \Lambda(\epsilon_i) = \frac{e^{\epsilon_i}}{1+e^{\epsilon_i}}\]

Therefore:

\[p_i = \Lambda(\beta_0 + \beta_1 \cdot \text{diff}_i) = \frac{e^{\beta_0 + \beta_1 \cdot \text{diff}_i}}{1 + e^{\beta_0 + \beta_1 \cdot \text{diff}_i}}\]

This is the logistic function, which has a particularly nice property. If we take the ratio of the probability of choosing transit to the probability of choosing driving:

\[\frac{p_i}{1-p_i} = e^{\beta_0 + \beta_1 \cdot \text{diff}_i}\]

Taking natural logarithms:

\[\ln\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 \cdot \text{diff}_i\]

The left side is the log-odds. This means logit coefficients directly represent changes in the log-odds, which is why epidemiologists and health researchers prefer logit.

Marginal Effects in Logit

The marginal effect in logit is:

\[\frac{dp_i}{d(\text{diff}_i)} = \Lambda(\beta_0 + \beta_1 \cdot \text{diff}_i) \times (1-\Lambda(\beta_0 + \beta_1 \cdot \text{diff}_i)) \times \beta_1\]

This can also be written as \(p_i(1-p_i) \times \beta_1\), showing how the effect of the commute difference depends on the current probability.

21.10 Estimation: Maximum Likelihood

How do we actually estimate the parameters \(\beta_0\) and \(\beta_1\) in probit and logit models? We can’t use OLS because the model isn’t linear. Instead, we use Maximum Likelihood Estimation (MLE).

The core idea is intuitive: choose the parameter values that make the observed data most likely. For binary choice data, the likelihood function is:

\[\mathcal{L} = \prod_{i: Y_i=1} p_i \times \prod_{i: Y_i=0} (1-p_i)\]

In plain language: the likelihood is the product of the predicted probabilities of choosing transit (for those who did) and predicted probabilities of choosing driving (for those who didn’t).

NoteQuestion

Why is maximizing the likelihood the right approach for binary choice models?

21.11 Answer

Maximizing likelihood means finding the parameter values that make our observed data as probable as possible. If our model is correct, the parameters that generated the data should make that data more likely than alternative parameter values. This is a fundamental principle of statistical inference that works even when OLS assumptions are violated.

How MLE Works (Conceptually)

  1. Start with an initial guess about the parameter values
  2. For each observation, calculate the predicted probability of their observed choice
  3. Compute the likelihood as the product of these probabilities
  4. Adjust the parameters to increase the likelihood
  5. Repeat until the likelihood stops increasing (convergence)

In practice, software implements sophisticated optimization algorithms (like Newton-Raphson) to perform this search efficiently.

21.12 Results: Comparing LPM, Probit, and Logit

When we estimate these three models on the Ben-Akiva commuting data, here’s what we find:

Linear Probability Model: - Coefficient on diff: 0.007031 (p < 0.001) - Intercept: 0.515

Probit Model: - Coefficient on diff: 0.030000 (p = 0.004) - Intercept: 0.064434 - Pseudo-R²: 0.5758

Logit Model: - Coefficient on diff: 0.053110 (p = 0.010)
- Intercept: 0.237575 - Pseudo-R²: 0.5757

Note that the probit and logit coefficients are much larger than the LPM coefficient. This isn’t because the effects are actually larger—it’s because these coefficients represent changes in the latent variable, not changes in probability.

21.13 Marginal Effects: The Real Story

To compare effects across models, we must look at marginal effects. Let’s evaluate each at the sample mean commute time difference (approximately 1.22 minutes):

LPM Marginal Effect: 0.007031 (constant for all)

Probit Marginal Effect at Mean: 0.0119

Logit Marginal Effect at Mean: 0.0130

These are much more comparable! At the average commute time difference, a one-minute increase in the bus time advantage increases the probability of choosing transit by about 0.012 to 0.013 percentage points—much smaller than the LPM’s constant effect.

The probit and logit marginal effects are similar but not identical. This is typical—they differ slightly because of the different distributional assumptions, but the differences usually matter less than we might think.

21.14 Probit or Logit? A Practical Perspective

ImportantChoosing Between Probit and Logit

For most applications, the choice between probit and logit is a matter of taste. Both models typically produce similar results and marginal effects. The differences are usually small compared to specification issues (like omitted variables or functional form choices).

Exceptions and conventions: - Probit is preferred when there’s reason to believe errors are normally distributed or when working with panel data where you want to control for unobserved heteroskedasticity - Logit is preferred in epidemiology because coefficients have a natural interpretation as log-odds ratios - Multinomial logit is the standard choice for multi-choice problems (more than two alternatives) - Ordered probit is conventional for ordinal outcomes (like survey responses on a scale)

21.15 A Deeper Look: The Visualization

When we plot the fitted probabilities from our models against the commute time difference, we see a crucial difference from the LPM:

  • The probit and logit curves are S-shaped (sigmoid), respecting the bounds of probability (0 to 1)
  • They’re steepest in the middle where probability is around 50%, reflecting how information matters most when we’re uncertain
  • They flatten at the extremes, where strongly positive or negative commute differences make the choice obvious

The LPM line carelessly violates these bounds, predicting negative probabilities for large negative commute differences.

21.16 Computing Marginal Effects in Practice

When estimating these models, most software packages can compute marginal effects automatically. In Stata, for example:

probit choice diff
margins, dydx(diff)

This gives marginal effects at the sample means. You can also evaluate marginal effects at specific values:

margins, dydx(diff) at(diff=30)

This would show the effect of the commute time difference when it equals 30 minutes.

21.17 Threshold Analysis: The Break-Even Point

Beyond average marginal effects, policy makers often want to know: At what commute time difference would someone be indifferent between modes?

In the probit model, this corresponds to where \(\beta_0 + \beta_1 \cdot \text{diff} = 0\) (where the latent utility equals zero, giving 50% predicted probability).

Solving: \(\text{diff}^* = -\beta_0 / \beta_1\)

With our estimates: \(\text{diff}^* = -0.064434 / 0.030000 = -2.15\) minutes

This means a commuter is indifferent when the bus is about 2 minutes faster than driving. Making the bus even 1 minute faster shifts the probability toward transit; making it 1 minute slower shifts probability toward driving.

This kind of threshold analysis is tremendously useful for policy evaluation: How much faster does transit need to be to attract a meaningful share of riders?

21.19 Summary: From Boston to a General Framework

We started with a real problem: How many people would use a new transit system? A simple question about choice led to a revolution in econometric practice.

Dan McFadden recognized that modeling discrete choices required fundamentally different statistical tools. The latent variable framework—where observed choices reflect thresholds of underlying continuous utilities—proved to be exactly the right abstraction.

Today, when a company wants to predict product adoption, a city wants to forecast ridership, a researcher wants to understand voting behavior, or a policymaker wants to evaluate a program’s effects on people’s decisions, they use the tools McFadden developed.

The methods you’ve learned—probit and logit, maximum likelihood estimation, marginal effect calculation—are the workhorses of modern applied economics. They appear in policy analysis, business strategy, epidemiology, political science, and marketing. Understanding them deeply, as you now do, opens doors to analyzing and understanding human choice across virtually every domain.

21.20 Key Formulas Reference

Probit Model: \[p_i = \Phi(\beta_0 + \beta_1 X_i)\]

Logit Model: \[p_i = \frac{e^{\beta_0 + \beta_1 X_i}}{1 + e^{\beta_0 + \beta_1 X_i}}\]

Marginal Effect (Probit): \[\frac{dp_i}{dX_i} = \phi(\beta_0 + \beta_1 X_i) \cdot \beta_1\]

Marginal Effect (Logit): \[\frac{dp_i}{dX_i} = p_i(1-p_i) \cdot \beta_1\]

Log-Odds (Logit): \[\ln\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 X_i\]