1 The Purpose of Data Analytics

In this chapter, we’ll explore the fundamental purpose and scope of data analytics. By the end of this chapter, you will understand:

The distinction between correlation and causation
How patterns emerge from randomness
The difference between population and sample data
The two primary goals of statistical analysis
The philosophical divide between frequentist and Bayesian approaches

1.1 What Is Data Analytics Really About?

Key Question

What is the ultimate goal of data analytics?

Data analytics is fundamentally about understanding cause and effect relationships in the world. While it’s easy to observe that two variables move together—that they are correlated—establishing causation is far more challenging and far more valuable.

Consider a simple example: we might observe that ice cream sales and drowning incidents are correlated. They both increase during summer months. But does ice cream cause drowning? Of course not. Both are caused by a third factor: warm weather, which leads people to buy ice cream and also to swim more frequently.

Important Distinction

Correlation does not imply causation. Two variables can move together without one causing the other. Establishing causal relationships requires careful analysis and often experimental design.

The distinction between correlation and causation is not merely academic—it has profound implications for how we understand the world and make decisions. Consider the famous closing lines of Robert Frost’s poem “The Road Not Taken”:

Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.

Frost claims that taking the road less traveled “made all the difference” to his life. But as statisticians, we must ask: how does he know? To establish causation, we would need a counterfactual—an alternative version of his life where he took the other road. Without observing this counterfactual, Frost cannot definitively claim that his choice caused the difference in his life’s trajectory. Perhaps his life would have turned out similarly regardless of which road he chose. Or perhaps taking the more traveled road would have led to even better outcomes.

This challenge—the impossibility of observing counterfactuals in our own lives—is precisely what makes causal inference so difficult and why rigorous statistical methods are essential.

In policy work—especially environmental policy and climate science—we need causal understanding. When we ask “how much warming will occur if we add X more tons of carbon dioxide to the atmosphere?”, we’re asking a causal question. The relationship between greenhouse gas concentrations and temperature change is incredibly complicated, random, and stochastic. Yet climate scientists have developed good estimates of what is called the global warming potential of different greenhouse gases. These estimates are based on a causal understanding of physical processes, not mere correlation.

This is why data analytics matters: we want to establish cause and effect, not just observe patterns. We’re here to understand how the world works, not just to make pretty pictures or note correlations.

1.2 From Randomness to Pattern

One of the most remarkable features of statistical analysis is how patterns emerge from what initially appears to be pure randomness. When we look at individual observations, they often seem chaotic and unpredictable. But when we collect enough observations, macro-level patterns begin to reveal themselves.

Conceptual Question

How can predictable patterns emerge from random individual events?

Answer

While individual events may be unpredictable, the aggregate behavior of many random events often follows predictable patterns. This is the fundamental insight of probability theory—that randomness at the micro level produces regularity at the macro level.

Consider the classic example of a Galton board (sometimes called a bean machine). When a single ball drops through the board, hitting pegs as it falls, its path is essentially random—at each peg, it bounces left or right unpredictably. We cannot predict where any individual ball will land.

However, when we drop hundreds or thousands of balls, a clear pattern emerges: they pile up in the shape of a bell curve, forming what statisticians call the normal distribution. The randomness of individual ball drops gives way to a predictable aggregate pattern.

This emergence of order from randomness is not magic—it’s mathematics. And it’s the foundation of statistical inference.

1.3 Many Patterns, Not Just One

Beware of Normalitis

One common misconception in statistics is that every pattern follows the normal distribution (the familiar bell curve). This is simply not true. While the normal distribution is important and widely applicable, it is just one of dozens of probability distributions used in statistics.

I call the mistaken belief that everything is normally distributed normalitis—and it’s a condition to avoid.

Different real-world phenomena follow different distributions:

Bernoulli distribution: Events with only two possible outcomes (coin flip: heads or tails; ball at a peg: left or right)
Binomial distribution: The number of successes in a fixed number of independent Bernoulli trials (how many heads in 10 coin flips?)
Poisson distribution: Count data and waiting times (how long you wait for the bus each day; how many customers arrive per hour)
Normal distribution: Many continuous phenomena in nature and society (heights, test scores, measurement errors)

These distributions are often mathematically related. For instance, when you sum up many independent Bernoulli trials (each ball on the Galton board making left-right decisions), you get a binomial distribution. And when the number of trials becomes very large, that binomial distribution approximates the normal distribution.

Conceptual Question

The word “Poisson” comes from French. What does it mean, and who was Poisson?

Answer

“Poisson” means “fish” in French (related to “Pisces,” the astrological sign). Siméon Denis Poisson was a French mathematician and physicist who discovered this particular distribution, which describes the probability of a given number of events occurring in a fixed interval of time or space.

Throughout this course, we’ll work with many different distributions. Each captures a different kind of pattern in data. The key is learning to recognize which pattern fits which situation—and to never assume that one pattern applies universally.

1.4 Population and Sample

In statistical analysis, we make a crucial distinction between two types of data:

Definitions

Population: All possible data points that exist in the world for a given phenomenon. This includes data that has been collected, data that could be collected, and data that will exist in the future.

Sample: A subset of the population that we have actually collected and can analyze. The sample is always smaller—often infinitesimally smaller—than the population.

Consider studying human height. The population would include the heights of all humans who have ever lived, are living now, and will live in the future. That’s an enormous—indeed, infinite—amount of data. Your sample might be the heights of 1,000 people surveyed in a particular city during a particular year.

No matter how large your sample, it remains tiny compared to the population. Even if you collect data on millions of individuals, that’s still just a tiny fraction of the theoretical population. As a mathematical principle:

\[\lim_{n \to \infty} \text{Sample} = \text{Population}\]

As the sample size approaches infinity, it approaches the population. But in practice, our samples are always finite and small relative to the population.

1.5 Two Goals of Statistical Analysis

What do we do with sample data once we collect it? We pursue one or both of two fundamental goals:

1. Description

The first goal is to describe the data we have collected. This is called descriptive statistics. We might:

Calculate the average (mean) age in our sample
Determine the most common (mode) educational level
Find the middle value (median) of family incomes
Measure the spread (variance or standard deviation) of environmental commitment scores

Descriptive statistics summarize and organize data in meaningful ways. They help us understand what our sample looks like. When we describe sample data, we’re making statements only about that specific set of observations.

2. Inference

The second, more ambitious goal is to infer patterns and relationships that extend beyond our sample to the broader population. This is called inferential statistics or statistical inference.

Suppose we collect sample data on 25 different variables for each person: age, education level, commitment to environmental causes, family income, transportation choices, and so on. We might discover relationships among these variables in our sample—for instance, that people with higher education levels tend to show stronger commitment to environmental causes.

The question then becomes: Can we extrapolate this relationship from our tiny sample to the entire population? Can we say with confidence that the relationship we found in this specific dataset also exists more broadly?

This is the central challenge of inferential statistics. We observe patterns in our sample and attempt to make general claims about the population. The entire machinery of statistical inference—hypothesis tests, confidence intervals, p-values, regression analysis—exists to help us make this logical leap from sample to population in a rigorous, quantifiable way.

Reflective Question

Why is it more valuable to make inferences about the population than to simply describe our sample?

Answer

Describing our sample tells us only about the specific observations we happened to collect. But policy decisions, scientific theories, and practical applications require understanding that extends beyond our particular sample. We need to know whether the patterns we observe are likely to hold generally, not just in the specific cases we studied. This is what makes statistical inference so powerful and so essential for decision-making.

When we perform inference successfully—when we can say with justified confidence that our sample findings reflect population patterns—we achieve what statisticians call external validity. But before we can even attempt to generalize to the population, we must first ensure that our findings within the sample are sound. When our causal analysis within the sample is properly conducted and the relationships we identify are genuine (not artifacts of confounding variables or measurement error), we say our analysis has internal validity. Both forms of validity are essential for credible statistical work.

1.6 Two Philosophical Approaches to Inference

How many fundamentally different approaches exist for making statistical inferences? The answer is two: the frequentist approach and the Bayesian approach. These represent two distinct philosophical frameworks for reasoning about probability and uncertainty.

The Frequentist Approach

The frequentist approach, which has dominated statistical practice for much of the 20th century, interprets probability in terms of long-run frequencies. From this perspective, probability statements only make sense for events that can be repeated many times.

Consider flipping a coin. A frequentist interprets “the probability of heads is 0.5” to mean: if we flip this coin infinitely many times, heads will appear in 50% of the flips. Probability, in this view, is an objective property of the world—a statement about what would happen if we could repeat an experiment indefinitely.

This philosophical stance has important implications. Imagine I flip a coin and catch it in my hand, concealing the result. I know how it landed, but you don’t. What is the probability that it landed heads?

Thought Experiment

I’ve just flipped a coin and caught it in my closed hand. I can see the result, but you cannot. What is the probability that the coin shows heads?

A frequentist would say: the probability is either 0 or 1, depending on how it actually landed. If it landed heads, the probability is 1 (certainty). If it landed tails, the probability is 0 (impossibility). The coin has already landed—there’s nothing probabilistic about it anymore. The event has occurred, and its outcome is now a fact of the world, even if you don’t know what that fact is.

This reveals a key feature of frequentist thinking: probabilities apply to events that haven’t happened yet, not to events that have already occurred but whose outcomes we simply don’t know. From a frequentist perspective, once the coin has landed, talking about the “probability” of how it landed is meaningless. It landed some particular way. The uncertainty you feel is about your knowledge, not about the event itself.

The Bayesian Approach

The Bayesian approach takes a fundamentally different view. Bayesians interpret probability as a measure of our degree of belief or state of knowledge about an event. Probability, from this perspective, is subjective—it represents how confident we are, given the information we have.

Let’s return to the coin in my hand. A Bayesian would say: given that you don’t know how it landed and you have no reason to believe the coin is unfair, your probability assessment should be 0.5. This doesn’t mean the coin is somehow in a superposition of states. Rather, it means that given your current state of knowledge, you should be equally uncertain about whether it shows heads or tails.

If I were to give you a hint—say, “It’s not tails”—a Bayesian would immediately update your probability to 1 for heads. Your degree of belief changes as you gain new information, even though the physical state of the coin hasn’t changed at all.

Fundamental Philosophical Difference

Frequentist view: Probability is an objective property of repeatable events. It doesn’t make sense to assign probabilities to fixed but unknown quantities.

Bayesian view: Probability represents our degree of belief or state of knowledge. We can assign probabilities to any uncertain proposition, including fixed but unknown quantities.

This philosophical difference leads to very different statistical methodologies. Frequentists develop procedures that work well in the long run—if we used this test over and over, we’d make correct decisions most of the time. Bayesians explicitly incorporate prior knowledge and update their beliefs as new evidence arrives.

Most practicing statisticians today are implicitly Bayesian in their everyday reasoning about uncertainty, even if they use frequentist methods in their formal analyses. When we say “there’s a 70% chance it will rain tomorrow,” we’re thinking like Bayesians—probability as degree of belief. When we conduct a hypothesis test with a significance level of 0.05, we’re using frequentist methodology—probability as long-run frequency.

Which Approach Is “Right”?

Neither approach is universally correct or incorrect. They answer different questions and serve different purposes. Frequentist methods provide objective procedures with well-understood long-run properties, which makes them particularly valuable in fields like medical research where regulatory decisions require clear standards. Bayesian methods allow us to explicitly incorporate prior knowledge and provide direct probability statements about hypotheses, which makes them particularly valuable in fields where we have genuine prior information and want to update our beliefs.

Throughout this course, we’ll primarily use frequentist methods, as these remain the dominant framework in most applied fields and are what you’ll encounter in published research. However, we’ll also discuss Bayesian perspectives where they provide valuable insights or alternative ways of thinking about inference.

The key is to understand both philosophical frameworks and recognize that they represent different—but equally rigorous—ways of reasoning about uncertainty and evidence.

1.7 Understanding Hypothesis Testing Concepts

Before we can intelligently discuss either frequentist or Bayesian inference, we need to understand some fundamental concepts that appear throughout statistical testing. These ideas—particularly around errors in decision-making—form the conceptual foundation for statistical inference.

Types of Errors

When we conduct a statistical test, we’re making a decision: either reject a hypothesis or fail to reject it. Like any decision made under uncertainty, we can make mistakes. There are two types of mistakes we might make:

Type I Error

A Type I error occurs when we reject a hypothesis that is actually correct. We declare that something is happening when, in fact, it is not.

In medical testing: declaring a healthy patient is sick (false positive)
In criminal justice: convicting an innocent person
In scientific research: claiming we’ve found an effect when none exists

Type II Error

A Type II error occurs when we fail to reject a hypothesis that is actually false. We fail to detect something that is really happening.

In medical testing: declaring a sick patient is healthy (false negative)
In criminal justice: acquitting a guilty person
In scientific research: failing to detect an effect that actually exists

These two types of errors are in tension with each other. If we make it harder to commit a Type I error (by requiring very strong evidence before rejecting a hypothesis), we inevitably make it easier to commit a Type II error (we’ll fail to detect real effects more often). Conversely, if we’re very eager to detect effects (reducing Type II errors), we’ll end up making more Type I errors by seeing patterns that aren’t really there.

The P-Value

The p-value is the probability of making a Type I error—the probability of rejecting a correct hypothesis. More precisely, it’s the probability of observing data as extreme as (or more extreme than) what we actually observed, assuming the hypothesis we’re testing is true.

The p-value is calculated from your data using statistical procedures. It’s an output of your analysis, not an input. In the old days, p-values were looked up in printed tables at the back of statistics textbooks. Today, statistical software calculates them instantly.

Common Misconception

The p-value is not “the probability that our results are wrong” or “the probability that the hypothesis is true.” It is specifically the probability of observing our data (or more extreme data) if the hypothesis we’re testing is actually correct.

The Significance Level (α)

The significance level, denoted by the Greek letter α (alpha), is the threshold probability you choose before collecting data. It represents how much Type I error risk you’re willing to tolerate.

Commonly used significance levels include: - α = 0.05 (5%): The most common choice in many fields - α = 0.01 (1%): Used when Type I errors are particularly costly - α = 0.10 (10%): Used when Type I errors are less concerning or when sample sizes are small

Here’s the crucial point: you choose α before looking at your data. The significance level is an input to your analysis, while the p-value is an output. You then compare them:

If p-value < α: Reject the hypothesis (the evidence is strong enough)
If p-value ≥ α: Fail to reject the hypothesis (the evidence is not strong enough)

Why We Never “Accept” Hypotheses

Notice the careful language: we “reject” or “fail to reject” hypotheses. We never “accept” a hypothesis. Why this asymmetry?

The reason is fundamental to the nature of scientific reasoning. Consider the history of physics. About 500 years ago, Isaac Newton developed his theory of gravity, which explained why objects fall to the ground. For over two centuries, Newton’s theory was supported by all available evidence. Scientists didn’t say “we accept Newton’s theory as correct”—they said “we fail to reject it; it’s the best explanation we have so far.”

Then, about 100 years ago, Albert Einstein developed general relativity, which showed that Newton’s theory, while extremely useful for everyday purposes, is actually incorrect in important ways. Einstein’s theory superseded Newton’s.

But does this mean Einstein’s theory is “correct”? Not necessarily. It’s the best explanation we have now, consistent with all currently available evidence. But tomorrow, someone might develop an even better theory that supersedes Einstein’s.

Scientific Humility

In science, we can demonstrate that theories are wrong or false (by finding contradictory evidence), but we can never prove that theories are correct or true (because future evidence might contradict them). This is why we never “accept” hypotheses—we only fail to reject them given current evidence.

This principle, articulated by philosopher Karl Popper, is called falsificationism. Scientific theories can be falsified but never verified with absolute certainty. This is why statistical hypothesis testing is framed around rejection rather than acceptance.

Statistical Power

There’s one more important concept related to errors: statistical power. Power is defined as the probability of not making a Type II error—that is, the probability of correctly rejecting a false hypothesis.

High statistical power is desirable: it means your test is good at detecting effects when they exist. Power depends on several factors: - Sample size (larger samples → higher power) - Effect size (larger effects → easier to detect → higher power) - Significance level (higher α → higher power, but also more Type I errors) - Variability in the data (less noise → higher power)

While there’s no standard name for the “probability of making a Type II error” (parallel to how we call Type I error probability the “p-value”), it’s typically denoted β (beta). Then power = 1 - β.

1.8 Looking Ahead

Throughout this course, we’ll develop both descriptive and inferential tools. We’ll learn to:

Visualize data through graphs and charts
Calculate summary statistics that capture essential features of datasets
Recognize different probability distributions and understand when each applies
Use sample data to make justified inferences about populations
Establish cause-and-effect relationships through careful analysis
Navigate the philosophical differences between frequentist and Bayesian approaches

Most importantly, we’ll engage in abstract thinking about data and probability. Statistics is not just a collection of computational procedures—it’s a coherent framework for reasoning about uncertainty, variability, and inference. Understanding this framework will serve you in any field where data and evidence matter.

The goal of this book is not merely to learn formulas and procedures, but to develop statistical intuition—to think clearly about randomness, patterns, causation, and inference. This kind of thinking is increasingly essential in environmental policy, climate science, economics, public health, and virtually every domain where evidence-based decision-making matters.

We’ll build this understanding gradually, starting with the foundations of probability and working our way up to sophisticated inferential methods. Along the way, we’ll grapple with deep questions: How do we know what we know? What does it mean for evidence to support a claim? How much uncertainty should we tolerate in our conclusions? These aren’t just technical questions—they’re fundamental questions about knowledge itself, approached through the lens of mathematical reasoning.