1 The Purpose of Data Analytics
In this chapter, we’ll explore the fundamental purpose and scope of data analytics. By the end of this chapter, you will understand:
- The distinction between correlation and causation
- How patterns emerge from randomness
- The difference between population and sample data
- The two primary goals of statistical analysis
- The philosophical divide between frequentist and Bayesian approaches
1.1 What Is Data Analytics Really About?
Data analytics is fundamentally about understanding cause and effect relationships in the world. While it’s easy to observe that two variables move together—that they are correlated—establishing causation is far more challenging and far more valuable.
Consider a simple example: we might observe that ice cream sales and drowning incidents are correlated. They both increase during summer months. But does ice cream cause drowning? Of course not. Both are caused by a third factor: warm weather, which leads people to buy ice cream and also to swim more frequently.
The distinction between correlation and causation is not merely academic—it has profound implications for how we understand the world and make decisions. Consider the famous closing lines of Robert Frost’s poem “The Road Not Taken”:
Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.
Frost claims that taking the road less traveled “made all the difference” to his life. But as statisticians, we must ask: how does he know? To establish causation, we would need a counterfactual—an alternative version of his life where he took the other road. Without observing this counterfactual, Frost cannot definitively claim that his choice caused the difference in his life’s trajectory. Perhaps his life would have turned out similarly regardless of which road he chose. Or perhaps taking the more traveled road would have led to even better outcomes.
This challenge—the impossibility of observing counterfactuals in our own lives—is precisely what makes causal inference so difficult and why rigorous statistical methods are essential.
In policy work—especially environmental policy and climate science—we need causal understanding. When we ask “how much warming will occur if we add X more tons of carbon dioxide to the atmosphere?”, we’re asking a causal question. The relationship between greenhouse gas concentrations and temperature change is incredibly complicated, random, and stochastic. Yet climate scientists have developed good estimates of what is called the global warming potential of different greenhouse gases. These estimates are based on a causal understanding of physical processes, not mere correlation.
This is why data analytics matters: we want to establish cause and effect, not just observe patterns. We’re here to understand how the world works, not just to make pretty pictures or note correlations.
1.2 From Randomness to Pattern
One of the most remarkable features of statistical analysis is how patterns emerge from what initially appears to be pure randomness. When we look at individual observations, they often seem chaotic and unpredictable. But when we collect enough observations, macro-level patterns begin to reveal themselves.
Consider the classic example of a Galton board (sometimes called a bean machine). When a single ball drops through the board, hitting pegs as it falls, its path is essentially random—at each peg, it bounces left or right unpredictably. We cannot predict where any individual ball will land.
However, when we drop hundreds or thousands of balls, a clear pattern emerges: they pile up in the shape of a bell curve, forming what statisticians call the normal distribution. The randomness of individual ball drops gives way to a predictable aggregate pattern.
This emergence of order from randomness is not magic—it’s mathematics. And it’s the foundation of statistical inference.
1.3 Many Patterns, Not Just One
Different real-world phenomena follow different distributions:
- Bernoulli distribution: Events with only two possible outcomes (coin flip: heads or tails; ball at a peg: left or right)
- Binomial distribution: The number of successes in a fixed number of independent Bernoulli trials (how many heads in 10 coin flips?)
- Poisson distribution: Count data and waiting times (how long you wait for the bus each day; how many customers arrive per hour)
- Normal distribution: Many continuous phenomena in nature and society (heights, test scores, measurement errors)
These distributions are often mathematically related. For instance, when you sum up many independent Bernoulli trials (each ball on the Galton board making left-right decisions), you get a binomial distribution. And when the number of trials becomes very large, that binomial distribution approximates the normal distribution.
Throughout this course, we’ll work with many different distributions. Each captures a different kind of pattern in data. The key is learning to recognize which pattern fits which situation—and to never assume that one pattern applies universally.
1.4 Population and Sample
In statistical analysis, we make a crucial distinction between two types of data:
Consider studying human height. The population would include the heights of all humans who have ever lived, are living now, and will live in the future. That’s an enormous—indeed, infinite—amount of data. Your sample might be the heights of 1,000 people surveyed in a particular city during a particular year.
No matter how large your sample, it remains tiny compared to the population. Even if you collect data on millions of individuals, that’s still just a tiny fraction of the theoretical population. As a mathematical principle:
\[\lim_{n \to \infty} \text{Sample} = \text{Population}\]
As the sample size approaches infinity, it approaches the population. But in practice, our samples are always finite and small relative to the population.
1.5 Two Goals of Statistical Analysis
What do we do with sample data once we collect it? We pursue one or both of two fundamental goals:
1. Description
The first goal is to describe the data we have collected. This is called descriptive statistics. We might:
- Calculate the average (mean) age in our sample
- Determine the most common (mode) educational level
- Find the middle value (median) of family incomes
- Measure the spread (variance or standard deviation) of environmental commitment scores
Descriptive statistics summarize and organize data in meaningful ways. They help us understand what our sample looks like. When we describe sample data, we’re making statements only about that specific set of observations.
2. Inference
The second, more ambitious goal is to infer patterns and relationships that extend beyond our sample to the broader population. This is called inferential statistics or statistical inference.
Suppose we collect sample data on 25 different variables for each person: age, education level, commitment to environmental causes, family income, transportation choices, and so on. We might discover relationships among these variables in our sample—for instance, that people with higher education levels tend to show stronger commitment to environmental causes.
The question then becomes: Can we extrapolate this relationship from our tiny sample to the entire population? Can we say with confidence that the relationship we found in this specific dataset also exists more broadly?
This is the central challenge of inferential statistics. We observe patterns in our sample and attempt to make general claims about the population. The entire machinery of statistical inference—hypothesis tests, confidence intervals, p-values, regression analysis—exists to help us make this logical leap from sample to population in a rigorous, quantifiable way.
When we perform inference successfully—when we can say with justified confidence that our sample findings reflect population patterns—we achieve what statisticians call external validity. But before we can even attempt to generalize to the population, we must first ensure that our findings within the sample are sound. When our causal analysis within the sample is properly conducted and the relationships we identify are genuine (not artifacts of confounding variables or measurement error), we say our analysis has internal validity. Both forms of validity are essential for credible statistical work.
1.6 Two Philosophical Approaches to Inference
How many fundamentally different approaches exist for making statistical inferences? The answer is two: the frequentist approach and the Bayesian approach. These represent two distinct philosophical frameworks for reasoning about probability and uncertainty.
The Frequentist Approach
The frequentist approach, which has dominated statistical practice for much of the 20th century, interprets probability in terms of long-run frequencies. From this perspective, probability statements only make sense for events that can be repeated many times.
Consider flipping a coin. A frequentist interprets “the probability of heads is 0.5” to mean: if we flip this coin infinitely many times, heads will appear in 50% of the flips. Probability, in this view, is an objective property of the world—a statement about what would happen if we could repeat an experiment indefinitely.
This philosophical stance has important implications. Imagine I flip a coin and catch it in my hand, concealing the result. I know how it landed, but you don’t. What is the probability that it landed heads?
A frequentist would say: the probability is either 0 or 1, depending on how it actually landed. If it landed heads, the probability is 1 (certainty). If it landed tails, the probability is 0 (impossibility). The coin has already landed—there’s nothing probabilistic about it anymore. The event has occurred, and its outcome is now a fact of the world, even if you don’t know what that fact is.
This reveals a key feature of frequentist thinking: probabilities apply to events that haven’t happened yet, not to events that have already occurred but whose outcomes we simply don’t know. From a frequentist perspective, once the coin has landed, talking about the “probability” of how it landed is meaningless. It landed some particular way. The uncertainty you feel is about your knowledge, not about the event itself.
The Bayesian Approach
The Bayesian approach takes a fundamentally different view. Bayesians interpret probability as a measure of our degree of belief or state of knowledge about an event. Probability, from this perspective, is subjective—it represents how confident we are, given the information we have.
Let’s return to the coin in my hand. A Bayesian would say: given that you don’t know how it landed and you have no reason to believe the coin is unfair, your probability assessment should be 0.5. This doesn’t mean the coin is somehow in a superposition of states. Rather, it means that given your current state of knowledge, you should be equally uncertain about whether it shows heads or tails.
If I were to give you a hint—say, “It’s not tails”—a Bayesian would immediately update your probability to 1 for heads. Your degree of belief changes as you gain new information, even though the physical state of the coin hasn’t changed at all.
This philosophical difference leads to very different statistical methodologies. Frequentists develop procedures that work well in the long run—if we used this test over and over, we’d make correct decisions most of the time. Bayesians explicitly incorporate prior knowledge and update their beliefs as new evidence arrives.
Most practicing statisticians today are implicitly Bayesian in their everyday reasoning about uncertainty, even if they use frequentist methods in their formal analyses. When we say “there’s a 70% chance it will rain tomorrow,” we’re thinking like Bayesians—probability as degree of belief. When we conduct a hypothesis test with a significance level of 0.05, we’re using frequentist methodology—probability as long-run frequency.
Which Approach Is “Right”?
Neither approach is universally correct or incorrect. They answer different questions and serve different purposes. Frequentist methods provide objective procedures with well-understood long-run properties, which makes them particularly valuable in fields like medical research where regulatory decisions require clear standards. Bayesian methods allow us to explicitly incorporate prior knowledge and provide direct probability statements about hypotheses, which makes them particularly valuable in fields where we have genuine prior information and want to update our beliefs.
Throughout this course, we’ll primarily use frequentist methods, as these remain the dominant framework in most applied fields and are what you’ll encounter in published research. However, we’ll also discuss Bayesian perspectives where they provide valuable insights or alternative ways of thinking about inference.
The key is to understand both philosophical frameworks and recognize that they represent different—but equally rigorous—ways of reasoning about uncertainty and evidence.
1.7 Understanding Hypothesis Testing Concepts
Before we can intelligently discuss either frequentist or Bayesian inference, we need to understand some fundamental concepts that appear throughout statistical testing. These ideas—particularly around errors in decision-making—form the conceptual foundation for statistical inference.
Types of Errors
When we conduct a statistical test, we’re making a decision: either reject a hypothesis or fail to reject it. Like any decision made under uncertainty, we can make mistakes. There are two types of mistakes we might make:
These two types of errors are in tension with each other. If we make it harder to commit a Type I error (by requiring very strong evidence before rejecting a hypothesis), we inevitably make it easier to commit a Type II error (we’ll fail to detect real effects more often). Conversely, if we’re very eager to detect effects (reducing Type II errors), we’ll end up making more Type I errors by seeing patterns that aren’t really there.
The P-Value
The p-value is the probability of making a Type I error—the probability of rejecting a correct hypothesis. More precisely, it’s the probability of observing data as extreme as (or more extreme than) what we actually observed, assuming the hypothesis we’re testing is true.
The p-value is calculated from your data using statistical procedures. It’s an output of your analysis, not an input. In the old days, p-values were looked up in printed tables at the back of statistics textbooks. Today, statistical software calculates them instantly.
The Significance Level (α)
The significance level, denoted by the Greek letter α (alpha), is the threshold probability you choose before collecting data. It represents how much Type I error risk you’re willing to tolerate.
Commonly used significance levels include: - α = 0.05 (5%): The most common choice in many fields - α = 0.01 (1%): Used when Type I errors are particularly costly - α = 0.10 (10%): Used when Type I errors are less concerning or when sample sizes are small
Here’s the crucial point: you choose α before looking at your data. The significance level is an input to your analysis, while the p-value is an output. You then compare them:
- If p-value < α: Reject the hypothesis (the evidence is strong enough)
- If p-value ≥ α: Fail to reject the hypothesis (the evidence is not strong enough)
Why We Never “Accept” Hypotheses
Notice the careful language: we “reject” or “fail to reject” hypotheses. We never “accept” a hypothesis. Why this asymmetry?
The reason is fundamental to the nature of scientific reasoning. Consider the history of physics. About 500 years ago, Isaac Newton developed his theory of gravity, which explained why objects fall to the ground. For over two centuries, Newton’s theory was supported by all available evidence. Scientists didn’t say “we accept Newton’s theory as correct”—they said “we fail to reject it; it’s the best explanation we have so far.”
Then, about 100 years ago, Albert Einstein developed general relativity, which showed that Newton’s theory, while extremely useful for everyday purposes, is actually incorrect in important ways. Einstein’s theory superseded Newton’s.
But does this mean Einstein’s theory is “correct”? Not necessarily. It’s the best explanation we have now, consistent with all currently available evidence. But tomorrow, someone might develop an even better theory that supersedes Einstein’s.
This principle, articulated by philosopher Karl Popper, is called falsificationism. Scientific theories can be falsified but never verified with absolute certainty. This is why statistical hypothesis testing is framed around rejection rather than acceptance.
Statistical Power
There’s one more important concept related to errors: statistical power. Power is defined as the probability of not making a Type II error—that is, the probability of correctly rejecting a false hypothesis.
High statistical power is desirable: it means your test is good at detecting effects when they exist. Power depends on several factors: - Sample size (larger samples → higher power) - Effect size (larger effects → easier to detect → higher power) - Significance level (higher α → higher power, but also more Type I errors) - Variability in the data (less noise → higher power)
While there’s no standard name for the “probability of making a Type II error” (parallel to how we call Type I error probability the “p-value”), it’s typically denoted β (beta). Then power = 1 - β.
1.8 Looking Ahead
Throughout this course, we’ll develop both descriptive and inferential tools. We’ll learn to:
- Visualize data through graphs and charts
- Calculate summary statistics that capture essential features of datasets
- Recognize different probability distributions and understand when each applies
- Use sample data to make justified inferences about populations
- Establish cause-and-effect relationships through careful analysis
- Navigate the philosophical differences between frequentist and Bayesian approaches
Most importantly, we’ll engage in abstract thinking about data and probability. Statistics is not just a collection of computational procedures—it’s a coherent framework for reasoning about uncertainty, variability, and inference. Understanding this framework will serve you in any field where data and evidence matter.
The goal of this book is not merely to learn formulas and procedures, but to develop statistical intuition—to think clearly about randomness, patterns, causation, and inference. This kind of thinking is increasingly essential in environmental policy, climate science, economics, public health, and virtually every domain where evidence-based decision-making matters.
We’ll build this understanding gradually, starting with the foundations of probability and working our way up to sophisticated inferential methods. Along the way, we’ll grapple with deep questions: How do we know what we know? What does it mean for evidence to support a claim? How much uncertainty should we tolerate in our conclusions? These aren’t just technical questions—they’re fundamental questions about knowledge itself, approached through the lens of mathematical reasoning.