How to Navigate Confidence Intervals With Confidence

Teaching statistics year after year prepped me for the most common misinterpretations of confidence intervals and confidence levels. Confusion such as:

Incorrectly interpreting a 99% interval as having a “99% probability of containing the true population parameter”
Finding significance because “the sample mean is contained in the interval”
Applying a confidence interval to samples that do not meet specific assumptions

What are Confidence Intervals?

Confidence intervals are like fishing nets to an analyst looking to capture the actual measure of a population in a pond of uncertainty. The margin of error dictates the width of the “net”. But unlike fishing scenarios, whether or not the confidence interval actually captures the true population measure typically remains uncertain. Confidence intervals are not intuitive, yet they are logical once you understand where they start.

So what EXACTLY, are we confident about? Is it the underlying data? Is it the result? Is it the sample? The confidence is actually in the procedures used to obtain the sample that was used to create the interval — and I’ll come back to this big idea at the end of the post. First, let’s paint the big picture in three parts: The data, the math, and the interpretation.

The Data

As I mentioned, a confidence interval captures a “true” (yet unknown) measure of a population using sample data. Therefore, you must be working with sample data to apply a confidence interval — you’re defeating the purpose if you’re already working with population data for which the metrics of interest are known.

Sampling Bias

It’s important to investigate how the sample was taken and determine if the sample represents the entire population. Sampling bias means a certain group has been under- or over- represented in a sample – in which case, the sample does not represent the entire population. A common misconception is that you can offset bias by increasing the sample size; however, once bias has been introduced to the sample, a larger sample using the same procedure will ensure the sample is much different from the population. Which is NOT a representative sample.

Examples of sampling bias:

Excluding a group who cannot be reached or does not respond
Only sampling groups of people who can be conveniently reached
Changing sampling techniques during the sampling process
Contacting people not chosen for sample

Statistic vs Parameter

A statistic describes a sample. A parameter describes a population. For example, if a sample of 50 adult female pandas weigh an average of 160 pounds, the sample mean of 160 is known as the statistic. Meanwhile, we don’t actually know the average of all adult female pandas. But if we did, that average (mean) of the population of all female pandas would be the parameter. Statistics are used to estimate parameters. Since we don’t typically know the details of an entire population, we rely heavily on statistics.

Mental Tip: Look at the first letters! A Statistic describes a Sample and a Parameter describes a Population

The Math

All confidence intervals take the form:

A common example here is polling reports — “The exit polls show John Cena has 46% of the vote, with a margin of error of 3 points.” Most people without a statistics background can draw the conclusion: “John Cena likely has between 43% and 49% of the vote.”

What if John Cena actually has 44% of the vote? Here, I’ve visualized 40 samples for which 38 contain that 44% Notice the confidence interval has two parts – the square in the middle represents the sample proportion and the horizontal line is the margin of error.

The Statistic, AKA “The Point Estimate”

The “statistic” is merely our estimate of the true parameter.

The statistic in the voting example is the sample percent from exit polls — the 46%. The actual percent of the population voting for John Cena – the parameter – is unknown until the polls close, so forecasters rely on sample values.

A sample mean is another example of a statistic – like the mean weight of an adult female panda. Using this statistic helps researchers avoid the hassle of traveling the world weighing all adult female pandas.

The Margin of Error

With confidence intervals, there’s a trade off between precision and accuracy: A wider interval may capture the true mean accurately, but it’s also less precise than a more narrow interval.

The width of the interval is decided by the margin of error because, mathematically, it is the piece that is added to and subtracted from the statistic to build the entire interval.

How do we calculate the margin of error? You have two main components — a t or z value derived from the confidence level,and the standard error. Unless you have control over the data collection on the front end, the confidence level is the only component you’ll be able to determine and adjust on the back end.

Two common Margin of Error (MoE) calculations

The confidence level

“Why can’t we just make it 100% confidence?” Great question! And one I’ve heard many times. Without going into the details of sampling distributions and normal curves, I’ll give you an example:

Assume the “average” adult female panda weighs “around 160 pounds.” To be 100% confident that we’ve created an interval that includes the TRUE mean weight, we’d have to use a range that includes all possible values of mean weights. This interval might be from, say 100 to 400 pounds – maybe even 50 to 1000 pounds. Either way, that interval would have to be ridiculously large to be 100% confident you’ve estimated the true mean. And with a range that wide, have you actually delivered any insightful message?

Again consider a confidence interval like a fishing net, the width of the net determined by the margin of error – more specifically, the confidence level (since that’s about all you have control over once a sample has been taken). This means a LARGER confidence level produces a WIDER net and a LOWER confidence level produces a more NARROW net (everything else equal).

For example: A 99% confidence interval fishing net is wider than a 95% confidence interval fishing net. The wider net catches more fish in the process.

But if the purpose of the confidence interval is to narrow down our search for the population parameter, then we don’t necessarily want more values in our “net”. We must strike a balance between precision (meaning fewer possibilities) and confidence.

Once a confidence level is established, the corresponding t* or z* value — called an upper critical value — is used in the calculation for the margin of error. If you’re interested in how to calculate the z* upper critical value for a 95% z-interval for proportions, check out this short video using the Standard Normal Distribution.

The standard error

This is the part of the margin of error you most likely won’t get to control.

Keeping with the panda example, if we are interested in the true mean weight for the adult female panda then the standard error is the standard deviation of the sampling distribution of sample mean weights. Standard error, a measure of variability, is based on a theoretical distribution of all possible sample means. I won’t get into the specifics in this post but here is a great video explaining the basics of the Central Limit Theorem and the standard error of the mean.

If you’re using proportions, such as in our John Cena election example, here is my favorite video explaining the sampling distribution of the sample proportion (p-hat).

As I mentioned, you will most likely NOT have much control over the standard error portion of the margin of error. But if you did, keep this PRO TIP in your pocket: a larger sample size (n) will reduce the width of the margin of error without sacrificing the level of confidence.

The Interpretation

Back to the panda weights example here. Let’s assume we used a 95% confidence interval to estimate the true mean weight of all adult female pandas:

Interpreting the Interval

Typically the confidence interval is interpreted something like this: “We are 95% confident the true mean weight of an adult female panda is between 150 and 165 pounds.”

Notice I didn’t use the word probability. At all. Let’s look at WHY:

Interpreting the Level

The confidence level tells us:, “If we took samples of this same size over and over again (think: in the long run) using this same method, we would expect to capture the true mean weight of an adult female panda 95% of the time.” Notice this IS a probability. A 95% probability of capturing the true mean exists BEFORE taking the sample. Which is why I did NOT reference the actual interval values. A different sample would produce a different interval. And as I said in the beginning of this post, we don’t actually know if the true mean is in the interval we calculated.

Well then, what IS the probability that my confidence interval – the one I calculated between the values of 150 and 165 pounds – contains the true mean weight of adult female pandas? Either 1 or 0. It’s either there, or it isn’t. Because — and here’s the tricky part — the sample was already collected before we did the math. NOTHING in the math can change the fact that we either did or didn’t collect a representative sample of the population. OUR CONFIDENCE IS IN THE DATA COLLECTION METHOD – not the math.

The numbers in the confidence interval would be different using a different sample.

Visualizing the Confidence Interval

Let’s assume the density curve below represents the actual population mean weights of all adult female pandas. In this made up example the mean weight of all adult pandas is 156.2 pounds with a population standard deviation of 13.6 pounds.

Beneath the population distribution are the simulation results of 300 samples of n = 20 pandas (sampled using an identical sampling method each time). Notice that roughly 95% of the intervals cover the true mean — capturing 156.2 within the interval (the green intervals) while close to 5% of intervals do NOT capture the 156.2 (the red intervals).

Pay close attention to the points made by the visualization above:

Each horizontal line represents a confidence interval constructed from a different sample
The green lines “capture” or “cover” the true (unknown) mean while the red lines do NOT cover the mean.
If this was a real situation, you would NOT know if your interval contained the true mean (green) or did not contain the true mean (red).

The logic of confidence intervals is based on long-run results — frequentist inference. Once the sample is drawn, the resulting interval either does or doesn’t contain the true population parameter — a probability of 1 or 0, respectively. Therefore, the confidence level does not imply the probability the parameter is contained in the interval. In the LONG run, after many samples, the resulting intervals will contain the mean C% of the time (where C is your confidence level).

So in what are we placing our confidence when we use confidence intervals? Our confidence is in the procedures used to find our sample. Any sampling bias will affect the results – which is why you don’t want to use confidence intervals with data that may not represent the population.

How to Interpret P-values: Webinar Reflection

Big thanks to Eva Murray and Andy Kriebel for inviting me onto their BrightTalk channel for a second time. If you missed it, Stats for Data Visualization Part 2 served up a refresher on p-values:

webinar 2

Since our primary audience tends to be those in data visualization, I used the regression output in Tableau to highlight the p-value in a test for regression towards the end. However, I spent the majority of the webinar discussing p-values in general because the logic of p-values applies broadly to all those tests you may or may not remember from school: t-tests, Chi-Square, z-tests, f-tests, Pearson, Spearman, ANOVA, MANOVA, MANCOVA, etc etc.

I’m dedicating the remainder of this post to some “rules” about statistical tests. If you consider publishing your research, you’ll be required to give more information about your data for researchers to consider your p-value meaningful. In the webinar, I did not dive into the assumptions and conditions necessary for a test for linear regression and it would be careless of me to leave it out of my blog. If you use p-values to drive decisions, please read on.

More about Tests for Linear Regression

Cautions always come with statistical tests – those cautions do not fall solely on the p-value “cut-off” debate.

p_values — Note: This cartoon uses sarcasm to poke fun at blindly following p-values.

Do you plan to publish your findings?

To publish your findings in a journal or use your research in a dissertation, the data must meet each condition/assumption before moving forward with the calculations and p-value interpretation, else the p-value is not meaningful.

Each statistical test comes with its own set of conditions and assumptions that justify the use of that test. Tests for Linear Regression have between 5 and 10 assumptions and conditions that must be met (depending on the type of regression and application).

Below I’ve listed a non-exhaustive list of common assumptions/conditions to check before running a test for linear regression (in no particular order).

The independent and dependent variables are continuous variables.
The two variables exhibit a linear relationship – check with scatterplot.
No significant outliers present in the residual plot (AKA points with extremely large residuals) – check with residual plot.
Observations are independent of each other (as in, the existence of one data point does not influence another) – test with Durbin-Watson statistic.
The data shows homoscedasticity (which means the variances remains the same along entire line of best fit) – check the residual plot, then test with Levene’s or Brown-Forsythe’s tests.
Normality – residuals must be approximately normally distributed – check using a histogram, normal probability plot of residuals. (In addition, a dissertation chair may require a Kolmogorov-Smirnov or Shapiro-Wilk test on the dependent and independent variables separately.) As sample size increases, this assumption may not be necessary thanks to the Central Limit Theorem.
There is no correlation between the independent variable(s) and the residuals – check using a correlation matrix or variance inflation factor (VIF).

yo dawg — In other words, these are tests that must be performed in order to perform the test you actually care about.

Note: Check with your publication and/or dissertation chair for complete list of assumptions and conditions for your specific situation.

Can I use p-values only as a sniff test?

Short answer: Yes. But I recommend learning how to interpret them and their limitations. Glancing over the list of assumptions above can give a good indication of how sensitive regression models are to outliers and outside variables. I’d also be hesitant to draw conclusions based on a p-value alone for small datasets.

I highly recommend looking at the residual plot (from webinar 1) to determine if your linear model is a good overall fit, keeping in mind the assumptions above. Here is a guide to creating a residual plot using Tableau.

marky mark

The Analytics PAIN Part 3: How to Interpret P-Values with Linear Regression

You may find Part 1 and Part 2 interesting before heading into part 3, below.

p-value-statistics-meme

Interpreting Statistical Output and P-Values

To recap, you own Pearson’s Pizza and you’ve hired your nephew Lloyd to run the establishment. And since Lloyd is not much of a math whiz, you’ve decided to help him learn some statistics based on pizza prices.

When we left off, you and Lloyd realized that, despite a strong correlation and high R-Squared value, the residual plot suggests that predicting pizza prices from toppings will become less and less accurate as the number of toppings increase:

Shiny Residual Plot 2 — A clear, distinct pattern in a residual plot is a red flag

Looking back at the original scatterplot and software output, Lloyd protests, “But the p-value is significant. It’s less than 0.0001.”

pizza4 — The original software output. What does it all mean?

Doesn’t a small p-value imply that our model is a go?

freddie

A crash course on hypothesis testing

In Pearson Pizza’s back room, you’ve got two pool tables and a couple of slot machines (which may or may not be legal). One day, a tall, serious man saunters in, slaps down 100 quid and challenges (in an British accent), “My name is Ronnie O’Sullivan. That’s right. THE. Ronnie O’Sullivan.”

You throw the name into the Google and find out the world’s best snooker player just challenged you to a game of pool.

Then something interesting happens. YOU win.

Suddenly, you wonder, is this guy really who he says he is?

Because the likelihood of you winning this pool challenge to THE Ronnie O’Sullivan is slim to none IF he is, indeed, THE Ronnie O’Sullivan (the world’s best snooker player).

Beating this guy is SIGNIFICANT in that it challenges the claim that he is who he claims he is:

You aren’t SUPPOSED to beat THE Ronnie O’Sullivan – you can’t even beat Lloyd.

But you did beat this guy, whoever he claims to be.

So, in the end, you decide this man was an impostor, NOT Ronnie O’Sullivan.

In this scenario:

The claim (or “null hypothesis”): “This man is Ronnie O’Sullivan” you have no reason to question him – you’ve never even heard of snooker

The alternative claim (or “alternative hypothesis”): “This man is NOT Ronnie O’Sullivan”

The p-value: The likelihood you beat the world’s best snooker player assuming he is, in fact, the real Ronnie O’Sullivan.

Therefore, the p-value is the likelihood an observed outcome (at least as extreme) would occur if the claim were true. A small p-value can cast doubt on the legitimacy of the claim – chances you could beat Ronnie O’Sullivan in a game of pool are slim to none so it is MORE likely he is not Ronnie O’Sullivan. Still puzzled? Here’s a clever video explanation.

Some mathy stuff to consider

The intention of this post is to tie the meaning of this p-value to your decision, in the simplest terms I can find. I am leaving out a great deal of theory behind the sampling distribution of the regression coefficients – but I would be happy to explain it offline. What you do need to understand, however, is your data set is just a sample, a subset, from an unknown population. The p-value output is based on your observed sample statistics in this one particular sample while the variation is tied to a distribution of all possible samples of the same size (a theoretical model). Another sample would indeed produce a different outcome, and therefore a different p-value.

The hypothesis our software is testing

The output below gives a great deal of insight into the regression model used on the pizza data. The statistic du jour for the linear model is the p-value you always see in a Tableau regression output: The p-value testing the SLOPE of the line.

Therefore, a significant or insignificant p-value is tied to the SLOPE of your model.

Recall, the slope of the line tells you how much the pizza price changes for every topping ordered. Slope is a reflection of the relationship of the variables you’re studying. Before you continue reading, you explain to Lloyd that a slope of 0 means, “There is no relationship/association between the number of toppings and the price of the pizza.”

Tableau Regression Output

Zoom into that last little portion – and look at the numbers in red below:

Panes		Line		Coefficients
Row	Column	p-value	DF	Term	Value	StdErr	t-value	p-value
Pizza Price	Toppings	< 0.0001	19	Toppings	1.25	0.0993399	12.5831	< 0.0001
				intercept	15	0.362738	41.3521	< 0.0001

In this scenario:

The claim: “There is no association between number of toppings ordered and the price of the pizza.” Or, the slope is zero (0).

The alternative claim: “There is an association between the number of toppings ordered and the price of the pizza.” In this case, the slope is not zero.*

The p-value = Assuming there is no relationship between number of toppings and price of pizza, the likelihood of obtaining a slope of at least $1.25 per topping is less than .01%.

The p-value is very small** — A slope of at least $1.25 would happen only .01% of the time just by chance. This small p-value means you have evidence of a relationship between number of toppings and the price of a pizza.

What the P-Value is NOT

The p-value is not the probability the null hypothesis is false – it is not the likelihood of a relationship between number of toppings and the price of pizza.
The p-value is not evidence we have a good linear model – remember, it’s only testing a relationship between the two variables (slope) based on one sample.
A high p-value does not necessarily mean there is no relationship between pizza price and number of toppings – when dealing in samples, chance variability (differences) and bias is present, leading to erroneous conclusions.
A statistically significant p-value does not necessarily mean the slope of the population data is not 0 – see the last bullet point. By chance, your sample data may be “off” from the full population data.

The P-Value is Not a Green Light

archer

The p-value here gives evidence of a relationship between the number of toppings ordered and the price of the pizza – which was already determined in part 1. (If you want to get technical, the correlation coefficient R is used in the formula to calculate slope.)

Applying a regression line for prediction requires the examination of all parts of the model. The p-value given merely reflects a significant slope — recall there is additional error (residuals) to consider and outside variables acting on one or both of the variables.

Ultimately, Pearson’s Pizza CAN apply the linear model to predict pizza prices from number of toppings. But only within reason. You decide not to predict for pizza prices when more than 5 toppings are chosen because, based on the residual plot, the prediction error is too great and the variation may ultimately hurt long-term budget predictions.

In a real business use case, the p-value, R-Squared, and residual plots can only aid in logical decision-making. Lloyd now realizes, thanks to your expertise, that using data output just to say he’s “data-driven” without proper attention to detail and common sense is unwise.

Statistical methods can be powerful tools for uncovering significant conclusions; however, with great power comes great responsibility.

—Anna Foard is a Business Development Consultant at Velocity Group

*Note this is an automatic 2-tailed test. Technically it is testing for a slope at least $1.25 AND at most -$1.25. For a 1-tailed test (looking only for greater than $1.25, for example) divide the p-value output by 2. For more information on the t-value, df, and standard error I’ve included additional notes and links at the bottom of this post.

**How small is small? Depends on the nature of what you are testing and your tolerance for “false negatives” or “false positives”. It is generally accepted practice in social sciences to consider a p-value small if it is under 0.05, meaning an observation at least as extreme would occur 5% of the time by chance if the claim were true.

More information on t-value, df:

For those curious about the t-value, this statistic is also called the “critical value” or “test statistic”. This value is like a z-score, but relies on the Student’s t-distribution. In other words, the t-value is a standardized value indicating how far a slope of “1.25” will fall from the hypothesized mean of 0, taking into account sample size and variation (standard error).

In t-tests for regression, degrees of freedom (df) is calculated by subtracting the number of parameters being estimated from the sample size. In this example, there are 21 – 2 degrees of freedom because we started with 21 independent points, and there are two parameters to estimate, slope and y-intercept.

Degrees of freedom (df) represents the amount of independent information available. For this example, n = 21 because we had 21 pieces of independent information. But since we used one piece of information to calculate slope and another to calculate the y-intercept, there are now n – 2 or 19 pieces of information left to calculate the variation in the model, and therefore the appropriate t-value.

What are Confidence Intervals?

The Data

Sampling Bias

Examples of sampling bias:

Statistic vs Parameter

The Math

The Statistic, AKA “The Point Estimate”

The Margin of Error

The confidence level

The standard error

The Interpretation

Interpreting the Interval

Interpreting the Level

Visualizing the Confidence Interval

More about Tests for Linear Regression

Do you plan to publish your findings?

Can I use p-values only as a sniff test?

Interpreting Statistical Output and P-Values

A crash course on hypothesis testing

Some mathy stuff to consider

The hypothesis our software is testing

Pizza Price

Toppings

< 0.0001

19

Toppings

1.25

0.0993399

12.5831

< 0.0001

intercept

15

0.362738

41.3521

< 0.0001

What the P-Value is NOT

The P-Value is Not a Green Light