From Data to a Story

To be fair, there are probably thousands of people more qualified to write about Data Storytelling than I. Though since it’s a topic I love and a topic I teach I will do my best to cover a few points for the curious in this post – especially the overlooked step of FINDING the data story.

What is Data?

You can Google the definition of data and find a slew of different answers, but one simplistic way to think about data is how my favorite Statistics textbook defines the term: “Data are usually numbers, but they are not ‘just numbers.’ Data are numbers with context.” (Yates, et al. The Practice of Statistics 3rd Edition.) So when we pull quantitative variables and qualitative (categorical) variables together, we are essentially giving meaning to what was otherwise a set of lonely numerical values.

I received some great answers to this tweet here.

What is Data Storytelling?

In my humble opinion, data storytelling takes the data (aka numbers in context) and, not only translates it into consumable information, but creates a connection between the audience and the insights to drive some action. That action might be a business decision or a “wow, I now appreciate this topic” response, depending on the context and audience.

A Short Guide to Data Storytelling

There are a few steps to telling a data story, and they could get complicated depending on your data type and analysis. And since your and your audience’s interests, background, and ability to draw conclusions plays into your storytelling, this process of finding and telling a data story could easily detour and fall into rabbit holes. Here I will map out a few general steps for both exploratory and explanatory analysis to help you simplify the complex both in process and in message.

1. Define Your Audience and Determine Their Objectives

This is important. And you’ll need to continue circling back to your audience throughout all of the steps below. If you know your audience’s goals, you can more easily cancel out the noise in your data and define the right questions and metrics along the way.

2. Find the Data Story: Exploratory Analysis

a) Make a Picture

To tell a data story, you have to find the data story. And that begins with the exploratory analysis of your dataset – which should always begin with exploring your data visually. When I taught AP Stats I told the students the same thing I tell you now: When you get your hands on a set of data, MAKE A PICTURE. There are so many things a chart or graph (or multiple charts and graphs) can tell you about the data that tables and summary statistics cannot (including errors). See Anscombe’s Quartet for a demonstration of WHY.

Here I have a dataset from Wikipedia – Ironman World Championship Medalists. I must give credit to Eva Murray and Andy Kriebel for putting this data into .csv form for a 2019 Makeover Monday challenge .

What story can we tell about this data from this format? It’s possible to draw some conclusions based on patterns we might be able to pull with our eyes; however, nothing exact and nothing conclusive. Instead, you might use data visualization tools to create charts and graphs — something like Tableau or Excel, or if you have time on your hands, a whiteboard and dry erase markers — to tease out a story.

b) Ask Questions. And Keep Asking Questions.

So where do you start? I always start with questions. Like, “Which countries have had the most medals in the Ironman Championship?”

We can also ask, “Which countries are the fastest, on average?” I’ve sorted low to high to reflect the fastest countries on top (faster = shorter duration):

Oh that is interesting – I did not expect to see the US at the bottom of the list given they have the most medals in the dataset. But since I looked at all medals and did not only consider GOLD medals, I might now want to compare countries with the most gold medals. It’s possible the US won mostly bronze, right?

Interesting! I did NOT expect to see the US at the top of this chart after my last analysis. Hmmmm…

c) Don’t Assume, Ask “Why?”

And as you continue to ask questions, you’ll pull out more interesting insights. Since the distribution of gold medals looks similar to the distribution of all medals, I’m still quite curious why the average times for the US is higher than other countries when they have an overwhelmingly large number of gold medals (and gold medals = fastest times). There MUST be some other variable confounding this comparison. So now I’ll slice the data by other variables — now I’ll look at gender, and compare the overall average finish times of males vs females:

So if males are, on average, faster than females then maybe we should look at the breakdown of males vs females among countries. Is it possible the distribution of males and females differ by country?

Note: Since this is a stacked bar chart, it may appear that the graph is unsorted; however, when there are two genders present, the length of the bar represents the sum of both the overall average male finish times and overall average female finish times. The chart is still sorted by overall average time, as you see 2 charts above.

Hmmm, it looks like the top countries, by average time, are represented by male athletes only. Meanwhile, females, whose times are slightly slower than males, make up a large proportion of the remaining countries.

Since the above chart is comparing average race times, let’s bring in some counts to compare the number (and ultimately proportion) of male vs female athletes in each country.

Alright, so now it makes sense that the overall average time for countries with a large proportion of female medalists will appear slightly longer than those with only males. Next I’d like to compare male and female finish times for ONLY gold medalists:

Wow, interesting! Of the 9 countries with gold medals, 8 of them have female representation on the 1st place podium. And 4 of those 9 countries ONLY have female representation at gold. But this doesn’t explain why the US has taken home so many more gold medals than other countries, while the overall average finish times for US finishers (and gold medalists too) are slower! What’s going on?

Is the YEAR a factor? When did the US win gold medals?

In the above graph, each country is now represented by a line graph (or two lines, if they both females and males won gold medals for that country). But I’ve changed the red color to highlight the US and shades of gray to put the other countries in the background. Looking at the above graph, we can see the overall finish times have decreased over time. AND we can see the US, for both males AND females, haven’t won gold medals in over a decade. So the two confounding variables we found for the race time paradox, as we could call it, were both gender AND year.

d) Focus on One Story

At this point I need to narrow down my context and define what questions I want to ask and which metrics will answer those questions. It’s hard to listen to someone’s story when it has a million tangents, right? So don’t tell those meandering stories with data, either. Pick a topic – and don’t forget to consider your audience. I’ve decided to step away from the specific countries, and compare only those GOLD MEDAL race times by gender over time. I’m curious to learn more about WHY those finish times fell over time!

e) Find the Right Analysis for Your Story

Once you determine which variables are involved in your analysis, you can choose the appropriate chart to dig deeper into the insights. Because I chose to focus on gold medal finish times over the years, a time series analysis is appropriate. Here is a great reference tool for matching your analysis to your questions:

Let’s bring in years and create a timeline:

And clearly there is more to investigate here – like why, if I stick with Tableau’s default aggregation of SUM do I see a spike here? After some Googling – ah, that year TWO separate Ironman Championships were held. So even though I’m looking only at times for the male and female gold medal times, there’s one year that’s doubled because there are two gold medal times per gender. Easy fix, let’s change the aggregation to AVERAGE, which will only affect this one year. Now let’s tease out our big story by looking at our chart and asking more questions :

3. Explanatory Analysis

Once you’ve explored the data and you’ve asked and (tried to) answer questions arising from the analysis, you’re ready to pull the story together for your audience.

a) Design with the Audience in Mind

Without calling attention to it, I began this step above. As you can see, I changed up the colors in my charts when I brought in the variable of gender. I did this for you, my audience, so you could easily pick out the differences in the race times for males and females. This is called “leveraging pre-attentive attributes“, which basically means here I’ve used color to help you see the differences without consciously thinking about it. In my final version I will need to make sure the difference in gender is clear and easy to compare (more on this later).

Color also needs to be chosen with the audience in mind. Not only do the colors need to make sense (here I chose the Iron Man brand colors to distinguish gender), they also need to be accessible for all viewers. For example the colors chosen need to have enough contrast for people with color vision deficiency. (For more in-depth information about the use of color, Lisa Charlotte Rost has an excellent resource in her blog Data Wrapper.) Too much of a good thing is never a good thing, so I chose to keep all other colors in the chart neutral so other elements of the story do not compete for the audience’s attention.

Also, without telling you, I’ve stripped away some of the unnecessary “clutter” in my chart by dumping grid lines and axis lines. When in doubt, leave white space in your charts to maximize the “data-to-ink ratio” – save your ink for the data and skip the background noise when possible.

Finally, I’m sticking with my old standard font (Georgia) for the axes and (eventually in the final version, the title). A serif typeface, Georgia is easy to read on small screens or screens with low resolution.

b) Call Attention to the Story/Insights/Action

Here I’ve answered a couple of questions that arose from the data, and you, my audience, never had to switch tabs to go looking for answers. (Re-reading my words I feel like I must sound like Grover in The Monster at the End of the Book.)

That increase in race time in 1981? A course change! Only a slight increase in time for males, but a big slow down for the females.

Also note I included an annotation for the two races in 1982.

It’s interesting to see the steady decline in finish times throughout the 1980s — what might have caused this decline? And remember, a decline in finish times means GOLD MEDALISTS GOT FASTER.

And finally, I called attention to the spike in 2004. Apparently the gold medal finisher was DQ’ed for doping…which makes me wonder about the sharp decline in the 1980’s.

c) Leverage the Title

In the words of Kurt Vonnegut, Pity the Readers. Be as clear and concise and don’t assume they understand your big words and complicated jargon. Keep it simple! One thing I left out of my original title was the very specific use of ONLY gold medal data here. Without that information, the reader might think we were averaging all of the medalist’s times each year!

But is it clear what the red and black colors mean? And are there any additional insights I can throw into my title without being verbose?

Color legends take up dashboard space and I’m a bit keen on the white space I’ve managed to leave in my view. In my final version, I’ve colored the words MALE and FEMALE in the subtitle to match the colors used in the chart, serving as a simple color legend.

A subtitle can help guide the audience to a specific insight, in this case the overall decline in gold medal times by both males and females since the first Ironman World Championship in 1978.

Other Considerations

My final version is interactive, as you can see here. I’ve added notes to the tooltip to display the percent change in winning times by gender each year when the audience interacts with the chart:

If you’re using multiple charts for your data story, you’ll also want to consider layout and how the audience will likely consume that information. Tableau has an excellent article on eye-tracking studies to help data story designers create a flow with minimal cognitive load and maximum impact.

Other Data Storytelling Resources

Storytelling with Data – Cole Nussbaumer Knaflic

Info we Trust – RJ Andrews

Data Storytelling: The Ultimate Collection of Resources

Struggling with Uncertainty : The Role of Variability

“Uncertainty is the only certainty there is, and knowing how to live with insecurity is the only security.”
― John Allen Paulos

What is Uncertainty and Why Does it Matter?

At a high level, “uncertainty” is the unknown. You might believe it’s as abstract as chaos, but in fact a truth does exist – a “true” value, or parameter is out there – we just don’t know it. In fact, uncertainty is a certainty when working with data. We use samples of data in time to make decisions on the current and future state of business. And to look for truths in data, we often estimate or use probability to attempt to capture that value, or ranges of values, based on known data/observations.

I teach people how to work with, understand, and garner insights from data. What I’ve also noticed there are two kinds of clients I’ve come across in the consulting world:

Those who would like to learn HOW to visualize, measure, and understand uncertainty to help make better organizational decisions.
Those who have never considered the impact of uncertainty on organizational decisions.

As I suggested above, working with uncertainty starts with working with sample sets of data – whatever we can get our hands on. From our observed data, we can make generalizations about a phenomenon or event that may impact our organization by estimating the probability they will occur. We do this with point estimates, intervals, or broad language. For example:

Mr. X has a 40% chance of becoming the new CEO.
The proportion of revenue expected to come from our small business division this quarter is approximately 54%, with a 3% margin of error.
The Pacific Region will likely merge with the North Pacific Region next year.

Building probability models to uncover these predictions is another conversation for another day; however, it is possible to take a baby step towards understanding uncertainty and working with probabilities. In my opinion, the first step on that journey is learning more about how we interpret variation/variability in data. Why? Because accuracy and precision in measuring a probability depends on how well we’ve measured and contained the variability in the data. How we interpret probabilities depends on how we understand the difference between significance, and natural variation.

Variation

From Merriam-Webster:

Definition of variation

the act or process of varying : the state or fact of being varied
an instance of varying
the extent to which or the range in which a thing varies

Ah. Isn’t that helpful?

Google, save me!

Variation:

A change or difference in condition, amount, or level, typically with certain limits
A different or distinct form or version of something

Variability:

lack of consistency or fixed pattern; liability to vary or change.

Why Does Variation Matter?

Here’s the thing — there’s no need to study/use data when everything is identical. It’s the differences in everything around us that creates this need to use, understand, and communicate data. Our minds like patterns, but a distinction between natural and meaningful variation is not intuitive – yet it is important.

Considering how often we default to summary statistics in reporting, it’s not surprising that distinguishing between significant insights and natural variation is difficult. Not only is it foreign, the game changes depending on your industry and context.

What’s So Complicated About Variation?

Let’s set the stage to digest the concept of variation by identifying why it’s not an innate concept.

At young ages, kids are taught to look for patterns, describe patterns, predict the next value in a pattern. They are also asked, “which one of these is not like the other?”

But this type of thinking generally isn’t cultivated or expanded. Let’s look at some examples:

1) Our brains struggle with relative magnitudes of numbers.

We have a group of 2 red and 2 blue blocks. Then we start adding more blue blocks. At what point do we say there are more blue blocks? Probably when there are 3 blue, 2 red, right?

Instead, what if we started with 100 blocks? Or 1000? 501 red/499 still seems the same, right? Understanding how the size of the group modifies the response is learned – as sample/population size increases, variability ultimately decreases.

Something to ponder: When is $300 much different from $400? When is it very similar?

“For example, knowing that it takes only about eleven and a half days for a million seconds to tick away, whereas almost thirty-two years are required for a billion seconds to pass, gives one a better grasp of the relative magnitudes of these two common numbers.”
― John Allen Paulos, Innumeracy: Mathematical Illiteracy and Its Consequences

We understand that 100 times 10 is 1000. And mathematically, we understand that 1 Million times 1000 is 1 Billion. What our brains fail to recognize is the difference between 1000 and 100 is only 900, but the difference between 1 million and 1 billion is 999,000,000! We have trouble with these differences in magnitudes:

We kind of “glaze” over these concepts in U.S. math classes. But they are not intuitive!

2) We misapply the Law of Large Numbers

My favorite example misapplying the Law of Large Numbers is called The Gambler’s Fallacy, or Monte Carlo Fallacy. Here’s an easy example:

Supposed I flip a fair coin 9 times in a row and it comes up heads all 9 times, what is your prediction for the 10th coin flip? If you said tails because you think tails is more likely, you just fell for the Gambler’s Fallacy. In fact, the probability for each coin flip is exactly the same each time AND each flip is independent of another. The fact that the coin came up heads 9 times in a row is not known to the coin, or gravity for that matter. It is natural variation in play.

The Law of Large Numbers does state that as the number of coin flips increase (n>100, 1000, 10000, etc), the probability of heads gets closer and closer to 50%. However, the Law of Large numbers does NOT play out this way in the short run — and casinos cash in on this fallacy.

Oh, and if you thought the 10th coin flip would come up heads again because it had just come up heads 9 times before, you are charged with a similar fallacy, called the Hot Hand Fallacy.

3) We rely too heavily on summary statistics

Once people start learning “summary statistics”, variation is usually only brought in as a discussion as a function of range. Sadly, range only considers minimum and maximum values and ignores any other variation within the data. Learning beyond “range” as a measure of variation/spread also helps hone in on differences between mean and median and when to use each.

Standard deviation and variance also measure variation; however, the calculation relies on the mean (average) and when there is a lack of normality to the data (e.g. the data is strongly skewed), standard deviation and variance can be an inaccurate measure of the spread of the data.

In relying on summary statistics, we find ourselves looking for that one number – that ONE source of truth to describe the variation in the data. Yet, there is no ONE number that clearly describes variability – which is why you’ll see people using the 5-number summary and interquartile range. But the lack of clarity in all summary statistics makes the argument for visualizing the data.

When working with any kind of data, I always recommend visualizing the variable(s) of interest first. A simple histogram, dot plot, or box-and-whisker plot can be useful in visualizing and understanding the variation present in the data.

Start Simple: Visualize Variation

Before calculating and visualizing uncertainty with probabilities, start with visualizing variation by looking at the data one variable at a time at a granular or dis-aggregated level.

Box-and-whisker can, not only give show you outliers, these charts can also give a comparison of consistency within a variable:

Simple control charts can capture natural variation for high-variability organizational decision-making, such as staffing an emergency room:

Think of histograms as a bar graph for continuous metrics. Histograms show the distribution of the variable (here, diameters of tortillas) over a set of bins of the same width. The width of the bar is determined by the “bin size” – smaller sets of ranges of tortilla diameters – and the height of the bar measures the frequency, or how many tortillas measured within that range. For example, the tallest bar indicates there are 26 tortillas measuring between (approximately) 6.08 and 6.10 cm.

I can’t stress enough the importance of changing the bin size to explore the variation further.

Notice the histogram with the wider bin size (below) can hide some of the variation you see above. In fact, the tortillas sampled for this process came from two separate production lines- which you can conclude from the top histogram but not below, thus emphasizing the importance of looking at variability from a more granular level.

Resources

Recently, Storytelling with Data blogged about visualizing variability. I’m also a fan of Nathan Yau’s Flowing Data post about visualizing uncertainty.

Brittany Fong has a great post about disaggregated data, as well as Steve Wexler’s post on Jitter Plots.

I plan a follow-up post diving more into probabilities and uncertainty. For now, I’m going to leave you with this cartoon from XKCD, called “Certainty.“

How Laser Tag Helped Students Learn About Data

How do you get a group of 15 to 18-year-old students interested in data prep and analysis? Why, you take them to play laser tag, of course!

That’s right, on a cold January day I loaded up two buses of teens and piloted them to an adventure at our local Stars and Strikes. And this is no small feat — this particular trip developed out of months of planning, and after years proclaiming that I will never ever ever ever EVER coordinate my own field trip for high school kids. I mean, you should SEE the stack of paperwork. And the level of responsibility itself made me anxious.

I’m a parent so I get it. And from a teacher’s point of view, many field trips aren’t worth the hassle.

So there I was, field trip money in one hand, clipboard in another: Imagine a caffeinated Tracy Flick. But thanks to the help of two parent chaperones and the AP Psychology teacher (Coach B), we ran the smoothest data-related field trip modern education has ever known.

What Does Laser Tag Have to do With Statistics?

Statistics textbooks are full of canned examples and squeaky clean data that often have no bearing on a students’ interests. For example, there is an oh-so-relatable exercise computing standard error for D-glucose contained in a sample of cockroach hindguts. In my experience I’ve learned when students can connect to the data, they are able to connect to the concept. We’re all like that, actually — to produce/collect our own data enables us to see what we otherwise would have missed.

(I can assure you confidence intervals constructed from D-glucose in coachroach hindguts did little for understanding standard error.)

The real world is made up of messy data. It’s full of unknowns, clerical errors, bias, unnecessary columns, confusing date formats, missing values; the list goes on. Laser Tag was suggested to me as a way to collect a “large” amount of data in a relatively short amount of time. And because of the size of the dataset, it required the student to input their own data — creating their own version of messy data complete with clerical errors. From there they’d have to make sense of the data, look for patterns, form hypotheses.

The Project

Students entered their data into a Google doc — you can find the complete data here.
Each partner team developed two questions for the data: One involving 1-variable analysis, another requiring bivariate analysis.
The duos then had to explore, clean, and analyze all 47 rows and 48 columns. At this point in the school year, students had been exposed to data up to about 50 rows, but never had they experienced “wide” data.
Analyses and presentations required a visualization, either using Excel or Tableau.

Partner projects lend to fantastic analyses, with half the grading

Playing the Games

Methodology: Each student was randomly assigned to a team using a random number generator. Teams of 5 played each other twice during the field trip. The teams were paired to play each other randomly. If, by chance, a team was chosen to play the same team twice, that choice would be ignored and another random selection would be made until a new team was chosen.

Before each game, I recorded which student wore which laser tag vest number. From the set-up room (see above picture), I could view which vest numbers were leading the fight and which team had the lead. It was entertaining. As the students (and Coach B — we needed one more player for even teams) finished their games, score cards were printed and I handed each student their own personal results. The words, “DON’T lose this” exited my lips often.

Upon our return to school (this only took a few hours, to the students’ dismay), results were already pouring the into the Google doc I’d set up ahead of time.

Teaching Tableau and Excel Skills

The AP Statistics exam is held every year in May, hosted by The College Board. On the exam, students are expected to use a graphing calculator but have no access to a computer or Google. Exactly the opposite of the real world.

Throughout the course, I taught all analysis first by hand, or using the TI-83/84. As students became proficient, I added time in the computer lab to teach basic skills using Excel and Tableau (assignments aligned to the curriculum while teaching skills in data analysis). It was my goal for students to have a general understanding of how to use these “real world” analytics tools while learning and applying AP Statistics curriculum.

After the field trip, we spent three days in the computer lab – ample time to work in Tableau and Excel with teacher guidance. Students spent time exploring the 48-column field trip dataset with both Excel and Tableau. They didn’t realize it, but by deciding which chart type to use for different variables, they were actually reviewing content from earlier in the year.

When plotting bivariate quantitative data, a scatterplot is often the go-to chart

Most faculty members had never heard of Tableau. At lunch one day I sat down with Coach B to demonstrate Tableau’s interface with our field trip dataset.

“What question would you ask this set of data?” I asked.

“A back shot is a cheap shot. I wonder who is more likely to take a cheap shot, males or females?”

So I proceeded to pull up a comparison and used box-and-whiskers plots to look for outliers. Within seconds, a large outlier was staring back at us within the pool of male students:

“Ha. I wonder who that was.” – Coach B

“That’s YOU.” – Me

From there, I created a tongue-in-cheek competitive analysis from the data:

Student Response

I’ve been teaching since 2004. Over the years, this was probably the most successful project I’ve seen come through my classroom. By “successful”, I’m talking the proportion of students who were able to walk outside of their comfort zone and into a challenging set of data, perform in-depth analyses, then communicate clear conclusions was much higher than in all previous years.

At the end of the year, after the AP Exam, after grades were all but inked on paper, students still talked excitedly about the project. I’d like to think it was the way I linked a fun activity to real-world analysis, though it most likely has to do with getting out of school for a few hours. Either way, they learned something valuable.

Univariate Analysis

One student, Abby, gave me permission to share her work adding, “This is the project that tied it all together. This was the moment I ‘got’ statistics.”

Interestingly, students were less inclined to suggest the female outlier of 2776 shots was a clerical mistake (which it was). I found there were two camps: Students who didn’t want to hurt feelings, and students who think outliers in the wild need no investigation. Hmmm.

Bivariate Analysis

For a group of kids new to communicating stats, I thought this was pretty good. We tweaked their wording (to be more contextual) as we dove into more advanced stats, but their analysis was well thought through.

What I Learned

When you teach, you learn.
Earlier I said the project was a success based on the students’ results. That’s only partially true; it was also a success because I grew as an educator. After years of playing by the rules I realized that sometimes you need to get outside your comfort zone. For me that was two-fold: 1) Sucking it up and planning a field trip and 2) Losing the old, tired TI-83 practice problems and teaching real-world analytics tools.

The Box-and-Whisker Plot For Grown-Ups: A How-to

Author’s note: This post is a follow-up to the webinar, Percentiles and How to Interpret a Box-and-Whisker Plot, which I created with Eva Murray and Andy Kriebel. You can read more on the topic of percentiles in my previous posts.

No, You Aren’t Crazy.

That box-and-whisker plot (or, boxplot) you learned to read/create in grade school probably IS different from the one you see presented in the adult world.

The boxplot on the top originated as the Range Bar, published by Mary Spear in the 1950’s. While the boxplot on the bottom was a modification created by John Tukey to account for outliers. Source: Hadley Wickham

As a former math and statistics teacher, I can tell you that (depending on your state/country curriculum and textbooks, of course) you most likely learned how to read and create the former boxplot (or, “range bar”) in school for simplicity. Unless you took an upper-level stats course in grade school or at University, you may have never encountered Tukey’s boxplot in your studies at all.

You see, teachers like to introduce concepts in small chunks. While this is usually a helpful strategy, students lose when the full concept is never developed. In this post I walk you through the range bar AND connect that concept to the boxplot, linking what you’ve learned in grade school to the topics of the present.

The Kid-Friendly Version: The Range Bar

In this example, I’m comparing the lifespans of a small, non-random set of animals. I chose this set of animals based solely on convenience of icons. Meaning, conclusions can only be drawn on animals for which Anna Foard has an icon. I note this important detail because, when dealing with this small, non-random sample, one cannot infer conclusions on the entire population of all animals.

1) Find the quartiles, starting with the median

Quartiles break the dataset into 4 quarters. Q1, median, Q3 are (approximately) located at the 25th, 50th, and 75th percentiles, respectively.

Finding the median requires finding the middle number when values are ordered from least to greatest. When there is an even number of data points, the two numbers in the middle are averaged.

Once the median has been located, find the other quartiles in the same way: The middle value in the bottom set of values (Q1), then the middle value in the top set (Q3).

2) Use the Five Number Summary to create the Range Bar

The first and third quartiles build the “box”, with the median represented by a line inside the box. The “whiskers” extend to the minimum and maximum values in the dataset:

But without the points:

The Range Bar probably looks similar to the first box-and-whisker plot you created in grade school. If you have children, it is most likely the first version of the box-and-whisker plot that they will encounter.

school example — from elementary school Pinterest

Suggestion:

Since the kid’s version of the boxplot does not show outliers, I propose teachers call this version, “The Range Bar” as it was originally dubbed, to not confuse those reading the chart. After all, someone looking at this version of a boxplot may not realize it does not account for outliers and may draw the wrong conclusion.

The Adult Version: The Boxplot

The only difference between the range bar and the boxplot is the view of outliers. Since this version requires a basic understanding of the concept of outliers and a stronger mathematical literacy, it is generally introduced in a high school or college statistics course.

1) Calculate the IQR

The interquartile range is the difference, or spread, between the third and first quartile reflecting the middle 50% of the dataset. The IQR builds the “box” portion of the boxplot.

2) Multiply the IQR by 1.5

3) Determine a threshold for outliers – the “fences”

1.5*IQR is then subtracted from the lower quartile and added to the upper quartile to determine a boundary or “fences” between non-outliers and outliers.

4) Consider values beyond the fences outliers

Since no animals’ lifespans are below -5 years, it is not possible for a low-value outlier in this particular set of data; however, one animal in this dataset lives beyond 31 years – an outlier in higher values.

5) Build the boxplot

Here we find the modification on the “range bar” – the whiskers only extend as far as non-outlier values. Outliers are denoted by a dot (or star).

boxplot 12 — The adult version also allows us to apply technology, so I left the points in the view to enhance the distribution’s view.

Advantage: Boxplot

In an academic setting, I use boxplots a great deal. When teaching AP Statistics, they are helpful to visualize the data quickly by hand as they only require summary statistics (and outliers). They also help students compare and visualize center, spread, and shape (to a degree).

When we get into the inference portion of AP Stats, students must verify assumptions for certain inference procedures — often those procedures require data symmetry and/or absence of outliers in a sample. The boxplot is a quick way for a student to verify assumptions by hand, under time constraints. When coaching doctoral candidates through the dissertation stats, similar assumptions are verified to check for outliers — using boxplots.

TI83plus — My portable visualization tool

Boxplot Advantages:

Summarizes variation in large datasets visually
Shows outliers
Compares multiple distributions
Indicates symmetry and skewness to a degree
Simple to sketch
Fun to say

laser tag competitive analysis — I took my students on a field trip to play laser tag. Here, boxplots help compare the distributions of tags by type AND compare how Coach B measures up to the students.

So What Could Go Wrong?

Unfortunately, boxplots have their share of disadvantages as well.

Consider:

A boxplot may show summary statistics well; however, clusters and multimodality are hidden.

In addition, a consumer of your boxplot who isn’t familiar with the measures required to construct one will have difficulty making heads or tails of it. This is especially true when your resulting boxplot looks like this:

example q2 and q3 equal — The median value is equal to the upper quartile. Would someone unfamiliar recognize this?

Or this:

example no upper — The upper quartile is the maximum non-outlier value in this set of data.

Or what about this?

example no whiskers — No whiskers?! Dataset values beyond the quartiles are all outliers.

Boxplot Disadvantages:

Hides the multimodality and other features of distributions
Confusing for some audiences
Mean often difficult to locate
Outlier calculation too rigid – “outliers” may be industry-based or case-by-case

Variations

Over the course of the years, multiple boxplot variations have been created to display parts (or all) of the distribution’s shape and features.

types of boxplots — Source: A Look at Boxplots

no whisker — No Whisker Box Plot Source: Andy Kriebel

Going For It

Box-and-whisker plots may be helpful for your specific use case, though not intuitive for all audiences. It may be helpful to include a legend or annotations to help the consumer understand the boxplot.

Check Yourself: Ticket out the Door

No cheating! Without looking back through this post, check your own understanding of boxplots. Answer can be found on the #MakeoverMonday webinar I recorded with Eva Murray a couple weeks ago.

Cartoon Source: xkcd

How to Build a Cumulative Frequency Distribution in Tableau

When my oldest son was born, I remember the pediatrician using a chart similar to the one below to let me know his height and weight percentile. That is, how he measured up relative to other babies his age. This is a type of cumulative relative frequency distribution. These charts help determine relative position of one data point to the rest of the dataset, showing an accumulating percent of observations for each value. In this case, the chart helps determine how a child is growing relative to other babies his age.

I decided to figure out how to create one in Tableau. Based on the types of cumulative frequency distributions I was used to when I taught AP Stats, I first determined I wanted the value of interest on the horizontal axis and the percents on the vertical axis.

Make a histogram

Using a simple example – US President age at inauguration – I started with a histogram so I could look at the overall shape of the distribution:

histogram pills

age of presidents histogram

Adjust bin size appropriately

From here I realized I already had what I needed in my view – discrete ages on the x-axis and counts of ages on the y-axis. For a wider range of values I would want a wider bin size, but in this situation I needed to resize bins to 1, representing each individual age.

age bins

age of presidents histogram bin 1

Change the marks from bars to a line

age line

Create a table calculation

Click on the green pill on the rows (the COUNT) and add a table calculation.

table calc freq dist

Actually, TWO table calculations

First choose “Running Total”, then click on the box “add secondary calculation”:

table calcs 1

Next, choose “percent of total” as the secondary calculation:

table calcs

Polish it up

Add drop lines…

drop lines

…and CTRL drag the COUNT (age in years) green pill from the rows to labels. Click on “Label” on the marks card and change the marks to label from “all” to “selected”.

label

And there you have it.

full graph 2

Interpreting percentiles

Percentiles describe the position of a data point relative to the rest of the dataset using a percent. That’s the percent of the rest of the dataset that falls below the particular data point. Using the baby weights example, the percentile is the percent of all babies of the same age and gender weighing less than your baby.

Back to the US president example.

Since I know Barack Obama was 47 when inaugurated, let’s look at his age relative to the other US presidents’ ages at inauguration:

obama point — 13.3% of US presidents were younger than Barack Obama when inaugurated. Source: The Practice of Statistics, 5th Edition

And another way to look at this percentile: 87% of US presidents were older than Barack Obama when inaugurated.

Thank you for reading and have an amazing day!

-Anna

Mean, Median, and Mode: How Visualizations Help Find What’s “Typical”

I was a high school math and statistics teacher for 14 years. And my stats course always began by visualizing the distribution of a variable using a simple chart or graph. One variable at a time we’d focus on creating, interpreting, and describing appropriate graphs. For quantitative variables, we’d use histograms or dot plots to discuss the distribution’s specific physical features. Why? Data visualization helps students draw conclusions about a population using sample data than summary statistics alone.

This post aims to review the basics of how measures of central tendency — mean, median, and mode — are used to measure what’s typical. Specifically, I’ll show you how to inspect distributions of variables visually and dissect how mean, median, and mode behave, in addition to common ways they are used. Ultimately it may be difficult, impossible, or misleading to describe a set of data using one number; however, I hope this journey of data exploration helps you understand how different types of data can effect how we describe what’s typical.

Remember Middle School?

Fair enough — I too try to forget the teased hair and track suit years. But I do recall learning to calculate mean, median, mode, and range for a set of numbers with no context and no end game. The math was simple, yet painfully boring. And I never fully realized we were playing a game of Which One of These is Not Like the Other.

middle school worksheet, recreated…why range tho?

It wasn’t until my first college stats course that I realized descriptive statistics serve a purpose – to attempt to summarize important features of a variable or dataset. And mean, median, mode – the measures of central tendency – attempt to summarize the typical value of a variable. These measures of typical may help us draw conclusions about a specific group or compare different groups using one numerical value.

To check off that middle school homework, here’s what we were programmed to do:

Mean: Add the numbers up, divide by the total number of values in the set. Also known as the arithmetic mean and informally called the “average”.

Median: Put the numbers in order from least to greatest (ugh, the worst part) and find the middle number. Oh, there’s two middle numbers? Average them. Did you leave out a number? Start over.

Mode: The number(s) that appear the most.

Repeat until you finish the worksheet.

Because we arrive at mean, median, and mode using different calculations, they summarize typical in different ways. The types of variables measured, the shape of the distribution, the context, and even the size of the set of data can alter the interpretation of each measure of central tendency.

Visually Inspecting Measures of Typical

What do You Mean When You Say, “Mean”?

We’re programmed to think in terms of an arithmetic mean, often dubbed the average; however, the geometric and harmonic means are extremely useful and worth your time to learn. Furthermore, when you want to weigh certain values in a dataset more than others, you’ll calculate a weighted mean. But for simplicity of this post, I will only use the arithmetic mean when I refer to the “mean” of a set of values.

Think of the mean as the balancing point of a distribution. That is, imagine you have a solid histogram of values and you must balance it on one finger. Where would you hold it? For all symmetric distributions the balancing point – the mean – is directly in the center.

female heights — I completely made these up

The Median

Just like the median in the road (or, “neutral ground” if you’re from Louisiana), the median represents that middle value, cutting the set of values in half — 50% of the data values fall below and 50% lie above the median. No matter the shape of the distribution, the median is the measure of central tendency reflecting the middle position of the data values.

The Mode(s)

The mode describes the value or category in a set of data that appears the most often. The mode is specifically useful when asking questions about categorical (qualitative) variables. In fact, mode is the only appropriate measure of typical for categorical variables. For example: What is the most common college mascot? What type of food do college students typically eat? Where are most 4+ Year colleges and universities located?

most colleges labelled — Note: Bar charts don’t have a “shape”, though it is easy to confuse a bar chart with a histogram at first glance. Source: US Dept of Education

Modes are also used to describe features of a distribution. In large sets of quantitative data, values are binned to create histograms. The taller “peaks” of the histogram indicate where more common data values cluster, called modes. A cluster of tall bins is sometimes called a modal range. A histogram having one tall peak is called unimodal while two peaks is referred to as bimodal. Multiple peaks = multimodal.

tuition modes — Example of a bimodal, possibly multimodal, distribution. Source: US Department of Education, 2013

You may notice multiple tall peaks of varying heights in one histogram — despite some bins (and clusters of bins) containing fewer values, they are often described as modes or modal ranges since they contain local maximums.

When the Mean and the Median are Similar

female heights labeled — The shape of this distribution of female’s heights is symmetric and unimodal. Often called bell-shaped, Gaussian, or approximately normal.

The histogram above shows a distribution of heights for a sample of college females. The mean, median, and mode of this distribution are equal at about 66.5 inches. When the shape of the distribution is symmetric and unimodal, the mean, median, and mode are equal.

Now I want to see what happens when I add male heights into the histogram:

all college heights — This distribution of heights of college students is symmetric and bimodal.

This histogram shows the distribution of heights of both male and female college students. It is symmetric, so the mean and median are equal at about 68.5 inches. But you’ll notice two peaks, indicating two modal ranges — one from 66 – 67 inches and another from 70 – 71 inches.

Do the mean and median represent the typical college student height when we are dealing with two distinctly different groups of students?

When the Mean and the Median Differ

In a skewed distribution, the median remains the center of the values; however, the mean is pulled away from the median from extreme values and outliers.

For example, the histogram above shows the distribution of college enrollment numbers in the United States from 2013. The shape of the distribution is skewed to the right — that is, most colleges reported enrollment below 5,000 students. However, the “tail” of the distribution is created by a small number of larger universities reporting much higher enrollment. These extreme outlying values pull the mean enrollment to the right of the median enrollment.

enrollment labelled — A skewed right distribution – the mean is pulled away from the median, to the right.

Reporting an average enrollment of 7,070 students for colleges in 2013 exaggerates the typical college enrollment since most US colleges and universities reported enrollment under 5,000 students.

The median, on the other hand, is resistant to outliers since it is based on position relative to the rest of the data. The median helps you conclude that half of all colleges enrolled fewer than 3,127 students and half of the colleges enrolled more than 3,127 students.

Depending on your end goal and context, median may provide a better measure of typical for skewed set of data. Medians are typically used to report salaries and housing prices since these distributions include mostly moderate values and fewer on the extremely high end. Take a look at the salaries of NFL players, for example:

nfl salaries with labels — The salary distribution of NFL players in 2018 is strongly skewed to the right.

Are we to only report medians for skewed distributions?

The median is not a good description of typical for a very small dataset (eg, n<10, depending on context).
The median is helpful when you want to ignore (or lessen effects of) outliers. Of course, as Daniel Zvinca* points out, your data could contain significant outliers that you don’t want to ignore.

average cartoon 2

In school, our grades are reported as means. However, students’ grade distributions can be symmetric or skewed. Let’s say you’re a student with three test grades, 65, 68, 70. Then you make a 100 on the fourth test. The distribution of those 4 grades is skewed to the right with a mean of 75.8 and median of 69. Despite the shape of the distribution, you may argue for the mean in this situation. On the other hand, if you scored a 30 on the fourth test instead of 100, you’d argue for the median. With only 4 data points, the median is not a good description of typical so here’s hoping you have a teacher who understands the effects of outliers and drops your lowest test score.

Inserting my opinion: As a former teacher, I recognize that when averaging all student grades from an assignment or test, the result is often misleading. In this case, I believe the median is a better description of the typical student’s performance because extreme values usually exist in a class set of grades (very high or very low) and will affect the calculation of the mean. After each test in AP statistics, I would post the mean, median, 5 number summary and standard deviation for each class. It didn’t take long for students to draw the same conclusion.

Ultimately, context can guide you in this decision of mean versus median but consider the existence of outliers and the distribution shape.

Using Modality to Find the Story

By investigating a distribution’s physical features, students are able to connect the numbers with a story in the data. In quantitative data, unusual features can include outliers, clusters, gaps and “peaks”. Specifically, identifying causes of the multimodality of a distribution can build context behind the metrics you report.

all tuition — This histogram of college tuition for all 4+ year colleges in 2013 two distinct “peaks”. Although the peaks are not equal in height, they tell a story. Source: US Dept of Education

When I investigated the distribution of college tuition, I expected the shape to appear skewed. I did not expect to find the smaller peak in the middle. So I filtered the data by type of college (public or private) and found two almost symmetric distributions of tuition:

public tuition — Tuition for public colleges and universities in 2013

private tuition — Tuition for private colleges and universities in 2013

The existence of the modes in this data makes it difficult to find a typical US college tuition; however, they did point to the existence of two different types of colleges mixed into the same data.

tuition all schools labelled — Notice how different the means and medians of the data subsets (public schools and private schools, separated) are from the mean and median of the entire dataset!

all tuition by color — The shape of the distribution makes a bit more sense to me now

Now I’m not confident that one number would represent the typical college tuition in the U.S., though I can say, “The typical tuition for 4+ year colleges in the US for the 2013-14 school year was about $7,484 for public schools and $27,726 for private schools.”

Oh and did you notice the slight peaks on the right side of both private and public tuition distributions? Me too. Which prompted me to look deeper:

public tuition annotated — Did you know Penn State has 24 campuses? I didn’t!

private tuition annotated — Several Liberal Arts schools in the Northeast are competitively priced between $43K and $47K per year

Measuring what’s Typical

So here’s the thing: Summarizing a set of values for a variable with one numerical description of “center” can help simplify a reporting process and aid in comparisons of large sets of data. However, sometimes finding this measure proves difficult, impossible, or even misleading.

As I suggest to my students, visualizing the distribution of the variable, considering its context and exploring its physical features will add value to your overall analysis and possibly help you find an appropriate measure of typical.

I have no pictures of myself in middle school, so please enjoy this re-creation of the 80s before a Bon Jovi concert.

*Special thank you to Daniel Zvinca for providing feedback for this post with his domain knowledge and extensive industry expertise.

How to Create a Residual Plot in Tableau

In this BrightTalk webinar with Eva Murray and Andy Kriebel, I discussed how to use residual plots to help determine the fit of your linear regression model. Since residuals show the remaining error after the line of best fit is calculated, plotting residuals gives you an overall picture of how well the model fits the data and, ultimately, its ability to predict.

residual 4runner — In the most common residual plots, residuals are plotted against the independent variable.

For simplicity, I hard-coded the residuals in the webinar by first calculating “predicted” values using Tableau’s least-squares regression model. Then, I created another calculated field for “residuals” by subtracting the observed and predicted y-values. Another option would use Tableau’s built in residual exporter. But what if you need a dynamic residual plot without constantly exporting the residuals?

Note: “least-squares regression model” is merely a nerdy way of saying “line of best fit”.

How to create a dynamic residual plot in Tableau

In this post I’ll show you how to create a dynamic residual plot without hard-coding fields or exporting residuals.

Step 1: Always examine your scatterplot first, observing form, direction, strength and any unusual features.

scatterplot 4Runner

Step 2: Calculated field for slope

The formula for slope: [correlation] * ([std deviation of y] / [std deviation of x])

correlation doesn’t mind which order you enter the variables (x,y) or (y,x)
y over x in the calculation because “rise over run”
be sure to use the “sample standard deviation”

slope 4runner

Step 3: Calculated field for y-intercept

The formula for y-intercept: Avg[y variable] – [slope] * Avg[x variable]

y-intercept 4runner

Step 4: Calculated field for predicted dependent variable

The formula for predicted y-variable = {[slope]} * [odometer miles] + {[y-intercept]}

Here, we are using the linear equation, y = mx + b where
- y is the predicted dependent variable (output: predicted price)
- m is the slope
- x is the observed independent variable (input: odometer miles)
- b is the y-intercept
Since the slope and y-intercept will not change value for each odometer mile, but we need a new predicted output (y) for each odometer mile input (x), we use a level of detail calculation. Luckily the curly brackets tell Tableau to hold the slope and y-intercept values at their constant level for each odometer mile.

equation 4runner

Step 5: Create calculated field for residuals

The formula for residuals: observed y – predicted y

Residual calc

Step 6: Drag the independent variable to columns, residuals to rows

pills 4runber

Step 7: Inspect your residual plot.

Don’t forget to inspect your residual plot for clear patterns, large residuals (possible outliers) and obvious increases or decreases to variation around the center horizontal line. Decide if the model should be used for prediction purposes.

The horizontal line in the middle is the least-squares regression line, shown in relation to the observed points.
The residual plot makes it easier to see the amount of error in your model by “zooming in” on the liner model and the scatter of the points around/on it.
Any obvious pattern observed in the residual plot indicates the linear model is not the best model for the data.

In the plot below, the residuals increase moving left to right. This means the error in predicting 4Runner price gets larger as the number of miles on the odometer increase. And this makes sense because we know more variables are affecting the price of the vehicle, especially as mileage increases. Perhaps this model is not effective in predicting vehicle price above 60K miles on the odometer.

residual 4runner

To recap, here are the basic equations we used above:

equations

For more on residual plots, check out The Minitab Blog.

Webinar Reflection: Stats for Data Visualization Part 1

Thank you to Makeover Monday‘s Eva Murray and Andy Kriebel for allowing me to grace their BrightTALK air waves with my love language of statistics yesterday! If you missed it, check out the recording.

webinar 1

With 180 school days and 5 classes (plus seminar once/week), you can imagine a typical U.S. high school math teacher has the opportunity to instruct/lead between 780 and 930 lectures each year. After 14 years teaching students in Atlanta-area schools (plus those student-teaching hours, and my time as a TA at LSU), I’ve instructed somewhere in the ballpark of 12,000 to 13,500 lessons in my lifetime.

So let’s be honest. Yesterday I was nervous to lead my very first webinar. After all, I depend on my gift of crowd-reading to determine the pace (and the tone) of a presentation. Luckily, I’m an expert at laughing at my own jokes so after the first few slides (and figuring out the delay), I felt comfortable. So Andy and Eva, I am ready for the next webinar on December 20th — Audience, y’all can sign up here.

Fun Fact: In 6th grade I was in the same math class as Andy Kriebel’s sister-in-law. It was also the only year I ever served time in in-school suspension (but remember, correlation doesn’t imply causation).

Webinar Questions and Answers

I was unable to get to all the questions asked on the webinar but rest assured I will do my best to field those here.

Q: Can you provide the dataset? A: Here’s a link to the 4Runner data I used for most of the webinar. Let me know if you’d like any others.
Q: Do you have the data that produced the cartoon in the beginning slide? A: A surprising number of people reproduced the data and the curves from the cartoon within hours of its release. Here is one person’s reproduction in R from this blog post
Q: Do you have any videos on the basics of statistics? A: YES! My new favorite is Mr. Nystrom on YouTube, we use similar examples and he looks like he loves his job. For others, Google the specific topic along with the words “AP Statistics” for some of the best tutorials out there.
Q: Could you explain about example A with r value -0.17, it seems as 0. A: The picture when r = -.17 is slightly negative — only slightly. This one is very tricky because we tend to think r = 0 if it’s not linear. But remember correlation is on a continuous scale of weak to positive – which means r = -.17 is still really, really weak. r = 0 is probably not very likely to be observed in real data unless the data creates a perfect square or circle, for example.
Q: Question for Anna, does she also use Python, R, other stats tools? A: I am learning R! R-studio makes it easier. When I coach doctoral candidates on dissertation defense I use SPSS and Excel; one day I will learn Python. Of course, I am an expert on the TI-84. Stop laughing.

6. Q: So with nonlinear regression [is it] better to put the prediction on the y-axis? A: With linear and nonlinear regression, the variable you want to predict will always be your y-axis (dependent) variable. That variable is always depicted with a y with a caret on top : And it’s called “y-hat”

Interpreting Statistical Output and P-Values

To recap, you own Pearson’s Pizza and you’ve hired your nephew Lloyd to run the establishment. And since Lloyd is not much of a math whiz, you’ve decided to help him learn some statistics based on pizza prices.

When we left off, you and Lloyd realized that, despite a strong correlation and high R-Squared value, the residual plot suggests that predicting pizza prices from toppings will become less and less accurate as the number of toppings increase:

Shiny Residual Plot 2 — A clear, distinct pattern in a residual plot is a red flag

Looking back at the original scatterplot and software output, Lloyd protests, “But the p-value is significant. It’s less than 0.0001.”

pizza4 — The original software output. What does it all mean?

Doesn’t a small p-value imply that our model is a go?

freddie

A crash course on hypothesis testing

In Pearson Pizza’s back room, you’ve got two pool tables and a couple of slot machines (which may or may not be legal). One day, a tall, serious man saunters in, slaps down 100 quid and challenges (in an British accent), “My name is Ronnie O’Sullivan. That’s right. THE. Ronnie O’Sullivan.”

You throw the name into the Google and find out the world’s best snooker player just challenged you to a game of pool.

Then something interesting happens. YOU win.

Suddenly, you wonder, is this guy really who he says he is?

Because the likelihood of you winning this pool challenge to THE Ronnie O’Sullivan is slim to none IF he is, indeed, THE Ronnie O’Sullivan (the world’s best snooker player).

Beating this guy is SIGNIFICANT in that it challenges the claim that he is who he claims he is:

You aren’t SUPPOSED to beat THE Ronnie O’Sullivan – you can’t even beat Lloyd.

But you did beat this guy, whoever he claims to be.

So, in the end, you decide this man was an impostor, NOT Ronnie O’Sullivan.

In this scenario:

The claim (or “null hypothesis”): “This man is Ronnie O’Sullivan” you have no reason to question him – you’ve never even heard of snooker

The alternative claim (or “alternative hypothesis”): “This man is NOT Ronnie O’Sullivan”

The p-value: The likelihood you beat the world’s best snooker player assuming he is, in fact, the real Ronnie O’Sullivan.

Therefore, the p-value is the likelihood an observed outcome (at least as extreme) would occur if the claim were true. A small p-value can cast doubt on the legitimacy of the claim – chances you could beat Ronnie O’Sullivan in a game of pool are slim to none so it is MORE likely he is not Ronnie O’Sullivan. Still puzzled? Here’s a clever video explanation.

Some mathy stuff to consider

The intention of this post is to tie the meaning of this p-value to your decision, in the simplest terms I can find. I am leaving out a great deal of theory behind the sampling distribution of the regression coefficients – but I would be happy to explain it offline. What you do need to understand, however, is your data set is just a sample, a subset, from an unknown population. The p-value output is based on your observed sample statistics in this one particular sample while the variation is tied to a distribution of all possible samples of the same size (a theoretical model). Another sample would indeed produce a different outcome, and therefore a different p-value.

The hypothesis our software is testing

The output below gives a great deal of insight into the regression model used on the pizza data. The statistic du jour for the linear model is the p-value you always see in a Tableau regression output: The p-value testing the SLOPE of the line.

Therefore, a significant or insignificant p-value is tied to the SLOPE of your model.

Recall, the slope of the line tells you how much the pizza price changes for every topping ordered. Slope is a reflection of the relationship of the variables you’re studying. Before you continue reading, you explain to Lloyd that a slope of 0 means, “There is no relationship/association between the number of toppings and the price of the pizza.”

Tableau Regression Output

Zoom into that last little portion – and look at the numbers in red below:

Panes		Line		Coefficients
Row	Column	p-value	DF	Term	Value	StdErr	t-value	p-value
Pizza Price	Toppings	< 0.0001	19	Toppings	1.25	0.0993399	12.5831	< 0.0001
				intercept	15	0.362738	41.3521	< 0.0001

In this scenario:

The claim: “There is no association between number of toppings ordered and the price of the pizza.” Or, the slope is zero (0).

The alternative claim: “There is an association between the number of toppings ordered and the price of the pizza.” In this case, the slope is not zero.*

The p-value = Assuming there is no relationship between number of toppings and price of pizza, the likelihood of obtaining a slope of at least $1.25 per topping is less than .01%.

The p-value is very small** — A slope of at least $1.25 would happen only .01% of the time just by chance. This small p-value means you have evidence of a relationship between number of toppings and the price of a pizza.

What the P-Value is NOT

The p-value is not the probability the null hypothesis is false – it is not the likelihood of a relationship between number of toppings and the price of pizza.
The p-value is not evidence we have a good linear model – remember, it’s only testing a relationship between the two variables (slope) based on one sample.
A high p-value does not necessarily mean there is no relationship between pizza price and number of toppings – when dealing in samples, chance variability (differences) and bias is present, leading to erroneous conclusions.
A statistically significant p-value does not necessarily mean the slope of the population data is not 0 – see the last bullet point. By chance, your sample data may be “off” from the full population data.

The P-Value is Not a Green Light

archer

The p-value here gives evidence of a relationship between the number of toppings ordered and the price of the pizza – which was already determined in part 1. (If you want to get technical, the correlation coefficient R is used in the formula to calculate slope.)

Applying a regression line for prediction requires the examination of all parts of the model. The p-value given merely reflects a significant slope — recall there is additional error (residuals) to consider and outside variables acting on one or both of the variables.

Ultimately, Pearson’s Pizza CAN apply the linear model to predict pizza prices from number of toppings. But only within reason. You decide not to predict for pizza prices when more than 5 toppings are chosen because, based on the residual plot, the prediction error is too great and the variation may ultimately hurt long-term budget predictions.

In a real business use case, the p-value, R-Squared, and residual plots can only aid in logical decision-making. Lloyd now realizes, thanks to your expertise, that using data output just to say he’s “data-driven” without proper attention to detail and common sense is unwise.

Statistical methods can be powerful tools for uncovering significant conclusions; however, with great power comes great responsibility.

p_values — Note: This cartoon uses sarcasm to poke fun at blindly following p-values.

—Anna Foard is a Business Development Consultant at Velocity Group

*Note this is an automatic 2-tailed test. Technically it is testing for a slope at least $1.25 AND at most -$1.25. For a 1-tailed test (looking only for greater than $1.25, for example) divide the p-value output by 2. For more information on the t-value, df, and standard error I’ve included additional notes and links at the bottom of this post.

**How small is small? Depends on the nature of what you are testing and your tolerance for “false negatives” or “false positives”. It is generally accepted practice in social sciences to consider a p-value small if it is under 0.05, meaning an observation at least as extreme would occur 5% of the time by chance if the claim were true.

More information on t-value, df:

For those curious about the t-value, this statistic is also called the “critical value” or “test statistic”. This value is like a z-score, but relies on the Student’s t-distribution. In other words, the t-value is a standardized value indicating how far a slope of “1.25” will fall from the hypothesized mean of 0, taking into account sample size and variation (standard error).

In t-tests for regression, degrees of freedom (df) is calculated by subtracting the number of parameters being estimated from the sample size. In this example, there are 21 – 2 degrees of freedom because we started with 21 independent points, and there are two parameters to estimate, slope and y-intercept.

Degrees of freedom (df) represents the amount of independent information available. For this example, n = 21 because we had 21 pieces of independent information. But since we used one piece of information to calculate slope and another to calculate the y-intercept, there are now n – 2 or 19 pieces of information left to calculate the variation in the model, and therefore the appropriate t-value.

The Analytics PAIN Part 2: Least-Squares Regression Lines and Residuals

Haven’t read Part One of this series? Click here.

The Context

You are the sole proprietor of Pearson’s Pizza, a local pizza shop. Out of nepotism and despite his weak math skills, you’ve hired your nephew Lloyd to run the joint. And because you want your business to succeed, you decide this is a good time to strengthen your stats knowledge while you teach Lloyd – after all,

“In learning you will teach, and in teaching you will learn.”

– Latin Proverb and Phil Collins

Your pizzas are priced as follows:

Cheese pizza (no toppings): $15

Additional toppings: $1/each for regular, $1.50 for premium

When we left off, you and Lloyd were exploring the relationship between the number of toppings to the pizza price using a sample of possible scenarios.

pizza scatter — The scatterplot of Pizza Price vs. Number of Toppings: As the number of toppings increase, price of the pizza increases.

The Purpose(s) of a “Regression” Line

When investigating data sets of two continuous, numerical variables, a scatterplot is the typical go-to graph of choice. (See Daniel Zvinca’s article for more on this, and other options.)

So. When do we throw in a “line of best fit”? The answer to that question may surprise you:

A “line of best” fit, a regression line, is used to: (1) assess the relationship between two continuous variables that may respond or interact with each other (2) predict the value of y based on the value of x.

In other words, a regression line may not add value to Lloyd’s visualization if it won’t help him predict pizza prices from the number of toppings ordered.

pizza4

The equation: pizza price = 1.25*Toppings +15

Recall the slope of the line above says that for every additional topping ordered the price of the pizza will increase by $1.25.

In the last post you discussed some higher-order concepts with Lloyd, like the correlation coefficient (R) and R-Squared. Using the data above, you said, “89.3% of the variability (differences) in pizza prices can be explained by the number of toppings.” Which also means 10.7% of the variability can be explained by other variables, in this case the two types of toppings.

Since there is a high R-Squared value, does Pearson’s Pizza have a solid model for prediction purposes? Before you answer, consider the logic behind “least-squares regression.”

Least-Squares Regression

You and Lloyd now understand that “trend lines”, “lines-of-best-fit”, and “regression lines” are all different ways of saying, “prediction lines.”

The least-squares regression line, the most common type of prediction line, uses regression to minimize the sum of the squared vertical distances from each observation (each point) to the regression line. These vertical distances, called residuals, are found simply by subtracting the predicted pizza price from the actual pizza price for each observed pizza purchase.

The magnitude of each residual indicates how much you’ve over- or under- predicted the pizza price, or prediction error.

Note the green lines in the plot below:

Lloyd learns about residuals

Recall, the least-squares regression equation:

pizza price = 1.25(toppings) + 15

Lloyd says he can predict the price of a pizza with 12 toppings:

pizza price =1.25*12 + 15

pizza price = $30

Sure, it’s easy to take the model and run with it. But what if the customer ordered 12 PREMIUM toppings? Logic says that’s (1.50)*12 + 15 = $33.

You explain to Lloyd that the residual here is 33 – 30, or $3. When a customer orders a pizza with 12 premium toppings, the model UNDER predicts the price of the pizza by $3.

lloyd christmas

How valuable is THIS model for prediction purposes? Answer: It depends how much error is acceptable to your business and to your customer.

Why the Residuals Matter

To determine if a linear model is appropriate, protocol* says to create a residual plot and check the graph of residuals. That is, graph all x-values (# of toppings) against the residuals and look for any obvious patterns. Create a residual plot with your own data here.

Ideally, the graph will show a cloud of points with no pattern. Patterns in residual plots suggest a linear model may NOT be suitable for prediction purposes.

You notice from the residual plot above, as the number of toppings increase, the residuals increase. You realize the prediction error increases as we predict for more toppings. For Pearson’s Pizza, the least-squares regression line may not be very helpful for predicting price from toppings as the number of toppings increases.

Is a residual plot necessary? Not always. The residual plot merely “zooms in” on the pattern surrounding the prediction line. Becoming more aware of residuals and the part they play in determining model fit helps you look for these patterns in the original plots. In larger data sets with more variability, however, patterns may be difficult to find.

Lloyd says, “But the p-value is significant. It’s < 0.0001. Why look at the visualization of the residual plot when the p-value is so low?”

Is Lloyd correct?! Find out in Part 3 of this series.

Summary

Today Lloyd learned a regression line has adds little to no value to his visualization if it won’t help him predict pizza prices from the number of toppings ordered.

As the owner of a prestigious pizza joint, you realize the importance of visualizing both the scatterplot and the residual plot instead of flying blind with correlation, R-Squared, and p-values alone.

Understanding residuals is one key to determining the success of your regression model. When you decide to use a regression line, keep your ultimate business goals in mind – apply the model, check the residual plot, calculate specific residuals to judge prediction error. Use context to decide how much faith to place in the magical maths.

linear_regression cartoon

*Full list of assumptions to be checked for the use of linear regression and how to check them here.

Want to have your own least-squares fun? This Rossman-Chance applet provides hours of entertainment for your bivariate needs.