That box-and-whisker plot (or, boxplot) you learned to read/create in grade school probably IS different from the one you see presented in the adult world.
The boxplot on the top originated as the Range Bar, published by Mary Spear in the 1950’s. While the boxplot on the bottom was a modification created by John Tukey to account for outliers. Source: Hadley Wickham
As a former math and statistics teacher, I can tell you that (depending on your state/country curriculum and textbooks, of course) you most likely learned how to read and create the former boxplot (or, “range bar”) in school for simplicity. Unless you took an upper-level stats course in grade school or at University, you may have never encountered Tukey’s boxplot in your studies at all.
You see, teachers like to introduce concepts in small chunks. While this is usually a helpful strategy, students lose when the full concept is never developed. In this post I walk you through the range bar AND connect that concept to the boxplot, linking what you’ve learned in grade school to the topics of the present.
In this example, I’m comparing the lifespans of a small, non-random set of animals. I chose this set of animals based solely on convenience of icons. Meaning, conclusions can only be drawn on animals for which Anna Foard has an icon. I note this important detail because, when dealing with this small, non-random sample, one cannot infer conclusions on the entire population of all animals.
Quartiles break the dataset into 4 quarters. Q1, median, Q3 are (approximately) located at the 25th, 50th, and 75th percentiles, respectively.
Finding the median requires finding the middle number when values are ordered from least to greatest. When there is an even number of data points, the two numbers in the middle are averaged.
Once the median has been located, find the other quartiles in the same way: The middle value in the bottom set of values (Q1), then the middle value in the top set (Q3).
The first and third quartiles build the “box”, with the median represented by a line inside the box. The “whiskers” extend to the minimum and maximum values in the dataset:
But without the points:
The Range Bar probably looks similar to the first box-and-whisker plot you created in grade school. If you have children, it is most likely the first version of the box-and-whisker plot that they will encounter.
Since the kid’s version of the boxplot does not show outliers, I propose teachers call this version, “The Range Bar” as it was originally dubbed, to not confuse those reading the chart. After all, someone looking at this version of a boxplot may not realize it does not account for outliers and may draw the wrong conclusion.
The only difference between the range bar and the boxplot is the view of outliers. Since this version requires a basic understanding of the concept of outliers and a stronger mathematical literacy, it is generally introduced in a high school or college statistics course.
The interquartile range is the difference, or spread, between the third and first quartile reflecting the middle 50% of the dataset. The IQR builds the “box” portion of the boxplot.
1.5*IQR is then subtracted from the lower quartile and added to the upper quartile to determine a boundary or “fences” between non-outliers and outliers.
Since no animals’ lifespans are below -5 years, it is not possible for a low-value outlier in this particular set of data; however, one animal in this dataset lives beyond 31 years – an outlier in higher values.
Here we find the modification on the “range bar” – the whiskers only extend as far as non-outlier values. Outliers are denoted by a dot (or star).
In an academic setting, I use boxplots a great deal. When teaching AP Statistics, they are helpful to visualize the data quickly by hand as they only require summary statistics (and outliers). They also help students compare and visualize center, spread, and shape (to a degree).
When we get into the inference portion of AP Stats, students must verify assumptions for certain inference procedures — often those procedures require data symmetry and/or absence of outliers in a sample. The boxplot is a quick way for a student to verify assumptions by hand, under time constraints. When coaching doctoral candidates through the dissertation stats, similar assumptions are verified to check for outliers — using boxplots.
Unfortunately, boxplots have their share of disadvantages as well.
Consider:
A boxplot may show summary statistics well; however, clusters and multimodality are hidden.
In addition, a consumer of your boxplot who isn’t familiar with the measures required to construct one will have difficulty making heads or tails of it. This is especially true when your resulting boxplot looks like this:
Or this:
Or what about this?
Over the course of the years, multiple boxplot variations have been created to display parts (or all) of the distribution’s shape and features.
Box-and-whisker plots may be helpful for your specific use case, though not intuitive for all audiences. It may be helpful to include a legend or annotations to help the consumer understand the boxplot.
No cheating! Without looking back through this post, check your own understanding of boxplots. Answer can be found on the #MakeoverMonday webinar I recorded with Eva Murray a couple weeks ago.
]]>
I decided to figure out how to create one in Tableau. Based on the types of cumulative frequency distributions I was used to when I taught AP Stats, I first determined I wanted the value of interest on the horizontal axis and the percents on the vertical axis.
Using a simple example – US President age at inauguration – I started with a histogram so I could look at the overall shape of the distribution:
From here I realized I already had what I needed in my view – discrete ages on the x-axis and counts of ages on the y-axis. For a wider range of values I would want a wider bin size, but in this situation I needed to resize bins to 1, representing each individual age.
Click on the green pill on the rows (the COUNT) and add a table calculation.
First choose “Running Total”, then click on the box “add secondary calculation”:
Next, choose “percent of total” as the secondary calculation:
Add drop lines…
…and CTRL drag the COUNT (age in years) green pill from the rows to labels. Click on “Label” on the marks card and change the marks to label from “all” to “selected”.
And there you have it.
Percentiles describe the position of a data point relative to the rest of the dataset using a percent. That’s the percent of the rest of the dataset that falls below the particular data point. Using the baby weights example, the percentile is the percent of all babies of the same age and gender weighing less than your baby.
Back to the US president example.
Since I know Barack Obama was 47 when inaugurated, let’s look at his age relative to the other US presidents’ ages at inauguration:
And another way to look at this percentile: 87% of US presidents were older than Barack Obama when inaugurated.
Thank you for reading and have an amazing day!
-Anna
]]>My favorite of all the means. Sometimes called expected value, or the mean of a discrete random variable.
When computing a course grade or overall GPA, the weighted mean takes into account each possible outcome and how often that outcome occurs in a dataset. A weight is applied to each possible outcome — for example, each type of grade in a course — then added together to return the overall weighted mean. And since Econ was my favorite course in college…
If you have an exam average of 80, quiz/homework average of 65 and lab average of 78, what is your final grade? (Hint: Don’t forget to change percentages to decimals.)
Weighted means are also effective for assessing risk in insurance or gambling. Also known as the expected value, it considers all possible outcomes of an event and the probability of each possible outcome. Expected values reflect a long-term average. Meaning, over the long run, you would expect to win/lose this amount. A negative expected value indicates a house advantage and a positive expected value indicates the player’s advantage (and unless you have skills in the poker room, the advantage is never on the player’s side). An expected value of $0 indicates you’ll break even in the long-run.
I’ll admit my favorite casino game is American roulette:
As you can see, the “inside” of the roulette table contains numbers 1-36 (18 of which are red, the other 18 black). But WAIT! Here’s how they fool you — see the numbers “0” and “00”? 0 and 00 are neither red nor black, though they do count towards the 38 total outcomes on the roulette board. When the dealer spins the wheel, a ball bounces around and chooses from numbers 1 thru 36, 0 AND 00 — that’s 38 possible outcomes.
Let’s say you wager $1 on “black”. And if the winning number is, in fact, black, you get your original dollar AND win another (putting you “up” $1). Unsuspecting victims new to the roulette table think they have a 50/50 shot at black; however, the probability of “black” is actually 18/38 and the probability of “not black” is 20/38″.
Here’s how it breaks down for you:
Just as in the grading example, each outcome (dollars made or lost) is first multiplied by its weight, where the weight here is the theoretical probability assigned to that outcome. After multiplying, add each product (outcome times probability) together. Note: Don’t divide at the end like you’d do for the arithmetic mean – it’s a common mistake, but easy to remedy if you check your work.
Some Gambling Advice: The belief that casino games adhere to some “law of averages” in the short run is called the Gambler’s Fallacy. Just because the ball on the roulette wheel landed on 5 red numbers in a row doesn’t mean it’s time for a black number on the next spin! I watched a guy lose $300 on three spins of the wheel because, as he exclaimed, “Every number has been red It’s black’s turn! It’s the law of averages!”
A Geometric mean is useful when you’re looking to average a factor (multiplier) applied over time – like investment growth or compound interest.
I enjoyed my finance classes in school, especially the part about how compound interest works. If you think about compound interest over time, you may recall the growth is exponential, not linear. And exponential growth indicates that in order to grow from one value to the next, a constant was multiplied (not added).
As a basic example, let’s say you invest $100,000 at the beginning of 4 years. For simplicity, let’s say the growth rate followed the pattern +40%, -40%, +40%, -40% over the 4 years. At the end of 4 years, you’ve got $70,560 left.
So you know your 4-year return on the investment is: (70,560 – 100,000)/100,000 = -.2944 or -29.44%. But if you averaged out the 4 growth rates using the arithmetic mean, you’d have 0%. Which is why the arithmetic mean doesn’t make sense here.
Instead, apply the geometric mean:
You drive 60 mph to grandma’s house and 40 mph on the return trip. What was your average speed?
Let’s dust off that formula from physics class: speed = distance/time
Since the speed you drive plays into the time it takes to cover a certain distance, that formula may clue you in as to why you can’t just take an arithmetic mean of the two speeds. So before I introduce the formula for harmonic mean, I’ll combine those two trips using the formula for speed to determine the average speed.
The set-up Distance doesn’t matter here so we’ll use 1 mile. Feel free to use a different distance to verify, but you’d be reducing fractions a good bit along the way and I’m all about efficiency. Use a distance of 1 mile for each leg of the journey and the two speeds of 40mph and 60 mph.
First determine the time it takes to go 1 mile by reworking the speed formula:
To determine the average speed, we’ll combine the two legs of the trip using the speed formula (which will return the overall, or average, speed of the entire trip):
The formula for the harmonic mean looks like this:
Where n is the number of 1-mile trips, in this example, and the rates are 40 and 60 mph:
If you scroll up and check out that last step using the speed formula (above), you’ll see the harmonic mean formula was merely a clean shortcut.
If you want more information about measures of center, check out the previous blog post — Mean, Median, and Mode: How Visualizations Help Measure What’s Typical
If your organization is looking to expand its data strategy, fix its data architecture, implement data visualization, and/or optimize using machine learning, check out Velocity Group.
]]>This post aims to review the basics of how measures of central tendency — mean, median, and mode — are used to measure what’s typical. Specifically, I’ll show you how to inspect distributions of variables visually and dissect how mean, median, and mode behave, in addition to common ways they are used. Ultimately it may be difficult, impossible, or misleading to describe a set of data using one number; however, I hope this journey of data exploration helps you understand how different types of data can effect how we describe what’s typical.
Fair enough — I too try to forget the teased hair and track suit years. But I do recall learning to calculate mean, median, mode, and range for a set of numbers with no context and no end game. The math was simple, yet painfully boring. And I never fully realized we were playing a game of Which One of These is Not Like the Other.
It wasn’t until my first college stats course that I realized descriptive statistics serve a purpose – to attempt to summarize important features of a variable or dataset. And mean, median, mode – the measures of central tendency – attempt to summarize the typical value of a variable. These measures of typical may help us draw conclusions about a specific group or compare different groups using one numerical value.
To check off that middle school homework, here’s what we were programmed to do:
Mean: Add the numbers up, divide by the total number of values in the set. Also known as the arithmetic mean and informally called the “average”.
Median: Put the numbers in order from least to greatest (ugh, the worst part) and find the middle number. Oh, there’s two middle numbers? Average them. Did you leave out a number? Start over.
Mode: The number(s) that appear the most.
Repeat until you finish the worksheet.
Because we arrive at mean, median, and mode using different calculations, they summarize typical in different ways. The types of variables measured, the shape of the distribution, the context, and even the size of the set of data can alter the interpretation of each measure of central tendency.
We’re programmed to think in terms of an arithmetic mean, often dubbed the average; however, the geometric and harmonic means are extremely useful and worth your time to learn. Furthermore, when you want to weigh certain values in a dataset more than others, you’ll calculate a weighted mean. But for simplicity of this post, I will only use the arithmetic mean when I refer to the “mean” of a set of values.
Think of the mean as the balancing point of a distribution. That is, imagine you have a solid histogram of values and you must balance it on one finger. Where would you hold it? For all symmetric distributions the balancing point – the mean – is directly in the center.
Just like the median in the road (or, “neutral ground” if you’re from Louisiana), the median represents that middle value, cutting the set of values in half — 50% of the data values fall below and 50% lie above the median. No matter the shape of the distribution, the median is the measure of central tendency reflecting the middle position of the data values.
The mode describes the value or category in a set of data that appears the most often. The mode is specifically useful when asking questions about categorical (qualitative) variables. In fact, mode is the only appropriate measure of typical for categorical variables. For example: What is the most common college mascot? What type of food do college students typically eat? Where are most 4+ Year colleges and universities located?
Modes are also used to describe features of a distribution. In large sets of quantitative data, values are binned to create histograms. The taller “peaks” of the histogram indicate where more common data values cluster, called modes. A cluster of tall bins is sometimes called a modal range. A histogram having one tall peak is called unimodal while two peaks is referred to as bimodal. Multiple peaks = multimodal.
You may notice multiple tall peaks of varying heights in one histogram — despite some bins (and clusters of bins) containing fewer values, they are often described as modes or modal ranges since they contain local maximums.
The histogram above shows a distribution of heights for a sample of college females. The mean, median, and mode of this distribution are equal at about 66.5 inches. When the shape of the distribution is symmetric and unimodal, the mean, median, and mode are equal.
Now I want to see what happens when I add male heights into the histogram:
This histogram shows the distribution of heights of both male and female college students. It is symmetric, so the mean and median are equal at about 68.5 inches. But you’ll notice two peaks, indicating two modal ranges — one from 66 – 67 inches and another from 70 – 71 inches.
Do the mean and median represent the typical college student height when we are dealing with two distinctly different groups of students?
In a skewed distribution, the median remains the center of the values; however, the mean is pulled away from the median from extreme values and outliers.
For example, the histogram above shows the distribution of college enrollment numbers in the United States from 2013. The shape of the distribution is skewed to the right — that is, most colleges reported enrollment below 5,000 students. However, the “tail” of the distribution is created by a small number of larger universities reporting much higher enrollment. These extreme outlying values pull the mean enrollment to the right of the median enrollment.
Reporting an average enrollment of 7,070 students for colleges in 2013 exaggerates the typical college enrollment since most US colleges and universities reported enrollment under 5,000 students.
The median, on the other hand, is resistant to outliers since it is based on position relative to the rest of the data. The median helps you conclude that half of all colleges enrolled fewer than 3,127 students and half of the colleges enrolled more than 3,127 students.
Depending on your end goal and context, median may provide a better measure of typical for skewed set of data. Medians are typically used to report salaries and housing prices since these distributions include mostly moderate values and fewer on the extremely high end. Take a look at the salaries of NFL players, for example:
Are we to only report medians for skewed distributions?
In school, our grades are reported as means. However, students’ grade distributions can be symmetric or skewed. Let’s say you’re a student with three test grades, 65, 68, 70. Then you make a 100 on the fourth test. The distribution of those 4 grades is skewed to the right with a mean of 75.8 and median of 69. Despite the shape of the distribution, you may argue for the mean in this situation. On the other hand, if you scored a 30 on the fourth test instead of 100, you’d argue for the median. With only 4 data points, the median is not a good description of typical so here’s hoping you have a teacher who understands the effects of outliers and drops your lowest test score.
Inserting my opinion: As a former teacher, I recognize that when averaging all student grades from an assignment or test, the result is often misleading. In this case, I believe the median is a better description of the typical student’s performance because extreme values usually exist in a class set of grades (very high or very low) and will affect the calculation of the mean. After each test in AP statistics, I would post the mean, median, 5 number summary and standard deviation for each class. It didn’t take long for students to draw the same conclusion.
Ultimately, context can guide you in this decision of mean versus median but consider the existence of outliers and the distribution shape.
By investigating a distribution’s physical features, students are able to connect the numbers with a story in the data. In quantitative data, unusual features can include outliers, clusters, gaps and “peaks”. Specifically, identifying causes of the multimodality of a distribution can build context behind the metrics you report.
When I investigated the distribution of college tuition, I expected the shape to appear skewed. I did not expect to find the smaller peak in the middle. So I filtered the data by type of college (public or private) and found two almost symmetric distributions of tuition:
The existence of the modes in this data makes it difficult to find a typical US college tuition; however, they did point to the existence of two different types of colleges mixed into the same data.
Now I’m not confident that one number would represent the typical college tuition in the U.S., though I can say, “The typical tuition for 4+ year colleges in the US for the 2013-14 school year was about $7,484 for public schools and $27,726 for private schools.”
Oh and did you notice the slight peaks on the right side of both private and public tuition distributions? Me too. Which prompted me to look deeper:
So here’s the thing: Summarizing a set of values for a variable with one numerical description of “center” can help simplify a reporting process and aid in comparisons of large sets of data. However, sometimes finding this measure proves difficult, impossible, or even misleading.
As I suggest to my students, visualizing the distribution of the variable, considering its context and exploring its physical features will add value to your overall analysis and possibly help you find an appropriate measure of typical.
*Special thank you to Daniel Zvinca for providing feedback for this post with his domain knowledge and extensive industry expertise.
]]>Since our primary audience tends to be those in data visualization, I used the regression output in Tableau to highlight the p-value in a test for regression towards the end. However, I spent the majority of the webinar discussing p-values in general because the logic of p-values applies broadly to all those tests you may or may not remember from school: t-tests, Chi-Square, z-tests, f-tests, Pearson, Spearman, ANOVA, MANOVA, MANCOVA, etc etc.
I’m dedicating the remainder of this post to some “rules” about statistical tests. If you consider publishing your research, you’ll be required to give more information about your data for researchers to consider your p-value meaningful. In the webinar, I did not dive into the assumptions and conditions necessary for a test for linear regression and it would be careless of me to leave it out of my blog. If you use p-values to drive decisions, please read on.
Cautions always come with statistical tests – those cautions do not fall solely on the p-value “cut-off” debate.
To publish your findings in a journal or use your research in a dissertation, the data must meet each condition/assumption before moving forward with the calculations and p-value interpretation, else the p-value is not meaningful.
Each statistical test comes with its own set of conditions and assumptions that justify the use of that test. Tests for Linear Regression have between 5 and 10 assumptions and conditions that must be met (depending on the type of regression and application).
Below I’ve listed a non-exhaustive list of common assumptions/conditions to check before running a test for linear regression (in no particular order).
Note: Check with your publication and/or dissertation chair for complete list of assumptions and conditions for your specific situation.
Short answer: Yes. But I recommend learning how to interpret them and their limitations. Glancing over the list of assumptions above can give a good indication of how sensitive regression models are to outliers and outside variables. I’d also be hesitant to draw conclusions based on a p-value alone for small datasets.
I highly recommend looking at the residual plot (from webinar 1) to determine if your linear model is a good overall fit, keeping in mind the assumptions above. Here is a guide to creating a residual plot using Tableau.
]]>For simplicity, I hard-coded the residuals in the webinar by first calculating “predicted” values using Tableau’s least-squares regression model. Then, I created another calculated field for “residuals” by subtracting the observed and predicted y-values. Another option would use Tableau’s built in residual exporter. But what if you need a dynamic residual plot without constantly exporting the residuals?
Note: “least-squares regression model” is merely a nerdy way of saying “line of best fit”.
In this post I’ll show you how to create a dynamic residual plot without hard-coding fields or exporting residuals.
The formula for slope: [correlation] * ([std deviation of y] / [std deviation of x])
The formula for y-intercept: Avg[y variable] – [slope] * Avg[x variable]
The formula for predicted y-variable = {[slope]} * [odometer miles] + {[y-intercept]}
The formula for residuals: observed y – predicted y
Don’t forget to inspect your residual plot for clear patterns, large residuals (possible outliers) and obvious increases or decreases to variation around the center horizontal line. Decide if the model should be used for prediction purposes.
In the plot below, the residuals increase moving left to right. This means the error in predicting 4Runner price gets larger as the number of miles on the odometer increase. And this makes sense because we know more variables are affecting the price of the vehicle, especially as mileage increases. Perhaps this model is not effective in predicting vehicle price above 60K miles on the odometer.
To recap, here are the basic equations we used above:
For more on residual plots, check out The Minitab Blog.
]]>With 180 school days and 5 classes (plus seminar once/week), you can imagine a typical U.S. high school math teacher has the opportunity to instruct/lead between 780 and 930 lectures each year. After 14 years teaching students in Atlanta-area schools (plus those student-teaching hours, and my time as a TA at LSU), I’ve instructed somewhere in the ballpark of 12,000 to 13,500 lessons in my lifetime.
So let’s be honest. Yesterday I was nervous to lead my very first webinar. After all, I depend on my gift of crowd-reading to determine the pace (and the tone) of a presentation. Luckily, I’m an expert at laughing at my own jokes so after the first few slides (and figuring out the delay), I felt comfortable. So Andy and Eva, I am ready for the next webinar on December 20th — Audience, y’all can sign up here.
Fun Fact: In 6th grade I was in the same math class as Andy Kriebel’s sister-in-law. It was also the only year I ever served time in in-school suspension (but remember, correlation doesn’t imply causation).
I was unable to get to all the questions asked on the webinar but rest assured I will do my best to field those here.
6. Q: So with nonlinear regression [is it] better to put the prediction on the y-axis? A: With linear and nonlinear regression, the variable you want to predict will always be your y-axis (dependent) variable. That variable is always depicted with a y with a caret on top : And it’s called “y-hat”
If you haven’t had time to go through Andy’s Visual Vocabulary, take a look at the correlation section.
At the end of the webinar I recommended Bora Beran’s blog for fantastic explanations on Tableau’s modeling features. He has a statistics background and explains the technical in a clear, easy-to-understand format.
Don’t forget to learn about residual plots if you are using regression to predict.
]]>
As you’d imagine, I’ve been introspective. Am I living my best life? If I go out tomorrow, will I have regrets?
If you’ve asked my advice, I’ve recommended the adventure over the status-quo; the challenge over the straight path; the Soup Number 5 over the chicken fingers. I’ve told you challenges grow you and discomfort makes you stronger. I come from personal experience — I’ve found regrets only inside the comfortable and the “supposed tos”.
My good friend Ericka lost her brother Daniel 5 years ago yesterday in a base-jumping accident. To honor his memory, Ericka posted a video he’d “have others watch for inspiration.”
So watch the video. And afterwards, if you want, you can keep reading this short post. But I won’t be offended if you choose to act on the emotions of the question, “What if you just had one more day? What are you going to do today?”
Thursday, December 3rd 2015 started off with a strange, uncomfortable internal pain – which I chose to ignore. I was married to my schedule and my routine. So I went to the gym, got ready for work, dropped the kids off at their schools, and was at my school by 7:50am. A little voice told me, “Go to the hospital” as I walked into my classroom, but my duty was to my students, to my school, and to my responsibility to everyone else. “I’m sure it’s nothing,” I told my brain.
The pains would come and go every 5 minutes by that time. I taught (or, I tried to teach) my first period pre-calculus class. (I’ll never forget – it was a lesson on the Law of Sines ambiguous case. The “ASS” case. A tricky lesson involving logic, geometry, and effective cursing.) But I hit a point where the pain was so intense I’d have to stop the lesson and sit to take deep breaths. A student suggested I had appendicitis. And I remembered from natural childbirth the indication of real pain was the inability to walk or talk when it hit – and THIS is when I decided to step off that path of expectations of others and ask for help.
A coworker drove me to the hospital and a short while later, I was lying on a stretcher receiving an ultrasound on my abdomen. Stunned techs ran around me loudly relaying their confusion to each other. I’d had enough pain medicine to take the edge off the intensity, but at this point my stomach was beginning to protrude near my belly button and I understood the voices around me were screaming, “Emergency!”
When you think you are on your death bed, or when you’re given terrible news, or when you are in your last moments, I think the thoughts are the same: Have I said and done enough? Do the people I love know how I feel about them? Will my children remember me?
At 40 my regrets are now the words I didn’t say to the people I love.
But few are given a second chance to change how they live their life.
For the record, I had an intussusception – my small intestine telescoped into my large intestine and was sucked further and further until emergency surgery saved my life. It took a while for the doctors to solve my mystery because an intussusception is so rare in adults, especially in females. I was told at first it was most likely caused by colon cancer, though thankfully, the pathology report came back clear a few days later. After a full 6 days in the hospital and 26 staples down my abdomen, I was released into the arms of my loving family.
What if you just had one more day? What are you going to do today?
Rest in peace, Daniel Moore.
]]>For what it’s worth … it’s never too late, or in my case too early, to be whoever you want to be. There’s no time limit. Start whenever you want. You can change or stay the same. There are no rules to this thing. We can make the best or the worst of it. I hope you make the best of it. I hope you see things that startle you. I hope you feel things you never felt before. I hope you meet people who have a different point of view. I hope you live a life you’re proud of, and if you’re not, I hope you have the courage to start all over again.
– F. Scott Fitzgerald, The Curious Case of Benjamin Button
To recap, you own Pearson’s Pizza and you’ve hired your nephew Lloyd to run the establishment. And since Lloyd is not much of a math whiz, you’ve decided to help him learn some statistics based on pizza prices.
When we left off, you and Lloyd realized that, despite a strong correlation and high R-Squared value, the residual plot suggests that predicting pizza prices from toppings will become less and less accurate as the number of toppings increase:
Looking back at the original scatterplot and software output, Lloyd protests, “But the p-value is significant. It’s less than 0.0001.”
Doesn’t a small p-value imply that our model is a go?
In Pearson Pizza’s back room, you’ve got two pool tables and a couple of slot machines (which may or may not be legal). One day, a tall, serious man saunters in, slaps down 100 quid and challenges (in an British accent), “My name is Ronnie O’Sullivan. That’s right. THE. Ronnie O’Sullivan.”
You throw the name into the Google and find out the world’s best snooker player just challenged you to a game of pool.
Then something interesting happens. YOU win.
Suddenly, you wonder, is this guy really who he says he is?
Because the likelihood of you winning this pool challenge to THE Ronnie O’Sullivan is slim to none IF he is, indeed, THE Ronnie O’Sullivan (the world’s best snooker player).
Beating this guy is SIGNIFICANT in that it challenges the claim that he is who he claims he is:
You aren’t SUPPOSED to beat THE Ronnie O’Sullivan – you can’t even beat Lloyd.
But you did beat this guy, whoever he claims to be.
So, in the end, you decide this man was an impostor, NOT Ronnie O’Sullivan.
In this scenario:
The claim (or “null hypothesis”): “This man is Ronnie O’Sullivan” you have no reason to question him – you’ve never even heard of snooker
The alternative claim (or “alternative hypothesis”): “This man is NOT Ronnie O’Sullivan”
The p-value: The likelihood you beat the world’s best snooker player assuming he is, in fact, the real Ronnie O’Sullivan.
Therefore, the p-value is the likelihood an observed outcome (at least as extreme) would occur if the claim were true. A small p-value can cast doubt on the legitimacy of the claim – chances you could beat Ronnie O’Sullivan in a game of pool are slim to none so it is MORE likely he is not Ronnie O’Sullivan. Still puzzled? Here’s a clever video explanation.
The intention of this post is to tie the meaning of this p-value to your decision, in the simplest terms I can find. I am leaving out a great deal of theory behind the sampling distribution of the regression coefficients – but I would be happy to explain it offline. What you do need to understand, however, is your data set is just a sample, a subset, from an unknown population. The p-value output is based on your observed sample statistics in this one particular sample while the variation is tied to a distribution of all possible samples of the same size (a theoretical model). Another sample would indeed produce a different outcome, and therefore a different p-value.
The output below gives a great deal of insight into the regression model used on the pizza data. The statistic du jour for the linear model is the p-value you always see in a Tableau regression output: The p-value testing the SLOPE of the line.
Therefore, a significant or insignificant p-value is tied to the SLOPE of your model.
Recall, the slope of the line tells you how much the pizza price changes for every topping ordered. Slope is a reflection of the relationship of the variables you’re studying. Before you continue reading, you explain to Lloyd that a slope of 0 means, “There is no relationship/association between the number of toppings and the price of the pizza.”
Zoom into that last little portion – and look at the numbers in red below:
Panes | Line | Coefficients | ||||||
Row | Column | p-value | DF | Term | Value | StdErr | t-value | p-value |
Pizza Price |
Toppings |
< 0.0001 |
19 |
Toppings |
1.25 |
0.0993399 |
12.5831 |
< 0.0001 |
intercept |
15 |
0.362738 |
41.3521 |
< 0.0001 |
In this scenario:
The claim: “There is no association between number of toppings ordered and the price of the pizza.” Or, the slope is zero (0).
The alternative claim: “There is an association between the number of toppings ordered and the price of the pizza.” In this case, the slope is not zero.*
The p-value = Assuming there is no relationship between number of toppings and price of pizza, the likelihood of obtaining a slope of at least $1.25 per topping is less than .01%.
The p-value is very small** — A slope of at least $1.25 would happen only .01% of the time just by chance. This small p-value means you have evidence of a relationship between number of toppings and the price of a pizza.
The p-value here gives evidence of a relationship between the number of toppings ordered and the price of the pizza – which was already determined in part 1. (If you want to get technical, the correlation coefficient R is used in the formula to calculate slope.)
Applying a regression line for prediction requires the examination of all parts of the model. The p-value given merely reflects a significant slope — recall there is additional error (residuals) to consider and outside variables acting on one or both of the variables.
Ultimately, Pearson’s Pizza CAN apply the linear model to predict pizza prices from number of toppings. But only within reason. You decide not to predict for pizza prices when more than 5 toppings are chosen because, based on the residual plot, the prediction error is too great and the variation may ultimately hurt long-term budget predictions.
In a real business use case, the p-value, R-Squared, and residual plots can only aid in logical decision-making. Lloyd now realizes, thanks to your expertise, that using data output just to say he’s “data-driven” without proper attention to detail and common sense is unwise.
Statistical methods can be powerful tools for uncovering significant conclusions; however, with great power comes great responsibility.
—Anna Foard is a Business Development Consultant at Velocity Group
*Note this is an automatic 2-tailed test. Technically it is testing for a slope at least $1.25 AND at most -$1.25. For a 1-tailed test (looking only for greater than $1.25, for example) divide the p-value output by 2. For more information on the t-value, df, and standard error I’ve included additional notes and links at the bottom of this post.
**How small is small? Depends on the nature of what you are testing and your tolerance for “false negatives” or “false positives”. It is generally accepted practice in social sciences to consider a p-value small if it is under 0.05, meaning an observation at least as extreme would occur 5% of the time by chance if the claim were true.
More information on t-value, df:
For those curious about the t-value, this statistic is also called the “critical value” or “test statistic”. This value is like a z-score, but relies on the Student’s t-distribution. In other words, the t-value is a standardized value indicating how far a slope of “1.25” will fall from the hypothesized mean of 0, taking into account sample size and variation (standard error).
In t-tests for regression, degrees of freedom (df) is calculated by subtracting the number of parameters being estimated from the sample size. In this example, there are 21 – 2 degrees of freedom because we started with 21 independent points, and there are two parameters to estimate, slope and y-intercept.
Degrees of freedom (df) represents the amount of independent information available. For this example, n = 21 because we had 21 pieces of independent information. But since we used one piece of information to calculate slope and another to calculate the y-intercept, there are now n – 2 or 19 pieces of information left to calculate the variation in the model, and therefore the appropriate t-value.
]]>You are the sole proprietor of Pearson’s Pizza, a local pizza shop. Out of nepotism and despite his weak math skills, you’ve hired your nephew Lloyd to run the joint. And because you want your business to succeed, you decide this is a good time to strengthen your stats knowledge while you teach Lloyd – after all,
“In learning you will teach, and in teaching you will learn.”
– Latin Proverb and Phil Collins
Your pizzas are priced as follows:
Cheese pizza (no toppings): $15
Additional toppings: $1/each for regular, $1.50 for premium
When we left off, you and Lloyd were exploring the relationship between the number of toppings to the pizza price using a sample of possible scenarios.
When investigating data sets of two continuous, numerical variables, a scatterplot is the typical go-to graph of choice. (See Daniel Zvinca’s article for more on this, and other options.)
So. When do we throw in a “line of best fit”? The answer to that question may surprise you:
A “line of best” fit, a regression line, is used to: (1) assess the relationship between two continuous variables that may respond or interact with each other (2) predict the value of y based on the value of x.
In other words, a regression line may not add value to Lloyd’s visualization if it won’t help him predict pizza prices from the number of toppings ordered.
The equation: pizza price = 1.25*Toppings +15
Recall the slope of the line above says that for every additional topping ordered the price of the pizza will increase by $1.25.
In the last post you discussed some higher-order concepts with Lloyd, like the correlation coefficient (R) and R-Squared. Using the data above, you said, “89.3% of the variability (differences) in pizza prices can be explained by the number of toppings.” Which also means 10.7% of the variability can be explained by other variables, in this case the two types of toppings.
Since there is a high R-Squared value, does Pearson’s Pizza have a solid model for prediction purposes? Before you answer, consider the logic behind “least-squares regression.”
You and Lloyd now understand that “trend lines”, “lines-of-best-fit”, and “regression lines” are all different ways of saying, “prediction lines.”
The least-squares regression line, the most common type of prediction line, uses regression to minimize the sum of the squared vertical distances from each observation (each point) to the regression line. These vertical distances, called residuals, are found simply by subtracting the predicted pizza price from the actual pizza price for each observed pizza purchase.
The magnitude of each residual indicates how much you’ve over- or under- predicted the pizza price, or prediction error.
Note the green lines in the plot below:
Recall, the least-squares regression equation:
pizza price = 1.25(toppings) + 15
Lloyd says he can predict the price of a pizza with 12 toppings:
pizza price =1.25*12 + 15
pizza price = $30
Sure, it’s easy to take the model and run with it. But what if the customer ordered 12 PREMIUM toppings? Logic says that’s (1.50)*12 + 15 = $33.
You explain to Lloyd that the residual here is 33 – 30, or $3. When a customer orders a pizza with 12 premium toppings, the model UNDER predicts the price of the pizza by $3.
How valuable is THIS model for prediction purposes? Answer: It depends how much error is acceptable to your business and to your customer.
To determine if a linear model is appropriate, protocol* says to create a residual plot and check the graph of residuals. That is, graph all x-values (# of toppings) against the residuals and look for any obvious patterns. Create a residual plot with your own data here.
Ideally, the graph will show a cloud of points with no pattern. Patterns in residual plots suggest a linear model may NOT be suitable for prediction purposes.
You notice from the residual plot above, as the number of toppings increase, the residuals increase. You realize the prediction error increases as we predict for more toppings. For Pearson’s Pizza, the least-squares regression line may not be very helpful for predicting price from toppings as the number of toppings increases.
Is a residual plot necessary? Not always. The residual plot merely “zooms in” on the pattern surrounding the prediction line. Becoming more aware of residuals and the part they play in determining model fit helps you look for these patterns in the original plots. In larger data sets with more variability, however, patterns may be difficult to find.
Lloyd says, “But the p-value is significant. It’s < 0.0001. Why look at the visualization of the residual plot when the p-value is so low?”
Is Lloyd correct?! Find out in Part 3 of this series.
Today Lloyd learned a regression line has adds little to no value to his visualization if it won’t help him predict pizza prices from the number of toppings ordered.
As the owner of a prestigious pizza joint, you realize the importance of visualizing both the scatterplot and the residual plot instead of flying blind with correlation, R-Squared, and p-values alone.
Understanding residuals is one key to determining the success of your regression model. When you decide to use a regression line, keep your ultimate business goals in mind – apply the model, check the residual plot, calculate specific residuals to judge prediction error. Use context to decide how much faith to place in the magical maths.
*Full list of assumptions to be checked for the use of linear regression and how to check them here.
Want to have your own least-squares fun? This Rossman-Chance applet provides hours of entertainment for your bivariate needs.
—Anna Foard is a Business Development Consultant at Velocity Group
]]>