How to be the Life of the Party Part 3: Permutations and Combinations

(and why your locker combination is actually a permutation)

Welcome to the third installment of my Cheat Sheet for Stats. Be sure to check out Part 1 and Part 2.

Permutations and combinations are useful to someone interested in determining the total number of items from a set or group. This is especially helpful in probability when calculating a denominator and/or numerator.

The difference between a permutation and a combination is simple to understand – if you pay close attention to how the items/objects/people are chosen (and ignore semantics). In this post I’ll give you definitions, formulas, and examples of both permutations and combinations. But first, I’ll discuss the Fundamental Counting Principle and factorials.

The Fundamental Counting Principle

Also known as the multiplication counting rule, this principle says to multiply all possible events together to find the total number of outcomes.

A simple example starts with packing for a vacation. Say you pack 4 shirts, 3 pairs of pants, and 2 pairs of shoes. How many possible outfits can you make? (Assume they all match, or you are 5 years old and don’t give a flip.)

The fundamental counting principle says you now have:

4 * 3 * 2 = 24 possible outfits

Here’s another example. Let’s say your company requires a 5-value verification code consisting of 3 numerical values and 2 alphabetic values (in that order and case sensitive). How many possible verification codes can be produced?

Sometimes it helps to see what is going on:

And visualize the values in each position:

There are 10 total digits to consider (0 – 9) and 26 letters in the alphabet – 52 if case-sensitive. The trick is to multiply to find the total possible outcomes:

= 2,704,000 different verification codes

And what if the requirement changed to 3 numbers and 2 letters (same order), but no repeats? We’d have to take away the number of options for each digit/letter as they are used:

= 1,909,440 different verification codes

There is a little more math involved if you can put these values in any order and I won’t cover that in this post.

Factorials

At first glance, a factorial looks like a very excited number. For instance, 5! might appear to be yelling, “FIVE!” (Silly teacher joke – works better in person.) The exclamation point is actually an operator telling us to multiply that number by all integers less than that number down to 1.

Image result for factorial meme

Permutations

Permutations apply the Fundamental Counting Principle to determine the number of ways you can arrange members of a group. The permutation formula calculates the number of arrangements of n objects taken r at a time:

For example, let’s say you and 29 other people are in the running for 3 distinct prizes. Your names are in a hat and prizes are only given to the first, second, and third names drawn (the best prize being first). The number of ways 30 people can take first, second, and third prize is called a permutation. In a permutation, the order in which the items or people are arranged “matters”. (And by matters, you could say the order is noted, or apparent.)

don’t judge my looks, I just ran a 5K in 25:36. the first place winner didn’t show up for her prize so a nice lady held the sign so we didn’t look like chumps.

For the prize example, you can calculate this using the formula for permutations:

And this goes back to the fundamental counting principle since a portion of the numerator cancels with the denominator:

Simplifying the expression to 30*29*28 = 24,360 ways 3 individuals can be awarded first, second, and third prize from a group of 30 in a random drawing. If we were merely drawing 3 names all at once with no difference in prizes, it would NOT be considered a permutation.

Luckily you really don’t need to know the formula to calculate a permutation. The Excel function for permutations is PERMUT:

Note: There is another Excel function for permutations with repetitions – that one is PERMUTATIONA. For this example, you would use that if we drew names for the three prizes and each time the name was returned to the hat, making it possible for the same person to win all 3 times.

Combinations

Now suppose you and 29 other individuals are in the running for 3 prizes, all with the same value. Your names are in a hat and all three names are drawn at once. Because no order or arrangement is involved, this type of counting technique is called a combination. The combination formula also calculates n objects taken r at a time:

notice the denominator is different – and since you’re dividing by a larger number you can see that a combination will produce fewer possible groups than will a permutation

For the newest version of our prize example, we are taking 3 names from the hat at one time and there is no difference between prizes. Here is that calculation:

Once the 27! in the numerator and denominator cancel, we are left with the 24,360 in the numerator, but still divide by 3! (which is 3*2 = 6):

Which results in only 4060 possible combinations.

The Excel function for combinations is COMBIN:

A Locker Combination is Actually a Permutation

Combination Padlock, Not Resettable Center-Dial Location, 3/4" Shackle Height

Now consider locker combinations. Let’s assume a typical dial lock (Right, left, right) in which there are 39 numbers on the dial and your code has 3 numbers. Does order matter? Absolutely! If you try to open the lock using your 3 number code but in a different order, the locker will not open. So how many possible codes does this locker have?

If numbers couldn’t repeat, we’d have P(39,3) = 54,834 different codes (or what we call “combinations”). But if numbers could repeat, there are 39*39*39 = 59,319 possible codes – to include repeatable values, apply the PERMUTATIONA function in Excel.

You Try!

Based on what you just learned, can you spot the difference between a combination and permutation? Bonus points if you can calculate the result. (Answers at the end of the post.)

  1. A board of directors consists of 13 people. In how many ways can a chief executive officer, a director, and a treasurer be selected?
  2. How many ways can a jury of 12 people be selected from a group of 40 people?
  3. A GM from a restaurant chain must select 8 restaurants from 14 for a promotional program. How many different ways can this selection be done?
  4. At Waffle House hash browns can be ordered 18 different ways. How possible orders can be made by choosing only 3 of the 18?
  5. A locker can have a 4-digit code. How many different codes can we have if there are 25 different numbers and numbers cannot repeat in any given code?

Answers:

  1. Permutation. P(13,3) = 1716
  2. Combination. C(40,12) = 5,586,853,480 order isn’t important here
  3. Combination. C(14,8) = 3003
  4. Combination. C(18,3) = 816
  5. Permutation. P(25,4) = 303,600 (Repeating numbers within a code would give 390,625 different codes.)

How to be the Life of the Party Part 2: Measures of Position

Welcome to Part 2 of my Cheat Sheet for Stats series, sure to make you the life of the party! In case you missed it, Part 1 covered measures of center and spread. In Part 2 we’ll dive into measures of position and location within a dataset – specifically how to calculate, apply, and interpret them.

Measures of Position

Comparing month to month sales volume or current housing prices within your neighborhood are easy to make without involving additional calculations. However, sometimes metrics aren’t exactly comparable in absolute terms. For example, sales volume from 1989 versus 2015. Or current housing prices in Atlanta versus San Francisco. So we typically “normalize” these metrics by adjusting for inflation or cost of living so we can compare relative values.

Another way of comparing considers relative position. Percentiles, quantiles, and z-scores measure the location of a value in relation to other values in a dataset. Once you know the price of a house in Atlanta relative to the rest of the Atlanta housing market, and the price of a house in San Francisco relative to the rest of the San Francisco housing market, you can see how those two housing prices compare to each other.

Percentiles

Percentiles describe the position of a data point relative to the rest of the dataset using a percent. That’s the percent of the rest of the dataset that falls below the particular data point.

Finding an Unknown Percentile

Example: Pretend you’re the 2nd oldest out of a group of 20 people. To find your percentile, count the number of people YOUNGER than you (18) and divide by the total number of people:

18/20 = 0.9, or the 90th percentile

(Note: By calculating percentile using the “percent less than” formula above, the computation disallows a 100th percentile (which is not a valid percentile); however, it forces the lowest value to the 0 percentile (which is also invalid). Another popular variation on this calculation computes the percent less than AND equal to: There are 19 people at your age or younger, so 19/20 is 95th percentile. This version of the calculation fixes the “0 percentile” problem but allows the possibility for an invalid 100th percentile. Both versions of the percentile calculation are acceptable and there is no universally “correct” computation.)

Finding a Given Percentile

Example: You want to determine what height marks the 58th percentile of the 20 people. You’ll multiply the 20 people by 58%, or 20*(0.58) = 11.6. This number is called the index. Round to the nearest whole number 12 – therefore, the 12th height in your group of 20 people roughly falls at the 58th percentile.

A cumulative relative frequency chart, used to locate percentiles

Quantiles

Quantiles break the dataset up into an equal number of n pieces. They signal reference points (or positions) in the dataset to which individual data points can be compared. Specific examples of quantiles are deciles (slicing the data into 10 equal pieces), terciles (3 equal pieces), and quartiles (4 equal pieces). I’ll elaborate using quartiles, but all quantiles follow the same logic:

Quartiles

Quartiles break the dataset up into 4 equal pieces so data points can be compared within the dataset relative to the four quarters of data.

I love watching football, which is broken up into four quarters. If I turn the TV on and the game has already started, I know how far the game has progressed in time relative to the quarter displayed on the screen.

For data:

  • Lower Quartile (Q1) – Roughly the 25th percentile. 25% of the data falls below this point and 75% lies above.
  • Median (Q2) – Roughly the 50th percentile, marks the position of the middle value where half fall above and half fall below this point
  • Upper Quartile (Q3) – Roughly the 75th percentile. 75% of the data falls below this point and 25% lies above.

Quartiles are found using these steps:

  1. Arrange the data from least to greatest.
  2. Find the median, the middle number. (If there are two numbers in the middle, find the average of the two.)
  3. Using the median as the midpoint, the data is now split in half. Now find the middle value in the bottom half of values (this is Q1).
  4. Lastly, find the middle number of the top half of values (this is Q3).

Example: The following values represent the lifespan of a sample of animals. What values break these lifespans into quartiles?

Often quartiles are displayed visually with a box-and-whisker plot:

Full blog post on box-and-whisker plots HERE

Z-Scores

A z-score indicates how many standard deviations a value falls above or below the mean (average). A value with a positive z-score lies above the mean while a value with a negative z-score falls below the mean. Z-scores are a way of standardizing values in order to compare them using relative position.

To calculate a z-score, subtract the average of the population/dataset (mean) from the data point (observation), then divide by the standard deviation of the population/dataset:

Example:

In 1927, Babe Ruth made history hitting 60 home runs in one Major League Baseball season. Only four people have been able to break Ruth’s record (though Mark McGwire and Sammy Sosa have broken that record 2 and 3 times, respectively). In 2001, Barry Bonds set the most recent record, hitting 73 home runs in a single season.

But just how does Barry’s home run performance compare to Babe’s? Many outside factors such as bat quality and pitcher performance could impact the number of homeruns hit by a MLB player. So how did these athletes compare to their peers of the time?

The 1927 league home run average was 7.2 home runs with a standard deviation of 9.7 home runs. While the 2001 league average was an astounding 21.4 home runs, with a standard deviation of 13.2 home runs.

To properly compare these heavy hitters, we need to determine how they preformed relative to the peers of their era by standardizing their absolute HR numbers into z-scores:

Babe Ruth’s performance of 60 HRs lies 5.44 standard deviations above the mean number of HRs hit in 1927
Barry Bond’s performance of 73 HRs lies 3.91 standard deviations above the mean number of HRs hit in 2001

While both athletes displayed phenomenal performances, Babe Ruth could still argue his status of home run champion when comparing in relative terms.

If you automatically think of a bell-shaped, normal curve when you hear “z-score”, you’re not alone. That’s a common connection because of the way we initially introduce z-scores in stats courses:

But z-scores can apply to ANY distribution because they are a way to compare data values using relative position. That is, z-scores “standardize” the data values from absolute to relative metrics.

The Altman Z-score

The Altman z-score is used to predict the liklihood a company will go bankrupt. The Altman z-score applies a weighted calculation based on specific predictors of bankruptcy. This article gives an excellent overview for those interested in calculating and interpreting Altman z-scores.

Next up in Part 3: Counting principles, including permutations and combinations. Because the combination for your combination lock is actually a permutation.

How to be the Life of the Party Part 1: A Cheat Sheet to Summary Statistics

Statistics: What are they good for?

Statistics are values or calculations that attempt to summarize a sample of data. If you’re talking about a population of data, you’re usually dealing with parameters. (Easy to remember: Statistics and Samples start with the letter S, and Populations and Parameters start with the letter P.)

And if you know all there is to know about a population, then you wouldn’t be concerned with statistics. However, we almost never know about an entire population of anything, which is why we focus on studying samples and the statistics that describe those samples.

Statistics and summary statistics, as I mentioned, help us summarize a sample of data so that we can wrap our mind around the entire dataset. Some statistics are better than others, and which statistic you choose to summarize a dataset depends entirely upon the type of data you’re working with and your end goals.

Here, I will briefly discuss the different types, definitions, calculations, and uses of basic summary statistics known as “Measures of Central Tendency” and “Measures of Variation.”

Measures of Central Tendency

Measures of central tendency summarize your dataset into one “typical”, central value. It’s best to look at the shape of your data and your ultimate goals before choosing one measure of central tendency.

Mean

The mean, or average is affected by extreme (outlying) measures since, mathematically, it takes into account all values in the dataset.

Suppose you want to find the mean, or average, of 5 runners on your running team:

The average function in Excel is AVERAGE.

Symbols:

x-bar is the mean of the sample
mu is the mean of the population

Median

Instead of taking the value of the numbers into account, the median considers which value(s) take the middle position. Outliers do NOT affect the value of the median.

The median function in Excel is MEDIAN.

Mode

The most common value, category, or quality is the mode. The mode measures “typical” for categorical data.

In quantitative data, we look for modal ranges to help us dig deeper and segment the dataset. I wrote a blog post recently that dives a little deeper into this concept, as well as visualizing measures of central tendency.

The mode function in Excel is MODE.

Weighted Mean

A weighted mean is helpful when all values in the dataset do not contribute to the average in the same way. For example, course grades are weighted based on what the type of assignment (test, quiz, project, etc). A test might count more towards your final grade than a homework assignment.

Weighted means are also applied when calculating the expected value of an outcome, such as in gambling and actuarial science. An expected value in gambling is what you’d expect to lose (because you’re gonna lose) over the long-run (many, many trials). To calculate this, a probability is applied to each possible outcome and then multiplied by the value of the outcome — pretty much the same way you calculated your grade back in high school:

Note: Weighted Means and Expected Values will reappear in a later post discussing discrete probability distributions.

I wrote a more extensive blog post about “other” means, including Geometric and Harmonic means, in case you’re curious.

Measures of Variation

The middle number of the dataset is only one statistic used the summarize the dataset — the spread of the data is also extremely important to understanding what is happening within your data. Measures of variation tell us how the data varies from end to end and/or within the middle of the dataset and because of this, some measures of spread help in identifying outliers.

For the following examples I took a non-random sample of animal lifespans, put them in order from shortest to longest lifespan. You’ll see the median is 11 years:

Range

The range only considers the variation from the smallest to largest value in the dataset. Here, the animal’s lifespans range 30 years (or, 35 – 5 years). Unfortunately, range doesn’t give us much information about the variation within the datset.

Interquartile Range

Interquartile range, or IQR, is the range of the middle 50% of the dataset. In this situation, it represents the middle 50% of animal lifespans.

To find the IQR, the values in the dataset must first be ordered least to greatest with the median identified (as I did above). Since the median cuts the dataset in half, we then look for the middle value of the bottom half of the dataset (that is, the middle lifespan between the kangaroo and the cat). That represents the lower quartile, or Q1. Then look for the middle value of the top of the dataset, (that is, the middle value between the dog and the elephant). The second number represents the upper quartile, or Q3.

The interquartile range is found by subtracting Q3 – Q1, or 17.5 – 8.5 = 9 which tells me that the middle 50% of these animals’ lifespans range by 9 years. Unfortuantely, IQR only gives information about the spread of the middle of the dataset.

You can calculate the IQR in Excel using the QUARTILE function to find the first quartile and the third quartile, then subtracting like in the above example.

Ever build a box-and-whisker plot (boxplot)? The “box” portion represents the IQR. Here’s a how-to with more information.

Outliers

One common method for calculating an outlier threshold in a dataset depends on the IQR. Once the IQR is calculated, it is then multiplied by 1.5. Find the low outlier threshold by subtracting the IQR*1.5 from Q1. Find upper outlier threshold by adding the IQR*1.5 to Q3. This is the method used to show outliers in box-and-whisker plots.

Standard Deviation

s is the standard deviation of the sample

Standard deviation measures the typical departure, or distance, of each data point from the mean. (Recall, the “mean” is just the average.) So ultimately, the calculation for standard deviation relies on the value of the mean.

The formula below specifically calculates the standard deviation of the sample. I wouldn’t worry too much about using this formula; however, understanding how standard deviation is calculated might help you understand what it calculates, so I’ll walk you through it:

sigma is the standard deviation of the population
  1. In the numerator within the parentheses, the mean is subtracted from each data point. As you can imagine, the mean is the center of the data so half of the resulting differences are negative (or 0), and the other half are positive (or 0). (Adding these differences up will always result in 0.)
  2. Those differences are then squared (making all values positive).
  3. The Greek uppercase letter Sigma in front of the parentheses means, “sum” — so all the squared differences are added up.
  4. Divide that value by the samples size (n) MINUS 1. Your resulting answer is the variance of the dataset.
  5. Since the variance is a squared measure, take the square root at the end. Now you have the standard deviation.

An easier way, of course, is to use Excel’s standard deviation functions for samples: STDEV or STDEV.S

Just like the mean, standard deviation is easily affected by outlying (extreme) values.

the MEDIAN is to INTERQUARTILE RANGE as MEAN is to STANDARD DEVIATION

– The Stats Ninja

Outliers

Several approaches are used to calculating the outlier threshold using standard deviations – which method is employed typically depends on the use case. Many software packages default to 3 standard deviations. If the outlier threshold is calculated using IQR (from above), 2.67 standard deviations mark that boundary.

Next Up: How to be the Life of the Party Part 2 breaks down measures of position, including percentiles and z-scores!

How Laser Tag Helped Students Learn About Data

How do you get a group of 15 to 18-year-old students interested in data prep and analysis? Why, you take them to play laser tag, of course!

That’s right, on a cold January day I loaded up two buses of teens and piloted them to an adventure at our local Stars and Strikes. And this is no small feat — this particular trip developed out of months of planning, and after years proclaiming that I will never ever ever ever EVER coordinate my own field trip for high school kids. I mean, you should SEE the stack of paperwork. And the level of responsibility itself made me anxious.

I’m a parent so I get it. And from a teacher’s point of view, many field trips aren’t worth the hassle.

So there I was, field trip money in one hand, clipboard in another: Imagine a caffeinated Tracy Flick. But thanks to the help of two parent chaperones and the AP Psychology teacher (Coach B), we ran the smoothest data-related field trip modern education has ever known.

What Does Laser Tag Have to do With Statistics?

Statistics textbooks are full of canned examples and squeaky clean data that often have no bearing on a students’ interests. For example, there is an oh-so-relatable exercise computing standard error for D-glucose contained in a sample of cockroach hindguts. In my experience I’ve learned when students can connect to the data, they are able to connect to the concept. We’re all like that, actually — to produce/collect our own data enables us to see what we otherwise would have missed.

(I can assure you confidence intervals constructed from D-glucose in coachroach hindguts did little for understanding standard error.)

The real world is made up of messy data. It’s full of unknowns, clerical errors, bias, unnecessary columns, confusing date formats, missing values; the list goes on. Laser Tag was suggested to me as a way to collect a “large” amount of data in a relatively short amount of time. And because of the size of the dataset, it required the student to input their own data — creating their own version of messy data complete with clerical errors. From there they’d have to make sense of the data, look for patterns, form hypotheses.

The Project

  • Students entered their data into a Google doc — you can find the complete data here.
  • Each partner team developed two questions for the data: One involving 1-variable analysis, another requiring bivariate analysis.
  • The duos then had to explore, clean, and analyze all 47 rows and 48 columns. At this point in the school year, students had been exposed to data up to about 50 rows, but never had they experienced “wide” data.
  • Analyses and presentations required a visualization, either using Excel or Tableau.

Partner projects lend to fantastic analyses, with half the grading

Playing the Games

Methodology: Each student was randomly assigned to a team using a random number generator. Teams of 5 played each other twice during the field trip. The teams were paired to play each other randomly. If, by chance, a team was chosen to play the same team twice, that choice would be ignored and another random selection would be made until a new team was chosen.

Before each game, I recorded which student wore which laser tag vest number. From the set-up room (see above picture), I could view which vest numbers were leading the fight and which team had the lead. It was entertaining. As the students (and Coach B — we needed one more player for even teams) finished their games, score cards were printed and I handed each student their own personal results. The words, “DON’T lose this” exited my lips often.

Upon our return to school (this only took a few hours, to the students’ dismay), results were already pouring the into the Google doc I’d set up ahead of time.

Teaching Tableau and Excel Skills

The AP Statistics exam is held every year in May, hosted by The College Board. On the exam, students are expected to use a graphing calculator but have no access to a computer or Google. Exactly the opposite of the real world.

Throughout the course, I taught all analysis first by hand, or using the TI-83/84. As students became proficient, I added time in the computer lab to teach basic skills using Excel and Tableau (assignments aligned to the curriculum while teaching skills in data analysis). It was my goal for students to have a general understanding of how to use these “real world” analytics tools while learning and applying AP Statistics curriculum.

After the field trip, we spent three days in the computer lab – ample time to work in Tableau and Excel with teacher guidance. Students spent time exploring the 48-column field trip dataset with both Excel and Tableau. They didn’t realize it, but by deciding which chart type to use for different variables, they were actually reviewing content from earlier in the year.

When plotting bivariate quantitative data, a scatterplot is often the go-to chart

Most faculty members had never heard of Tableau. At lunch one day I sat down with Coach B to demonstrate Tableau’s interface with our field trip dataset.

“What question would you ask this set of data?” I asked.

“A back shot is a cheap shot. I wonder who is more likely to take a cheap shot, males or females?”

So I proceeded to pull up a comparison and used box-and-whiskers plots to look for outliers. Within seconds, a large outlier was staring back at us within the pool of male students:

“Ha. I wonder who that was.” – Coach B

“That’s YOU.” – Me

From there, I created a tongue-in-cheek competitive analysis from the data:

Full color version found here.

Student Response

I’ve been teaching since 2004. Over the years, this was probably the most successful project I’ve seen come through my classroom. By “successful”, I’m talking the proportion of students who were able to walk outside of their comfort zone and into a challenging set of data, perform in-depth analyses, then communicate clear conclusions was much higher than in all previous years.

At the end of the year, after the AP Exam, after grades were all but inked on paper, students still talked excitedly about the project. I’d like to think it was the way I linked a fun activity to real-world analysis, though it most likely has to do with getting out of school for a few hours. Either way, they learned something valuable.

Univariate Analysis

One student, Abby, gave me permission to share her work adding, “This is the project that tied it all together. This was the moment I ‘got’ statistics.”

Interestingly, students were less inclined to suggest the female outlier of 2776 shots was a clerical mistake (which it was). I found there were two camps: Students who didn’t want to hurt feelings, and students who think outliers in the wild need no investigation. Hmmm.

Bivariate Analysis

For a group of kids new to communicating stats, I thought this was pretty good. We tweaked their wording (to be more contextual) as we dove into more advanced stats, but their analysis was well thought through.

What I Learned

When you teach, you learn.
Earlier I said the project was a success based on the students’ results. That’s only partially true; it was also a success because I grew as an educator. After years of playing by the rules I realized that sometimes you need to get outside your comfort zone. For me that was two-fold: 1) Sucking it up and planning a field trip and 2) Losing the old, tired TI-83 practice problems and teaching real-world analytics tools.

The Box-and-Whisker Plot For Grown-Ups: A How-to

Author’s note: This post is a follow-up to the webinar, Percentiles and How to Interpret a Box-and-Whisker Plot, which I created with Eva Murray and Andy Kriebel. You can read more on the topic of percentiles in my previous posts.

No, You Aren’t Crazy.

That box-and-whisker plot (or, boxplot) you learned to read/create in grade school probably IS different from the one you see presented in the adult world.

versions

The boxplot on the top originated as the Range Bar, published by Mary Spear in the 1950’s. While the boxplot on the bottom was a modification created by John Tukey to account for outliers. Source: Hadley Wickham

As a former math and statistics teacher, I can tell you that (depending on your state/country curriculum and textbooks, of course) you most likely learned how to read and create the former boxplot (or, “range bar”) in school for simplicity. Unless you took an upper-level stats course in grade school or at University, you may have never encountered Tukey’s boxplot in your studies at all.

You see, teachers like to introduce concepts in small chunks. While this is usually a helpful strategy, students lose when the full concept is never developed. In this post I walk you through the range bar AND connect that concept to the boxplot, linking what you’ve learned in grade school to the topics of the present.

The Kid-Friendly Version: The Range Bar

In this example, I’m comparing the lifespans of a small, non-random set of animals. I chose this set of animals based solely on convenience of icons. Meaning, conclusions can only be drawn on animals for which Anna Foard has an icon. I note this important detail because, when dealing with this small, non-random sample, one cannot infer conclusions on the entire population of all animals.

1) Find the quartiles, starting with the median

Quartiles break the dataset into 4 quarters. Q1, median, Q3 are (approximately) located at the 25th, 50th, and 75th percentiles, respectively.

Finding the median requires finding the middle number when values are ordered from least to greatest. When there is an even number of data points, the two numbers in the middle are averaged.

median
Here the median is the average of the cat and dog’s longevity. NOTE: If, with an even set of values the two in the middle were different, the lower of the two values would be in the 50th percentile and would not be the same measure as the median.

Once the median has been located, find the other quartiles in the same way: The middle value in the bottom set of values (Q1), then the middle value in the top set (Q3).

quartiles
Here we can easily see when quartiles don’t match up exactly with percentiles: Even thought Q1 = 8.5, the duck (7) is in the 25th percentile while the pig is above the 25th percentile. And the sheep is in the 75th percentile despite the value of 17.5 at Q3.

2) Use the Five Number Summary to create the Range Bar

The first and third quartiles build the “box”, with the median represented by a line inside the box. The “whiskers” extend to the minimum and maximum values in the dataset:

range bar

But without the points:

animal box no points

The Range Bar probably looks similar to the first box-and-whisker plot you created in grade school. If you have children, it is most likely the first version of the box-and-whisker plot that they will encounter.

school example
from elementary school Pinterest

Suggestion:

Since the kid’s version of the boxplot does not show outliers, I propose teachers call this version, “The Range Bar” as it was originally dubbed, to not confuse those reading the chart. After all, someone looking at this version of a boxplot may not realize it does not account for outliers and may draw the wrong conclusion.

The Adult Version: The Boxplot

The only difference between the range bar and the boxplot is the view of outliers. Since this version requires a basic understanding of the concept of outliers and a stronger mathematical literacy, it is generally introduced in a high school or college statistics course.

1) Calculate the IQR

The interquartile range is the difference, or spread, between the third and first quartile reflecting the middle 50% of the dataset. The IQR builds the “box” portion of the boxplot.

IQR 2

2) Multiply the IQR by 1.5

IQR

3) Determine a threshold for outliers – the “fences”

1.5*IQR is then subtracted from the lower quartile and added to the upper quartile to determine a boundary or “fences” between non-outliers and outliers.

IQR 1

4) Consider values beyond the fences outliers

outliers

Since no animals’ lifespans are below -5 years, it is not possible for a low-value outlier in this particular set of data; however, one animal in this dataset lives beyond 31 years – an outlier in higher values.

5) Build the boxplot

Here we find the modification on the “range bar” – the whiskers only extend as far as non-outlier values. Outliers are denoted by a dot (or star).

boxplot 12
The adult version also allows us to apply technology, so I left the points in the view to enhance the distribution’s view.

Advantage: Boxplot

In an academic setting, I use boxplots a great deal. When teaching AP Statistics, they are helpful to visualize the data quickly by hand as they only require summary statistics (and outliers). They also help students compare and visualize center, spread, and shape (to a degree).

When we get into the inference portion of AP Stats, students must verify assumptions for certain inference procedures — often those procedures require data symmetry and/or absence of outliers in a sample. The boxplot is a quick way for a student to verify assumptions by hand, under time constraints. When coaching doctoral candidates through the dissertation stats, similar assumptions are verified to check for outliers — using boxplots.

TI83plus
My portable visualization tool

Boxplot Advantages:

  • Summarizes variation in large datasets visually
  • Shows outliers
  • Compares multiple distributions
  • Indicates symmetry and skewness to a degree
  • Simple to sketch
  • Fun to say

laser tag competitive analysis
I took my students on a field trip to play laser tag. Here, boxplots help compare the distributions of tags by type AND compare how Coach B measures up to the students.


So What Could Go Wrong?

Unfortunately, boxplots have their share of disadvantages as well.

Consider:

A boxplot may show summary statistics well; however, clusters and multimodality are hidden.

In addition, a consumer of your boxplot who isn’t familiar with the measures required to construct one will have difficulty making heads or tails of it. This is especially true when your resulting boxplot looks like this:

example q2 and q3 equal
The median value is equal to the upper quartile. Would someone unfamiliar recognize this?

Or this:

example no upper
The upper quartile is the maximum non-outlier value in this set of data.

Or what about this?

example no whiskers
No whiskers?! Dataset values beyond the quartiles are all outliers.


Boxplot Disadvantages:

  • Hides the multimodality and other features of distributions
  • Confusing for some audiences
  • Mean often difficult to locate
  • Outlier calculation too rigid – “outliers” may be industry-based or case-by-case

Variations

Over the course of the years, multiple boxplot variations have been created to display parts (or all) of the distribution’s shape and features.

no whisker
No Whisker Box Plot Source: Andy Kriebel

Going For It

Box-and-whisker plots may be helpful for your specific use case, though not intuitive for all audiences. It may be helpful to include a legend or annotations to help the consumer understand the boxplot.

boxplot overview

Check Yourself: Ticket out the Door

No cheating! Without looking back through this post, check your own understanding of boxplots. Answer can be found on the #MakeoverMonday webinar I recorded with Eva Murray a couple weeks ago.

data quiz

Cartoon Source: xkcd

How to Build a Cumulative Frequency Distribution in Tableau

When my oldest son was born, I remember the pediatrician using a chart similar to the one below to let me know his height and weight percentile. That is, how he measured up relative to other babies his age. This is a type of cumulative relative frequency distribution. These charts help determine relative position of one data point to the rest of the dataset, showing an accumulating percent of observations for each value. In this case, the chart helps determine how a child is growing relative to other babies his age.

 

boys percentile
Source: CDC

I decided to figure out how to create one in Tableau. Based on the types of cumulative frequency distributions I was used to when I taught AP Stats, I first determined I wanted the value of interest on the horizontal axis and the percents on the vertical axis.

Make a histogram

Using a simple example – US President age at inauguration – I started with a histogram so I could look at the overall shape of the distribution:

histogram pills

age of presidents histogram

Adjust bin size appropriately

From here I realized I already had what I needed in my view – discrete ages on the x-axis and counts of ages on the y-axis. For a wider range of values I would want a wider bin size, but in this situation I needed to resize bins to 1, representing each individual age.

age bins

age of presidents histogram bin 1

Change the marks from bars to a line

age line

Create a table calculation

Click on the green pill on the rows (the COUNT) and add a table calculation.

table calc freq dist

Actually, TWO table calculations

First choose “Running Total”, then click on the box “add secondary calculation”:

table calcs 1

Next, choose “percent of total” as the secondary calculation:

table calcs

Polish it up

Add drop lines…

drop lines

…and CTRL drag the COUNT (age in years) green pill from the rows to labels. Click on “Label” on the marks card and change the marks to label from “all” to “selected”.

label

 

 

And there you have it.

full graph 2

Interpreting percentiles

Percentiles describe the position of a data point relative to the rest of the dataset using a percent. That’s the percent of the rest of the dataset that falls below the particular data point. Using the baby weights example, the percentile is the percent of all babies of the same age and gender weighing less than your baby.

Back to the US president example.

Since I know Barack Obama was 47 when inaugurated, let’s look at his age relative to the other US presidents’ ages at inauguration:

obama point
13.3% of US presidents were younger than Barack Obama when inaugurated.                           Source: The Practice of Statistics, 5th Edition

And another way to look at this percentile: 87% of US presidents were older than Barack Obama when inaugurated.

Thank you for reading and have an amazing day!

-Anna

The Ways of Means

As a follow-up to last week’s webinar with Andy Kriebel and Eva Murray, I’ve put together just a few common examples of means other than the ubiquitous arithmetic mean. A great deal of work on each of these topics can be found throughout the interwebs if your Googling fingers get itchy.

The Weighted Mean

My favorite of all the means. Sometimes called expected value, or the mean of a discrete random variable.

Grades

When computing a course grade or overall GPA, the weighted mean takes into account each possible outcome and how often that outcome occurs in a dataset. A weight is applied to each possible outcome — for example, each type of grade in a course — then added together to return the overall weighted mean. And since Econ was my favorite course in college…

syllabus

If you have an exam average of 80, quiz/homework average of 65 and lab average of 78, what is your final grade? (Hint: Don’t forget to change percentages to decimals.)

grades
If your professor’s software is forgiving, that’s a 76.

Vegas

Weighted means are also effective for assessing risk in insurance or gambling. Also known as the expected value, it considers all possible outcomes of an event and the probability of each possible outcome. Expected values reflect a long-term average. Meaning, over the long run, you would expect to win/lose this amount. A negative expected value indicates a house advantage and a positive expected value indicates the player’s advantage (and unless you have skills in the poker room, the advantage is never on the player’s side). An expected value of $0 indicates you’ll break even in the long-run.

I’ll admit my favorite casino game is American roulette:

Casino complete table with roulette and chips, 3d render

As you can see, the “inside” of the roulette table contains numbers 1-36 (18 of which are red, the other 18 black). But WAIT! Here’s how they fool you — see the numbers “0” and “00”? 0 and 00 are neither red nor black, though they do count towards the 38 total outcomes on the roulette board. When the dealer spins the wheel, a ball bounces around and chooses from numbers 1 thru 36, 0 AND 00 — that’s 38 possible outcomes.

Let’s say you wager $1 on “black”. And if the winning number is, in fact, black, you get your original dollar AND win another (putting you “up” $1). Unsuspecting victims new to the roulette table think they have a 50/50 shot at black; however, the probability of “black” is actually 18/38 and the probability of “not black” is 20/38″.

Here’s how it breaks down for you:

roulette table math

Just as in the grading example, each outcome (dollars made or lost) is first multiplied by its weight, where the weight here is the theoretical probability assigned to that outcome. After multiplying, add each product (outcome times probability) together. Note: Don’t divide at the end like you’d do for the arithmetic mean – it’s a common mistake, but easy to remedy if you check your work.

Some Gambling Advice: The belief that casino games adhere to some “law of averages” in the short run is called the Gambler’s Fallacy. Just because the ball on the roulette wheel landed on 5 red numbers in a row doesn’t mean it’s time for a black number on the next spin! I watched a guy lose $300 on three spins of the wheel because, as he exclaimed, “Every number has been red It’s black’s turn! It’s the law of averages!”

The Geometric Mean

A Geometric mean is useful when you’re looking to average a factor (multiplier) applied over time – like investment growth or compound interest.

investments

I enjoyed my finance classes in school, especially the part about how compound interest works. If you think about compound interest over time, you may recall the growth is exponential, not linear. And exponential growth indicates that in order to grow from one value to the next, a constant was multiplied (not added).

As a basic example, let’s say you invest $100,000 at the beginning of 4 years. For simplicity, let’s say the growth rate followed the pattern +40%, -40%, +40%, -40% over the 4 years. At the end of 4 years, you’ve got $70,560 left.

excel growth

So you know your 4-year return on the investment is: (70,560 – 100,000)/100,000 = -.2944 or -29.44%. But if you averaged out the 4 growth rates using the arithmetic mean, you’d have 0%. Which is why the arithmetic mean doesn’t make sense here.

Instead, apply the geometric mean:

geometric mean calc
Note: Multiplying by .4 (or -.4) only returns the amount gained (or lost). Multiplying by 1.4 (or .6) returns the total amount, including what was gained (or lost).

The Harmonic Mean

You drive 60 mph to grandma’s house and 40 mph on the return trip. What was your average speed?

grandma

Let’s dust off that formula from physics class: speed = distance/time

Since the speed you drive plays into the time it takes to cover a certain distance, that formula may clue you in as to why you can’t just take an arithmetic mean of the two speeds. So before I introduce the formula for harmonic mean, I’ll combine those two trips using the formula for speed to determine the average speed.

The set-up Distance doesn’t matter here so we’ll use 1 mile. Feel free to use a different distance to verify, but you’d be reducing fractions a good bit along the way and I’m all about efficiency. Use a distance of 1 mile for each leg of the journey and the two speeds of 40mph and 60 mph.

First determine the time it takes to go 1 mile by reworking the speed formula:

harmonic mean calcs 1

To determine the average speed, we’ll combine the two legs of the trip using the speed formula (which will return the overall, or average, speed of the entire trip):

harmonic mean calcs 2
If, instead of driving equal distances, you were looking for the average speed it took you to drive two equal amounts of time, the arithmetic mean WOULD be useful.

The formula for the harmonic mean looks like this:

harmonic mean

Where n is the number of 1-mile trips, in this example, and the rates are 40 and 60 mph:

harmonic mean calcs 3

If you scroll up and check out that last step using the speed formula (above), you’ll see the harmonic mean formula was merely a clean shortcut.

qed.jpeg

If you want more information about measures of center, check out the previous blog post — Mean, Median, and Mode: How Visualizations Help Measure What’s Typical

If your organization is looking to expand its data strategy, fix its data architecture, implement data visualization, and/or optimize using machine learning, check out Velocity Group.