Sunday, November 29, 2015

Two methods for finding outliers


There are two methods for finding outliers, numbers that are far away from "the middle". If we count the middle as the median, we use the five number summary to find the threshold for high and low outliers. If we have the average, we will need to calculate the standard deviation for a sample sx, and find the z-score. With a calculator, this is very simple, but it can also be done by hand with small sets.

The five number summary

 Here is a list of the number of win for the Pac-12 teams as of November 29, 2015. Obviously, the length of the list is 12.

8, 7, 6, 4, 4, 0, 6, 6, 5, 4, 3, 1

We need to put the list in order, either top-to-bottom or bottom-to-top. Since the 8 is the first number on the list, let's go top-to-bottom.

8, 7, 6, 6, 6, 5, 4, 4, 4, 3, 1, 0

The five number summary are the high and low values - very easy, and the Quartiles, Q3, Q2 and Q1. We already know how to get Q2, because it is the median. Q3 is the median of the top half of the data and Q1 is the median of the bottom half of the data. Because there are 12 items on the list, it splits into the top six and the bottom six, and the median is the average of the two middle values.

8, 7, 6, 6, 6, 5 || 4, 4, 4, 3, 1, 0

The median Q2 is (5+4)/2 = 4.5

Q3: For the top half, the median is between the first 6 and the second 6, so it is 6.

Q1: For the bottom half, the median is between the 4 and the 3, so the median is (4+3)/2 = 3.5

High = 8
Q3 = 6
Q2 = 4.5
Q1 = 3.5
Low = 0

Next we get the IQR = Q3 - Q1, which in our instance is 6 - 3.5 = 2.5

The high threshold for outliers is Q3 + 1.5*IQR, or 6 + 1.5*2.5 = 6 + 3.75 = 9.75.  This threshold is above 8, so 8 is not an outlier.

The low threshold for outliers is Q1 - 1.5*IQR, or 3.5 - 1.5*2.5 = 3.5 - 3.75 = -0.25.  This threshold is just barely below 0, so 0 is not an outlier.

The z-score method

We know how to take z-scores if we have the average and standard deviation, but here we are going to have to compute the average and standard deviation instead of them being given. Average isn't hard by hand with smallish data sets, and if you have a calculator that is set up for statistics, both the standard deviation and average are given to you as quickly as you can input the set. If you don't have a calculator. Here is what we need to do.

1. Find the sum of the list, which we will call sum(x).

In our case, it's 8+7+6+6+6+5+4+4+4+3+1+0 = 54

2. Find the sum of the squares of the list, which we will call sum(x²) 

In our case, it's 64+49+36+36+36+25+16+16+16+9+1+0 = 304

3. Then we get  sum(x²) - [sum(x)]²/n

This is 304 - 54²/12 = 304 - 243 = 61

4. The standard deviation is the square root of the value from step 3 divided by n-1.

sqrt(61/11) ~= 2.35487881..., which we can round to 2.35.

The average is 54/12 = 4.5 

To be a high outlier, we need a z-score over 2. To be a low outlier, we need a z-score under -2.


z(8) = (8-4.5)/2.35 ~= 1.48936..., which isn't above 2, so it's not an outlier.

z(0) = (0-4.5)/2.35 ~= =1.91489..., which isn't below -2, so it's not a low outlier, but it was close.

With this particular set, our two methods agreed there were no outliers. The methods sometimes disagree. We can have sets with just high outliers, just low outliers, outliers in both directions or no outliers at all.

No comments: