Friday, July 27, 2012

TEACHING STATISTICS AND THE LONDON SUMMER OLYMPICS


Probably few if any popular events offer such an abundance of statistics teaching topics as the Olympics. The closest rival I can think of would be the upcoming presidential election, which will be covered in a separate post later this year. The only downside about the Olympics is that it will be history by the time the fall semester starts. L

Here is my list – I do not claim completeness and will update it as more ideas cross my mind:

1)      There will certainly be an abundance of graphs related to the Olympics. Let the students find “good” and “bad” ones and discuss them.
2)      I know that in ski jumping the min and max of the judges’ scores is eliminated. I am sure there will be a discipline in the summer Olympics as well where that applies.
3)      The most successful nation will be the one with the most “total medals” (U.S. custom) or most “gold medals” (more European custom) – this is just another example for the mode.
4)      The numeric score that determines the final rankings comes in all types of scales . . . time and distance and other measures are obviously on the ratio scale. Basketball teams’ points (2 for a win, 1 for a loss) during the preliminary round are on the interval scale. Decathlon scoring is on the ordinal scale (there is an exponent involved that makes it not on the interval scale). I just hope there is no winner determined by something measured on the nominal scale. J
I am sure among gymnastics, dressage, synchronized swimming, etc., experts can find additional interesting scales.
5)      Certain “mass” events like the marathon can be used to teach creating histograms and to show skewness. You can also use it to compute mean, median, percentiles, standard deviation and variance. It is also a good starting point to discuss the uselessness of modes in certain situations.
6)      Drug testing allows for all kinds of probability related questions – simple and conditional probability, independence, etc. come to mind. You are a drug-using cyclist. They will randomly test 5000 of the 12000 athletes. What is the chance they will catch you? There are 500 cyclists at the Olympics and they test 300 of them. What is the chance they will catch you now? You are getting tested 4 times and thanks to your superior drugs, the chance that a test returns a positive result is 25% each time you get tested. What is the chance they will never catch you?
7)      The hypergeometric distribution can also be used in the context of drug testing. 40 of the 290 weight lifters use drugs. If 20 of them are randomly chosen for drug tests that never fail to detect a drug user, what is the chance that exactly 7 drug users will get caught?
8)      Guessing the medalist in some sport where you have no prior knowledge allows for the coverage of permutations vs. combinations. What is the chance you will guess the medalists correctly in the men’s 400m free-style final if you don’t know anything about swimming? Or: The top three runners in the women’s 200 meter semi-final advance to the final. If you don’t know anything about track, what is the chance you will guess the three women that advance correctly?
9)      The expected traffic chaos can be used to apply the normal distribution: It takes you X minutes to get to the team handball arena and X ~N(60, 152). What is the chance you will be late for the game if you leave 78 minutes before the start?
10)   You can use the public transportation to frame questions about the uniform distribution. The bus from the athletes’ village runs to the stadium runs every 8 minutes, but not according to a printed schedule. What is the chance a random athlete needs to wait between 3 and 5 minutes?
11)   Points scored in many sports (e.g. basketball) can be assumed to be normally distributed:  In basketball points for the USA~N(98,122). What is the chance they will score more than 90 points? The same works for most track and field disciplines.
12)   Differences in normally distributed variables to determine the winner. E.g. Points for the USA are X~N(90,102) and for Argentina are Y~N(85,202) . . .  what is the chance that Argentin wins the basketball game?
13)   In some disciplines you should be able to find correlations between certain physical attributes of the athletes (height, weight, . . .) and their results. E.g. between the height of high jumpers and the height they jump. You can use that also to discuss cause-and-effect and thus simple regression analysis.
14)   You can find correlations between results in the heats and semifinals or semifinals and finals (swimming, running, . . .).
15)   Proportions can be found in basketball, skeet, etc… and allow for the binomial distribution: If each time I shoot, the chance I hit the disk is 99%; what is the chance I hit 49 out of 50 disks?
16)   There is some research about a country’s expected medal haul using multiple regression analysis using population, GDP/capita and other variables. For more, see:

My list is nowhere near complete, but it shows that virtually every topic that is traditionally covered in business statistics can be applied to the Olympics . . . Enjoy. J

Wednesday, July 25, 2012

VISUALIZING THE CENTRAL LIMIT THEOREM



I will not go into the inner workings of the central limit theorem (CLT) here and rather show how I present it graphically to my students. 

Every decent textbook has some graphs that show how the distribution of the means starts to become normally distributed as n increases and how the variance of the means decreases. However, I believe that students pay much more attention to this topic if they see it in action and potentially play with it themselves!

I created three Excel files that allow showing the CLT in class:
1) for a normal distribution
2) for a uniform distribution
3) for a wild and random distribution

In all three files the distribution, means, etc. are on the first sheet. In the next two sheets, I present the distribution of 10000 sample means for n = 1, 2, 5, 10, and 30 as well as the original distribution (also based on n = 9999 or 10000 since I had to make the continuous distributions discrete). Plotting the 10000 means obviously don't give us the true distribution, but rather an approximation. Nevertheless the students see that the distributions come from real numbers and real draws, which I believe makes the point of the CLT stick more clearly. 


In one sheet, I kept the vertical scales identical which guarantees a visual equality of the areas under the curves. In the other I allowed Excel to pick the best maximum value for the vertical axis to best show the shape. Both have their pros and cons. Which one to use probably depends on the students' understanding of continuous distributions.

I kept the sample randomizer active so you can see how the sample mean distributions change slightly each time. Also if you want to change the population distribution in the wild distribution file, feel free to change the number in the yellow cells on the first sheet. In order to be able to count and display the means (rounded to one digit), the numbers need to be kept between 0 and 10! 


You might notice that I "cheated" when showing the original uniform distribution and the means for n = 1. Because the distribution is generated in 1/1000s and the graph in 1/10s the minimum and maximum vertical line would only be 1/2 the "correct" size. Thus I doubled the observations for those two to keep the picture in line with the pure continuous distribution.

Finally, I created three handouts with the six graphs on each page (two pages per handout - one per type of axis). Here the handout for the wild distribution. The links for all the files are below the graphs.




The handout links:

The Excel file links:




Thursday, July 19, 2012

(MIS)USING TWO VERTICAL AXES



If you show two variables (often time series) that are either measured on different scales (like $ and volume), using two axes is required. If the variables are on the same scale, but the magnitude is very different, you can do the same... but if the magnitudes are fairly similar, using two axes can be very misleading as you see in the graph below...
 source: http://seekingalpha.com/article/118072-u-s-and-bric-world-market-share

The careless consumer of the graph will undoubtedly at first think the BRICs have overtaken the United States temporarily. Only at second glance the true story unfolds.

Here is an example for the same unit (seconds), but vastly different magnitudes - the problem is the scaling. If you want to show and compare progression, you need to use the same zero point on both axes. Even if the point is not shown, the two scales need to increase in proportion. Otherwise you can show anything.
Funnily, The Economist (the source for the graph) wanted to show that the 100 meter free-style swimming world record improve at a faster rate than the 100 meter running WR, but the graph only gives the impression of a less than twice as fast improvement, while checking the numbers reveals that the WR for swimming improved by 22.9% and for running by 8.8%. Thus the swimming WR improved 2.6 times faster - somewhat more than the graph suggests. When 9 seconds (running axis) and 45 seconds (swimming) correspond, they should have also matched 10 seconds with 50, 11 with 55, etc... which would have (visually) suppressed the improvement of the running WR.

    Source: economist.com

Last, different scales can greatly suppress volatility. In the graph below you see the annual percentage changes of U.S. and German gross capital formation for 1971 to 2010. For the U.S. (Germany) the average annual growth was 6.56% (6.76%) with a standard deviation of 7.05% (13.60%) so the German numbers fluctuated a lot more. However, in the graph, thanks to using two axes, the fluctuations seem to be quite similar! The second graph shows the same data in a graph with only one scale, clearly a much more accurate representation of the data.