Central tendency: Mean, Median, Mode
Central tendency: Mean, Median, Mode
Before discussing measures of central tendency, a word of caution is necessary. Customers do not feel averages. They feel their specific experience. As a result, while central tendency is an important descriptive statistic, it is often misused. For example, a customer is told that the average delivery time is noon, but his actual delivery time turns out to be 3:00 pm. The customer, in this case, does not experience the average and may feel that he has been lied to.
The central tendency of a dataset is a measure of the predictable center of a distribution of data. Stated another way, it is the location of the bulk of the observations in a dataset. Knowing the central tendency of a process outputs, in combination with its standard deviation, will allow the prediction of the process future performance. The common measures of central tendency are the mean, the median, and the mode.
Mean, Median, Mode
The mean (also called the average) of a dataset is one of the most used and abused statistical tools for determining central tendency. It is the most used because it is the easiest to apply. It is the most abused because of a lack of understanding of its limitations. The average is simple to calculate. It is the sum of the magnitudes of all observations divided by the number of observations.
In a normally distributed dataset, the average is the statistical tool of choice for determining central tendency. We use averages every day to make comparisons of all kinds such as batting averages, gas mileage, and school grades. One weakness of the mean is that it tells nothing about segmentation in the data. Consider the batting average of a professional baseball player. It might be said that he bats .300 (Meaning a 30 percent success rate), but this does not mean that on a given night he will bat .300. In fact, this rarely happens. A closer evaluation reveals that he bats .200 against left-handed pitchers and .350 against right-handed pitchers. He also bats close to .400 at home and .250 on the road. What results is a family of distributions instead of a single distribution.
As a result, coaches use specific averages for specific situations. That way they can predict who will best support the teams offense, given a specific pitcher and game location. This is a common situation with datasets. Many processes produce data that represent families of distributions. Knowledge of these data characteristics can tell a lot about how a process behaves.
Another weakness of the mean is that it does not give the true central tendency of skewed distributions. An example would be a call centers cycle time for handling calls. A histogram of this data from a call center would show the mean is shifted to the right due to the skewedness of the distribution. This happens because we calculate the mean from the magnitudes of the individual observations. Since the data points to the right have a higher magnitude, they bias the calculation, even though they have lower frequencies of occurrence.
What we need in this case is a method that establishes central tendency without "magnitude bias". There are two ways of doing this: the median and the mode.
The median is the middle of the dataset, when arranged in order of smallest to largest. If there are nine data points, for example, then five is the median of the set. 1 2 3 4 5 6 7 8 9
The mode, on the other hand, is a measure of central tendency that represents the most frequently observed value or range of values. In the dataset below, the central tendency as described by the mode is three. 1 2 3 3 4 5 6 7 8 9
The mode is most useful when the dataset has more than one segment, is badly skewed, or it is necessary to eliminate the effect of extreme values. An example of a segmented dataset would the observed height of all thirty-year-old people in a town. This dataset would have two peaks, because it is made up of two segments. The male and female data points would form two separate distributions, and as a result, the combined distribution would have two modes.
Lets suppose that this dataset shows that the mean height would be 5.5 feet. The median would be of similar magnitude and both would be worthless in predicting the height of the next person to be measured. Knowing the gender of the next person, on the other hand, would allow for a better prediction of the next persons height. This is because there would be a mode for males and a mode for females. The mode in this case would be a good predictor.
In other words, the appropriate method of calculating central tendency is dependent upon the nature of the data. In a nonskewed distribution of data, the mean, median, and mode are equally suited to define central tendency. They are, in fact right on top of each other. In a skewed distribution, like that of the call center mentioned earlier, the mean, median, and mode are all different. For prediction purposes, with a skewed distribution, the mean is of little value. The median and the mode would better predictors, but each tells a different story. Which is best depends upon why the data is skewed and how the result will be used.
In a skewed dataset, the median may be the best indication of central tendency for hypothesis testing (Non-Parametric Tests), but the mode may be a better predictor of the next observation. Only a through knowledge of the data will show what method to use.
A shift in the process output can make an otherwise normal dataset seem skewed. In that case, the recent data is evidence of special cause variation.
It means that the dataset may be on the way to becoming bimodal, not skewed. For example, consider measuring the height of all thirty-year-old-people in a town as above. If females are measured first, there will be a normally distributed dataset centered around 5 feet. As the men begin to be measured, the date set will begin to take on a skewed look. Eventually, the dataset will become bimodal. This phenomenon can make statistical decision making difficult. The key is to understand the reason for the datasets skewedness.