Several posts ago, Bonnie said we would address some difficult concepts for student understanding of statistics. I thought I would take a shot at one of the concepts she listed, degrees of freedom (*df*).

To help understand this concept, let us first think of *df* in a non-statistical way and say that *df* refers to the ability to make independent choices, or take independent actions, in a situation. Consider a situation similar to one suggested by Joseph Eisenhauer. Suppose you have three tasks you wish to accomplish, for example that you want to go shopping, plan a vacation, and workout at the gym. Assume that each task will take about an hour and that you may do all on one day, or only one each day over the course of several days. I have created a situation with three degrees of freedom, you have three independent decisions to make. Suppose you decide you will go shopping today. Does this decision put any limitations on when you may do the other tasks? No, for you may still do the other tasks either today, or in the course of the next few days. Suppose next you decide to plan a vacation and you will do that that tomorrow. Does this decision place any limitation on when you may go to the gym? Again, no, because you still might go to the gym today, tomorrow, or on another day. Notice here, that each choice of when to do an activity is independent of each of the other choices. Thus, you have 3 degrees of freedom of choice in the order of doing the tasks.

Now, set a different scenario where I plan some limitation on the order in which you may do the tasks. You still have the same three tasks to do, except now you decide you will do only one a day and you want to have them all completed over a span of three days. This scenario has only 2 *df*, for there are only two independent decisions for you to make. After you have made a choice on two of the activities, the day for doing the third activity is “fixed” or decided by your other two choices. For example, suppose you decide to plan your vacation today. For this choice you have total freedom to make a decision for any of the three days. You next decide to plan when to go to the gym. Notice for this decision, however, you have only two choices left, either tomorrow or the following day. A statistician would say you have two degrees of freedom when making this decision. You decide to go the day after tomorrow. Finally, you have to plan shopping, but now you have essentially no choices open to you, it must be tomorrow. For this decision, you have no degrees of freedom. Thus, in a sense, you have 2 *df* in this scenario. You are free to make two choices, but making any two choices automatically determines your third choice.

Of course, the obvious question a student may ask is “What does all this have to do with statistics?” Let’s see. Statistically, the *df* are the number of scores that are free to vary when calculating a statistic, or in other words, the number of pieces of independent information available when calculating a statistic. Suppose you are told that a student took three quizzes, each worth a total of 10 points. You are asked to guess what her scores were. In this scenario, you may guess any three numbers as long as they are in the range from 0 to 10. In this example, you have 3 *df*, for each score is free to vary. Each score is an independent piece of information. Choosing the score for one quiz has no effect on either of the other two scores that you may choose.

But now I give you some information about the student’s performance by telling you that the total of her scores was 27. I have now created a scenario with 2 *df.* Suppose you guess 10 for the first score. Does choosing this score place any limitation on what you might guess for a score on the second, given that the total of the scores must be 27? No, for your choice of a second score is still free to vary from 0 to 10. You guess 9 for a second score. What about your choice of a third score? What must it be. If the total of the three scores is 27, and the first score you chose was 10, and the second 9, then your third choice must be 8 for a total of 27 to be obtained. In this instance, the third score is not free to vary if you know the total of the scores and any two of the three scores. For this example then, there are 2 *df* in the choice of scores. If you know the total of the three scores, then only two provide independent information, the third score becomes dependent on previous two scores. By giving you knowledge of the total of the scores I have reduced the *df *in the number of choices you have.

Can we now relate these two examples to the calculation of statistics? Consider that you have a sample of 10 scores and you want to calculate the mean for these scores. In order to do so, you must know all 10 scores, if you know only 9, you cannot calculate the mean. Thus if there are *n* scores in a sample, then for calculating the mean from this sample there are *n* *df. *Each score is free to vary, and an independent piece of information. You cannot calculate the mean unless you know all *n* scores. But suppose you know the mean for the scores and you want to calculate the standard deviation (*s*) for the scores. In these instance, there are 9 *df* for these scores, for if you know the mean, you need to know only 9 of the scores, the 10^{th} score is in a sense “determined” for you by the value of the other 9 scores. So, for a set of *n* scores, there are *n *– 1 *df* when calculating the standard deviation.

A question frequently arises when the idea of a fixed or determined score is discussed. Students may ask how can someone’s score on a test, for example, be “determined” or “fixed in value” by her other scores on tests? Students should be made to realize that during the actual data collection process all scores are free to vary and the concept of degrees of freedom does not apply. Degrees of freedom only come into play after the data have been collected and we are calculating statistics on those data.

These ideas can be expanded to the computation of other statistics. Consider analyzing data with a 2 *x *2 *chi-square test of independence*. When we are collecting data for the contingency table, the concept of degrees of freedom is not applicable. After we have collected the scores, however, and each cell of the contingency table is filled, then we can use the cell totals to find the row and column marginal totals. Notice at this stage, that if I were to tell you the row and marginal totals, then I would need to give you only one cell total, and you would be able to determine the other three cell totals. In this instance, when knowing the row and marginal totals, there is only 1 *df* for the cell totals. In a more general sense, if there are *r* rows and *c* columns in a contingency table, then once the row and column totals are known, the table possesses (*r *– 1) (*c* - 1) *df.*

I think giving students this intuitive overview of *df* helps them to understand where such numbers come from when they are learning about various statistical tests. Perhaps it may help to make statistics a little less mysterious.

“Suppose you guess 10 for the first score. Does choosing this score place any limitation on what you might guess for a score on the second, given that the total of the scores must be 27? No, for your choice of a second score is still free to vary from 0 to 10.”

On: this doesn’t seem right. The choice for the second score isn’t totally free. It has to be at least more than 7 (10 + 7 + 10 = 27). So it does place a limitation on the second score.

Off: I like this blog a lot. Keep them coming! Maybe write en pop-science book about statistics?

Good point. Thanks for bringing it up. I think the concept still applies, but the example needs to be changed. Thanks for reading the blog.

Pingback: Bain’s (2004) Approach to What Makes a Great Statistics Professor | Statistical Sage Blog

But why in population SD calculation we we use n instead of n-1? We would need mu to be known to calculate deviations and the same logic should apply!? Assumption: population if finite.

Hello!

The issue of why, when using the variance or standard deviation as an inferential statistics (that is to use information from data from the sample to draw conclusions from the population about the variability of the data) is fairly complex. I think Hal Kiess and I do a fairly good job of explaining that in detail in Chapter 4 of Statistical Concepts for the Behavioral Sciences. In short … if you have a population, thus you have Mu, you have every single piece of data … and by definition, this would be the square root of the variance (the sum of squared divided by 1) … however, a sample will always be less varied than the population … always. Think of a dish of many types of beans or lentils. One dish could have three different types of beans or lentils, while another dish (the way my Mom likes to make it) could have 6 or more different beans. Now, if you dished yourself out a bowl of the bean dish (a random sample), is it possible that in that one serving you would have all six types of beans in the same proportion? Certainly, it’s possible, but what is more likely is that you won’t have every single type of bean in the dish. Your sample will be less varied than the population … thus, to fix this consistent problem with using the sample variance or the standard deviation to estimate the population — you have to divide by N-1.

Your mode of presentation is lucid but it came to a halt on the spot where you tried to make us understand while applying this concept during calculation in statistics.You could have put an example, say, a grid, as used in calculation through Chi-Sqare Test.The total thing remained blurred

Thank you, Dr. Sankar.

I have been thinking about the importance of adding more examples in my posts and will take your advice for future articles.

Take care,

Bonnie

Conceptually, these kinds of explanations (“free to vary” stories) have never made much sense to me. On the other hand, what is more transparent conceptually, is that because of the inherent uncertainty of a sample relative to a population (i.e. you’ll never really find the true population statistic using a sample, because you are blindly choosing some data and missing other data), mathematicians decided to impose a modest number of mathematical tweaks here-and-there (….using more tweaking, depending on the complexity of the interactions of variables) to move the statistical outcome closer to what would be expected for the population. It would be nice to read the history of how this all came about….and understand the original intention. Thanks for the blog :)