# Category Archives: Hypothesis Testing

## Difficult Concepts: Research Hypotheses vs. Statistical Hypotheses

I always cringe when I see a statement in a text or website such as “the research hypothesis, symbolized as H1 , states a relationship between variables.” No! No! No! How can students not be confused on the difference between research and statistical hypotheses when instructors are? H1 is not the research hypothesis, it is the alternative to the null hypothesis in a statistical test.

Let’s be very clear, in most research settings, there are two very distinct types of hypotheses: the Research or Experimental Hypothesis, and the Statistical Hypotheses. A research hypothesis is a statement of an expected or predicted relationship between two or more variables. It’s what the experimenter believes will happen in her research study. For example a researcher may hypothesize that prolonged exposure to loud noise will increase systolic blood pressure. In this instance the researcher predicts that exposure to prolonged noise (the independent variable) will increase systolic blood pressure (the dependent variable). This hypothesis sets the stage to design a study to collect empirical data to test its truth or falsity. From this research hypothesis we can imagine the scientist will, in some fashion, manipulate the amount of noise a person is exposed to and then take a measure of blood pressure. The choice of statistical test will depend upon the research design used, a very simple design may require only a t test, a more complex factorial design may require an analysis of variance, or if the design is correlational, a correlation coefficient may be used. Each of these statistical tests will possess different null and alternative hypotheses.

Regardless of the statistical test used, however, the test itself will not have a clue (if I am allowed to be anthropomorphic here) of where the measurement of the dependent variable came from or what it means. More years ago than I care to remember, C. Alan Boneau made this point very succinctly in an article in the American Psychologist (1961, 16, p.261): “The statistical test cares not whether a Social Desirability scale measures social desirability, or number of trials to extinction is an indicator of habit strength….Given unending piles of numbers from which to draw small samples, the t test and the F test will methodically decide for us whether the means of the piles are different.”

Rejecting a null hypothesis and accepting an alternative does not necessarily provide support for the research hypothesis that was tested. For example, a psychologist may predict an interaction of  her variables and find that she rejects the null hypothesis for the interaction in an analysis of variance. But the alternative hypothesis for interaction in an ANOVA simply indicates that an interaction occurred, and there are many ways for such an interaction to occur. The observed interaction may not be the interaction that was predicted in the research hypothesis.

So please, make life simpler and more understandable for your students. Don’t call a statistical alternative hypothesis a research hypotheses. It is not. Your students will appreciate you making the distinction.

## Difficult Concepts—Degrees of Freedom

Several posts ago, Bonnie said we would address some difficult concepts for student understanding of statistics. I thought I would take a shot at one of the concepts she listed, degrees of freedom (df).

To help understand this concept, let us first think of df in a non-statistical way and say that df refers to the ability to make independent choices, or take independent actions, in a situation. Consider a situation similar to one suggested by Suppose you have three tasks you wish to accomplish, for example that you want to go shopping, plan a vacation, and workout at the gym. Assume that each task will take about an hour and that you may do all on one day, or only one each day over the course of several days. I have created a situation with three degrees of freedom, you have three independent decisions to make. Suppose you decide you will go shopping today. Does this decision put any limitations on when you may do the other tasks? No, for you may still do the other tasks either today, or in the course of the next few days. Suppose next you decide to plan a vacation and you will do that that tomorrow. Does this decision place any limitation on when you may go to the gym? Again, no, because you still might go to the gym today, tomorrow, or on another day. Notice here, that each choice of when to do an activity is independent of each of the other choices. Thus, you have 3 degrees of freedom of choice in the order of doing the tasks.

Now, set a different scenario where I plan some limitation on the order in which you may do the tasks. You still have the same three tasks to do, except now you decide you will do only one a day and you want to have them all completed over a span of three days. This scenario has only 2 df, for there are only two independent decisions for you to make. After you have made a choice on two of the activities, the day for doing the third activity is “fixed” or decided by your other two choices. For example, suppose you decide to plan your vacation today. For this choice you have total freedom to make a decision for any of the three days. You next decide to plan when to go to the gym. Notice for this decision, however, you have only two choices left, either tomorrow or the following day. A statistician would say you have two degrees of freedom when making this decision. You decide to go the day after tomorrow. Finally, you have to plan shopping, but now you have essentially no choices open to you, it must be tomorrow. For this decision, you have no degrees of freedom. Thus, in a sense, you have 2 df in this scenario. You are free to make two choices, but making any two choices automatically determines your third choice.

Of course, the obvious question a student may ask is “What does all this have to do with statistics?” Let’s see. Statistically, the df are the number of scores that are free to vary when calculating a statistic, or in other words, the number of pieces of independent information available when calculating a statistic. Suppose you are told that a student took three quizzes, each worth a total of 10 points. You are asked to guess what her scores were. In this scenario, you may guess any three numbers as long as they are in the range from 0 to 10. In this example, you have 3 df, for each score is free to vary. Each score is an independent piece of information. Choosing the score for one quiz has no effect on either of the other two scores that you may choose.

But now I give you some information about the student’s performance by telling you that the total of her scores was 27. I have now created a scenario with 2 df. Suppose you guess 10 for the first score. Does choosing this score place any limitation on what you might guess for a score on the second, given that the total of the scores must be 27? No, for your choice of a second score is still free to vary from 0 to 10. You guess 9 for a second score. What about your choice of a third score? What must it be. If the total of the three scores is 27, and the first score you chose was 10, and the second 9, then your third choice must be 8 for a total of 27 to be obtained. In this instance, the third score is not free to vary if you know the total of the scores and any two of the three scores. For this example then, there are 2 df in the choice of scores. If you know the total of the three scores, then only two provide independent information, the third score becomes dependent on previous two scores. By giving you knowledge of the total of the scores I have reduced the df in the number of choices you have.

Can we now relate these two examples to the calculation of statistics? Consider that you have a sample of 10 scores and you want to calculate the mean for these scores. In order to do so, you must know all 10 scores, if you know only 9, you cannot calculate the mean. Thus if there are n scores in a sample, then for calculating the mean from this sample there are n df. Each score is free to vary, and an independent piece of information. You cannot calculate the mean unless you know all n scores. But suppose you know the mean for the scores and you want to calculate the standard deviation (s) for the scores. In these instance, there are 9 df for these scores, for if you know the mean, you need to know only 9 of the scores, the 10th score is in a sense “determined” for you by the value of the other 9 scores. So, for a set of n scores, there are – 1 df when calculating the standard deviation.

A question frequently arises when the idea of a fixed or determined score is discussed. Students may ask how can someone’s score on a test, for example, be “determined” or “fixed in value” by her other scores on tests? Students should be made to realize that during the actual data collection process all scores are free to vary and the concept of degrees of freedom does not apply. Degrees of freedom only come into play after the data have been collected and we are calculating statistics on those data.

These ideas can be expanded to the computation of other statistics. Consider analyzing data with a 2 2 chi-square test of independence. When we are collecting data for the contingency table, the concept of degrees of freedom is not applicable. After we have collected the scores, however, and each cell of the contingency table is filled, then we can use the cell totals to find the row and column marginal totals. Notice at this stage, that if I were to tell you the row and marginal totals, then I would need to give you only one cell total, and you would be able to determine the other three cell totals. In this instance, when knowing the row and marginal totals, there is only 1 df for the cell totals. In a more general sense, if there are r rows and c columns in a contingency table, then once the row and column totals are known, the table possesses (– 1) (c – 1) df.

I think giving students this intuitive overview of df helps them to understand where such numbers come from when they are learning about various statistical tests. Perhaps it may help to make statistics a little less mysterious.

## How wonderful and I wish, I wish …

As I type this, I have one fifty minute class left to teach, and my time with my statistics class will be over. As with anything, each semester is varied. Some semesters I cover more information than other semesters. I liken this semester to driving through the city and hitting all green lights! As such, I believe my students were able to master additional information based on what is probably mostly good fortune.

So, here is my list of things I’m so thrilled I covered:

(1) Effect size statistics, like eta squared: Sure effect size statistics are not used that much, and lets face it, they are super easy to calculate, but my biggest reason for wanting to teach effect size statistics is it helps students to understand what a t-test or F-test can tell us (is there a difference) and what it can’t tell (how big is the effect). In fact, by spending about 20 minutes on the teaching of effect size statistics, students were better able to understand why the “p-value” for an observed t or F score provides us with no information. All we need to know is, did we pass the threshold.

(2) We find the critical value BEFORE calculating the observed value: This discussion helps focus student on the logic of statistical hypothesis testing. Specifically, statistical hypothesis testing works because we assume that the null hypothesis is true, that there is no effect of the independent variable on the dependent variable. With this assumption, we are able to generate the sampling distribution that provides us with information on the standard error. Now, if our sample mean is too extreme, we reject our initial hypothesis, the null, and accept the alternative hypothesis, that is the means are different. By finding the critical value prior to calculating the statistic, it helps focus students on that “line in the sand” to say … my observations are too extreme for me to stay with my current hypothesis. Students are far less likely to fall victim to equating p-value with the strength of the effect of the independent variable, or to conclude … the data is trending because I have a p-value of .07 or some other funky thing far too many people do with null hypothesis testing. By spending a bit more time on the steps involved in hypothesis testing, I think students are less likely to fall victim to the common misconceptions surrounding Statistical Null Hypothesis Testing.

(3) Though not a specific concept, I am pleased that for almost every concept I taught this semester I used new examples. Sure, I’m still a sage in training, no grey hair and all, but I was beginning to find myself using the same examples. As this is the third semester my supplement instructor, Amy, is taking notes in class, I felt I owed it to her, at least, to “keep it fresh.” I also found thinking about this blog helped spur my mind toward different examples. In doing so, I found some worked even better than my “old stand by” examples, but the great things was, when the new example flopped, I just quickly switched to the example I knew helped students.

Now for my Wish List of things I always wished I could have covered, but didn’t.

(1) Though I do get to cover the concepts of the F-test. I teach a three credit class, and only have time to cover the one-factor between subject ANOVA. If only I could cover a two-factor between subject ANOVA and a one-factor within subject ANOVA, I would feel my students would really understand the F-test (and as such, be less incline to misuse or over use it).

(2) Yet, I feel if I could cover non-parametrics, students would better understand the role of the assumptions in parametric tests, and issues like Power and random error could be even better understood. Plus they would get the benefit of learning about a really important class of statistics. Sadly, another semester has passed without me being able to cover this topic with the depth I think it deserves.

(3) I fear I don’t emphasize the weakness of statistics, and that they are only as good as the quality of the theories being tested in the design. They are also only as good as the quality of the sample and the quality of the measure. At least the latter two concepts get covered in classes that will follow the statistics class. But so few people speak of the topic of equifinity, that the same outcome can have multiple explanations. Again, though I touch on this, the idea of developing the alternative rival hypotheses that could explain the same empirical evidence is one I simply don’t have time to cover to the extent I would like. If you have a weak theory or haven’t taken into account the alternative rival hypotheses when designing your study, cool statistics will not improve the quality of your findings.

(4) Though I tell students the hypothesis drive everything, from the selection of the measure and research design, to the specific statistic one would select, and though there are example problems in the textbook (Integrating Your Knowledge) that students have to complete, I really wish we could spend more time on this.

Maybe next semester, I can find a way to reach my wish list … maybe!

## Teaching hypothesis, design, analysis & inference as one thing

“So, the question I would like to pose for the sages and anyone else interested in commenting … for a first semester undergraduate applied statistics class … what are the most critical student learning outcomes that have to be mastered?”

First let me just comment on the blog Bonnie just posted today: I think her list of core concepts is excellent, and I agree that those concepts (all of which have to do with the ever-present error inherent in all our observations and measurements) should certainly be taught in the introductory statistics course.  Nevertheless, let me introduce a different perspective.

When I first saw the question posed by Bonnie  (reproduced above) I thought the answer would be an easy one to write. It turns out it is not quite so easy. My problem is that I see all the parts of the application of statistics as parts of an integrated whole. So my answer will appear to be a daunting one.

I hope that students can take away an appreciation (mastery would be too much to ask at this level) of how we use data to make inferences about the behavior under study. My typical homework problems were not, except for some initial ones, about calculations: finding means, t or F values, and p values. Rather an experimenter’s hypotheses would be stated along with how she collected the data to test those hypotheses, and (relatively simple) data would then be listed. The question posed was: what can you conclude from these data, and especially what can you conclude about the hypotheses? The appropriate way to answer such problems was to present the means and to interpret what the pattern tells us, with the statistical test of significance to guide us as to which differences we could assume due to the independent variable.

I understand that this is asking a lot of the students, but just getting statistics from data sets bores the heck out of me, and I don’t see why it would not be equally boring to the students. A few weeks into the semester we would be into the Analysis of Variance (Keppel’s book does a wonderful job facilitating early introduction of AOV), and the course especially emphasized factorial designs in which interpretation of patterns of means with the assistance of significance testing becomes, for me at least, most challenging and most interesting. The logic of the interplay of hypothesis, design, data, statistical analysis and inference is to me all one thing.

Such an integrated concept, satisfying to me, may or may not be an asset when applied to teaching the first undergraduate course in statistics.

## Core Statistical Concepts

I have been spending the week thinking about what I consider to be the “core concepts” that need to be covered in an applied statistics class, be it in psychology, health, business, or education. However, before I post my personal thoughts, I felt it necessary to see what other applied statisticians had to say. In my search, I found http://www.statlit.org/pdf/2004McKenzieASA.pdf . This work was conducted by John McKenzie (2004), Conveying the Core Concepts, is from the Proceedings of the ASA Section on Statistical Education, pages 2755-2757.

In reading what  McKenzie, and several other professors of applied statistics identified as the core concepts in statistics, I must say … I concur. Listed below are the core concepts in applied statistics … the information that, in my opinion, simply has to be covered regardless of illness, snow days, or anything else that could interrupt a professors’ teaching schedule.

Variability: Students cannot understand the purpose of statistics unless they get the concept of variability. Within this, we can further talk about variability due to chance and variability due to effect. Including in the discussion of variability should be the difference between systematic and random variability. I would have to say that not a class period goes by without me spending at least a little time on helping students to focus on issues of variability (especially variability due to the individual differences of the subjects who just happen to be in our sample).

Randomness: Though I would see randomness and variability as being part of the same large concept, McKenzie’s work identified the concept of randomness as not only separate from variability but also critical for students to master.

Sampling Distribution: Along with Hypothesis Testing, the teaching of sampling distribution is considered to be one of the most complicated to teach.  I would concur, which is why I spend an entire class period just on a single activity with M&M’s to demonstrate the concept of sampling distribution. (Please see a prior blog entry for details on this tactile activity).

Hypothesis Testing: The sages and I spent the month of October and much of November discussing whether Hypothesis Testing is critical and if so, how to best tackle the teaching of this complex topic. Not surprising, McKenzie identified the teaching of hypothesis testing as being one of the two most difficult concepts to teach in applied statistics (the other being sampling distribution). Though there may be several published articles on hypothesis testing no longer being a critical concept to teach, the individuals who were surveyed for McKenzie’s work, certainly consider it to be a critical concepts.

Data Collection Methods: Though I have said to my students more times that I can count, “the quality of our statistics is limited by the quality of our sample,” I must admit to being a bit surprised that this was considered critical by others, especially since when I look at many undergraduate statistics textbooks, data collection methods are barely mentioned. Kiess and Green’s (2010) Statistical Concept for the Behavioral Sciences, 4/e, certainly tackles the issue of data collection methods.

Association vs. Causality: This core concept makes me smile, as often when I meet someone for the first time, and they ask me what I do … my response is often met with one of two comments … “Oh, I hated statistics” or “Correlation does not mean causation.” It’s kind of like me recalling how to greet a person in German, a class that I had for three years, and yet recall so little. We, as applied statisticians, certainly engrave this concept into the minds of our students, but I’m sure most of you are like me, hoping student get more than a “pat phrase” out of our classes.

Significance (Statistical vs. Practical): This is a critical concept in applied statistics and one that is probably not mentioned in theoretical statistics classes. Sure, we delineate a mark in which we have to say … these results are too extreme for us to attribute them to “chance” … but just because we found a statistically significant difference, doesn’t mean it’s a difference that truly matters. In applied statistics, it’s not enough to understand how statistical significance works, but to be able to interpret the results to determine practical difference. I must admit to not covering this core concept to the same extent I cover the others.

As I think of other “critical concepts” they tend to be a bit more specific and fall under the larger concepts listed above (e.g., understanding what a standard deviation can tell us, clearly falls under the concept of variability. I invite all of you, to comment on what concepts, if any, are missing from this list.