# Category Archives: Statistical Hypothesis Testing

## Difficult Concepts: Research Hypotheses vs. Statistical Hypotheses

I always cringe when I see a statement in a text or website such as “the research hypothesis, symbolized as H1 , states a relationship between variables.” No! No! No! How can students not be confused on the difference between research and statistical hypotheses when instructors are? H1 is not the research hypothesis, it is the alternative to the null hypothesis in a statistical test.

Let’s be very clear, in most research settings, there are two very distinct types of hypotheses: the Research or Experimental Hypothesis, and the Statistical Hypotheses. A research hypothesis is a statement of an expected or predicted relationship between two or more variables. It’s what the experimenter believes will happen in her research study. For example a researcher may hypothesize that prolonged exposure to loud noise will increase systolic blood pressure. In this instance the researcher predicts that exposure to prolonged noise (the independent variable) will increase systolic blood pressure (the dependent variable). This hypothesis sets the stage to design a study to collect empirical data to test its truth or falsity. From this research hypothesis we can imagine the scientist will, in some fashion, manipulate the amount of noise a person is exposed to and then take a measure of blood pressure. The choice of statistical test will depend upon the research design used, a very simple design may require only a t test, a more complex factorial design may require an analysis of variance, or if the design is correlational, a correlation coefficient may be used. Each of these statistical tests will possess different null and alternative hypotheses.

Regardless of the statistical test used, however, the test itself will not have a clue (if I am allowed to be anthropomorphic here) of where the measurement of the dependent variable came from or what it means. More years ago than I care to remember, C. Alan Boneau made this point very succinctly in an article in the American Psychologist (1961, 16, p.261): “The statistical test cares not whether a Social Desirability scale measures social desirability, or number of trials to extinction is an indicator of habit strength….Given unending piles of numbers from which to draw small samples, the t test and the F test will methodically decide for us whether the means of the piles are different.”

Rejecting a null hypothesis and accepting an alternative does not necessarily provide support for the research hypothesis that was tested. For example, a psychologist may predict an interaction of  her variables and find that she rejects the null hypothesis for the interaction in an analysis of variance. But the alternative hypothesis for interaction in an ANOVA simply indicates that an interaction occurred, and there are many ways for such an interaction to occur. The observed interaction may not be the interaction that was predicted in the research hypothesis.

So please, make life simpler and more understandable for your students. Don’t call a statistical alternative hypothesis a research hypotheses. It is not. Your students will appreciate you making the distinction.

## Difficult Concepts—Degrees of Freedom

Several posts ago, Bonnie said we would address some difficult concepts for student understanding of statistics. I thought I would take a shot at one of the concepts she listed, degrees of freedom (df).

To help understand this concept, let us first think of df in a non-statistical way and say that df refers to the ability to make independent choices, or take independent actions, in a situation. Consider a situation similar to one suggested by Suppose you have three tasks you wish to accomplish, for example that you want to go shopping, plan a vacation, and workout at the gym. Assume that each task will take about an hour and that you may do all on one day, or only one each day over the course of several days. I have created a situation with three degrees of freedom, you have three independent decisions to make. Suppose you decide you will go shopping today. Does this decision put any limitations on when you may do the other tasks? No, for you may still do the other tasks either today, or in the course of the next few days. Suppose next you decide to plan a vacation and you will do that that tomorrow. Does this decision place any limitation on when you may go to the gym? Again, no, because you still might go to the gym today, tomorrow, or on another day. Notice here, that each choice of when to do an activity is independent of each of the other choices. Thus, you have 3 degrees of freedom of choice in the order of doing the tasks.

Now, set a different scenario where I plan some limitation on the order in which you may do the tasks. You still have the same three tasks to do, except now you decide you will do only one a day and you want to have them all completed over a span of three days. This scenario has only 2 df, for there are only two independent decisions for you to make. After you have made a choice on two of the activities, the day for doing the third activity is “fixed” or decided by your other two choices. For example, suppose you decide to plan your vacation today. For this choice you have total freedom to make a decision for any of the three days. You next decide to plan when to go to the gym. Notice for this decision, however, you have only two choices left, either tomorrow or the following day. A statistician would say you have two degrees of freedom when making this decision. You decide to go the day after tomorrow. Finally, you have to plan shopping, but now you have essentially no choices open to you, it must be tomorrow. For this decision, you have no degrees of freedom. Thus, in a sense, you have 2 df in this scenario. You are free to make two choices, but making any two choices automatically determines your third choice.

Of course, the obvious question a student may ask is “What does all this have to do with statistics?” Let’s see. Statistically, the df are the number of scores that are free to vary when calculating a statistic, or in other words, the number of pieces of independent information available when calculating a statistic. Suppose you are told that a student took three quizzes, each worth a total of 10 points. You are asked to guess what her scores were. In this scenario, you may guess any three numbers as long as they are in the range from 0 to 10. In this example, you have 3 df, for each score is free to vary. Each score is an independent piece of information. Choosing the score for one quiz has no effect on either of the other two scores that you may choose.

But now I give you some information about the student’s performance by telling you that the total of her scores was 27. I have now created a scenario with 2 df. Suppose you guess 10 for the first score. Does choosing this score place any limitation on what you might guess for a score on the second, given that the total of the scores must be 27? No, for your choice of a second score is still free to vary from 0 to 10. You guess 9 for a second score. What about your choice of a third score? What must it be. If the total of the three scores is 27, and the first score you chose was 10, and the second 9, then your third choice must be 8 for a total of 27 to be obtained. In this instance, the third score is not free to vary if you know the total of the scores and any two of the three scores. For this example then, there are 2 df in the choice of scores. If you know the total of the three scores, then only two provide independent information, the third score becomes dependent on previous two scores. By giving you knowledge of the total of the scores I have reduced the df in the number of choices you have.

Can we now relate these two examples to the calculation of statistics? Consider that you have a sample of 10 scores and you want to calculate the mean for these scores. In order to do so, you must know all 10 scores, if you know only 9, you cannot calculate the mean. Thus if there are n scores in a sample, then for calculating the mean from this sample there are n df. Each score is free to vary, and an independent piece of information. You cannot calculate the mean unless you know all n scores. But suppose you know the mean for the scores and you want to calculate the standard deviation (s) for the scores. In these instance, there are 9 df for these scores, for if you know the mean, you need to know only 9 of the scores, the 10th score is in a sense “determined” for you by the value of the other 9 scores. So, for a set of n scores, there are – 1 df when calculating the standard deviation.

A question frequently arises when the idea of a fixed or determined score is discussed. Students may ask how can someone’s score on a test, for example, be “determined” or “fixed in value” by her other scores on tests? Students should be made to realize that during the actual data collection process all scores are free to vary and the concept of degrees of freedom does not apply. Degrees of freedom only come into play after the data have been collected and we are calculating statistics on those data.

These ideas can be expanded to the computation of other statistics. Consider analyzing data with a 2 2 chi-square test of independence. When we are collecting data for the contingency table, the concept of degrees of freedom is not applicable. After we have collected the scores, however, and each cell of the contingency table is filled, then we can use the cell totals to find the row and column marginal totals. Notice at this stage, that if I were to tell you the row and marginal totals, then I would need to give you only one cell total, and you would be able to determine the other three cell totals. In this instance, when knowing the row and marginal totals, there is only 1 df for the cell totals. In a more general sense, if there are r rows and c columns in a contingency table, then once the row and column totals are known, the table possesses (– 1) (c – 1) df.

I think giving students this intuitive overview of df helps them to understand where such numbers come from when they are learning about various statistical tests. Perhaps it may help to make statistics a little less mysterious.

## Exposing students to Diversity while teaching the t-test

There are a lot of ways we can approach diversity in the statistics classroom, as even the term “diversity” can be operationally defined in so many ways. One method is to use research on diversity as a basis of an example when teaching statistics in context.

Often the complexity of the statistics used in a published journal article are beyond what would be taught in an introductory course in applied statistics, however, what I often do is take the research hypothesis and design and simplify it a bit. Yes, this is an example of scaffolding. So, I structure the study to fit the concept I am teaching (e.g., making a multivariate research student univariate, or making a two-way factorial a one-way). Keeping the general structure of the study intact, I often shorten the task so it will take less than 10 minutes to run through the mock study, collect the data, and then provide students with critical conceptual background information. This still gives me enough time (in a 50 minute class) to have students work through the problem, while I model it, and go from question to answer through the use of hypothesis testing.

In this example, the concept I will be teaching is the independent t-test, a form of null hypothesis testing. The study I am using is Apfelbaum, Pauker, and Sommer’s (2010) study of 4th and 5th grade students which examined the effects of color-blind thinking and value-diversity thinking on bias.

In short, color-blind thinking is simply ignoring race as a variable worth attending to, as in doing so, issues of bias will be minimized. (For my social scientist readers … this is a very etic way of approaching the potential of racial bias.)

Value-diversity thinking (emic) actively recognizes differences within each racial and ethnic group.

As we are comparing two different conditions, this study can easily be adapted to an independent t-test.

So, during class, we could quickly, and randomly provide students with one of two sheets of paper.

Borrowing phrased directly from the published study, the students in the color-blind condition would see phrases like:

• We need to focus on how we are similar to our neighbors rather than how we are different.
• We want to show everyone that race is not important and that we’re all the same.

Meanwhile, the students in the value diversity condition would see phrases like:

• We need to recognize how we are different from our neighbors and appreciate those differences.
• We want to show everyone that race is important because our racial differences make each of us special.

In the actual study, Apfelbaum, Pauker, and Sommer’s (2010) looked at both implied and explicit racial biases. For the in-class activity, as this is being conducted with college students instead of 4th and 5th graders, we could just use the implied bias. Read to students the following scenario (slightly modified from the article): “Most of Brady’s classmates got invitations to his birthday party, but Terry was one of the kids who did not. Brady decided not to invite him because he knew that Terry would not be able to buy him any of the presents on his ‘wish list.’”

Then ask students to write down an answer on a scale from 1 – 10, 1 being completely inappropriate and 10 being completely appropriate. Typically, my class size is too large to collect data from all of the students, so I would randomly select 5 students from each condition, and write their responses on the chalk board. Now, we can model how to answer the question: is encouraging people to ignore race a way to increase bias or decrease it, compared to encouraging people to factor race into evaluating situations.

Of course, one of the problems in using “real data” in a study with so few subjects is that you will never be certain if the test statistic will support the same conclusion as the research article. There are two ways to deal with this problem, acknowledge the potential for low power right from the start, or have the students complete the activity, but use data that you selected to model how to answer this question with the use of an independent t-test. The latter might be best for individual’s new to teaching, as you can come to class with your calculations prepared.

In closing, it is easy to bring diversity into a classroom, even if you are that “scoop of vanilla ice cream.” One of the best ways is to make use of published research studies on cultural or racial diversity as a way of modeling critical concepts in statistics.

Apfelbaum, E. P., Pauker, K., & Sommers, S. R. (2010) In blind pursuit of racial equality? Psychological Science, 21, 1587-1592. http://pss.sagepub.com/content/21/11/1587.full

## Before the semester starts … I’m playing with pictures!

I am sure I’m not alone in wanting to use the time between semesters to make adjustments to what I am teaching or how I am teaching it. By now, you probably recognize that I am a fan of learning about new pedagogical techniques. I am dedicated to helping students to truly understanding the concepts of statistics. Often, having visuals when you teach is useful for students.

I use the chalk and a board (OK, more like 8 boards that move). I draw a lot of pictures. However, a mathematics professor (who is both a great colleague and friend) has been bugging me about using Mathematic in addition to chalk (a delivery system she also loves).

With Mathematica, it is my hope that I will not only be able to present my students with a visual image of certain concepts during class time (like how a normal distribution changes when the size of the standard deviation gets larger or smaller) but by making these demonstrations available electronically to students for them to explore these concepts on their own, I am hoping students will gain a greater conceptual understanding of critical statistical concepts.

Mathematica is a software package, that among other things, provides demonstrations of statistical concepts. Each demonstration was designed by an instructor. For it to be published, it is my understanding that it goes through a rigorous peer-review process. As such, if it’s printed for use, you know it will work. The down side is that your university would have to pay for a subscription to Mathematica for the demonstrations to be useful. http://www.wolfram.com/solutions/education/higher-education/uses-for-education.html

As I stated last week, in my list of resolutions, my goal is to find five different demonstrations this semester. Why five? It seemed like a reasonable number … not too challenging.

This was really easier than I anticipated. I started by indentifying the concepts that would most benefit from being able to visualize and manipulate variables. Then I visited the Mathematica web site and searched the topics. Each search yielded anywhere from 5 to 25 demonstrations, some were appropriate, others weren’t. I looked through the demonstrations and selected the ones I liked.

Here are the concepts and the demonstrations I identified as being potentually useful this semester.

(1) The Normal Distribution, where students get to input mu and sigma, would make a nice visual demonstration.

http://demonstrations.wolfram.com/TheNormalDistribution/

This Normal Distribution also shows the area under the curve (i.e., you can manipulate the z-score)

http://demonstrations.wolfram.com/AreaOfANormalDistribution/

(2) Another good demonstration would be the Sampling Distribution of the Means, where students can see the impact of changing mu, sigma, or sample size on its shape.

http://demonstrations.wolfram.com/SamplingDistributionOfTheSampleMean/

I’m also going to throw in a demonstration on the Central Limit Theorem, as how can we talk about the Sampling Distribution of the Means without mentioning the Central Limit Theorem?

http://demonstrations.wolfram.com/TheCentralLimitTheorem/

(3) Of course, what changes in the Sampling Distribution of the mean is the standard error, thus showing how a standard error changes due to changes in the sample size and/or variability makes a great deal of sense. I was really hoping that a demonstration on the standard error would already be available, unfortunately, it doesn’t seem to be. A similar concept is the confidence interval, though even with this demonstration the writer of the Mathematica code for this demonstration did not include how variability (i.e., standard deviation) impacts the size of the “margin of error.” However, it still could be a useful demonstration.

http://demonstrations.wolfram.com/ConfidenceIntervalsConfidenceLevelSampleSizeAndMarginOfError/

Though not as clean looking at the one above, this demonstration also includes the size of the standard deviation. http://demonstrations.wolfram.com/ConfidenceIntervalExploration/

I would expect that the two demonstrations would be necessary for student to get a richer understanding of confidence intervals.

That having been said, I believe that two new Mathematica Demonstrations are in order … one dealing with the size of the standard error based on changes in sample size and variability and a possibily a new CI demonstration that merges the best of these two demonstrations.

(4) The effects of the sample size and population variance on hypothesis testing with the t-test seems like a great visual demonstration.

(5) How changes in the variables impact correlation’s (depending on how they are calculated) should be useful for my students.

http://demonstrations.wolfram.com/CorrelationAndRegressionExplorer/

(6) Those of you who know me, are probably not surprised that I can’t just stop at 5 examples for this first semester … so here is a great demonstration on Power. Though I can get students to define power, and identify threats to power, I am never fully certain that they truly get the beauty (and hassle) of power. This demonstration may help.

http://demonstrations.wolfram.com/ThePowerOfATestConcerningTheMeanOfANormalPopulation/

Of course, without proper instruction during class time and an accompanying explanation following class instruction, these demonstrations may end up being little more than pretty pictures to students.

In a few weeks, especially after I actually try these demonstrations with my students, I will provide for you the information I attached with the demonstrations as well as feedback as to what worked and what didn’t. After all … anyone who has taught long enough knows, even the best planned lessons and demonstrations some times flop.

Though not specifically having to do with teaching statistics … I found a nice article at Chronicle of Higher Education on Iphones, Blackberries, etc … and apps that could help professors. The attendance and learning students’ names apps look promising. http://chronicle.com/article/College-20-6-Top-Smartphone/125764/

I look forward to hearing from any of you who have used Mathematica Demonstrations (or others) during class and for homework.

## How wonderful and I wish, I wish …

As I type this, I have one fifty minute class left to teach, and my time with my statistics class will be over. As with anything, each semester is varied. Some semesters I cover more information than other semesters. I liken this semester to driving through the city and hitting all green lights! As such, I believe my students were able to master additional information based on what is probably mostly good fortune.

So, here is my list of things I’m so thrilled I covered:

(1) Effect size statistics, like eta squared: Sure effect size statistics are not used that much, and lets face it, they are super easy to calculate, but my biggest reason for wanting to teach effect size statistics is it helps students to understand what a t-test or F-test can tell us (is there a difference) and what it can’t tell (how big is the effect). In fact, by spending about 20 minutes on the teaching of effect size statistics, students were better able to understand why the “p-value” for an observed t or F score provides us with no information. All we need to know is, did we pass the threshold.

(2) We find the critical value BEFORE calculating the observed value: This discussion helps focus student on the logic of statistical hypothesis testing. Specifically, statistical hypothesis testing works because we assume that the null hypothesis is true, that there is no effect of the independent variable on the dependent variable. With this assumption, we are able to generate the sampling distribution that provides us with information on the standard error. Now, if our sample mean is too extreme, we reject our initial hypothesis, the null, and accept the alternative hypothesis, that is the means are different. By finding the critical value prior to calculating the statistic, it helps focus students on that “line in the sand” to say … my observations are too extreme for me to stay with my current hypothesis. Students are far less likely to fall victim to equating p-value with the strength of the effect of the independent variable, or to conclude … the data is trending because I have a p-value of .07 or some other funky thing far too many people do with null hypothesis testing. By spending a bit more time on the steps involved in hypothesis testing, I think students are less likely to fall victim to the common misconceptions surrounding Statistical Null Hypothesis Testing.

(3) Though not a specific concept, I am pleased that for almost every concept I taught this semester I used new examples. Sure, I’m still a sage in training, no grey hair and all, but I was beginning to find myself using the same examples. As this is the third semester my supplement instructor, Amy, is taking notes in class, I felt I owed it to her, at least, to “keep it fresh.” I also found thinking about this blog helped spur my mind toward different examples. In doing so, I found some worked even better than my “old stand by” examples, but the great things was, when the new example flopped, I just quickly switched to the example I knew helped students.

Now for my Wish List of things I always wished I could have covered, but didn’t.

(1) Though I do get to cover the concepts of the F-test. I teach a three credit class, and only have time to cover the one-factor between subject ANOVA. If only I could cover a two-factor between subject ANOVA and a one-factor within subject ANOVA, I would feel my students would really understand the F-test (and as such, be less incline to misuse or over use it).

(2) Yet, I feel if I could cover non-parametrics, students would better understand the role of the assumptions in parametric tests, and issues like Power and random error could be even better understood. Plus they would get the benefit of learning about a really important class of statistics. Sadly, another semester has passed without me being able to cover this topic with the depth I think it deserves.

(3) I fear I don’t emphasize the weakness of statistics, and that they are only as good as the quality of the theories being tested in the design. They are also only as good as the quality of the sample and the quality of the measure. At least the latter two concepts get covered in classes that will follow the statistics class. But so few people speak of the topic of equifinity, that the same outcome can have multiple explanations. Again, though I touch on this, the idea of developing the alternative rival hypotheses that could explain the same empirical evidence is one I simply don’t have time to cover to the extent I would like. If you have a weak theory or haven’t taken into account the alternative rival hypotheses when designing your study, cool statistics will not improve the quality of your findings.

(4) Though I tell students the hypothesis drive everything, from the selection of the measure and research design, to the specific statistic one would select, and though there are example problems in the textbook (Integrating Your Knowledge) that students have to complete, I really wish we could spend more time on this.

Maybe next semester, I can find a way to reach my wish list … maybe!

## Teaching hypothesis, design, analysis & inference as one thing

“So, the question I would like to pose for the sages and anyone else interested in commenting … for a first semester undergraduate applied statistics class … what are the most critical student learning outcomes that have to be mastered?”

First let me just comment on the blog Bonnie just posted today: I think her list of core concepts is excellent, and I agree that those concepts (all of which have to do with the ever-present error inherent in all our observations and measurements) should certainly be taught in the introductory statistics course.  Nevertheless, let me introduce a different perspective.

When I first saw the question posed by Bonnie  (reproduced above) I thought the answer would be an easy one to write. It turns out it is not quite so easy. My problem is that I see all the parts of the application of statistics as parts of an integrated whole. So my answer will appear to be a daunting one.

I hope that students can take away an appreciation (mastery would be too much to ask at this level) of how we use data to make inferences about the behavior under study. My typical homework problems were not, except for some initial ones, about calculations: finding means, t or F values, and p values. Rather an experimenter’s hypotheses would be stated along with how she collected the data to test those hypotheses, and (relatively simple) data would then be listed. The question posed was: what can you conclude from these data, and especially what can you conclude about the hypotheses? The appropriate way to answer such problems was to present the means and to interpret what the pattern tells us, with the statistical test of significance to guide us as to which differences we could assume due to the independent variable.

I understand that this is asking a lot of the students, but just getting statistics from data sets bores the heck out of me, and I don’t see why it would not be equally boring to the students. A few weeks into the semester we would be into the Analysis of Variance (Keppel’s book does a wonderful job facilitating early introduction of AOV), and the course especially emphasized factorial designs in which interpretation of patterns of means with the assistance of significance testing becomes, for me at least, most challenging and most interesting. The logic of the interplay of hypothesis, design, data, statistical analysis and inference is to me all one thing.

Such an integrated concept, satisfying to me, may or may not be an asset when applied to teaching the first undergraduate course in statistics.

## Avoiding Misunderstandings in the teaching of Null Hypothesis Testing

There are so many places were issues can arise for students when they are learning about Null Hypothesis testing. I believe that the best professors highly rely upon the technique of scaffolding (see a prior post for more detail). Briefly, scaffolding is a Vygotskian concept where the professor constrains the situation for the students so they can learn the component parts of a larger, more complex concept. Certainly, as Null Hypothesis testing is complex, scaffolding is in order.

Many of the statistics classes I teach have student learning outcomes that expect students to be able to calculate and interpret statistics like the z-test, t-test, F-test, and correlation coefficients, (i.e., Null Hypothesis testing). Here are the component pieces, in my opinion, that often deserve a full class period ( at least 50 mins.) and homework that requires students to master the pieces before putting it all together. I find in breaking apart the teaching of these concepts, that students not only end up in the same place as when a professor doesn’t break down these pieces and just goes full steam ahead, but that students have a far greater understanding of the underlying concepts, thus minimizing the likelihood of them carrying with them misconceptions. So, though it may seem like it takes more time to teach this way, my experiences has been that it doesn’t, while resulting in greater student understanding.

(1) Though my focus is concepts not mathematics/calculations, I find that students will never fully understand statistics without having do complete repeated hand calculations of small data sets using definitional formulas. Thus, it is critical that students learn how to calculate the Sum of the Squared Deviations (SS). They can then learn how to use the SS for calculating the variance and standard deviation. (See prior post for details on how I use a kinesthetic activity for the teaching of SS which maximizes student comprehension of the Sum of Squares).

(2) I actively teach concepts on the Normal Distribution and z-score, which typically take more than one class period.

(3) I feel it is critical that students fully understand sampling error, standard error, and how to estimate standard error. Again, please see a prior post for a tactile activity I use in the teaching the concepts of sampling error/standard error.

(4) Understanding that we begin by assuming the null hypothesis is true, then we establish a point of rejecting that ones hypothesis is wrong (a line in the sand), and what the consequences are if you hold onto a hypothesis that isn’t true or reject one that is true. This is such a critical component of this entire process, and helps lay out students understanding of the assumptions underlying NHT, what Alpha and Beta are (along with their corresponding errors) and even helps lay the ground work for understanding when to avoid parametric statistics in place of non-parametric statistics.

(5) Students need to understand the purposes, strengths, limitations, and assumptions required for each NHT statistic.

(6) By this point, if all is spelled out, especially if students can calculate the means, SS, and standard error, learning how to calculate and interpret the z-test, t-test, F-test, or correlation coefficient becomes easy. The calculation and interpretation become students’ favorite part of the class, as it all makes sense to them.

(7) However, even though we’ve discussed this previously, we cover yet again, detailed issues of type I and type II error, the requirement that NHT does not work absent of a theory that is predicting a specific outcome, and that though we have estimated sampling error, that estimate still contains sampling error, measurement error, and experimenter error.

(8) We calculate effect size statistics and confidence intervals. The latter so students get begin to get an idea of the size of an effect, the latter is to aid in general understanding of what the point estimate of the sample mean is really telling us. Confidence Intervals are truly easier for students to “get.”

Students don’t leave my class, or at least I hope they don’t, thinking that if their Observed t falls in the rejection region that there is proof that their independent variable caused the change of the dependent variable ; they don’t leave thinking the p-value and effect size are one in the same; they don’t leave believing that any research design (in general) or any experiment (in particular) is equally helped by using a specific statistic,… but they do leave recognizing that this one test is providing evidence, and that to be sure, more needs to be done.

I liken the types of conceptual mistakes individuals have about how and when to use NHT and what it can tell us to when my children were young and they thought … to get money, you just had to go to the cash machine. Yes, I get money from the cash machine, but obviously not without first putting it in. And yes, a significant statistics test can tell us something, but not outside of the context of us first understanding all that went into that study for the statistic to come out, and just like the amount of money available to me from the cash machine … there are significant limitations for which we must always be aware, lest we look like fools.

I believed if taught well, students “get” this.

## What Does a Statistical Hypothesis Test Test?

In his October 31st post, Marty stated “The statistical significance test simply assesses the likelihood of the rival hypothesis of “chance.” ” I would like to elaborate a little on this statement because it makes a very important point about statistical hypothesis testing. As both Bonnie and Marty have indicated, there will always be error in any data that we collect– sampling error, measurement error, and experimenter-procedural error. Unfortunately, humans are not well prepared to assess the extent of this error in data from a mere observational basis. Too often we are wont to see relationships in nature where none exist. Statistical hypothesis testing offers a relatively simple (although students often don’t initially perceive it to be simple) solution for this problem.

A statistical hypothesis test is a dispassionate method of making a decision of whether “chance” most reasonably explains the relationship observed in the data or is there something else we should search for in explanation. It is important to remember that the statistical test simply tests a null hypothesis assuming that certain conditions apply in the data being tested. It is up to the experimenter to insure that those conditions are met by his or her data. And, if the hypothesis test indicates that the hypothesis of chance is an unlikely explanation of the relationship observed, it does not provide any evidence that the research hypothesis is a plausible explanation for the results. An experiment can be confounded or a third variable may be responsible for an observed correlation. The statistical test cannot assess the likelihood of such occurrences in the data, only a careful analysis of the design of the research can provide that assessment. This important point is sometimes misrepresented by text authors with statements similar to “the alternative hypothesis states that the independent variable does affect the dependent variable.” But the alternative hypothesis of a statistical tests states no such such relationship. For a parametric test, it simply indicates that the sample means were drawn from different populations, but not the reason why those populations may differ. And this alternative hypothesis remains essentially the same regardless of the experimenter’s research hypothesis. On the other hand, if the hypothesis test indicates that chance is the most plausible explanation for the results obtained, then again it cannot indicate whether the result was from a poor design or inadequate measurement of the variables in the research.

Hypothesis testing thus simply provides an objective way of deciding that given the data we have obtained, is chance a plausible explanation? But hypothesis testing is simply the start of the explanatory process, not the end of that process.

Filed under Statistical Hypothesis Testing

## Data without error isn’t

Just a few thoughts after reading Schmidt’s Detecting and Correcting the Lies That Data Tell (2010) (see Bonnie’s October 14 post for link).  In it Schmidt argues, presenting clarifying examples, that accurate interpretation of collected data suffers from “ … researchers’ continued reliance on the use of statistical significance testing in data analysis and interpretation and the failure to correct for the distorting effects of sampling error, measurement error, and other artifacts.” [from the Abstract]  Schmidt suggests the use of meta-analysis but improved by the estimation of and the elimination of the “distorting effects.”

Schmidt’s applications to meta-analysis are elegant, with a beauty (to me) similar to that of structural equation modeling, in both of which distinctions are made between and independent estimations are made of the constructs of interest and the errors necessarily attached to our measurements.  This is valuable work providing a powerful tool for theory testing.  But it also makes me uneasy.  As we all know, sampling and measurement errors are always present in collected data.  So when we get an estimate of an effect size after stripping away the intrinsic error, what does it mean?  Schmidt presents as “the truth” what the results would look like if the data were different from what the data really are.  I am reminded of what an editor once wrote to my co-author and me, criticizing an analysis we had done on transformed scores.  He said he wanted to see that the subjects did, not what the experimenters did.  We thought he had a point.

Schmidt also argues against the use statistical significance testing, citing a number of ways it has led to misinterpretations.  I agree with him about the misinterpretations, and I agree with Bonnie (see her recent blog here) about what to do about those misguided uses of a significance test – don’t do that!  But I do not agree that therefore significance testing should be abandoned.  Meta-analysis is not necessarily appropriate for all research questions and studies.  For a stand-alone study in which a researcher claims her independent variable has shown an effect, it is not unreasonable to ask for some evidence that the obtained difference is unlikely to have resulted by chance (i.e., from the effects of those pesky sampling and measurement errors).  Good experimental design attempts to establish a cause-and-effect conclusion by eliminating all other “rival hypotheses.”  The statistical significance test simply assesses the likelihood of the rival hypothesis of “chance.”

Filed under Statistical Hypothesis Testing

## A cult of empiricism?

“… there is a strong cult of naive and overconfident empiricism in psychology and the social sciences with an excessive faith in data as the direct source of scientific truth and an inadequate appreciation of how misleading data can be.”

These are powerful words by Frank Schmidt in his May, 2010 Perspectives in Psychological Science article. (See prior post for link.)

I remember when I first heard an individual utter “empiricist” as if it was some horrific way of thinking … almost evil like. Yet, even though I openly admit to be a rationalistic empiricist, I have to say, Frank is, at least in part, correct. We, as researchers, risk following a false path due to our arrogance and full acceptance of data and statistical results. We can be confident, self assured, but we must never, ever forget that everything we do comes with at least three forms of error … and that’s assuming we have done every thing perfectly.

Maybe, if we all help our students to, as Donald T. Campbell used to say, “Wallow in the inadequacies of our design,” then maybe this “cult” could stop its blind following and return us to a skeptical science.

What are these three errors? Sampling Error, Measurement Error, and Experimenter Error

We never avoid all three of these errors, but (1) we can minimize them, (2) for sampling and measurement error we can estimate them, and (3) we can never forget, even the best spelled out study is going to have all three present.

In truth, when teaching about statistical hypothesis testing, it is the sampling error that is the most pressing. After all, that is the very reason most Null Hypothesis Tests, like the t-test and F-test, work. The denominator is an estimate of sampling error, error due to individual differences of the subjects who just happen to be in our study.  I often  point out to students that there is sampling error in our estimator of sampling error, there is also measurement error, and experimenter error. Thus, we must always be mindful that we can never fully trust a single statistic, and like all sciences, must build our case on replication, replication, and replication of our study — lest we end up, as Schwartz might say, lying to ourselves.

Yet, Schwartz further justifies the weakness of statistical hypothesis testing by pointing out people’s misunderstandings. I paraphrase these common misunderstandings identified by other researchers and summarized by Schwartz, not to trash statistical hypothesis testing, but to forewarn all of you who teach it to know … what students are most likely to misunderstand.

Misunderstanding #1: When we find a statistically signficant difference, with alpha = .05, that means the probability of replicating this finding is 95%. Reality: Our alpha level is merely the point we identify as “extreme.”

Misunderstanding #2: The p value is an index of the size of the relationship … the smaller the p value, the stronger the relationship.  Reality: I tell my students, don’t even calculate the observed p-value, as it tempt you into misinterpreting the results. The critical value, not the p-value, is what we should be focused upon, and of course, the critical value is determined by what we have identified as extreme (our alpha level) and this is done PRIOR to collecting a single piece of data.

Misunderstanding #3: If you fail to find a statistically significant difference, the only real explanation for this is that any observed relationship is strictly spurious.  Reality: Of course, this  misconception is forgetting the possibility that you had a Type II error — i.e., low power.

Misunderstanding #4: Statistical Hypothesis testing is the only, objective way to further science. Reality: Statistical hypothesis testing, like all statistics, is merely one tool in a set of tools to help us further our science. As we train our next generation of scientists, we have to acknowledge this and make sure students know that statistical hypothesis testing like the t-test and F-test are merely one of many tools at our disposal.

Of course, at some point, though I may agree with Schwartz, I have to part ways … and it is about at this point of his article in which we part.

Schwartz’s “cure all” for the “ills” of statistical hypothesis testing is the Confidence Interval. Yet, I ask … if there is a problem with estimating the standard error in the t-test, how is it OK to use the exact same formula to estimate the standard error in a confidence interval? If this argument by Schwartz isn’t simply an issue of people’s understanding regarding how statistical hypothesis testing (and thus estimating standard error) works, then how can another statistic, requiring the same assumptions for it to work using the same method of calculation be OK?