“… there is a strong cult of naive and overconfident empiricism in psychology and the social sciences with an excessive faith in data as the direct source of scientific truth and an inadequate appreciation of how misleading data can be.”

These are powerful words by Frank Schmidt in his May, 2010 Perspectives in Psychological Science article. (See prior post for link.)

I remember when I first heard an individual utter “empiricist” as if it was some horrific way of thinking … almost evil like. Yet, even though I openly admit to be a rationalistic empiricist, I have to say, Frank is, at least in part, correct. We, as researchers, risk following a false path due to our arrogance and full acceptance of data and statistical results. We can be confident, self assured, but we must never, ever forget that everything we do comes with at least three forms of error … and that’s assuming we have done every thing perfectly.

Maybe, if we all help our students to, as Donald T. Campbell used to say, “Wallow in the inadequacies of our design,” then maybe this “cult” could stop its blind following and return us to a skeptical science.

What are these three errors? Sampling Error, Measurement Error, and Experimenter Error

We never avoid all three of these errors, but (1) we can minimize them, (2) for sampling and measurement error we can estimate them, and (3) we can never forget, even the best spelled out study is going to have all three present.

In truth, when teaching about statistical hypothesis testing, it is the sampling error that is the most pressing. After all, that is the very reason most Null Hypothesis Tests, like the *t*-test and *F*-test, work. The denominator is an estimate of sampling error, error due to individual differences of the subjects who just happen to be in our study. I often point out to students that there is sampling error in our estimator of sampling error, there is also measurement error, and experimenter error. Thus, we must always be mindful that we can never fully trust a single statistic, and like all sciences, must build our case on replication, replication, and replication of our study — lest we end up, as Schwartz might say, lying to ourselves.

Yet, Schwartz further justifies the weakness of statistical hypothesis testing by pointing out people’s misunderstandings. I paraphrase these common misunderstandings identified by other researchers and summarized by Schwartz, not to trash statistical hypothesis testing, but to forewarn all of you who teach it to know … what students are most likely to misunderstand.

Misunderstanding #1: When we find a statistically signficant difference, with alpha = .05, that means the probability of replicating this finding is 95%. Reality: Our alpha level is merely the point we identify as “extreme.”

Misunderstanding #2: The *p* value is an index of the size of the relationship … the smaller the *p* value, the stronger the relationship. Reality: I tell my students, don’t even calculate the observed *p*-value, as it tempt you into misinterpreting the results. The critical value, not the *p*-value, is what we should be focused upon, and of course, the critical value is determined by what we have identified as extreme (our alpha level) and this is done PRIOR to collecting a single piece of data.

Misunderstanding #3: If you fail to find a statistically significant difference, the only real explanation for this is that any observed relationship is strictly spurious. Reality: Of course, this misconception is forgetting the possibility that you had a Type II error — i.e., low power.

Misunderstanding #4: Statistical Hypothesis testing is the only, objective way to further science. Reality: Statistical hypothesis testing, like all statistics, is merely one tool in a set of tools to help us further our science. As we train our next generation of scientists, we have to acknowledge this and make sure students know that statistical hypothesis testing like the *t-*test and *F*-test are merely one of many tools at our disposal.

Of course, at some point, though I may agree with Schwartz, I have to part ways … and it is about at this point of his article in which we part.

Schwartz’s “cure all” for the “ills” of statistical hypothesis testing is the Confidence Interval. Yet, I ask … if there is a problem with estimating the standard error in the* t*-test, how is it OK to use the exact same formula to estimate the standard error in a confidence interval? If this argument by Schwartz isn’t simply an issue of people’s understanding regarding how statistical hypothesis testing (and thus estimating standard error) works, then how can another statistic, requiring the same assumptions for it to work using the same method of calculation be OK?

I look forward to comments.

Until then, the best professors know where the students are most likely to have misunderstandings, and through their teaching, actively attempt to minimize all misunderstandings. Schwartz’s (2010) article does provide us with four very common misconceptions.

Next week, I’ll talk about how I go about making sure my students don’t fall prey to these misconceptions.

Until next week … happy thinking!

I fully agree with Bonnie in recommending that students not try to interpret p values calculated by most statistics programs, but rather determine a critical value and rejection region for the test statistic as part of the process of setting up their research design. p values too often lead to misinterpretation.

In this vein, the article by Siegfried has a curious figure at the beginning of the article. What is presumably illustrated is the sampling distribution of some statistic such as the t. If so, then the values on the x axis represent observed values of the test statistic and the shaded area the rejection region. Yet the figure identifies a point on the x axis as the “observed size of effect,” not the observed size of the test statistic. The implication from the figure is that the smaller the size of the p value, the larger the size of the effect. But effect size measures are not necessarily related to p values. Most effect size measures indicate how much variance we have accounted for by the independent variable, and how much knowing the level of an independent variable a subject has been exposed to reduces the error in predicting that person’s score on the dependent variable.

Moral of the story: p values are not effect sizes.