Category Archives: Methods of Data Collection

Assuring Data Integrity

The quality of our data driven decisions is limited by the quality of our data. It really is that simple. If your data have errors, your decisions will be error prone as well.  This is the 4th in a series of posts on what needs to be taught to administrators making data driven decisions. Sure, administrators often hire people to handle data entry and analysis, but how can you tell if you have the right staffing, especially during times of financial limitations?   Before making decisions with data,  Administrators have to know how to look for signs of errors in their data. Even if an administrator has been formally trained in statistics or assessment,  an examination of many popular books in these areas reveals that not  many people are talking about the importance of verifying data integrity, that is the accuracy of your data.

Obviously, one of the best ways to assure data integrity is to follow appropriate data management plans. This file from MIT on Data Management provides a great deal of useful information

There are some professionals of data management who believe that it’s OK for there to be data errors and consider this among best practices … “Don’t waste money assuring all data are accurate.” Now, if you are working with an organization gathering data from tens of thousands of people, and there is no reason to believe that the errors present in the data will be systematic (that is always wrong in the same way), then I can see why such people are making such statements. The statistics will treat this small and random error like sampling error, thus minimizing the issue of error in your data while getting you to the information you really need to know to make the right decisions. However, most universities and smaller companies are working with data sets far smaller than what would be necessary to just let the errors be and let the statistics take care of it. Thus even very few data errors can truly mask what is going on. Moreover, I have found that most data errors in higher education aren’t random, making any error a problem.  Take for example data on enrollment. I have examined a lot of data on enrollment and anytime there is an error it is always in the same direction, underestimating the number of students enrolled in a particular program or who are part of a particular ethnic/racial group. When you are already working with smaller data sets, even a single error with a single observation could be enough to keep an administrator from making the best possible decision. And if the error is systematic, as most of the error I have seen in data sets have been, then you are assured to make a faulty decision with faulty data. Thus, identifying and fixing the errors in your data are an important part of any administrator interested in making high quality data driven decisions.

When training administrators, it often helps if you can use errors in data integrity with their own organization. Listed below are the most common types of data integrity errors that I have observed.

Types of Error:

  1.  Miscoding of Individuals leaves them out of important counts.
    • Example: Enrollment or demographic counts can be miscoded. I recall getting a set of raw data and seeing that for some students they were listed as being in the College Business, other students CoB, still others CBM, yet each of these students were in the College of Business Management. And though a human can look at each of these ways of listing College of Business and see they are the same, a computer can’t unless programmed to do so. However, it would be far easier to put into your data management plan that the College of Business Management is always coded as BUS. Then accurate counts become possible.
  2. Misplaced decimals or Place Value Errors.
    • Example: Proportion of a whole can create challenges. Decimals always seem to add to challenges in data accuracy. At most universities we base payment of a position off of a “whole” position. Thus if the full time position is 12 credits, someone teaching 6 credits would have a .5 position. Of course, sometimes positions become more complicated. For example, during an analysis a .05 position was treated as a .50 position, thus this position was increased by 10 times what she was supposed to be because of a decimal error. In another example while translating counts from one table to another, a person dropped a zero for the number. The table said 12, but the actual number of majors was 120.
  3. Inverse in Coding. Results that have to be formatted that are formatted in reverse.
    • Example: Let’s say we have a  survey with the responses Strongly Agree, Agee, Disagree, or Strongly Disagree, where 4 is Strongly Agree and  1 is Strongly Disagree. I have seen people code them with 1 at Strongly Disagree and 4 as Strongly Agree. Of course, it would be fine if such coding was done consistently and noted, but that’s not what I typically have experienced.
  4. Treating a True Score of Zero as a Non-Response.
    • Example: When I code data, I typically code females at zero and males as 1. A non response, which for this type of question actually holds some interesting information, does not get assigned a value. A graduate student once deleted all of the zeros in the table … every single one, and now values that should have been zeros were treated as if the person didn’t respond.
  5. Unexpected Errors that only someone new to data management or who is truly arrogant could create.
    • Example: They are, after all, unexpected and as such cannot be predicted or classified. Just know inexperience and arrogance or just arrogance are a bad mix in everything, including assuring data integrity. I’m not saying a more seasoned person isn’t apt to do something stupid. The difference is, given they are humbled enough through years of such embarrassing episodes, they not only know how to look for such errors, they triple check for such errors to avoid public humiliation!

It’s not simply enough to know about the errors, we have to know how to recognize them. Listed here are some examples of ways to share with administrators on how to assure the integrity of their data.

  • LOOK at the Data!
    • Always look at a chart or graph of the data you are about to make a decision upon. Does it make sense? It is what you expected? Is anything missing? Is anything extreme? Though it is true you won’t be able to find all errors this way, you will be able to spot quite a few. For example, if there are no females in your graduating class, then you know you have a problem, or if not a single student in 10 years graduated from the largest major on your campus in 4 years, you know you have a problem.
    • If at all possible, chart or graph multi-year trends as this will not only help you to see mistakes clearly, like the mathematics department going from three years of 100 plus majors, to a year of 12 majors, once you have determined there is no problem with the data, you can see trends more clearly.
  • Trust your Gut.
    • Even applied statisticians can’t walk around with all of the data easily accessible in their minds. However, we also tend to get a sense when something is off. Interestingly, research on infants and toddlers has demonstrated that from an early age, they are implicitly, that is not consciously picking up patterns in their observations. They are kind of like little statisticians. We never lose that implicitly, that is not available to consciousness, ability. However, as it is not consciously available, it often comes to us as a sense of … hmmm, that doesn’t feel right.
    • If you are looking at data and it doesn’t feel right, look at it more closely for errors. I actually had this occur when I was looking at means that seemed just a bit off … all of them. That was when I found someone had miscoded one of the responses. This coding error would have resulted in a decision that wasn’t supported by the data.
  • When critical, triangulate the data.
    • First, look at the raw data for discrepancies. If none are found, then triangulate the data.  The term triangulating of data literally means finding data from three different sources (e.g., the department chair, enrollment services, and institutional research) and compare them. If there is a difference you have to find where the source of the problem. Often you may be able to find two sources of data and not three. You can compare two sources of data as well, just make sure they are from independent  sources (e.g, enrollment services data vs. chair data).

Any administrator knows, it is far easier to not have errors than to have them, find them, and fix them. Here are some examples of how these problems can be minimized or at least detected early enough in the process that they can be fixed before creating challenges for decision makers.

  • Most of these problems will be minimized if a well thought out and articulated Data Management Plan is crafted and implemented.
  • When errors are found, examine to see if the problem is with the implementation of the Data Management Plan or with the Plan, itself.
  • Make revisions to the Data Management Plan as appropriate.
  • Within your data management plan, devise a period verification of the accuracy of the data. This should be twice a year, and truthfully shouldn’t take very long. Three data point checks for a half a dozen to a dozen different types of files should do the trick.
  • An extremely limited number of people should have access to raw data. When you start to code or otherwise prepare the data for analysis, that should NOT be taking place at the level of the raw data. It should be copied and then worked on in a separate file.
  • And please, high quality data analysis and data integrity requires skilled and appropriately paid staffing. Given that the consequences to having errors in your data is so high, this is not a place you want to scrimp.

Leave a comment

Filed under Applied Statistics, Methods of Data Collection, Professional Development

Most Americans favor…

While watching a well-known TV news channel the other day, a report came on about student reactions to the enhanced security procedures at U. S. airports. The correspondent indicated that students are split right down the middle, 50-50 on the use of full-body airport scanners. Fifty percent favor, fifty percent oppose. I was curious what data there were to support this contention. The correspondent stated that half the students he talked to were in favor, half opposed. He then proceeded to present the results of his sample, which appeared to be N = 2, one student in favor, one student opposed. But he went on, according to another poll, most Americans favor the full-body scanners. Now “most” is a most ambiguous word in my mind. By definition, most simply means the majority, but if asked to attach a number to the word “most,” I tend to think about 80%. So I concluded that about 80% of Americans favor the new scanning procedure. But just to be sure, I looked up the poll that was being cited here. (

The poll indicated it was based on a random sample of 514 adults who were called either on their land line or cell phones. Now a random sample is one in which each member of a a population has an equal chance of being included in the sample. But I for one, have no chance of being included in this sample; I never respond to telephone polls, sometimes quite ungraciously. So the question arises, how many calls were made to obtain 514 responses? And how would those who declined to participate in this poll have responded?

The poll results indicate that if we consider only those people who “strongly support” the new body scanners, then 37% of the sample responded favorably. Twenty-seven percent were somewhat in favor, for a total of 64% responding that they support the scanning, either strongly or somewhat. But only 48% responded in a favorable way to the enhanced hand searches. And it also seems reasonable to expect that those who fly infrequently or not at all (53% of the respondents in the survey) may differ in their beliefs from those who fly frequently. We might also expect differences related to age of respondents, but we can’t tell from the results of the poll.

In her recent blog post, Bonnie included data collection methods as one of the core concerns for an introductory statistics course. To quote:

“Though I have said to my students more times that I can count, ‘the quality of our statistics is limited by the quality of our sample,’ I must admit to being a bit surprised that this was considered critical by others, especially since when I look at many undergraduate statistics textbooks, data collection methods are barely mentioned.”

The two examples given above provide excellent support for Bonnie’s contention that students should be taught to carefully evaluate the quality of the data on which statistics are based. Who were the subjects and how were they selected? What questions were asked and what responses allowed? And what inferences can be made from the results. It might prove to be an interesting class exercise to have students find media reports of current polls and then actually access the poll to see who the respondents were, what questions were asked, and the results obtained.


Leave a comment

Filed under Methods of Data Collection, Methods of Data Collection