Great Resource for the Teaching of Applied Statistics

Hello All,

The Society for the Teaching of Psychology has an office dedicated to great, peer-reviewed resources for teaching called the Office of Teaching Resources in Psychology.

Two such (free) resources for those of us teaching applied statistics include the free on-line book, Teaching Statistics and Research Methods: Tips from TOPS. http://teachpsych.org/ebooks/stats2012/index.php

Another such resource, is Statistical Literacy in Psychology: Resources, Activities, and Assessment Methodshttp://teachpsych.org/Resources/Documents/otrp/resources/statistics/STP_Statistical%20Literacy_Psychology%20Major%20Learning%20Goals_4-2014.pdf

The web site housing these two resources is filled with great ideas, all of which have been peer-reviewed. You can find teaching resources including example syllabi as well as article on how to maximize your students’ learning. Even if you are teaching applied statistics in an area outside of psychology, I encourage you to make use of this value set of tools. ( http://teachpsych.org/ )

Happy Teaching!

Bonnie

 

Leave a comment

Filed under Applied Statistics, Curriculum, Engaging students, Pedagogy, Preparing to Teach, Professional Development

Starting a new semester … engaging students

Bonnie:

To aid with the start of the semester…

Originally posted on Statistical Sage Blog:

School has started, which is always exciting.  After all, the new school year is filled with GREAT possibilities.

Yet, lets face it … most of the students in our classes this fall aren’t going to have that level of excitement, especially about having to take statistics. I often contemplate, why aren’t students excited about the possibilities of learning statistics? Though I certainly don’t have all of the answers, and look forward to hearing from the sages, I do have a few hypotheses, each that I will be discussing over the course of the next few weeks.

(1) People don’t trust statistics, and the students have heard these comments, possibly for years. http://www.quotegarden.com/statistics.html Take this link to a list of quotes on statistics, and see how many of them basically say … you can use statistics to lie. Of course, it probably doesn’t help labor day marks the start of the big push for…

View original 348 more words

Leave a comment

Filed under Uncategorized

First Day of Class — starting off right

Bonnie:

Ah, the first day of the semester is upon us …

Originally posted on Statistical Sage Blog:

How do students comes to us on the first day of class? Yes, I can just about hear you mumbling …

(1) They wonder why they even have to take this class … after all they are a [non-quantitative] major. Why does [psychology, sociology, business, education, etc] need statistics?

(2) They may have had really bad math experiences in the past leading to (a) math anxiety (b) poor math attitudes including a low self efficacy and/or (c) weak math skills.

(3) They have heard lots and lots of stories as to how hard or useless or manipulative statistics can be. We have all heard the quote … and so have they … “There are lies, … , and statistics!”

But the first thing I want to let you know is … instructors of applied statistics may be over estimating the negative thinking of their students. Mills (2004) http://findarticles.com/p/articles/mi_m0FCR/is_3_38/ai_n6249218/?tag=content;col1 found that, in general…

View original 806 more words

Leave a comment

Filed under Uncategorized

The Importance of Questions in Data Analysis

“The sexy job in the next ten years will be statisticians” said Google’s Chief Economist back in 2009.  The ability to understand data and pull out valuable insight will become increasingly in demand in business, government, and journalism to name but a few fields.

And one of the most important first steps when analysing data is the questions you ask.

Let’s take journalism as an example.  In years gone by, a researcher would surround himself with the national and regional papers and scour them for hours, searching for one line – a line that begged more questions to be asked.  He’d return to the newsroom from this activity present the line to a journalist, and say “follow this up.”

A brainstorming session would follow.  But these were different from what you and I might think of as a brainstorming session.  It wasn’t sitting around staring at a blank flipchart or whiteboard.  Instead they arrived with the idea.  The goal of the brainstorming session was for numerous people to fire as many questions as they could think of in 10 minutes.  Any more than 10 minutes, and they had probably started naval-gazing.  This was a quick hit to explore as many angles as they could think of.  Then filter them down to the juiciest ones, guided by every journalist’s greatest asset – a nose for a story.

Let’s take a current example.  The European Court of Justice ruled in March 2011 that “taking the gender of the insured individual into account as a risk factor in insurance contracts constitutes discrimination”.  The ruling will come into effect in December 2012.  Insurers have had the time in between to adjust their pricing models.

This ruling poses a number of questions for insurers and the general public.  It also challenges the application of statistics in this context.

Faced with this news, what questions can your class generate?  What if they take on different roles?  What questions might the insurer ask?  How might they adjust their policies to account for the new ruling?  What about a journalist?  What questions might they ask in the public’s interest? What about the perspective of the judge in the ruling?  What might the opposing lawyers have argued?  If you take this ruling further, what implications could there be?  What challenges could you make to the ruling?

The average premium for women in the UK is £425 pa compared with £536 pa for men.  However, what challenges can be made to the use of averages in this instance?  Given the opportunity, how would you dive into the data to gain greater insight?

Some commentary and coverage around this ruling could add additional contextual information and offers the jump-off point for further questions and discussion:

“Currently millions of insurance policies take gender into account. The court ruled that practice as inappropriate since there are myriad other factors that could be considered. Gender, however, is typically easy to check and can point to sound statistical conclusions, the industry says.”  NY Times

Speaking of the case’s advocate general, Julianne Kott, the Wall Street Journal writes:

“Life-insurance discrimination might be permissible under the law, she allows, if women live longer because they are women, if there is something innate and biological about the female sex that causes longevity.”

But, she argues, important causes of longevity are behavioral—eating habits, smoking and drinking, sports, work environments, drug use. That women have, on average, behaved differently than men doesn’t necessarily mean any one woman’s femaleness is the reason why.

Differences in longevity “merely come to light statistically,” Ms. Kokott writes, and sex is thus just shorthand for whatever is causing those differences. And, she says, “the use of a person’s sex as a kind of substitute criterion for other distinguishing features is incompatible with the equal treatment of men and women.””

One suggestion is that insurers might encourage more people to sign up for black box insurance.

Black box insurance – also known as ‘telematics’ or ‘pay as you go car insurance’- aims to offer drivers a cheaper alternative by delivering driver-centred premiums based upon actual driving style rather than statistics.”

Similar in concept to the black boxes in aeroplanes (though presumably not indestructible), these devices track when and where you are driving and measures your speed, acceleration and braking.  Instead of using statistics based on your demographics, it would give a more direct impression of how safe a driver you are.  However, this doesn’t remove statistics completely.  The roads you drive on and the time of night you drive impact how much you have to pay, which presumably is based on the probability of having an accident.

The Guardian discusses other ways insurers might respond to the new ruling:

““It has been suggested some insurers may try to get round the rules by re-classifying the cars typically bought by young men into a higher insurance category, which would in turn push their premiums up. The ABI research paper mentioned an unnamed insurer which said women accounted for 70% of its Mini drivers, but only 30% of its BMW drivers. Alternatively, car insurers may start paying more attention to people’s occupations.”

One suggestion is that insurers might encourage more people to sign up for black box insurance:

Black box insurance – also known as ‘telematics’ or ‘pay as you go car insurance’- aims to offer drivers a cheaper alternative by delivering driver-centred premiums based upon actual driving style rather than statistics.”

As an example exercise, you could divide students into small discussion groups, and assign them roles (e.g. one group could be journalists for national press, another could be journalists of an insurance industry publication, and a third group could be senior managers of an insurance company).

Give them each ten minutes to brainstorm within the group as many different questions as they can.  Then get them to filter down the questions to the most important, and discuss how they might go about answering these questions, and the potential implications of their findings.  Then get each group to report back to the larger group, and invite further questions from the class.

You could then review the whole session and how asking more questions early on has an impact in how you approach statistical analysis, and other contexts in which you could apply this approach.

All of this should hopefully stimulate engaging and lively debate based on a real-world example of applied statistics.

Leave a comment

Filed under Uncategorized

Populations vs. Samples in Statistics

Hello All,

So, my first love is teaching applied statistics. It has been and probably always will be my favorite thing to do at work. However, for the past two years I served as an interim administrator. So for my next few posts, I am discussing what information needs to be understood by administrators, as many of you very well may be called upon to lead professional development workshops or work, one on one, with an administrator.

If you look down my list that I posted previously regarding what information should be included in a training session with administrators you will see that the first four posts were more broadly based, dealing with issues technically outside of applied statistics but that either provide the foundations for good decision making with data or that are there to help administrators understand why we need to use statistics to make good decisions.

  1. Epistemology, Decision Making, and Statistics
  2. Cognitive Biases; How Statistics can be used to get to the Truth
  3. Detecting Data Integrity Issues
  4. Data Management Protocol
  5. Populations vs. Samples
  6. Observational Errors: Measurement, Experimental, and Sampling
  7. Quality Decisions are Limited by the Quality of Measures
  8. Sampling and Quality Decisions
  9. Statistics and Sampling Error
  10. Parameters and Mathematical Modeling vs. Inferential Statistics (Introduction)
  11. Mathematical Modeling, Parameters, and Assumptions
  12. Statistical Decision Errors: Type I and Type II

However, now we are up to #5, which is simply two terms that can be found at the beginning of every statistics book: Population vs. Sample.

A population is an entire group of something sharing a common characteristic or characteristics. For example, in a memo that I recently received, the GPA for all students from our institution was listed. This number was compared to the GPA for all student athletes from our institution. All students from an institution is one example of a population. All student athletes is another example of a different population. In each of these cases, the GPA is NOT a STATISTIC.

A statistic is a number that capture what is going on with a sample, that is a subset of a population. 

A Parameter is a number that captures what is going on with a population, that is the entire set of something or someone sharing a common characteristic, like being student athletes at a particular institution.

Now before I can explain why this matters, we will have to go through a few most posts for background information. Until then, let’s just say … we don’t use statistics to understand populations, we use parameters.

And we can’t treat a sample like a population, it is a subset, which means it is probably going to be less varied than the total population and there very well may be differences in the people in the sample when compared to the overall population. Take, for example, when  a professor holds an optional study session for the exam. Think about the students who would show up for such a session. How might they differ from the students in the class who didn’t show up?

  • Maybe the students who didn’t show up have to work or take care of a family member, so they didn’t have the time.
  • Maybe the students who showed up are  really motivated to master the material while the ones who didn’t show up are satisfied to just do well enough.
  • The theories of Carol Dweck would predict students with Incremental Views of Intelligence (that belief that with effort they can get smarter) would be more likely to participate than those with Entity Views of Intelligence (those who believe we are either born smart or not).

There are any number of reasons why students in these two groups may differ, but if the goal is the  estimate how well the entire class is prepared by using the sample of those who show up for study group, there is going to be error that will keep you from seeing everything.

Population parameters don’t have that kind of error, because everyone is in the group.

Understanding these two terms is requisite for understanding what is to follows. Certainly professors  teaching applied statistics know and understand this, but often administrators do not, so if you are called to train them or work with them in areas of assessment or the use of data to make decisions, make sure they understand this distinction.

Till next time ….

Bonnie

Leave a comment

Filed under Uncategorized

Data Management Protocol

Hello All!

Given so often the person who is the applied statistician on a college campus is asked to conduct professional development training, I am in the midst of a series of postings on what should be covered during such training for administrators responsible for data driven decisions. (I introduced this topic in this post  http://statisticalsage.wordpress.com/2014/07/24/data-driven-decisions-doing-it-right/) 

Central to effective data driven decisions is high quality data, which requires an articulated Data Management Protocol. In short, the goal of the Data Management Protocol should be to effectively communicate to everyone in the organization how data are being collected, organized, coded, and stored in a manner that maximizes integrity  and increases the usefulness of the data set. (Note: I addressed issues of data integrity in a prior post http://statisticalsage.wordpress.com/2014/08/07/assuring-data-integrity/ ) 

Let’s face it, it is not possible for an organization’s leader to be well versed in all areas. As a result, the Data Management Protocol is often left up to someone with such a specialty. However, that does not relinquish the administrator’s responsibility from assuring the data management protocol is actually effective, leading to increased data usefulness and integrity. Thus, when training such administrators, the focus on data management protocol has to be on how to tell if it is working or not. Critical to this is “human capital” … do you have the right people in the right positions to manage your data?

Here are some tips to evaluating your “human capital”

  • The Right People Matter: Place people in charge of data management who have the right training and experience. This is not the place to save on resources by hiring a person with no experience or formalized experience. Bad data means bad decisions.
  • Make use of Data Issues: Even with the best plan and best people, data issues will still arise. Seize on such situations as a way to evaluate the effectiveness of your team. An effective team will be looking at  what went wrong to cause the problem. They will identify the root of the problem, so the fix will take care of the immediate problem and all future problems. Many times such issues require an improvement in the data management protocol. Let’s say you are working on workload evaluation for a faculty member who ended up getting underpaid. The quickest fix would be to go to HR and adjust the pay. The proper fix would be to determine what was in the data file that generated this error in the first place. In such an example, I discovered that a new class had been entered into our data system incorrectly. It was coded as a 1 credit lecture class (amounting to a 1 credit workload hour) instead of the 1 credit science laboratory (that amounts to 3 credits of workload hours). If all that happened was the person’s pay was hand adjusted, then the next time a faculty member taught the same lab, the same error word have been made. The identification of this error also precipitated an evaluation of the coding of all science lab classes, identifying and fixing other problems at the root. This situation should have also yielded changes to the protocol of how courses were entered (and by whom). Be cautious of people managing your data who seek quick fixes over permanent fixes.

The right team will automatically create a data management protocol that includes the following:

  • Specification for Data Organization including a protocol for how data files are named, the structure of the files (that eases data analysis of varying types), and file naming and storing for ease of retrieval.
  • Security Mechanisms to protect the data including limiting access to the base data files.
  • Back up Protocol for data storage failures.
  • Document Data Details. Identify the method for assuring the details associated with the data (Who, What, Where, When, How, Why) is stored with the data. Data can come from lots of different sources (on campus/off campus; enrollment services/academic affairs). Knowing the source of data is critical in making decisions, as is knowing if the data are self report data, which careers with it a great deal of systematic error, or direct measures.
  • Consistent Coding of data protocol  (e.g., CBM for College of Business Management; CAS for College of Arts and Sciences)
  • Metadata Collection identification (when appropriate for web site and other electronic presence). Understanding who is visiting your web site and what they are looking at is filled with great information and is often an under utilized source of information at many college campus and other organizations.  

Now, if you are working with an organization that doesn’t have a Data Management Protocol, you are not alone. In one study, it was found that 22% of major organizations who needed to be making data driven decisions had no Data Management Protocol, and of those who did, only 40% were actually enforcing them. In other words, the plan was just collecting dust on the shelf! (Sorry, I read this the other day, and can’t locate the source.)

Creating a Data Management Protocol is not as daunting as it seems.

  • As with all major changes to an organization, a creation of a Data Management Protocol is most likely to be successful with overt support from Organizational Leaders.
  • Conduct an assessment of where the organization currently is. The use of surveys and focus groups makes the most sense.
    • What data do people need, and for what reasons?
    • Who needs access to the data and why?
    • Who currently “owns” that data? Who is currently responsible?
    • When is the data needed (e.g., for external reports or internal decisions)
    • What is currently in place? What is working; what is not? What is truly effective?
  • Once you know where the organization currently is with regard to data needs and management, seek out examples of Data Management Protocols from other similar institutions (remember, I’m teaching at a state sponsored university where such sharing of information among our sister schools is not only possible but expected. In the for profit world, this is less likely to be possible).
  • Data Management Protocol are truly best created by experts and not in large committees. Given that Data Management Plans help facilitate the interface between data collection, storage, and use by humans, please make sure to include as one of your experts your Quantitative Psychologist. He or she will not only understand the uses of data and how formatting and ease analyses, but such a person will also understand how to aid the humans who will be using such data in the future.
  • Once created, pilot the plan before fully implementing it. Expect that revisions will be needed.
  • Don’t forget to have an assessment plan put into place for future data driven improvements.

The quality of decisions is limited by the quality of the data. A well implemented Data Management Protocol will increase the quality of the data, decrease the man-hours for obtaining data, and will facilitate the use of data to make great decisions. It is worth the investment!

 

 

I am including a few links to short articles that go into more detail regarding Data Management.

http://searchdatamanagement.techtarget.com/tip/Four-steps-to-formulate-your-customer-data-management-strategy

http://searchdatamanagement.techtarget.com/answer/Integrate-data-quality-into-your-master-data-management-strategy

http://searchdatamanagement.techtarget.com/news/2240226947/Data-Management-Maturity-Model-takes-things-up-a-level

 

 

 

 

 

 

 

 

 

Leave a comment

Filed under Uncategorized

Assuring Data Integrity

The quality of our data driven decisions is limited by the quality of our data. It really is that simple. If your data have errors, your decisions will be error prone as well.  This is the 4th in a series of posts on what needs to be taught to administrators making data driven decisions. Sure, administrators often hire people to handle data entry and analysis, but how can you tell if you have the right staffing, especially during times of financial limitations?   Before making decisions with data,  Administrators have to know how to look for signs of errors in their data. Even if an administrator has been formally trained in statistics or assessment,  an examination of many popular books in these areas reveals that not  many people are talking about the importance of verifying data integrity, that is the accuracy of your data.

Obviously, one of the best ways to assure data integrity is to follow appropriate data management plans. This file from MIT on Data Management provides a great deal of useful information http://libraries.mit.edu/data-management/.

There are some professionals of data management who believe that it’s OK for there to be data errors http://www.iso.com/Research-and-Analyses/ISO-Review/Data-Management-and-Data-Quality-Best-Practices.html and consider this among best practices … “Don’t waste money assuring all data are accurate.” Now, if you are working with an organization gathering data from tens of thousands of people, and there is no reason to believe that the errors present in the data will be systematic (that is always wrong in the same way), then I can see why such people are making such statements. The statistics will treat this small and random error like sampling error, thus minimizing the issue of error in your data while getting you to the information you really need to know to make the right decisions. However, most universities and smaller companies are working with data sets far smaller than what would be necessary to just let the errors be and let the statistics take care of it. Thus even very few data errors can truly mask what is going on. Moreover, I have found that most data errors in higher education aren’t random, making any error a problem.  Take for example data on enrollment. I have examined a lot of data on enrollment and anytime there is an error it is always in the same direction, underestimating the number of students enrolled in a particular program or who are part of a particular ethnic/racial group. When you are already working with smaller data sets, even a single error with a single observation could be enough to keep an administrator from making the best possible decision. And if the error is systematic, as most of the error I have seen in data sets have been, then you are assured to make a faulty decision with faulty data. Thus, identifying and fixing the errors in your data are an important part of any administrator interested in making high quality data driven decisions.

When training administrators, it often helps if you can use errors in data integrity with their own organization. Listed below are the most common types of data integrity errors that I have observed.

Types of Error:

  1.  Miscoding of Individuals leaves them out of important counts.
    • Example: Enrollment or demographic counts can be miscoded. I recall getting a set of raw data and seeing that for some students they were listed as being in the College Business, other students CoB, still others CBM, yet each of these students were in the College of Business Management. And though a human can look at each of these ways of listing College of Business and see they are the same, a computer can’t unless programmed to do so. However, it would be far easier to put into your data management plan that the College of Business Management is always coded as BUS. Then accurate counts become possible.
  2. Misplaced decimals or Place Value Errors.
    • Example: Proportion of a whole can create challenges. Decimals always seem to add to challenges in data accuracy. At most universities we base payment of a position off of a “whole” position. Thus if the full time position is 12 credits, someone teaching 6 credits would have a .5 position. Of course, sometimes positions become more complicated. For example, during an analysis a .05 position was treated as a .50 position, thus this position was increased by 10 times what she was supposed to be because of a decimal error. In another example while translating counts from one table to another, a person dropped a zero for the number. The table said 12, but the actual number of majors was 120.
  3. Inverse in Coding. Results that have to be formatted that are formatted in reverse.
    • Example: Let’s say we have a  survey with the responses Strongly Agree, Agee, Disagree, or Strongly Disagree, where 4 is Strongly Agree and  1 is Strongly Disagree. I have seen people code them with 1 at Strongly Disagree and 4 as Strongly Agree. Of course, it would be fine if such coding was done consistently and noted, but that’s not what I typically have experienced.
  4. Treating a True Score of Zero as a Non-Response.
    • Example: When I code data, I typically code females at zero and males as 1. A non response, which for this type of question actually holds some interesting information, does not get assigned a value. A graduate student once deleted all of the zeros in the table … every single one, and now values that should have been zeros were treated as if the person didn’t respond.
  5. Unexpected Errors that only someone new to data management or who is truly arrogant could create.
    • Example: They are, after all, unexpected and as such cannot be predicted or classified. Just know inexperience and arrogance or just arrogance are a bad mix in everything, including assuring data integrity. I’m not saying a more seasoned person isn’t apt to do something stupid. The difference is, given they are humbled enough through years of such embarrassing episodes, they not only know how to look for such errors, they triple check for such errors to avoid public humiliation!

It’s not simply enough to know about the errors, we have to know how to recognize them. Listed here are some examples of ways to share with administrators on how to assure the integrity of their data.

  • LOOK at the Data!
    • Always look at a chart or graph of the data you are about to make a decision upon. Does it make sense? It is what you expected? Is anything missing? Is anything extreme? Though it is true you won’t be able to find all errors this way, you will be able to spot quite a few. For example, if there are no females in your graduating class, then you know you have a problem, or if not a single student in 10 years graduated from the largest major on your campus in 4 years, you know you have a problem.
    • If at all possible, chart or graph multi-year trends as this will not only help you to see mistakes clearly, like the mathematics department going from three years of 100 plus majors, to a year of 12 majors, once you have determined there is no problem with the data, you can see trends more clearly.
  • Trust your Gut.
    • Even applied statisticians can’t walk around with all of the data easily accessible in their minds. However, we also tend to get a sense when something is off. Interestingly, research on infants and toddlers has demonstrated that from an early age, they are implicitly, that is not consciously picking up patterns in their observations. They are kind of like little statisticians. We never lose that implicitly, that is not available to consciousness, ability. However, as it is not consciously available, it often comes to us as a sense of … hmmm, that doesn’t feel right.
    • If you are looking at data and it doesn’t feel right, look at it more closely for errors. I actually had this occur when I was looking at means that seemed just a bit off … all of them. That was when I found someone had miscoded one of the responses. This coding error would have resulted in a decision that wasn’t supported by the data.
  • When critical, triangulate the data.
    • First, look at the raw data for discrepancies. If none are found, then triangulate the data.  The term triangulating of data literally means finding data from three different sources (e.g., the department chair, enrollment services, and institutional research) and compare them. If there is a difference you have to find where the source of the problem. Often you may be able to find two sources of data and not three. You can compare two sources of data as well, just make sure they are from independent  sources (e.g, enrollment services data vs. chair data).

Any administrator knows, it is far easier to not have errors than to have them, find them, and fix them. Here are some examples of how these problems can be minimized or at least detected early enough in the process that they can be fixed before creating challenges for decision makers.

  • Most of these problems will be minimized if a well thought out and articulated Data Management Plan is crafted and implemented.
  • When errors are found, examine to see if the problem is with the implementation of the Data Management Plan or with the Plan, itself.
  • Make revisions to the Data Management Plan as appropriate.
  • Within your data management plan, devise a period verification of the accuracy of the data. This should be twice a year, and truthfully shouldn’t take very long. Three data point checks for a half a dozen to a dozen different types of files should do the trick.
  • An extremely limited number of people should have access to raw data. When you start to code or otherwise prepare the data for analysis, that should NOT be taking place at the level of the raw data. It should be copied and then worked on in a separate file.
  • And please, high quality data analysis and data integrity requires skilled and appropriately paid staffing. Given that the consequences to having errors in your data is so high, this is not a place you want to scrimp.

Leave a comment

Filed under Applied Statistics, Methods of Data Collection, Professional Development