Revisiting Limitations, Reliability and Validity of Large Database Research

I am often asked for a copy of these remarks I made as part of a presidential panel at the Association for the Study of Higher Education (ASHE) in 2011.  Here is an edited version of that talk.

I know many of you from the applied side of educational research: institutional research.  I don’t think there is anyone I know here from another hat I have worn in the past, which is as a cognitive psychologist and where I first learned the literature on the complications of trying to investigate any human phenomena.  Given the complications, it’s a wonder any of us try at all.

Yet we do.  I cannot imagine what an association for the study of higher education might do if it were not filled with people who, despite the complications, tried to gather comprehensive and systematic data from the many different types of institutions of higher education that we have.  So that we could assess, and attempt to improve the experience for all students who aspire to hang a college diploma or certificate on their wall.

So, I welcome debate on this topic.  Everything we do should be reliable and valid, at least, as reliable and valid as we can make it without being paralyzed by doubt.  I wonder if someone, back in 1965, had said to Sandy Astin, “you know, you better like this CIRP thing, because it is going to dominate your life for the next 40 years…” he would have had second thoughts about this grand plan to inform higher education on the impact that college has on students.

Yet we cannot become like the congress: so divided and convinced in our own certitude that we not only accomplish nothing but anger a lot of people who rely upon us.  I do not want to see educational researchers on the chart that is making the rounds now that shows approval of Congress at 11%, lower than Paris Hilton and BP during the oil spill. 

Let me go back to Sandy.  If there is anything I have learned in the six years of directing CIRP is that you cannot go wrong by referring to Sandy Astin.  CIRP was started to answer big questions.  Big questions that could not be answered by a study over here that had one institution with their questionnaire and a study over there at a very different institution that phrased things differently.  Big questions that cross-sectional design could not and would never answer.  Big questions that needed a lot of little questions to get at the Big answers.  If Sandy Astin had waited until those questions were absolutely perfect and everyone agreed on how perfect they were, well, there would be a big hole in higher education research these days.  And Pat Terenzini and Ernie Pascarella’s book would have only been six inches thick instead of seven.

But, as science does, we build upon the past.

What concerns me about some of the current debate is that it seems to me like some of the debate in congress, which has, as we know, not accomplished anything except making a lot of us really upset at them.  Telling schools not to do NSSE is not the answer. We should recognize the limitations in NSSE. And CIRP.  And every other study of higher education out there.  But also realize that we are in better shape because of how the research from these tools has informed higher education in general as well as hundreds of institutions.

As scholars, sometimes we forget that after all, the ultimate reason for this kind of work is to inform institutional change.  Not get published, not get grants, not get tenure, not get invited to conferences, but to actually have an impact on what our students get out of college.  Results from NSSE, and CIRP, and NCES, and Ernie and Pat’s work. National results get institutions talking about why and how they do what they do.  Having a local version of those results, like CIRP and NSSE offer, provides a great service to institutions that they cannot accomplish themselves.  We should be looking how to improve these tools, not completely tear them down.

As to reliability and validity.  I can tell you a few things about CIRP research.  I can tell you how we have looked at student self-report on things like GPA and SAT scores and find them highly accurate when compared to the actual scores.  I can tell you how we have good correlations of self-report measures of academic self-efficacy and subsequent performance on some standardized tests such as the California Critical Thinking Skills Test.  I could also tell you that I would love to do more of this kind of work. If anyone has a few million out there to help me with that, please come see me after the panel.

For every paper you can show me about how invalid self-report surveys are, I can show another that says they are valid.  The key thing is not a blanket statement that people cannot remember what they did last week and so we should stop asking, but in crafting questions that allow them to answer in a way that they can answer with a certain degree of validity.  But remember, even in physics, there is uncertainty. 

Let’s look at a typical CIRP question that asks for recall. When asking incoming first-year students to reflect on the past year and indicate how often they, for instance, were bored in class, students are given only three response options: “frequently,” “occasionally,” or “not at all.”  Fairly straightforward in themselves as qualifiers, the instructions also specify: “if you engaged in an activity one or more times, but not frequently, mark ‘occasionally’ and go on to tell students to mark ‘not at all’ “if you have not performed the activity during the past year.”  The wording of these questions provides sufficient direction to respondents and not enough latitude to waver off into vagueness.

I have personally administered this questionnaire to thousands of students over many years.  In the room with them, offering to answer questions.  They had questions, but they were more like “I just got my student ID and I don’t remember the number, what should I do?” and “why do they ask these questions?” and “can I go to the bathroom?”  Nobody ever asked me what this section of the questionnaire meant. 

Even so, what level of specificity in results do we really need in order to provide useful information and how should we interpret results?  I am sure that all of you were as eager as I was to get up yesterday morning and read about the new NSSE results.  There is important information in there about a number of things, but let’s take majors.  One of the findings the media picked up on was that on average, engineering majors spent more hours studying than business or, pause for effect, social science majors. To be more specific, engineering students studied an average of 19 hours a week and education majors an average of 15 hours a week.  Do I believe that engineering majors tend to spend more time studying than education majors?  Sure.  Do I for a minute think that the population value if we had a perfect way of recording hours spent studying (putting aside the Heisenberg Uncertainty Principle for a moment which of course tells us that this is never going to happen) that the final all-knowing results would be 19 hours?  Not at all.

But the important piece of information here is the relationship between the groups.  That we can get without putting all the engineers in a box with Schroedinger’s cat.

Another important recognition here is that some questions just cannot be compared against outside standards.  Respondent opinions, perceptions, values, and beliefs about themselves are important aspects of their every day experiences, and have value.  Certainly the questions should be crafted with care, by people familiar with all the potential sources of bias that can impact results.  But, just because perceptions and values are not easily verified, does not mean that they are not important or reliable in predicting student achievement. 

There are scores of studies that examine the connection between perceived campus climate and outcome measures, such as graduation, that are backed up by triangulating with observations and interview studies.  This is why it is a common practice in research to also use multiple questions that examine the same trait from different perspectives to create constructs that attempt to describe a phenomenon.  More sophisticated methods, such as how we at HERI are using item response theory in creating constructs, also have moved the field forward in this regard.

There is a very rich body of literature on survey questions. I encourage those of you interested in this topic to attend the annual conference of the American Association of Public Opinion Research.  They are way ahead of us in the area of survey methodology.  Don Dillman’s work, in particular, is masterful.  These people live and breathe the impact of question wording, response options and even horizontal or vertical formatting of the responses.  But they all believe that if we apply what we know about creating questionnaires, we can effectively utilize questionnaires to collect useful data.

So, in summary, what do I believe?

1)    There are important questions that only large scale surveys can answer.

2)    As with any line of research, there is uncertainty.  We need to recognize it and acknowledge it in interpreting our results.

3)    We need to always move forward in making our measures better.

Some of you might have heard this quote from George Elliot.  We can perhaps let the sexism slide a bit, since George really was Mary Anne, and hope that were she writing today she would do so under her own name and without the male pronoun:

The important work of moving the world forward does not wait to be done by perfect men.

Of course, maybe she was thinking that it would be the perfect women that did it, right?

Let us by all means strive to be perfect. But let us not let our failings in that area mean that we do not continue to try ourselves and support those who battle beside us.