# The Concept of Reliability

Modules Chapter 5 wk2 p655

C H A P T E R  5

Reliability

In everyday conversation, reliability is a synonym for dependability or consistency. We speak of the train that is so reliable you can set your watch by it. If we’re lucky, we have a reliable friend who is always there for us in a time of need.

Broadly speaking, in the language of psychometrics reliability refers to consistency in measurement. And whereas in everyday conversation reliability always connotes something positive, in the psychometric sense it really only refers to something that is consistent—not necessarily consistently good or bad, but simply consistent.

It is important for us, as users of tests and consumers of information about tests, to know how reliable tests and other measurement procedures are. But reliability is not an all-or-none matter. A test may be reliable in one context and unreliable in another. There are different types and degrees of reliability. A  reliability coefficient  is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance. In this chapter, we explore different kinds of reliability coefficients, including those for measuring test-retest reliability, alternate-forms reliability, split-half reliability, and inter-scorer reliability.

The Concept of Reliability

Recall from our discussion of classical test theory that a score on an ability test is presumed to reflect not only the testtaker’s true score on the ability being measured but also error.1 In its broadest sense, error refers to the component of the observed test score that does not have to do with the testtaker’s ability. If we use X to represent an observed score, T to represent a true score, and E to represent error, then the fact that an observed score equals the true score plus error may be expressed as follows:

A statistic useful in describing sources of test score variability is the  variance  (σ2)—the standard deviation squared. This statistic is useful because it can be broken into components. Page 142Variance from true differences is  true variance , and variance from irrelevant, random sources is  error variance . If σ2 represents the total variance, the true variance, and the error variance, then the relationship of the variances can be expressed as

In this equation, the total variance in an observed distribution of test scores (σ2) equals the sum of the true variance  plus the error variance . The term  reliability  refers to the proportion of the total variance attributed to true variance. The greater the proportion of the total variance attributed to true variance, the more reliable the test. Because true differences are assumed to be stable, they are presumed to yield consistent scores on repeated administrations of the same test as well as on equivalent forms of tests. Because error variance may increase or decrease a test score by varying amounts, consistency of the test score—and thus the reliability—can be affected.

In general, the term  measurement error  refers to, collectively, all of the factors associated with the process of measuring some variable, other than the variable being measured. To illustrate, consider an English-language test on the subject of 12th-grade algebra being administered, in English, to a sample of 12-grade students, newly arrived to the United States from China. The students in the sample are all known to be “whiz kids” in algebra. Yet for some reason, all of the students receive failing grades on the test. Do these failures indicate that these students really are not “whiz kids” at all? Possibly. But a researcher looking for answers regarding this outcome would do well to evaluate the English-language skills of the students. Perhaps this group of students did not do well on the algebra test because they could neither read nor understand what was required of them. In such an instance, the fact that the test was written and administered in English could have contributed in large part to the measurement error in this evaluation. Stated another way, although the test was designed to evaluate one variable (knowledge of algebra), scores on it may have been more reflective of another variable (knowledge of and proficiency in English language). This source of measurement error (the fact that the test was written and administered in English) could have been eliminated by translating the test and administering it in the language of the testtakers.

Measurement error, much like error in general, can be categorized as being either systematic or random.  Random error  is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process. Sometimes referred to as “noise,” this source of error fluctuates from one testing situation to another with no discernible pattern that would systematically raise or lower scores. Examples of random error that could conceivably affect test scores range from unanticipated events happening in the immediate vicinity of the test environment (such as a lightning strike or a spontaneous “occupy the university” rally), to unanticipated physical events happening within the testtaker (such as a sudden and unexpected surge in the testtaker’s blood sugar or blood pressure).

JUST THINK . . .

What might be a source of random error inherent in all the tests an assessor administers in his or her private office?

In contrast to random error,  systematic error  refers to a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured. For example, a 12-inch ruler may be found to be, in actuality, a tenth of one inch longer than 12 inches. All of the 12-inch measurements previously taken with that ruler were systematically off by one-tenth of an inch; that is, anything measured to be exactly 12 inches with that ruler was, in reality, 12 and one-tenth inches. In this example, it is the measuring instrument itself that has been found to be a source of systematic error. Once a systematic error becomes known, it becomes predictable—as well as fixable. Note also that a systematic source of error does not affect score consistency. So, for example, suppose a measuring instrument such as the official weight scale used on The Biggest Loser television Page 143program consistently underweighed by 5 pounds everyone who stepped on it. Regardless of this (systematic) error, the relative standings of all of the contestants weighed on that scale would remain unchanged. A scale underweighing all contestants by 5 pounds simply amounts to a constant being subtracted from every “score.” Although weighing contestants on such a scale would not yield a true (or valid) weight, such a systematic error source would not change the variability of the distribution or affect the measured reliability of the instrument. In the end, the individual crowned “the biggest loser” would indeed be the contestant who lost the most weight—it’s just that he or she would actually weigh 5 pounds more than the weight measured by the show’s official scale. Now moving from the realm of reality television back to the realm of psychological testing and assessment, let’s take a closer look at the source of some error variance commonly encountered during testing and assessment.

JUST THINK . . .

What might be a source of systematic error inherent in all the tests an assessor administers in his or her private office?

Sources of Error Variance

Sources of error variance include test construction, administration, scoring, and/or interpretation.

Test construction

One source of variance during test construction is  item sampling  or  content sampling , terms that refer to variation among items within a test as well as to variation among items between tests. Consider two or more tests designed to measure a specific skill, personality attribute, or body of knowledge. Differences are sure to be found in the way the items are worded and in the exact content sampled. Each of us has probably walked into an achievement test setting thinking “I hope they ask this question” or “I hope they don’t ask that question.” If the only questions on the examination were the ones we hoped would be asked, we might achieve a higher score on that test than on another test purporting to measure the same thing. The higher score would be due to the specific content sampled, the way the items were worded, and so on. The extent to which a testtaker’s score is affected by the content sampled on a test and by the way the content is sampled (that is, the way in which the item is constructed) is a source of error variance. From the perspective of a test creator, a challenge in test development is to maximize the proportion of the total variance that is true variance and to minimize the proportion of the total variance that is error variance.

Sources of error variance that occur during test administration may influence the testtaker’s attention or motivation. The testtaker’s reactions to those influences are the source of one kind of error variance. Examples of untoward influences during administration of a test include factors related to the test environment: room temperature, level of lighting, and amount of ventilation and noise, for instance. A relentless fly may develop a tenacious attraction to an examinee’s face. A wad of gum on the seat of the chair may make itself known only after the testtaker sits down on it. Other environment-related variables include the instrument used to enter responses and even the writing surface on which responses are entered. A pencil with a dull or broken point can make it difficult to blacken the little grids. The writing surface on a school desk may be riddled with heart carvings, the legacy of past years’ students who felt compelled to express their eternal devotion to someone now long forgotten. External to the test environment in a global sense, the events of the day may also serve as a source of error. So, for example, test results may vary depending upon whether the testtaker’s country is at war or at peace (Gil et al., 2016). A variable of interest when evaluating a patient’s general level of suspiciousness or fear is the patient’s home neighborhood and lifestyle. Especially in patients who live in and must cope daily with an unsafe neighborhood, Page 144what is actually adaptive fear and suspiciousness can be misinterpreted by an interviewer as psychotic paranoia (Wilson et al., 2016).

Other potential sources of error variance during test administration are testtaker variables. Pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication can all be sources of error variance. Formal learning experiences, casual life experiences, therapy, illness, and changes in mood or mental state are other potential sources of testtaker-related error variance. It is even conceivable that significant changes in the testtaker’s body weight could be a source of error variance. Weight gain and obesity are associated with a rise in fasting glucose level—which in turn is associated with cognitive impairment. In one study that measured performance on a cognitive task, subjects with high fasting glucose levels made nearly twice as many errors as subjects whose fasting glucose level was in the normal range (Hawkins et al., 2016).

Examiner-related variables are potential sources of error variance. The examiner’s physical appearance and demeanor—even the presence or absence of an examiner—are some factors for consideration here. Some examiners in some testing situations might knowingly or unwittingly depart from the procedure prescribed for a particular test. On an oral examination, some examiners may unwittingly provide clues by emphasizing key words as they pose questions. They might convey information about the correctness of a response through head nodding, eye movements, or other nonverbal gestures. In the course of an interview to evaluate a patient’s suicidal risk, highly religious clinicians may be more inclined than their moderately religious counterparts to conclude that such risk exists (Berman et al., 2015). Clearly, the level of professionalism exhibited by examiners is a source of error variance.

Test scoring and interpretation

In many tests, the advent of computer scoring and a growing reliance on objective, computer-scorable items have virtually eliminated error variance caused by scorer differences. However, not all tests can be scored from grids blackened by no. 2 pencils. Individually administered intelligence tests, some tests of personality, tests of creativity, various behavioral measures, essay tests, portfolio assessment, situational behavior tests, and countless other tools of assessment still require scoring by trained personnel.

Manuals for individual intelligence tests tend to be very explicit about scoring criteria, lest examinees’ measured intelligence vary as a function of who is doing the testing and scoring. In some tests of personality, examinees are asked to supply open-ended responses to stimuli such as pictures, words, sentences, and inkblots, and it is the examiner who must then quantify or qualitatively evaluate responses. In one test of creativity, examinees might be given the task of creating as many things as they can out of a set of blocks. Here, it is the examiner’s task to determine which block constructions will be awarded credit and which will not. For a behavioral measure of social skills in an inpatient psychiatric service, the scorers or raters might be asked to rate patients with respect to the variable “social relatedness.” Such a behavioral measure might require the rater to check yes or no to items like Patient says “Good morning” to at least two staff members.

JUST THINK . . .

Can you conceive of a test item on a rating scale requiring human judgment that all raters will score the same 100% of the time?

Scorers and scoring systems are potential sources of error variance. A test may employ objective-type items amenable to computer scoring of well-documented reliability. Yet even then, a technical glitch might contaminate the data. If subjectivity is involved in scoring, then the scorer (or rater) can be a source of error variance. Indeed, despite rigorous scoring criteria set forth in many of the better-known tests of intelligence, examiner/scorers occasionally still are confronted by situations where an examinee’s response lies in a gray area. The element of subjectivity in scoring may be much greater in the administration of certain nonobjective-type personality tests, tests of creativity (such as the block test just described), and certain academic tests (such as essay examinations). Subjectivity in scoring can even enter into Page 145behavioral assessment. Consider the case of two behavior observers given the task of rating one psychiatric inpatient on the variable of “social relatedness.” On an item that asks simply whether two staff members were greeted in the morning, one rater might judge the patient’s eye contact and mumbling of something to two staff members to qualify as a yes response. The other observer might feel strongly that a no response to the item is appropriate. Such problems in scoring agreement can be addressed through rigorous training designed to make the consistency—or reliability—of various scorers as nearly perfect as can be.

Other sources of error

Surveys and polls are two tools of assessment commonly used by researchers who study public opinion. In the political arena, for example, researchers trying to predict who will win an election may sample opinions from representative voters and then draw conclusions based on their data. However, in the “fine print” of those conclusions is usually a disclaimer that the conclusions may be off by plus or minus a certain percent. This fine print is a reference to the margin of error the researchers estimate to exist in their study. The error in such research may be a result of sampling error—the extent to which the population of voters in the study actually was representative of voters in the election. The researchers may not have gotten it right with respect to demographics, political party affiliation, or other factors related to the population of voters. Alternatively, the researchers may have gotten such factors right but simply did not include enough people in their sample to draw the conclusions that they did. This brings us to another type of error, called methodological error. So, for example, the interviewers may not have been trained properly, the wording in the questionnaire may have been ambiguous, or the items may have somehow been biased to favor one or another of the candidates.

Certain types of assessment situations lend themselves to particular varieties of systematic and nonsystematic error. For example, consider assessing the extent of agreement between partners regarding the quality and quantity of physical and psychological abuse in their relationship. As Moffitt et al. (1997) observed, “Because partner abuse usually occurs in private, there are only two persons who ‘really’ know what goes on behind closed doors: the two members of the couple” (p. 47). Potential sources of nonsystematic error in such an assessment situation include forgetting, failing to notice abusive behavior, and misunderstanding instructions regarding reporting. A number of studies (O’Leary & Arias, 1988; Riggs et al., 1989; Straus, 1979) have suggested that underreporting or overreporting of perpetration of abuse also may contribute to systematic error. Females, for example, may underreport abuse because of fear, shame, or social desirability factors and overreport abuse if they are seeking help. Males may underreport abuse because of embarrassment and social desirability factors and overreport abuse if they are attempting to justify the report.

Just as the amount of abuse one partner suffers at the hands of the other may never be known, so the amount of test variance that is true relative to error may never be known. A so-called true score, as Stanley (1971, p. 361) put it, is “not the ultimate fact in the book of the recording angel.” Further, the utility of the methods used for estimating true versus error variance is a hotly debated matter (see Collins, 1996; Humphreys, 1996; Williams & Zimmerman, 1996a, 1996b). Let’s take a closer look at such estimates and how they are derived.

Reliability Estimates

Test-Retest Reliability Estimates

A ruler made from the highest-quality steel can be a very reliable instrument of measurement. Every time you measure something that is exactly 12 inches long, for example, your ruler will tell you that what you are measuring is exactly 12 inches long. The reliability of this instrument Page 146of measurement may also be said to be stable over time. Whether you measure the 12 inches today, tomorrow, or next year, the ruler is still going to measure 12 inches as 12 inches. By contrast, a ruler constructed of putty might be a very unreliable instrument of measurement. One minute it could measure some known 12-inch standard as 12 inches, the next minute it could measure it as 14 inches, and a week later it could measure it as 18 inches. One way of estimating the reliability of a measuring instrument is by using the same instrument to measure the same thing at two points in time. In psychometric parlance, this approach to reliability evaluation is called the test-retest method, and the result of such an evaluation is an estimate of test-retest reliability.

Test-retest reliability  is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test. The test-retest measure is appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time, such as a personality trait. If the characteristic being measured is assumed to fluctuate over time, then there would be little sense in assessing the reliability of the test using the test-retest method.

As time passes, people change. For example, people may learn new things, forget some things, and acquire new skills. It is generally the case (although there are exceptions) that, as the time interval between administrations of the same test increases, the correlation between the scores obtained on each testing decreases. The passage of time can be a source of error variance. The longer the time that passes, the greater the likelihood that the reliability coefficient will be lower. When the interval between testing is greater than six months, the estimate of test-retest reliability is often referred to as the  coefficient of stability .

An estimate of test-retest reliability from a math test might be low if the testtakers took a math tutorial before the second test was administered. An estimate of test-retest reliability from a personality profile might be low if the testtaker suffered some emotional trauma or received counseling during the intervening period. A low estimate of test-retest reliability might be found even when the interval between testings is relatively brief. This may well be the case when the testings occur during a time of great developmental change with respect to the variables they are designed to assess. An evaluation of a test-retest reliability coefficient must therefore extend beyond the magnitude of the obtained coefficient. If we are to come to proper conclusions about the reliability of the measuring instrument, evaluation of a test-retest reliability estimate must extend to a consideration of possible intervening factors between test administrations.

An estimate of test-retest reliability may be most appropriate in gauging the reliability of tests that employ outcome measures such as reaction time or perceptual judgments (including discriminations of brightness, loudness, or taste). However, even in measuring variables such as these, and even when the time period between the two administrations of the test is relatively small, various factors (such as experience, practice, memory, fatigue, and motivation) may intervene and confound an obtained measure of reliability.2

Taking a broader perspective, psychological science, and science in general, demands that the measurements obtained by one experimenter be replicable by other experimenters using the same instruments of measurement and following the same procedures. However, as observed in this chapter’s  Close-Up , a replicability problem of epic proportions appears to be brewing.Page 147

CLOSE-UP

Psychology’s Replicability Crisis*

In the mid-2000s, academic scientists became concerned that science was not being performed rigorously enough to prevent spurious results from reaching consensus within the scientific community. In other words, they worried that scientific findings, although peer-reviewed and published, were not replicable by independent parties. Since that time, hundreds of researchers have endeavored to determine if there is really a problem, and if there is, how to curb it. In 2015, a group of researchers called the Open Science Collaboration attempted to redo 100 psychology studies that had already been peer-reviewed and published in leading journals (Open Science Collaboration, 2015). Their results, published in the journal Science, indicated that, depending on the criteria used, only 40–60% of replications found the same results as the original studies. This low replication rate helped confirm that science indeed had a problem with replicability, the seriousness of which is reflected in the term replicability crisis.

Why and how did this crisis of replicability emerge? Here it will be argued that the major causal factors are (1) a general lack of published replication attempts in the professional literature, (2) editorial preferences for positive over negative findings, and (3) questionable research practices on the part of authors of published studies. Let’s consider each of these factors.

Lack of Published Replication Attempts

Journals have long preferred to publish novel results instead of replications of previous work. In fact, a recent study found that only 1.07% of the published psychological scientific literature sought to directly replicate previous work (Makel et al., 2012). Academic scientists, who depend on publication in order to progress in their careers, respond to this bias by focusing their research on unexplored phenomena instead of replications. The implications for science are dire. Replication by independent parties provides for confidence in a finding, reducing the likelihood of experimenter bias and statistical anomaly. Indeed, had scientists been as focused on replication as they were on hunting down novel results, the field would likely not be in crisis now.

Editorial Preference for Positive over Negative Findings

Journals prefer positive over negative findings. “Positive” in this context does not refer to how upbeat, beneficial, or heart-warming the study is. Rather, positive refers to whether the study concluded that an experimental effect existed. Stated another way, and drawing on your recall from that class you took in experimental methods, positive findings typically entail a rejection of the null hypothesis. In essence, from the perspective of most journals, rejecting the null hypothesis as a result of a research study is a newsworthy event. By contrast, accepting the null hypothesis might just amount to “old news.”

The fact that journals are more apt to publish positive rather than negative studies has consequences in terms of the types of studies that even get submitted for publication. Studies submitted for publication typically report the existence of an effect rather than the absence of one. The vast majority of studies that actually get published also report the existence of an effect. Those studies designed to disconfirm reports of published effects are few-and-far-between to begin with, and may not be deemed publishable even when they are conducted and submitted to a journal for review. The net result is that scientists, policy-makers, judges, and anyone else who has occasion to rely on published research may have a difficult time determining the actual strength and robustness of a reported finding.

Questionable Research Practices (QRPs)

In this admittedly nonexhaustive review of factors contributing to the replicability crisis, the third factor is QRPs. Included here are questionable scientific practices that do not rise to the level of fraud but still introduce error into bodies of scientific evidence. For example, a recent survey of psychological scientists found that nearly 60% of the respondents reported that they decided to collect more data after peeking to see if their already-collected data had reached statistical significance (John et al., 2012). While this procedure may seem relatively benign, it is not. Imagine you are trying to determine if a nickel is fair, or weighted toward heads. Rather than establishing the number flips you plan on performing prior to your “test,” you just start flipping and from time-to-time check how many times the coin has come up heads. After a run of five heads, you notice that your weighted-coin hypothesis is looking strong and decide to stop flipping. The nonindependence between the decision to collect data and the data themselves introduces bias. Over the course of many studies, such practices can seriously undermine a body of research.

There are many other sorts of QRPs. For example, one variety entails the researcher failing to report all of the research undertaken in a research program, and then Page 148selectively only reporting the studies that confirm a particular hypothesis. With only the published study in hand, and without access to the researchers’ records, it would be difficult if not impossible for the research consumer to discern important milestones in the chronology of the research (such as what studies were conducted in what sequence, and what measurements were taken).

One proposed remedy for such QRPs is preregistration (Eich, 2014). Preregistration involves publicly committing to a set of procedures prior to carrying out a study. Using such a procedure, there can be no doubt as to the number of observations planned, and the number of measures anticipated. In fact, there are now several websites that allow researchers to preregister their research plans. It is also increasingly common for academic journals to demand preregistration (or at least a good explanation for why the study wasn’t preregistered). Alternatively, some journals award special recognition to studies that were preregistered so that readers can have more confidence in the replicability of the reported findings.

Lessons Learned from the Replicability Crisis

The replicability crisis represents an important learning opportunity for scientists and students. Prior to such replicability issues coming to light, it was typically assumed that science would simply self-correct over the long run. This means that at some point in time, the nonreplicable study would be exposed as such, and the scientific record would somehow be straightened out. Of course, while some self-correction does occur, it occurs neither fast enough nor often enough, nor in sufficient magnitude. The stark reality is that unreliable findings that reach general acceptance can stay in place for decades before they are eventually disconfirmed. And even when such long-standing findings are proven incorrect, there is no mechanism in place to alert other scientists and the public of this fact.

Traditionally, science has only been admitted into courtrooms if an expert attests that the science has reached “general acceptance” in the scientific community from which it comes. However, in the wake of science’s replicability crisis, it is not at all uncommon for findings to meet this general acceptance standard. Sadly, the standard may be met even if the findings from the subject study are questionable at best, or downright inaccurate at worst. Fortunately, another legal test has been created in recent years (Chin, 2014). In this test, judges are asked to play a gatekeeper role and only admit scientific evidence if it has been properly tested, has a sufficiently low error rate, and has been peer-reviewed and published. In this latter test, judges can ask more sensible questions, such as whether the study has been replicated and if the testing was done using a safeguard like preregistration.

Conclusion

Spurred by the recognition of a crisis of replicability, science is moving to right from both past and potential wrongs. As previously noted, there are now mechanisms in place for preregistration of experimental designs and growing acceptance of the importance of doing so. Further, organizations that provide for open science (e.g., easy and efficient preregistration) are receiving millions of dollars in funding to provide support for researchers seeking to perform more rigorous research. Moreover, replication efforts—beyond even that of the Open Science Collaboration—are becoming more common (Klein et al, 2013). Overall, it appears that most scientists now recognize replicability as a concern that needs to be addressed with meaningful changes to what has constituted “business-as-usual” for so many years.

Effectively addressing the replicability crisis is important for any profession that relies on scientific evidence. Within the field of law, for example, science is used every day in courtrooms throughout the world to prosecute criminal cases and adjudicate civil disputes. Everyone from a criminal defendant facing capital punishment to a major corporation arguing that its violent video games did not promote real-life violence may rely at some point in a trial on a study published in a psychology journal. Appeals are sometimes limited. Costs associated with legal proceedings are often prohibitive. With a momentous verdict in the offing, none of the litigants has the luxury of time—which might amount to decades, if at all—for the scholarly research system to self-correct.

When it comes to psychology’s replicability crisis, there is good and bad news. The bad news is that it is real, and that it has existed perhaps, since scientific studies were first published. The good news is that the problem has finally been recognized, and constructive steps are being taken to address it.

Used with permission of Jason Chin.

*This Close-Up was guest-authored by Jason Chin of the University of Toronto.

Page 149

Parallel-Forms and Alternate-Forms Reliability Estimates

If you have ever taken a makeup exam in which the questions were not all the same as on the test initially given, you have had experience with different forms of a test. And if you have ever wondered whether the two forms of the test were really equivalent, you have wondered about the alternate-forms or parallel-forms reliability of the test. The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability, which is often termed the  coefficient of equivalence .

Although frequently used interchangeably, there is a difference between the terms alternate forms and parallel forms.  Parallel forms  of a test exist when, for each form of the test, the means and the variances of observed test scores are equal. In theory, the means of scores obtained on parallel forms correlate equally with the true score. More practically, scores obtained on parallel tests correlate equally with other measures. The term  parallel forms reliability  refers to an estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test when, for each form of the test, the means and variances of observed test scores are equal.

Alternate forms  are simply different versions of a test that have been constructed so as to be parallel. Although they do not meet the requirements for the legitimate designation “parallel,” alternate forms of a test are typically designed to be equivalent with respect to variables such as content and level of difficulty. The term  alternate forms reliability  refers to an estimate of the extent to which these different forms of the same test have been affected by item sampling error, or other error.

JUST THINK . . .

You missed the midterm examination and have to take a makeup exam. Your classmates tell you that they found the midterm impossibly difficult. Your instructor tells you that you will be taking an alternate form, not a parallel form, of the original test. How do you feel about that?

Obtaining estimates of alternate-forms reliability and parallel-forms reliability is similar in two ways to obtaining an estimate of test-retest reliability: (1) Two test administrations with the same group are required, and (2) test scores may be affected by factors such as motivation, fatigue, or intervening events such as practice, learning, or therapy (although not as much as when the same test is administered twice). An additional source of error variance, item sampling, is inherent in the computation of an alternate- or parallel-forms reliability coefficient. Testtakers may do better or worse on a specific form of the test not as a function of their true ability but simply because of the particular items that were selected for inclusion in the test.3

Developing alternate forms of tests can be time-consuming and expensive. Imagine what might be involved in trying to create sets of equivalent items and then getting the same people to sit for repeated administrations of an experimental test! On the other hand, once an alternate or parallel form of a test has been developed, it is advantageous to the test user in several ways. For example, it minimizes the effect of memory for the content of a previously administered form of the test.

JUST THINK . . .

From the perspective of the test user, what are other possible advantages of having alternate or parallel forms of the same test?

Certain traits are presumed to be relatively stable in people over time, and we would expect tests measuring those traits—alternate forms, parallel forms, or otherwise—to reflect that stability. As an example, we expect that there will be, and in fact there is, a reasonable degree of stability in scores on intelligence tests. Conversely, we might expect relatively little stability in scores obtained on a measure of state anxiety (anxiety felt at the moment).Page 150

An estimate of the reliability of a test can be obtained without developing an alternate form of the test and without having to administer the test twice to the same people. Deriving this type of estimate entails an evaluation of the internal consistency of the test items. Logically enough, it is referred to as an  internal consistency estimate of reliability  or as an  estimate of inter-item consistency . There are different methods of obtaining internal consistency estimates of reliability. One such method is the split-half estimate.

Split-Half Reliability Estimates

An estimate of  split-half reliability  is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once. It is a useful measure of reliability when it is impractical or undesirable to assess reliability with two tests or to administer a test twice (because of factors such as time or expense). The computation of a coefficient of split-half reliability generally entails three steps:

· Step 1. Divide the test into equivalent halves.

· Step 2. Calculate a Pearson r between scores on the two halves of the test.

· Step 3. Adjust the half-test reliability using the Spearman–Brown formula (discussed shortly).

When it comes to calculating split-half reliability coefficients, there’s more than one way to split a test—but there are some ways you should never split a test. Simply dividing the test in the middle is not recommended because it’s likely that this procedure would spuriously raise or lower the reliability coefficient. Different amounts of fatigue for the first as opposed to the second part of the test, different amounts of test anxiety, and differences in item difficulty as a function of placement in the test are all factors to consider.

One acceptable way to split a test is to randomly assign items to one or the other half of the test. Another acceptable way to split a test is to assign odd-numbered items to one half of the test and even-numbered items to the other half. This method yields an estimate of split-half reliability that is also referred to as  odd-even reliability . 4 Yet another way to split a test is to divide the test by content so that each half contains items equivalent with respect to content and difficulty. In general, a primary objective in splitting a test in half for the purpose of obtaining a split-half reliability estimate is to create what might be called “mini-parallel-forms,” with each half equal to the other—or as nearly equal as humanly possible—in format, stylistic, statistical, and related aspects.

Step 2 in the procedure entails the computation of a Pearson r, which requires little explanation at this point. However, the third step requires the use of the Spearman–Brown formula.

The Spearman–Brown formula

The  Spearman–Brown formula  allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test. It is a specific application of a more general formula to estimate the reliability of a test that is lengthened or shortened by any number of items. Because the reliability of a test is affected by its length, a formula is necessary for estimating the reliability of a test that has been shortened or lengthened. The general Spearman–Brown (rSB) formula is

Page 151

where rSB is equal to the reliability adjusted by the Spearman–Brown formula, rxy is equal to the Pearson r in the original-length test, and n is equal to the number of items in the revised version divided by the number of items in the original version.

By determining the reliability of one half of a test, a test developer can use the Spearman–Brown formula to estimate the reliability of a whole test. Because a whole test is two times longer than half a test, n becomes 2 in the Spearman–Brown formula for the adjustment of split-half reliability. The symbol rhh stands for the Pearson r of scores in the two half tests:

Usually, but not always, reliability increases as test length increases. Ideally, the additional test items are equivalent with respect to the content and the range of difficulty of the original items. Estimates of reliability based on consideration of the entire test therefore tend to be higher than those based on half of a test. Table 5–1 shows half-test correlations presented alongside adjusted reliability estimates for the whole test. You can see that all the adjusted correlations are higher than the unadjusted correlations. This is so because Spearman–Brown estimates are based on a test that is twice as long as the original half test. For the data from the kindergarten pupils, for example, a half-test reliability of .718 is estimated to be equivalent to a whole-test reliability of .836.

 Grade Half-Test Correlation (unadjusted r ) Whole-Test  Estimate (rSB) K .718 .836 1 .807 .893 2 .777 .875 Table 5–1 Odd-Even Reliability Coefficients before and after the Spearman-Brown Adjustment*

*For scores on a test of mental ability

If test developers or users wish to shorten a test, the Spearman–Brown formula may be used to estimate the effect of the shortening on the test’s reliability. Reduction in test size for the purpose of reducing test administration time is a common practice in certain situations. For example, the test administrator may have only limited time with a particular testtaker or group of testtakers. Reduction in test size may be indicated in situations where boredom or fatigue could produce responses of questionable meaningfulness.

JUST THINK . . .

What are other situations in which a reduction in test size or the time it takes to administer a test might be desirable? What are the arguments against reducing test size?

A Spearman–Brown formula could also be used to determine the number of items needed to attain a desired level of reliability. In adding items to increase test reliability to a desired level, the rule is that the new items must be equivalent in content and difficulty so that the longer test still measures what the original test measured. If the reliability of the original test is relatively low, then it may be impractical to increase the number of items to reach an acceptable level of reliability. Another alternative would be to abandon this relatively unreliable instrument and locate—or develop—a suitable alternative. The reliability of the instrument could also be raised in some way. For example, the reliability of the instrument might be raised by creating new items, clarifying the test’s instructions, or simplifying the scoring rules.

Internal consistency estimates of reliability, such as that obtained by use of the Spearman–Brown formula, are inappropriate for measuring the reliability of heterogeneous tests and speed tests. The impact of test characteristics on reliability is discussed in detail later in this chapter.Page 152

Other Methods of Estimating Internal Consistency

In addition to the Spearman–Brown formula, other methods used to obtain estimates of internal consistency reliability include formulas developed by Kuder and Richardson (1937) and Cronbach (1951).  Inter-item consistency  refers to the degree of correlation among all the items on a scale. A measure of inter-item consistency is calculated from a single administration of a single form of a test. An index of inter-item consistency, in turn, is useful in assessing the  homogeneity  of the test. Tests are said to be homogeneous if they contain items that measure a single trait. As an adjective used to describe test items, homogeneity (derived from the Greek words homos, meaning “same,” and genos, meaning “kind”) is the degree to which a test measures a single factor. In other words, homogeneity is the extent to which items in a scale are unifactorial.

In contrast to test homogeneity,  heterogeneity  describes the degree to which a test measures different factors. A heterogeneous (or nonhomogeneous) test is composed of items that measure more than one trait. A test that assesses knowledge only of ultra high definition (UHD) television repair skills could be expected to be more homogeneous in content than a general electronics repair test. The former test assesses only one area whereas the latter assesses several, such as knowledge not only of UHD televisions but also of digital video recorders, Blu-Ray players, MP3 players, satellite radio receivers, and so forth.

The more homogeneous a test is, the more inter-item consistency it can be expected to have. Because a homogeneous test samples a relatively narrow content area, it is to be expected to contain more inter-item consistency than a heterogeneous test. Test homogeneity is desirable because it allows relatively straightforward test-score interpretation. Testtakers with the same score on a homogeneous test probably have similar abilities in the area tested. Testtakers with the same score on a more heterogeneous test may have quite different abilities.

Although a homogeneous test is desirable because it so readily lends itself to clear interpretation, it is often an insufficient tool for measuring multifaceted psychological variables such as intelligence or personality. One way to circumvent this potential source of difficulty has been to administer a series of homogeneous tests, each designed to measure some component of a heterogeneous variable.5

The Kuder–Richardson formulas

Dissatisfaction with existing split-half methods of estimating reliability compelled G. Frederic Kuder and M. W. Richardson (1937; Richardson & Kuder, 1939) to develop their own measures for estimating reliability. The most widely known of the many formulas they collaborated on is their  Kuder–Richardson formula 20 , or KR-20, so named because it was the 20th formula developed in a series. Where test items are highly homogeneous, KR-20 and split-half reliability estimates will be similar. However, KR-20 is the statistic of choice for determining the inter-item consistency of dichotomous items, primarily those items that can be scored right or wrong (such as multiple-choice items). If test items are more heterogeneous, KR-20 will yield lower reliability estimates than the split-half method. Table 5–2 summarizes items on a sample heterogeneous test (the HERT), and Table 5–3 summarizes HERT performance for 20 testtakers. Assuming the difficulty level of all the items on the test to be about the same, would you expect a split-half (odd-even) estimate of reliability to be fairly high or low? How would the KR-20 reliability estimate compare with the odd-even estimate of reliability—would it be higher or lower?

 Item Number Content Area 1 UHD television 2 UHD television 3 Digital video recorder (DVR) 4 Digital video recorder (DVR) 5 Blu-Ray player 6 Blu-Ray player 7 Smart phone 8 Smart phone 9 Computer 10 Computer 11 Compact disc player 12 Compact disc player 13 Satellite radio receiver 14 Satellite radio receiver 15 Video camera 16 Video camera 17 MP3 player 18 MP3 player Table 5–2 Content Areas Sampled for 18 Items of the Hypothetical Electronics Repair Test (HERT) Item Number Number of Testtakers Correct 1 14 2 12 3 9 4 18 5 8 6 5 7 6 8 9 9 10 10 10 11 8 12 6 13 15 14 9 15 12 16 12 17 14 18 7 Table 5–3 Performance on the 18-Item HERT by Item for 20 Testtakers

We might guess that, because the content areas sampled for the 18 items from this “Hypothetical Electronics Repair Test” are ordered in a manner whereby odd and even items Page 153tap the same content area, the odd-even reliability estimate will probably be quite high. Because of the great heterogeneity of content areas when taken as a whole, it could reasonably be predicted that the KR-20 estimate of reliability will be lower than the odd-even one. How is KR-20 computed? The following formula may be used:

where rKR20 stands for the Kuder–Richardson formula 20 reliability coefficient, k is the number of test items, σ2 is the variance of total test scores, p is the proportion of testtakers who pass the item, q is the proportion of people who fail the item, and Σ pq is the sum of the pq products over all items. For this particular example, k equals 18. Based on the data in Table 5–3, Σpq can be computed to be 3.975. The variance of total test scores is 5.26. Thus, rKR20 = .259.

An approximation of KR-20 can be obtained by the use of the 21st formula in the series developed by Kuder and Richardson, a formula known as—you guessed it—KR-21. The KR-21 formula may be used if there is reason to assume that all the test items have approximately Page 154the same degree of difficulty. Let’s add that this assumption is seldom justified. Formula KR-21 has become outdated in an era of calculators and computers. Way back when, KR-21 was sometimes used to estimate KR-20 only because it required many fewer calculations.

Numerous modifications of Kuder–Richardson formulas have been proposed through the years. The one variant of the KR-20 formula that has received the most acceptance and is in widest use today is a statistic called coefficient alpha. You may even hear it referred to as coefficient α−20. This expression incorporates both the Greek letter alpha (α) and the number 20, the latter a reference to KR-20.

Coefficient alpha

Developed by Cronbach (1951) and subsequently elaborated on by others (such as Kaiser & Michael, 1975; Novick & Lewis, 1967),  coefficient alpha  may be thought of as the mean of all possible split-half correlations, corrected by the Spearman–Brown formula. In contrast to KR-20, which is appropriately used only on tests with dichotomous items, coefficient alpha is appropriate for use on tests containing nondichotomous items. The formula for coefficient alpha is

where ra is coefficient alpha, k is the number of items, is the variance of one item, Σ is the sum of variances of each item, and σ2 is the variance of the total test scores.

Coefficient alpha is the preferred statistic for obtaining an estimate of internal consistency reliability. A variation of the formula has been developed for use in obtaining an estimate of test-retest reliability (Green, 2003). Essentially, this formula yields an estimate of the mean of all possible test-retest, split-half coefficients. Coefficient alpha is widely used as a measure of reliability, in part because it requires only one administration of the test.

Unlike a Pearson r, which may range in value from −1 to +1, coefficient alpha typically ranges in value from 0 to 1. The reason for this is that, conceptually, coefficient alpha (much like other coefficients of reliability) is calculated to help answer questions about how similar sets of data are. Here, similarity is gauged, in essence, on a scale from 0 (absolutely no similarity) to 1 (perfectly identical). It is possible, however, to conceive of data sets that would yield a negative value of alpha (Streiner, 2003b). Still, because negative values of alpha are theoretically impossible, it is recommended under such rare circumstances that the alpha coefficient be reported as zero (Henson, 2001). Also, a myth about alpha is that “bigger is always better.” As Streiner (2003b) pointed out, a value of alpha above .90 may be “too high” and indicate redundancy in the items.

In contrast to coefficient alpha, a Pearson r may be thought of as dealing conceptually with both dissimilarity and similarity. Accordingly, an r value of −1 may be thought of as indicating “perfect dissimilarity.” In practice, most reliability coefficients—regardless of the specific type of reliability they are measuring—range in value from 0 to 1. This is generally true, although it is possible to conceive of exceptional cases in which data sets yield an r with a negative value.

Average proportional distance (APD)

A relatively new measure for evaluating the internal consistency of a test is the average proportional distance (APD) method (Sturman et al., 2009). Rather than focusing on similarity between scores on items of a test (as do split-half methods and Cronbach’s alpha), the APD is a measure that focuses on the degree of difference that exists between item scores. Accordingly, we define the  average proportional distance method  as a measure used to evaluate the internal consistency of a test that focuses on the degree of difference that exists between item scores.

To illustrate how the APD is calculated, consider the (hypothetical) “3-Item Test of Extraversion” (3-ITE). As conveyed by the title of the 3-ITE, it is a test that has only three Page 155items. Each of the items is a sentence that somehow relates to extraversion. Testtakers are instructed to respond to each of the three items with reference to the following 7-point scale: 1 = Very strongly disagree, 2 = Strongly disagree, 3 = Disagree, 4 = Neither Agree nor Disagree, 5 = Agree, 6 = Strongly agree, and 7 = Very strongly agree.

Typically, in order to evaluate the inter-item consistency of a scale, the calculation of the APD would be calculated for a group of testtakers. However, for the purpose of illustrating the calculations of this measure, let’s look at how the APD would be calculated for one testtaker. Yolanda scores 4 on Item 1, 5 on Item 2, and 6 on Item 3. Based on Yolanda’s scores, the APD would be calculated as follows:

· Step 1: Calculate the absolute difference between scores for all of the items.

· Step 2: Average the difference between scores.

· Step 3: Obtain the APD by dividing the average difference between scores by the number of response options on the test, minus one.

So, for the 3-ITE, here is how the calculations would look using Yolanda’s test scores:

· Step 1: Absolute difference between Items 1 and 2 = 1

·    Absolute difference between Items 1 and 3 = 2

·    Absolute difference between Items 2 and 3 = 1

· Step 2: In order to obtain the average difference (AD), add up the absolute differences in Step 1 and divide by the number of items as follows:

· Step 3: To obtain the average proportional distance (APD), divide the average difference by 6 (the 7 response options in our ITE scale minus 1). Using Yolanda’s data, we would divide 1.33 by 6 to get .22. Thus, the APD for the ITE is .22. But what does this mean?

The general “rule of thumb” for interpreting an APD is that an obtained value of .2 or lower is indicative of excellent internal consistency, and that a value of .25 to .2 is in the acceptable range. A calculated APD of .25 is suggestive of problems with the internal consistency of the test. These guidelines are based on the assumption that items measuring a single construct such as extraversion should ideally be correlated with one another in the .6 to .7 range. Let’s add that the expected inter-item correlation may vary depending on the variables being measured, so the ideal correlation values are not set in stone. In the case of the 3-ITE, the data for our one subject suggests that the scale has acceptable internal consistency. Of course, in order to make any meaningful conclusions about the internal consistency of the 3-ITE, the instrument would have to be tested with a large sample of testtakers.

One potential advantage of the APD method over using Cronbach’s alpha is that the APD index is not connected to the number of items on a measure. Cronbach’s alpha will be higher when a measure has more than 25 items (Cortina, 1993). Perhaps the best course of action when evaluating the internal consistency of a given measure is to analyze and integrate the information using several indices, including Cronbach’s alpha, mean inter-item correlations, and the APD.

Before proceeding, let’s emphasize that all indices of reliability provide an index that is a characteristic of a particular group of test scores, not of the test itself (Caruso, 2000; Yin & Fan, 2000). Measures of reliability are estimates, and estimates are subject to error. The precise amount of error inherent in a reliability estimate will vary with various factors, such as the sample of testtakers from which the data were drawn. A reliability index published in a test manual might be very impressive. However, keep in mind that the reported reliability was achieved with a particular group of testtakers. If a new group of testtakers is sufficiently Page 156different from the group of testtakers on whom the reliability studies were done, the reliability coefficient may not be as impressive—and may even be unacceptable.

Measures of Inter-Scorer Reliability

When being evaluated, we usually would like to believe that the results would be the same no matter who is doing the evaluating.6 For example, if you take a road test for a driver’s license, you would like to believe that whether you pass or fail is solely a matter of your performance behind the wheel and not a function of who is sitting in the passenger’s seat. Unfortunately, in some types of tests under some conditions, the score may be more a function of the scorer than of anything else. This was demonstrated back in 1912, when researchers presented one pupil’s English composition to a convention of teachers and volunteers graded the papers. The grades ranged from a low of 50% to a high of 98% (Starch & Elliott, 1912). Concerns about inter-scorer reliability are as relevant today as they were back then (Chmielewski et al., 2015; Edens et al., 2015; Penney et al., 2016). With this as background, it can be appreciated that certain tests lend themselves to scoring in a way that is more consistent than with other tests. It is meaningful, therefore, to raise questions about the degree of consistency, or reliability, that exists between scorers of a particular test.

Variously referred to as scorer reliability, judge reliability, observer reliability, and inter-rater reliability,  inter-scorer reliability  is the degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure. Reference to levels of inter-scorer reliability for a particular test may be published in the test’s manual or elsewhere. If the reliability coefficient is high, the prospective test user knows that test scores can be derived in a systematic, consistent way by various scorers with sufficient training. A responsible test developer who is unable to create a test that can be scored with a reasonable degree of consistency by trained scorers will go back to the drawing board to discover the reason for this problem. If, for example, the problem is a lack of clarity in scoring criteria, then the remedy might be to rewrite the scoring criteria section of the manual to include clearly written scoring rules. Inter-rater consistency may be promoted by providing raters with the opportunity for group discussion along with practice exercises and information on rater accuracy (Smith, 1986).

Inter-scorer reliability is often used when coding nonverbal behavior. For example, a researcher who wishes to quantify some aspect of nonverbal behavior, such as depressed mood, would start by composing a checklist of behaviors that constitute depressed mood (such as looking downward and moving slowly). Accordingly, each subject would be given a depressed mood score by a rater. Researchers try to guard against such ratings being products of the rater’s individual biases or idiosyncrasies in judgment. This can be accomplished by having at least one other individual observe and rate the same behaviors. If consensus can be demonstrated in the ratings, the researchers can be more confident regarding the accuracy of the ratings and their conformity with the established rating system.

JUST THINK . . .

Can you think of a measure in which it might be desirable for different judges, scorers, or raters to have different views on what is being judged, scored, or rated?

Perhaps the simplest way of determining the degree of consistency among scorers in the scoring of a test is to calculate a coefficient of correlation. This correlation coefficient is referred to as a  coefficient of inter-scorer reliability . In this chapter’s  Everyday Psychometrics  section, the nature of the relationship between the specific method used and the resulting estimate of diagnostic reliability is considered in greater detail.Page 157

EVERYDAY PSYCHOMETRICS

The Importance of the Method Used for Estimating Reliability*

As noted throughout this text, reliability is extremely important in its own right and is also a necessary, but not sufficient, condition for validity. However, researchers often fail to understand that the specific method used to obtain reliability estimates can lead to large differences in those estimates, even when other factors (such as subject sample, raters, and specific reliability statistic used) are held constant. A published study by Chmielewski et al. (2015) highlighted the substantial influence that differences in method can have on estimates of inter-rater reliability.

As one might expect, high levels of diagnostic (inter-rater) reliability are vital for the accurate diagnosis of psychiatric/psychological disorders. Diagnostic reliability must be acceptably high in order to accurately identify risk factors for a disorder that are common to subjects in a research study. Without satisfactory levels of diagnostic reliability, it becomes nearly impossible to accurately determine the effectiveness of treatments in clinical trials. Low diagnostic reliability can also lead to improper information regarding how a disorder changes over time. In applied clinical settings, unreliable diagnoses can result in ineffective patient care—or worse. The utility and validity of a particular diagnosis itself can be called into question if expert diagnosticians cannot, for whatever reason, consistently agree on who should and should not be so diagnosed. In sum, high levels of diagnostic reliability are essential for establishing diagnostic validity (Freedman, 2013; Nelson-Gray, 1991).

The official nomenclature of psychological/psychiatric diagnoses in the United States is the Diagnostic and Statistical Manual of Mental Disorders (DSM-5; American Psychiatric Association, 2013), which provides explicit diagnostic criteria for all mental disorders. A perceived strength of recent versions of the DSM is that disorders listed in the manual can be diagnosed with a high level of inter-rater reliability (Hyman, 2010; Nathan & Langenbucher, 1999), especially when trained professionals use semistructured interviews to assign those diagnoses. However, the field trials for the newest version of the manual, the DSM-5, demonstrated a mean kappa of only .44 (Regier et al., 2013), which is considered a “fair” level of agreement that is only moderately greater than chance (Cicchetti, 1994; Fleiss, 1981). Moreover, DSM-5 kappas were much lower than those from previous versions of the manual which had been in the “excellent” range. As one might expect, given the assumption that psychiatric diagnoses are reliable, the results of the DSM-5 field trials caused considerable controversy and led to numerous criticisms of the new manual (Frances, 2012; Jones, 2012). Interestingly, several diagnoses, which were unchanged from previous versions of the manual, also demonstrated low diagnostic reliability suggesting that the manual itself was not responsible for the apparent reduction in reliability. Instead, differences in the methods used to obtain estimates of inter-rater reliability in the DSM-5 Field Trials, compared to estimates for previous versions of the manual, may have led to the lower observed diagnostic reliability.

Prior to DSM-5, estimates of DSM inter-rater reliability were largely derived using the audio-recording method. In the audio-recording method, one clinician interviews a patient and assigns diagnoses. Then a second clinician, who does not know what diagnoses were assigned, listens to an audio-recording (or watches a video-recording) of the interview and independently assigns diagnoses. These two sets of ratings are then used to calculate inter-rater reliability coefficients (such as kappa). However, in recent years, several researchers have made the case that the audio-recording method might inflate estimates of diagnostic reliability for a variety of reasons (Chmielewski et al., 2015; Kraemer et al., 2012). First, if the interviewing clinician decides the patient they are interviewing does not meet diagnostic criteria for a disorder, they typically do not ask about any remaining symptoms of the disorder (this is a feature of semistructured interviews designed to reduce administration times). However, it also means that the clinician listening to the audio-tape, even if they believe the patient might meet diagnostic criteria for a disorder, does not have all the information necessary to assign a diagnosis and therefore is forced to agree that no diagnosis is present. Second, only the interviewing clinician can follow up patient responses with further questions or obtain clarification regarding symptoms to help them make a decision. Third, even when semistructured interviews are used it is possible that two highly trained clinicians might obtain different responses from a patient if they had each conducted their own interview. In other words, the patient may volunteer more or perhaps even different information to one of the clinicians for any number of reasons. All of the above result in the audio- or video-recording method artificially constraining the information provided to the clinicians to be identical, which is unlikely to occur in actual research or Page 158clinical settings. As such, this method does not allow for truly independent ratings and therefore likely results in overestimates of what would be obtained if separate interviews were conducted.

In the test-retest method, separate independent interviews are conducted by two different clinicians, with neither clinician knowing what occurred during the other interview. These interviews are conducted over a time frame short enough that true change in diagnostic status is highly unlikely, making this method similar to the dependability method of assessing reliability (Chmielewski & Watson, 2009). Because diagnostic reliability is intended to assess the extent to which a patient would receive the same diagnosis at different hospitals or clinics—or, alternatively, the extent to which different studies are recruiting similar patients—the test-retest method provides a more meaningful, realistic, and ecologically valid estimate of diagnostic reliability.

Chmielewski et al. (2015) examined the influence of method on estimates of reliability by using both the audio-recording and test-retest methods in a large sample of psychiatric patients. The authors’ analyzed DSM-5 diagnoses because of the long-standing claims in the literature that they were reliable and the fact that structured interviews had not yet been created for the DSM-5. They carefully selected a one-week test-retest interval, based on theory and research, to minimize the likelihood that true diagnostic change would occur while substantially reducing memory effects and patient fatigue which might exist if the interviews were conducted immediately after each other. Clinicians in the study were at least master’s level and underwent extensive training that far exceeded the training of clinicians in the vast majority of research studies. The same pool of clinicians and patients was used for the audio-recording and test-retest methods. Diagnoses were assigned using the Structured Clinical Interview for DSM-IV (SCID-I/P; First et al., 2002), which is widely considered the gold-standard diagnostic interview in the field. Finally, patients completed self-report measures which were examined to ensure patients’ symptoms did not change over the one-week interval.

Diagnostic (inter-rater) reliability using the audio-recording method was very high (mean kappa = .80) and would be considered “excellent” by traditional standards (Cicchetti, 1994; Fleiss, 1981). Moreover, estimates of diagnostic reliability were equivalent or superior to previously published values for the DSM-5. However, estimates of diagnostic reliability obtained from the test-retest method were substantially lower (mean kappa = .47) and would be considered only “fair” by traditional standards. Moreover, approximately 25% of the disorders demonstrated “poor” diagnostic reliability. Interestingly, this level of diagnostic reliability was very similar to that observed in the DSM-5 Field Trials (mean kappa = .44), which also used the test-retest method (Regier et al., 2013). It is important to note these large differences in estimates of diagnostic reliability emerged despite the fact that (1) the same highly trained master’s-level clinicians were used for both methods; (2) the SCID-I/P, which is considered the “gold standard” in diagnostic interviews, was used; (3) the same patient sample was used; and (4) patients’ self-report of their symptoms was very stable (or, patients were experiencing their symptoms the same way during both interviews) and any changes in self-report were unrelated to diagnostic disagreements between clinicians. These results suggest that the reliability of diagnoses is far lower than commonly believed. Moreover, the results demonstrate the substantial influence that method has on estimates of diagnostic reliability even when other factors are held constant.

Used with permission of Michael Chmielewski.

*This Everyday Psychometrics was guest-authored by Michael Chmielewski of Southern Methodist University and was based on an article by Chmielewski et al. (2015), published in the Journal of Abnormal Psychology (copyright © 2015 by the American Psychological Association). The use of this information does not imply endorsement by the publisher.

Using and Interpreting a Coefficient of Reliability

We have seen that, with respect to the test itself, there are basically three approaches to the estimation of reliability: (1) test-retest, (2) alternate or parallel forms, and (3) internal or inter-item consistency. The method or methods employed will depend on a number of factors, such as the purpose of obtaining a measure of reliability.

Another question that is linked in no trivial way to the purpose of the test is, “How high should the coefficient of reliability be?” Perhaps the best “short answer” to this question is: Page 159“On a continuum relative to the purpose and importance of the decisions to be made on the basis of scores on the test.” Reliability is a mandatory attribute in all tests we use. However, we need more of it in some tests, and we will admittedly allow for less of it in others. If a test score carries with it life-or-death implications, then we need to hold that test to some high standards—including relatively high standards with regard to coefficients of reliability. If a test score is routinely used in combination with many other test scores and typically accounts for only a small part of the decision process, that test will not be held to the highest standards of reliability. As a rule of thumb, it may be useful to think of reliability coefficients in a way that parallels many grading systems: In the .90s rates a grade of A (with a value of .95 higher for the most important types of decisions), in the .80s rates a B (with below .85 being a clear B−), and anywhere from .65 through the .70s rates a weak, “barely passing” grade that borders on failing (and unacceptable). Now, let’s get a bit more technical with regard to the purpose of the reliability coefficient.

The Purpose of the Reliability Coefficient

If a specific test of employee performance is designed for use at various times over the course of the employment period, it would be reasonable to expect the test to demonstrate reliability across time. It would thus be desirable to have an estimate of the instrument’s test-retest reliability. For a test designed for a single administration only, an estimate of internal consistency would be the reliability measure of choice. If the purpose of determining reliability is to break down the error variance into its parts, as shown in Figure 5–1, then a number of reliability coefficients would have to be calculated.

Figure 5–1  Sources of Variance in a Hypothetical Test  In this hypothetical situation, 5% of the variance has not been identified by the test user. It is possible, for example, that this portion of the variance could be accounted for by  transient error,  a source of error attributable to variations in the testtaker’s feelings, moods, or mental state over time. Then again, this 5% of the error may be due to other factors that are yet to be identified.

Note that the various reliability coefficients do not all reflect the same sources of error variance. Thus, an individual reliability coefficient may provide an index of error from test construction, test administration, or test scoring and interpretation. A coefficient of inter-rater reliability, for example, provides information about error as a result of test scoring. Specifically, it can be used to answer questions about how consistently two scorers score the same test items. Table 5–4 summarizes the different kinds of error variance that are reflected in different reliability coefficients.

Page 160

 Type of Reliability Purpose Typical uses Number of  Testing Sessions Sources of Error Variance Statistical Procedures · Test-retest · To evaluate the stabilityof a measure · When assessing the stability of various personality traits · 2 · Administration · Pearson r or Spearman rho · Alternate-forms · To evaluate the relationship between different forms of a measure · When there is a need for different forms of a test (e.g., makeup tests) · 1 or 2 · Test construction or administration · Pearson r or Spearman rho · Internal consistency · To evaluate the extent to which items on a scale relate to one another · When evaluating the homogeneity of a measure (or, all items are tapping a single construct) · 1 · Test construction · Pearson r between equivalent test halves with Spearman Brown correction or Kuder-R-ichardson for dichotomous items, or coefficient alpha for multipoint items or APD · Inter-scorer · To evaluate the level of agreement between raters on a measure · Interviews or coding of behavior. Used when researchers need to show that there is consensus in the way that different raters view a particular behavior pattern (and hence no observer bias). · 1 · Scoring and interpretation · Cohen’s kappa, Pearson r or Spearman rho Table 5–4 Summary of Reliability Types

The Nature of the Test

Closely related to considerations concerning the purpose and use of a reliability coefficient are those concerning the nature of the test itself. Included here are considerations such as whether (1) the test items are homogeneous or heterogeneous in nature; (2) the characteristic, ability, or trait being measured is presumed to be dynamic or static; (3) the range of test scores is or is not restricted; (4) the test is a speed or a power test; and (5) the test is or is not criterion-referenced.

Some tests present special problems regarding the measurement of their reliability. For example, a number of psychological tests have been developed for use with infants to help identify children who are developing slowly or who may profit from early intervention of some sort. Measuring the internal consistency reliability or the inter-scorer reliability of such tests is accomplished in much the same way as it is with other tests. However, measuring test-retest reliability presents a unique problem. The abilities of the very young children being tested are fast-changing. It is common knowledge that cognitive development during the first months and years of life is both rapid and uneven. Children often grow in spurts, sometimes changing dramatically in as little as days (Hetherington & Parke, 1993). The child tested just before and again just after a developmental advance may perform very differently on the two testings. In such cases, a marked change in test score might be attributed to error when in reality it reflects a genuine change in the testtaker’s skills. The challenge in gauging the test-retest reliability of such tests is to do so in such a way that it is not spuriously lowered by the testtaker’s actual Page 161developmental changes between testings. In attempting to accomplish this, developers of such tests may design test-retest reliability studies with very short intervals between testings, sometimes as little as four days.

Homogeneity versus heterogeneity of test items

Recall that a test is said to be homogeneous in items if it is functionally uniform throughout. Tests designed to measure one factor, such as one ability or one trait, are expected to be homogeneous in items. For such tests, it is reasonable to expect a high degree of internal consistency. By contrast, if the test is heterogeneous in items, an estimate of internal consistency might be low relative to a more appropriate estimate of test-retest reliability.

Dynamic versus static characteristics

Whether what is being measured by the test is dynamic or static is also a consideration in obtaining an estimate of reliability. A  dynamic characteristic  is a trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experiences. If, for example, one were to take hourly measurements of the dynamic characteristic of anxiety as manifested by a stockbroker throughout a business day, one might find the measured level of this characteristic to change from hour to hour. Such changes might even be related to the magnitude of the Dow Jones average. Because the true amount of anxiety presumed to exist would vary with each assessment, a test-retest measure would be of little help in gauging the reliability of the measuring instrument. Therefore, the best estimate of reliability would be obtained from a measure of internal consistency. Contrast this situation to one in which hourly assessments of this same stockbroker are made on a trait, state, or ability presumed to be relatively unchanging (a  static characteristic ), such as intelligence. In this instance, obtained measurement would not be expected to vary significantly as a function of time, and either the test-retest or the alternate-forms method would be appropriate.

JUST THINK . . .

Provide another example of both a dynamic characteristic and a static characteristic that a psychological test could measure.

Restriction or inflation of range

In using and interpreting a coefficient of reliability, the issue variously referred to as  restriction of range  or  restriction of variance  (or, conversely,  inflation of range  or  inflation of variance ) is important. If the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower. If the variance of either variable in a correlational analysis is inflated by the sampling procedure, then the resulting correlation coefficient tends to be higher. Refer back to Figure 3–17 on page 111 (Two Scatterplots Illustrating Unrestricted and Restricted Ranges) for a graphic illustration.

Also of critical importance is whether the range of variances employed is appropriate to the objective of the correlational analysis. Consider, for example, a published educational test designed for use with children in grades 1 through 6. Ideally, the manual for this test should contain not one reliability value covering all the testtakers in grades 1 through 6 but instead reliability values for testtakers at each grade level. Here’s another example: A corporate personnel officer employs a certain screening test in the hiring process. For future testing and hiring purposes, this personnel officer maintains reliability data with respect to scores achieved by job applicants—as opposed to hired employees—in order to avoid restriction of range effects in the data. This is so because the people who were hired typically scored higher on the test than any comparable group of applicants.

Speed tests versus power tests

When a time limit is long enough to allow testtakers to attempt all items, and if some items are so difficult that no testtaker is able to obtain a perfect score, then the test is a  power test . By contrast, a  speed test  generally contains items of Page 162uniform level of difficulty (typically uniformly low) so that, when given generous time limits, all testtakers should be able to complete all the test items correctly. In practice, however, the time limit on a speed test is established so that few if any of the testtakers will be able to complete the entire test. Score differences on a speed test are therefore based on performance speed because items attempted tend to be correct.

A reliability estimate of a speed test should be based on performance from two independent testing periods using one of the following: (1) test-retest reliability, (2) alternate-forms reliability, or (3) split-half reliability from two separately timed half tests. If a split-half procedure is used, then the obtained reliability coefficient is for a half test and should be adjusted using the Spearman–Brown formula.

Because a measure of the reliability of a speed test should reflect the consistency of response speed, the reliability of a speed test should not be calculated from a single administration of the test with a single time limit. If a speed test is administered once and some measure of internal consistency, such as the Kuder–Richardson or a split-half correlation, is calculated, the result will be a spuriously high reliability coefficient. To understand why the KR-20 or split-half reliability coefficient will be spuriously high, consider the following example.

When a group of testtakers completes a speed test, almost all the items completed will be correct. If reliability is examined using an odd-even split, and if the testtakers completed the items in order, then testtakers will get close to the same number of odd as even items correct. A testtaker completing 82 items can be expected to get approximately 41 odd and 41 even items correct. A testtaker completing 61 items may get 31 odd and 30 even items correct. When the numbers of odd and even items correct are correlated across a group of testtakers, the correlation will be close to 1.00. Yet this impressive correlation coefficient actually tells us nothing about response consistency.

Under the same scenario, a Kuder–Richardson reliability coefficient would yield a similar coefficient that would also be, well, equally useless. Recall that KR-20 reliability is based on the proportion of testtakers correct (p) and the proportion of testtakers incorrect (q) on each item. In the case of a speed test, it is conceivable that p would equal 1.0 and q would equal 0 for many of the items. Toward the end of the test—when many items would not even be attempted because of the time limit—p might equal 0 and q might equal 1.0. For many, if not a majority, of the items, then, the product pq would equal or approximate 0. When 0 is substituted in the KR-20 formula for Σ pq, the reliability coefficient is 1.0 (a meaningless coefficient in this instance).

Criterion-referenced tests

criterion-referenced test  is designed to provide an indication of where a testtaker stands with respect to some variable or criterion, such as an educational or a vocational objective. Unlike norm-referenced tests, criterion-referenced tests tend to contain material that has been mastered in hierarchical fashion. For example, the would-be pilot masters on-ground skills before attempting to master in-flight skills. Scores on criterion-referenced tests tend to be interpreted in pass–fail (or, perhaps more accurately, “master-failed-to-master”) terms, and any scrutiny of performance on individual items tends to be for diagnostic and remedial purposes.

Traditional techniques of estimating reliability employ measures that take into account scores on the entire test. Recall that a test-retest reliability estimate is based on the correlation between the total scores on two administrations of the same test. In alternate-forms reliability, a reliability estimate is based on the correlation between the two total scores on the two forms. In split-half reliability, a reliability estimate is based on the correlation between scores on two halves of the test and is then adjusted using the Spearman–Brown formula to obtain a reliability estimate of the whole test. Although there are exceptions, such traditional procedures of Page 163estimating reliability are usually not appropriate for use with criterion-referenced tests. To understand why, recall that reliability is defined as the proportion of total variance (σ2) attributable to true variance (σ2th). Total variance in a test score distribution equals the sum of the true variance plus the error variance (σe2)

A measure of reliability, therefore, depends on the variability of the test scores: how different the scores are from one another. In criterion-referenced testing, and particularly in mastery testing, how different the scores are from one another is seldom a focus of interest. In fact, individual differences between examinees on total test scores may be minimal. The critical issue for the user of a mastery test is whether or not a certain criterion score has been achieved.

As individual differences (and the variability) decrease, a traditional measure of reliability would also decrease, regardless of the stability of individual performance. Therefore, traditional ways of estimating reliability are not always appropriate for criterion-referenced tests, though there may be instances in which traditional estimates can be adopted. An example might be a situation in which the same test is being used at different stages in some program—training, therapy, or the like—and so variability in scores could reasonably be expected. Statistical techniques useful in determining the reliability of criterion-referenced tests are discussed in great detail in many sources devoted to that subject (e.g., Hambleton & Jurgensen, 1990).

The True Score Model of Measurement and Alternatives to It

Thus far—and throughout this book, unless specifically stated otherwise—the model we have assumed to be operative is  classical test theory (CTT) , also referred to as the true score (or classical) model of measurement. CTT is the most widely used and accepted model in the psychometric literature today—rumors of its demise have been greatly exaggerated (Zickar & Broadfoot, 2009). One of the reasons it has remained the most widely used model has to do with its simplicity, especially when one considers the complexity of other proposed models of measurement. Comparing CTT to IRT, for example, Streiner (2010) mused, “CTT is much simpler to understand than IRT; there aren’t formidable-looking equations with exponentiations, Greek letters, and other arcane symbols” (p. 185). Additionally, the CTT notion that everyone has a “true score” on a test has had, and continues to have, great intuitive appeal. Of course, exactly how to define this elusive true score has been a matter of sometimes contentious debate. For our purposes, we will define  true score  as a value that according to classical test theory genuinely reflects an individual’s ability (or trait) level as measured by a particular test. Let’s emphasize here that this value is indeed very test dependent. A person’s “true score” on one intelligence test, for example, can vary greatly from that same person’s “true score” on another intelligence test. Similarly, if “Form D” of an ability test contains items that the testtaker finds to be much more difficult than those on “Form E” of that test, then there is a good chance that the testtaker’s true score on Form D will be lower than that on Form E. The same holds for true scores obtained on different tests of personality. One’s true score on one test of extraversion, for example, may not bear much resemblance to one’s true score on another test of extraversion. Comparing a testtaker’s scores on two different tests purporting to measure the same thing requires a sophisticated knowledge of the properties of each of the two tests, as well as some rather complicated statistical procedures designed to equate the scores.

Another aspect of the appeal of CTT is that its assumptions allow for its application in most situations (Hambleton & Swaminathan, 1985). The fact that CTT assumptions are rather easily met and therefore applicable to so many measurement situations can be Page 164advantageous, especially for the test developer in search of an appropriate model of measurement for a particular application. Still, in psychometric parlance, CTT assumptions are characterized as “weak”—this precisely because its assumptions are so readily met. By contrast, the assumptions in another model of measurement, item response theory (IRT), are more difficult to meet. As a consequence, you may read of IRT assumptions being characterized in terms such as “strong,” “hard,” “rigorous,” and “robust.” A final advantage of CTT over any other model of measurement has to do with its compatibility and ease of use with widely used statistical techniques (as well as most currently available data analysis software). Factor analytic techniques, whether exploratory or confirmatory, are all “based on the CTT measurement foundation” (Zickar & Broadfoot, 2009, p. 52).

For all of its appeal, measurement experts have also listed many problems with CTT. For starters, one problem with CTT has to do with its assumption concerning the equivalence of all items on a test; that is, all items are presumed to be contributing equally to the score total. This assumption is questionable in many cases, and particularly questionable when doubt exists as to whether the scaling of the instrument in question is genuinely interval level in nature. Another problem has to do with the length of tests that are developed using a CTT model. Whereas test developers favor shorter rather than longer tests (as do most testtakers), the assumptions inherent in CTT favor the development of longer rather than shorter tests. For these reasons, as well as others, alternative measurement models have been developed. Below we briefly describe domain sampling theory and generalizability theory. We will then describe in greater detail, item response theory (IRT), a measurement model that some believe is a worthy successor to CTT (Borsbroom, 2005; Harvey & Hammer, 1999).

Domain sampling theory and generalizability theory

The 1950s saw the development of a viable alternative to CTT. It was originally referred to as domain sampling theory and is better known today in one of its many modified forms as generalizability theory. As set forth by Tryon (1957), the theory of domain sampling rebels against the concept of a true score existing with respect to the measurement of psychological constructs. Whereas those who subscribe to CTT seek to estimate the portion of a test score that is attributable to error, proponents of  domain sampling theory  seek to estimate the extent to which specific sources of variation under defined conditions are contributing to the test score. In domain sampling theory, a test’s reliability is conceived of as an objective measure of how precisely the test score assesses the domain from which the test draws a sample (Thorndike, 1985). A domain of behavior, or the universe of items that could conceivably measure that behavior, can be thought of as a hypothetical construct: one that shares certain characteristics with (and is measured by) the sample of items that make up the test. In theory, the items in the domain are thought to have the same means and variances of those in the test that samples from the domain. Of the three types of estimates of reliability, measures of internal consistency are perhaps the most compatible with domain sampling theory.

In one modification of domain sampling theory called generalizability theory, a “universe score” replaces that of a “true score” (Shavelson et al., 1989). Developed by Lee J. Cronbach (1970) and his colleagues (Cronbach et al., 1972),  generalizability theory  is based on the idea that a person’s test scores vary from testing to testing because of variables in the testing situation. Instead of conceiving of all variability in a person’s scores as error, Cronbach encouraged test developers and researchers to describe the details of the particular test situation or  universe  leading to a specific test score. This universe is described in terms of its  facets , which include things like the number of items in the test, the amount of training the test scorers have had, and the purpose of the test administration. Page 165According to generalizability theory, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained. This test score is the  universe score , and it is, as Cronbach noted, analogous to a true score in the true score model. Cronbach (1970) explained as follows:

“What is Mary’s typing ability?” This must be interpreted as “What would Mary’s word processing score on this be if a large number of measurements on the test were collected and averaged?” The particular test score Mary earned is just one out of a universe of possible observations. If one of these scores is as acceptable as the next, then the mean, called the universe score and symbolized here by Mp (mean for person p), would be the most appropriate statement of Mary’s performance in the type of situation the test represents.

The universe is a collection of possible measures “of the same kind,” but the limits of the collection are determined by the investigator’s purpose. If he needs to know Mary’s typing ability on May 5 (for example, so that he can plot a learning curve that includes one point for that day), the universe would include observations on that day and on that day only. He probably does want to generalize over passages, testers, and scorers—that is to say, he would like to know Mary’s ability on May 5 without reference to any particular passage, tester, or scorer… .

The person will ordinarily have a different universe score for each universe. Mary’s universe score covering tests on May 5 will not agree perfectly with her universe score for the whole month of May… . Some testers call the average over a large number of comparable observations a “true score”; e.g., “Mary’s true typing rate on 3-minute tests.” Instead, we speak of a “universe score” to emphasize that what score is desired depends on the universe being considered. For any measure there are many “true scores,” each corresponding to a different universe.