Discussion post – ethical dilemma

Do Adjusted Subscores Lack Validity? Don’t Blame the Messenger

Sandip Sinharay1, Shelby J. Haberman1, and Howard Wainer2

Abstract

There are several techniques that increase the precision of subscores by borrowing information from other parts of the test. These techniques have been criticized on validity grounds in several of the recent publications. In this note, the authors ques- tion the argument used in these publications and suggest both inherent limits to the validity argument and empirical issues worth examining.

Keywords

subscores, validity, augmented subscore

Introduction: Subscores and Adjusted Subscores

There are several techniques that increase the precision of subscores by borrowing

information from other parts of the test. These techniques have been criticized on val-

idity grounds in several recent publications such as Skorupski and Carvajal (2010) and

Stone, Ye, Zhu, and Lane (2010). In this note, we question the argument used in these

publications and suggest both inherent limits to the validity argument and empirical

issues worth examining. We begin with an introduction to the techniques that borrow

information from other parts of the test as part of the subscore computation process

and then evaluate the validity arguments advanced recently concerning these

techniques.

Interest in subscores in educational testing reflects their potential remedial and

instructional benefit. According to the National Research Council report ‘‘Knowing

1Educational Testing Service, Princeton, NJ, USA 2National Board of Medical Examiners, Philadelphia, PA, USA

Corresponding Author:

Sandip Sinharay, Educational Testing Service, 12T Rosedale Road, Princeton, NJ 08541, USA

Email: [email protected]

Educational and Psychological Measurement

71(5) 789–797 ª The Author(s) 2011

Reprints and permission: sagepub.com/journalsPermissions.nav

DOI: 10.1177/0013164410391782 http://epm.sagepub.com

What Students Know’’ (2001), the target of assessment is to provide particular infor-

mation about an examinee’s knowledge, skill, and abilities. Subscores have the poten-

tial to provide such information; however, they are too often not reliable enough for

their intended purposes. Several researchers have suggested methods that increase the

precision of subscores by borrowing information from the other related scores or sub-

scores. For example,

• Wainer, Sheehan, and Wang (2000) and Wainer, Vevea, et al. (2001) suggest the augmented subscore that is a function of an examinee’s score on the sub-

scale of interest and that examinee’s score on the remaining subscales.

• Yen (1987) suggested the objective performance index (OPI) that is a weighted average of the observed subscore and an estimate of the observed

subscore obtained using a unidimensional item response theory (IRT) model

for the entire test.

• Haberman (2008a) suggested a weighted average of a subscore and the total score. Sinharay (2010) found that this weighted average is typically very sim-

ilar to the augmented subscore (Wainer et al., 2000).

• Several researchers (de la Torre & Patz, 2005; Haberman & Sinharay, 2010; Luecht, 2003; Yao & Boughton, 2007) suggested using estimated abilities or

their transformations obtained from a multivariate IRT (MIRT) model as sub-

scores. For background on MIRT models, see, for example, Reckase (1997).

The scores obtained from the above-mentioned approaches will be referred to as

‘‘adjusted subscores.’’1 Researchers have found that adjusted subscores are more reli-

able, often substantially so, than the subscores themselves (Dwyer, Boughton, Yao,

Steffen, & Lewis, 2006; Sinharay, 2010; Skorupski & Carvajal, 2010; Stone, Ye,

Zhu, & Lane, 2010).

Recent Criticisms of Adjusted Subscores

The validity of adjusted subscores has been questioned recently. Skorupski and

Carvajal (2010) studied four subscores from a large statewide test and found that

the corresponding OPIs and the augmented subscores (Wainer et al., 2000) were

highly correlated among themselves. The correlations between augmented subscores

were 0.97 or greater and those between the OPIs were all 1.00. Skorupski and Carvajal

(2010) commented that this phenomenon of high correlations among the adjusted sub-

scores (which means that the rank orderings for the four adjusted subscores are very

similar) leads to potential loss of meaning of the subscores and ‘‘reduces, if not elim-

inates, the utility of the subscores for the diagnostic purposes for which they are

intended. This begs the question: Are the augmented subscores providing more useful

information than the raw ones?’’ (p. 372). They went on to comment that ‘‘although

augmentation dramatically improves the reliability of subscores, it may in fact nega-

tively affect the validity of score interpretations’’ (p. 372). In the abstract of their arti-

cle, they commented that the near-perfect correlations among the adjusted subscores

790 Educational and Psychological Measurement 71(5)

‘‘called into question the validity of the resultant subscores, and therefore the useful-

ness of the subscore augmentation process.’’

Stone et al. (2010) studied the four subscores for the spring 2006 assessment of the

Delaware Student Testing Program 8th grade mathematics assessment. They found

the augmented subscores, the OPIs, and the MIRT-based subscores to be highly cor-

related among themselves and commented that ‘‘it may be that adjusted subscale

scores represent the measurement of a construct that is different from the construct

being measured by the unadjusted subscale scores’’ (p. 80). They commented that bor-

rowing information from other subscales causes a ‘‘potential threat to validity’’ of the

adjusted subscores (p. 80).