Contact us:
+1 (520) 226-8615
Email:
[email protected]
Do Adjusted Subscores Lack Validity? Don’t Blame the Messenger
Sandip Sinharay1, Shelby J. Haberman1, and Howard Wainer2
Abstract
There are several techniques that increase the precision of subscores by borrowing information from other parts of the test. These techniques have been criticized on validity grounds in several of the recent publications. In this note, the authors ques- tion the argument used in these publications and suggest both inherent limits to the validity argument and empirical issues worth examining.
Keywords
subscores, validity, augmented subscore
Introduction: Subscores and Adjusted Subscores
There are several techniques that increase the precision of subscores by borrowing
information from other parts of the test. These techniques have been criticized on val-
idity grounds in several recent publications such as Skorupski and Carvajal (2010) and
Stone, Ye, Zhu, and Lane (2010). In this note, we question the argument used in these
publications and suggest both inherent limits to the validity argument and empirical
issues worth examining. We begin with an introduction to the techniques that borrow
information from other parts of the test as part of the subscore computation process
and then evaluate the validity arguments advanced recently concerning these
techniques.
Interest in subscores in educational testing reflects their potential remedial and
instructional benefit. According to the National Research Council report ‘‘Knowing
1Educational Testing Service, Princeton, NJ, USA 2National Board of Medical Examiners, Philadelphia, PA, USA
Corresponding Author:
Sandip Sinharay, Educational Testing Service, 12T Rosedale Road, Princeton, NJ 08541, USA
Email: [email protected]
Educational and Psychological Measurement
71(5) 789–797 ª The Author(s) 2011
Reprints and permission: sagepub.com/journalsPermissions.nav
DOI: 10.1177/0013164410391782 http://epm.sagepub.com
What Students Know’’ (2001), the target of assessment is to provide particular infor-
mation about an examinee’s knowledge, skill, and abilities. Subscores have the poten-
tial to provide such information; however, they are too often not reliable enough for
their intended purposes. Several researchers have suggested methods that increase the
precision of subscores by borrowing information from the other related scores or sub-
scores. For example,
• Wainer, Sheehan, and Wang (2000) and Wainer, Vevea, et al. (2001) suggest the augmented subscore that is a function of an examinee’s score on the sub-
scale of interest and that examinee’s score on the remaining subscales.
• Yen (1987) suggested the objective performance index (OPI) that is a weighted average of the observed subscore and an estimate of the observed
subscore obtained using a unidimensional item response theory (IRT) model
for the entire test.
• Haberman (2008a) suggested a weighted average of a subscore and the total score. Sinharay (2010) found that this weighted average is typically very sim-
ilar to the augmented subscore (Wainer et al., 2000).
• Several researchers (de la Torre & Patz, 2005; Haberman & Sinharay, 2010; Luecht, 2003; Yao & Boughton, 2007) suggested using estimated abilities or
their transformations obtained from a multivariate IRT (MIRT) model as sub-
scores. For background on MIRT models, see, for example, Reckase (1997).
The scores obtained from the above-mentioned approaches will be referred to as
‘‘adjusted subscores.’’1 Researchers have found that adjusted subscores are more reli-
able, often substantially so, than the subscores themselves (Dwyer, Boughton, Yao,
Steffen, & Lewis, 2006; Sinharay, 2010; Skorupski & Carvajal, 2010; Stone, Ye,
Zhu, & Lane, 2010).
Recent Criticisms of Adjusted Subscores
The validity of adjusted subscores has been questioned recently. Skorupski and
Carvajal (2010) studied four subscores from a large statewide test and found that
the corresponding OPIs and the augmented subscores (Wainer et al., 2000) were
highly correlated among themselves. The correlations between augmented subscores
were 0.97 or greater and those between the OPIs were all 1.00. Skorupski and Carvajal
(2010) commented that this phenomenon of high correlations among the adjusted sub-
scores (which means that the rank orderings for the four adjusted subscores are very
similar) leads to potential loss of meaning of the subscores and ‘‘reduces, if not elim-
inates, the utility of the subscores for the diagnostic purposes for which they are
intended. This begs the question: Are the augmented subscores providing more useful
information than the raw ones?’’ (p. 372). They went on to comment that ‘‘although
augmentation dramatically improves the reliability of subscores, it may in fact nega-
tively affect the validity of score interpretations’’ (p. 372). In the abstract of their arti-
cle, they commented that the near-perfect correlations among the adjusted subscores
790 Educational and Psychological Measurement 71(5)
‘‘called into question the validity of the resultant subscores, and therefore the useful-
ness of the subscore augmentation process.’’
Stone et al. (2010) studied the four subscores for the spring 2006 assessment of the
Delaware Student Testing Program 8th grade mathematics assessment. They found
the augmented subscores, the OPIs, and the MIRT-based subscores to be highly cor-
related among themselves and commented that ‘‘it may be that adjusted subscale
scores represent the measurement of a construct that is different from the construct
being measured by the unadjusted subscale scores’’ (p. 80). They commented that bor-
rowing information from other subscales causes a ‘‘potential threat to validity’’ of the
adjusted subscores (p. 80).