Comparison of Computer Scoring Model Performance for Short Text Responses Across Undergraduate Institutional Types


Constructed response (CR) assessments allow students to demonstrate understanding of complex topics and provide teachers with deeper insight into student thinking. Computer scoring models (CSMs) remove the barrier of increased time and effort, making CR more accessible. As CSMs are commonly created using responses from research-intensive colleges and universities (RICUs), this pilot study examines the effectiveness of seven previously developed CSMs on diverse CRs from RICUs, two-year colleges (TYCs), and primarily undergraduate institutions (PUIs). We asked if accuracy of the CSMs was maintained with a new testing set of CRs and if CSM accuracy differed among different institutional types. A human scorer and the CSMs analytically categorized 444 CRs for the presence or absence of seven ideas relating to weight loss. Comparing human and CSM predictions revealed five CSMs maintained high agreement (Cohen’s kappa > 0.80); however, two CSMs demonstrated reduced agreement (Cohen’s kappa < 0.65). Seventy-one percent of these miscodes were false negatives. RICU responses were 1.4 times more likely to be miscoded than TYCs (p = 0.038) or PUIs (p = 0.047) across all seven CSMs. However, this increased frequency may result from the higher number of ideas in RICU responses in comparison to TYCs (p = 0.082) and PUIs (p = 0.013). Accounting for increased ideas removed the significant difference between RICUs and TYCs (p = 0.23) and PUIs (p = 0.54). Finally, qualitative examination of miscodes provides insight into reduced CSM performance. Collectively, these data support the utility of these CSMs across institutional types and with novel CRs.


Megan Shiroda, Juli Uhl, Mark Urban-Lurain, Kevin Haudek

Year of Publication



Journal of Science Education and Technology

Number of Pages


Date Published


ISSN Number