Exploring the Effect of Assessment Construct Complexity on Machine Learning Scoring of Argumentation


Argumentation, a key scientific practice, requires students to construct and critique arguments, but timely and large-scale evaluation of responses depends on automated text scoring systems, which rely on machine learning algorithms. Recent work has shown the utility of these automated systems, as well as proposing to increase the use of machine learning for high complexity assessments. Therefore, in this study, we investigated whether the construct complexity of an assessment item affected machine learning model performance. We employed human experts to score student responses to 17 argumentation items aligned to 3 levels of a learning progression and randomly selected 361 responses to use as training sets to build machine learning scoring models for each item. We were able to produce scoring models with a range of scoring agreement between computers and humans, measured by Cohen’s kappa (M = .60; range .38 - .89). Most models demonstrated good to almost perfect performance (kappa > .60). We found that scoring models for more complex constructs, such as multiple dimensions of science learning or higher levels of a learning progression, had lower performance metrics as compared to models for items at lower levels. These negative correlations were significant for three construct characteristics we examined, complexity, diversity and structure. In order to develop automated scoring models for more complex assessment items, larger training sets or additional model tuning may be required.


Kevin Haudek, Xiaoming Zhai

Year of Publication


Conference Name

NARST Annual Conference

Date Published



National Association for Research on Science Teaching and Learning

Conference Location