A three-stage approach to the automated scoring of spontaneous spoken responses
This paper presents a description and evaluation of SpeechRaterSM, a system for automated scoring of non-native speakers’ spoken English proficiency, based on tasks which elicit spontaneous monologues on particular topics. This system builds on much previous work in the automated scoring of test responses, but differs from previous work in that the highly unpredictable nature of the responses to this task type makes the challenge of accurate scoring much more difficult.SpeechRater uses a three-stage architecture. Responses are first processed by a filtering model to ensure that no exceptional conditions exist which might prevent them from being scored by SpeechRater. Responses not filtered out at this stage are then processed by the scoring model to estimate the proficiency rating which a human might assign to them, on the basis of features related to fluency, pronunciation, vocabulary diversity, and grammar. Finally, an aggregation model combines an examinee’s scores for multiple items to calculate a total score, as well as an interval in which the examinee’s score is predicted to reside with high confidence.SpeechRater’s current level of accuracy and construct representation have been deemed sufficient for low-stakes practice exercises, and it has been used in a practice exam for the TOEFL since late 2006. In such a practice environment, it offers a number of advantages compared to human raters, including system load management, and the facilitation of immediate feedback to students. However, it must be acknowledged that SpeechRater presently fails to measure many important aspects of speaking proficiency (such as intonation and appropriateness of topic development), and its agreement with human ratings of proficiency does not yet approach the level of agreement between two human raters.