IELTS | News and Insights - What automarking means for language test…

Advances in machine learning (ML) and artificial intelligence (AI) have driven rapid adoption of automarking in language assessments.

As use of the technology expands, questions about what it means for test validity become harder to ignore. After all, scores are often used to make high-stakes decisions that affect access to education, employment and migration.

For institutions, test scores inform decisions about admissions, placement and academic support. They need confidence in results to ensure they’re accepting students with the right level of language proficiency to thrive on and off campus.

To ensure test integrity, it is, therefore, vital to understand how automarking affects test validity before introducing a new scoring system.

What is automarking in the context of language assessment?

Automarkers are algorithms – trained using ML techniques – to evaluate and mark open-ended spoken and written responses to tasks.

In high-stakes language assessments, automarking supports the scoring process by marking specific test components and flagging unusual responses for review. Of course, it must be at least as accurate as a highly trained human examiner, with an understandable scoring model.

Generally, automarking is more achievable for writing tasks than speaking. Capturing and analysing speech is more complex, particularly when learner speech is heavily accented and assessment depends on interaction and nuance.

Used well, automarking provides a range of benefits:

Efficiency and scalability
Standardisation
Time-to-results
Flexibility and availability
Confidence in results

When well-trained, it also performs consistently over time without needing ongoing training and standardisation, as humans do.

The relationship between scoring and test validity

Test validity is how well an assessment measures what it’s designed to measure and, therefore, whether you can use test results confidently to make important decisions.

Validity relates to all aspects of testing, including:

Conceptualising the skills assessed
Designing, developing and implementing the test
Evaluating the consequences of test use
Gathering evidence to support proposed test uses.

Reliable scores are key to validity. To be trusted, scoring systems must perform consistently across tasks, test conditions and proficiency levels. They must also closely agree with trained human examiners.

Importantly, validity isn’t a given when introducing a new scoring system. Before implementation, test providers must examine new methods and prove they’re at least as accurate and reliable as their current assessment standard.

How automarking impacts test validity

While automarkers excel in many areas, they fall short in others.

For example, in speaking assessments, they can easily evaluate quantifiable factors like frequency of long pauses and speech rate. They also perform well with constrained tasks, like short responses, reading aloud or answering set questions.

However, it’s harder to use automarking to assess other aspects of oral communication reliably. This includes skills like turn management in a conversation, nuance, argumentation and implied meaning, which are more qualitative.

In some instances, algorithm complexity makes it hard to determine how automarkers arrived at a specific result. This lack of transparency around scoring criteria is particularly concerning in high-stakes contexts where a small difference in scores can have a significant impact on test takers.

In addition, an automarker may not score all test takers accurately. This could happen when certain test-taker profiles are missed or underrepresented in the training data. This can lead to an automarker that hasn’t properly learned to score their responses. The inconsistency in automarking accuracy across test takers would lead to bias in test scores if the automarker is the only means of evaluating test-taker performance.

Finally, automarker accuracy may be affected by test-takers cheating or using test-gaming behaviours to trick the scoring system rather than demonstrating their language abilities. Usually, separate systems are needed to identify unusual responses and flag them for review by trained human examiners (Gao et al., 2024).

How best practice overcomes some limitations of automarking

Automarking is improving all the time, so the technology has great potential. Applying certain best practices helps mitigate certain risks and limitations.

Broad training data: The data the system relies on must represent all test takers, including different ability levels and language backgrounds. It should also be monitored continuously as that population changes.
Multiple scoring features: It’s essential to have a range of features that represent what the test aims to measure and what trained human examiners assess. For speaking, these must go beyond grammar, vocabulary, pronunciation and fluency.
Benchmarking against human judgement: Automarkers should be continuously compared against a 'gold standard' established through double-marking or even multiple marking by certified human examiners.
Researching measures of automarker confidence: These indicate when automarked scores are less trustworthy, so human marking must be relied on instead.
Cheating detection: Active mechanisms can flag malpractice like irrelevant context, repetitive language or prompt-copying. Broad scoring features also reduce the possibilities for gaming the system.
Transparency and interpretability: Information about how automarking systems work, the criteria they use, and what scores mean must be accessible and explainable for institutions and test takers.
Fitness for purpose: Using automarkers in low-stakes practice contexts has very different implications from a high-stakes university entry test. Those accessing test scores must take care to understand the marking system when choosing the right test for each situation.
Human involvement: Many experts advocate a hybrid approach where responses that fall below a confidence threshold are escalated to human examiners. This is in line with the UK’s Ofqual guidelines, which state AI cannot be the only means of determining results for high-stakes qualifications.
Broad evaluation: Automarking shouldn’t only be judged on its results, but also on how it shapes learning, preparation and performance. For example, well-designed tests can foster positive, lasting study habits. Similarly, using automarked tests during preparation can positively or negatively impact learning activities and test-taker attitudes (Gong, K. 2023).

These best practices mitigate risk, but don’t eliminate it. So, for the time being, a hybrid approach is essential to ensure scoring efficiency and reliability – and safeguard test validity.

Automarking: Best practices mitigate risks so we can realise the benefits

Automarking is a rapidly evolving technology with huge potential benefits for test takers and stakeholders. But it has limitations, like the ability to evaluate higher-order skills such as organisation, idea development and the nuances of human communication.

Awarding inaccurate automarker scores to test takers can also have serious consequences in high-stakes contexts like university admissions.

That means ongoing research and careful implementation remain essential, recognising that automated systems will always have limitations. Applying best practices and effective processes can also help to mitigate these and reduce the risk of inaccurate scores. This is essential for responsible, ethical AI use in high-stakes assessment.

The priority remains to use a hybrid model combining automation with a high degree of human oversight and control – so scores accurately reflect candidates’ true abilities.

References

Automarking in language assessment: Key considerations for best practice (PDF 7.2 MB - 24 pages), Xu, J, Schmidt, E, Galaczi, E, Somers, A. 2024.
Detecting aberrant responses in automated L2 spoken English assessment, Gao, S., Gales, M., & Xu, J. 2024, in Exploring artificial intelligence in applied linguistics, Chapelle, C., Beckett, G. H., Ranalli, J. (Eds.), (pp. 96-117)
Challenges and opportunities for spoken English learning and instruction brought by automated speech scoring in large-scale speaking tests: a mixed-method investigation into the washback of SpeechRater in TOEFL iBT. Gong, K., 2023
Principles of AI use in marking, Williamson, J., Ofqual 2026

What automarking means for language test validity and integrity