Corpus-Based Evaluation Models for Quality Assurance Of AI-Generated ESL Learning Materials
DOI:
https://doi.org/10.63125/m33q0j38Keywords:
Corpus-Based Quality Assurance, AI-Generated ESL Materials, Cohesion and Readability Indices, Lexical Appropriacy, Regression-Based ValidationAbstract
This study addresses the problem that AI-generated ESL learning materials can appear fluent yet vary in accuracy, level appropriateness, and coherence, weakening quality assurance for large-scale cloud and enterprise deployment. The purpose was to develop and validate a corpus-based evaluation model that links corpus indicators to stakeholder quality judgments. Using a quantitative cross-sectional, case-based design, N = 120 evaluators assessed M = 80 AI-generated texts across four categories (reading passages, dialogues, grammar explanations, and practice prompts) using a five-point Likert instrument. Key dependent variables were overall QA and subscales for accuracy, clarity, coherence, level appropriateness, and pedagogical usefulness; key independent variables were readability control index, lexical appropriacy score, cohesion score, lexical diversity (HD-D), and grammar error rate (errors per 100 words). Analyses used descriptive statistics, Cronbach’s alpha, Pearson correlations, and multiple regression with text-type stability checks. Overall perceived quality was acceptable (overall QA M = 3.84, SD = 0.53), with clarity highest (M = 3.96) and accuracy lowest (M = 3.72). Reliability was strong (overall α = .91). Corpus to human alignment was substantial: readability control correlated with level appropriateness (r = .61), cohesion with coherence (r = .58), lexical appropriacy with clarity (r = .52) and usefulness (r = .49), and grammar error rate with accuracy (r = −.67), all p < .001. A five-predictor regression model predicted overall QA (F (5,74) = 21.64, p < .001; R² = .59; Adj. R² = .56), with grammar error rate the strongest predictor (β = −.41), followed by readability (β = .29), cohesion (β = .24), and lexical appropriacy (β = .21); performance remained stable across text types (R² = .52–.61). Implications are that organizations can operationalize QA as automated gates for error density, readability bands, cohesion thresholds, and vocabulary profile alignment, then reserve human review for borderline cases to improve safety, consistency, and turnaround time in enterprise content workflows. Average indicators were overall readability 0.64, lexical appropriacy 0.71, cohesion 0.59, lexical diversity 0.82, and grammar error rate 2.40 per 100 words.
