International Journal of
Information and Education Technology

Editor-In-Chief: Prof. Jon-Chao Hong
Frequency: Monthly
ISSN: 2010-3689 (Online)
E-mali: editor@ijiet.org
Publisher: IACSIT Press
 

OPEN ACCESS
3.2
CiteScore

IJIET 2025 Vol.15(12): 2718-2729
doi: 10.18178/ijiet.2025.15.12.2467

Leveraging Large Language Models for Arabic Short Answer Grading and Feedback Generation

Emad Nabil1,*, Mostafa Mohamed Saeed2, Rana Reda3, Safiullah Faizullah4, and Wael Hassan Gomaa5
1. Faculty of Computer and Information Systems, Islamic University of Madinah, Madinah, Saudi Arabia
2. Computational Approaches to Modeling Language (CAMeL) Lab, New York University, Abu Dhabi, United Arab Emirates
3. Digital Egypt for Investment Co., Cairo, Egypt
4. Faculty of Computer and Information Systems, Islamic University of Madinah, Madinah, Saudi Arabia
5. Faculty of Computers and Artificial Intelligence, Beni-Suef University, Beni-Suef, Egypt
Email: e.nabil@fci-cu.edu.eg (E.N.); mms10094@nyu.edu (M.M.S.); rana.reda@defi.com.eg (R.R.); safi@iu.edu.sa (S.F.); wael.goma@gmail.com (W.H.G.)
*Corresponding author

Manuscript received May 7, 2025; revised June 10, 2025; accepted July 25, 2025; published December 16, 2025

Abstract—This paper explores the potential of Large Language Models (LLMs) in automating the grading and feedback generation of short-answer responses in Arabic. Arabic poses unique challenges due to its linguistic complexity and the relative scarcity of well-developed Natural Language Processing (NLP) resources compared to languages such as English and Chinese. The study evaluates both proprietary models (GPT-4) and open-source models (Llama 3-8B, Llama 3-70B, and DeepSeek-V3) using the Environmental Science Corpus—a custom-designed dataset tailored for Arabic short-answer assessment. Two core tasks are addressed: grading and feedback generation. In the grading task, DeepSeek-V3 achieved the best performance, with a Quadratic Weighted Kappa (QWK) score of 0.8273, a Pearson correlation of 86.09%, and a Root Mean Squared Error (RMSE) of 0.76, indicating near-perfect agreement with human evaluators. GPT-4 ranked second, followed by Llama 3-70B, while Llama 3-8B was the lowest-performing model. In feedback generation, DeepSeek-V3 again led the performance with a human evaluation score of 79.61% for generating accurate and constructive feedback. GPT-4 ranked second, followed by Llama 3 models. Statistical analysis using the Wilcoxon test revealed significant performance differences among all models (p < 0.05), indicating that each LLM offers unique capabilities in handling Arabic short-answer grading. Overall, the results underscore the effectiveness of LLMs in Arabic educational assessment, highlighting the critical role of prompt engineering in enhancing model performance. The study demonstrates that LLMs can not only grade student responses with high accuracy but also generate meaningful feedback, thereby supporting the development of more effective automated learning tools. Practical recommendations and best practices are presented to help educators and developers optimize the use of LLMs in Arabic-language educational settings, laying the groundwork for future advancements in Arabic NLP.

Keywords—Arabic short answer grading, Large Language Models (LLMs), prompt engineering, GPT-4, Llama-3, DeepSeek-V3


[PDF]

Cite: Emad Nabil, Mostafa Mohamed Saeed, Rana Reda, Safiullah Faizullah, and Wael Hassan Gomaa, "Leveraging Large Language Models for Arabic Short Answer Grading and Feedback Generation," International Journal of Information and Education Technology, vol. 15, no. 12, pp. 2718-2729, 2025.


Copyright © 2025 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Article Metrics in Dimensions

Menu