Using Fleiss’ Kappa Coefficient to Measure the Intra and Inter-Rater Reliability of Three AI Software Programs in the Assessment of EFL Learners’ Story Writing
DOI:
https://doi.org/10.59992/IJESA.2024.v3n1p4Keywords:
Artificial intelligence, Computational assessment, Writing Assessment, EFL Learners, Intra-reliability, Inter-rater reliability, ELTAbstract
Story writing is a valuable skill for EFL learners, as it allows them to express their creativity and practice their language proficiency. However, assessing story writing can be challenging and time-consuming for teachers, especially when they have to deal with large classes and multiple criteria. Therefore, some researchers have explored the use of artificial intelligence (AI) tools to automate the assessment of story writing and provide feedback to learners. However, the reliability of these tools is still questionable. This study aimed to compare the intra- and inter-rater reliability of three AI tools for assessing EFL learners' story writing: Poe.com, Bing, and Google Bard.
The study utilized quantitative tools to answer the research questions, namely, calculating the Fleiss' Kappa coefficient using the Datatab software program (available on datatab.com). The study sampled 14 written pieces by EFL Libyan adult learners, the pieces used were stories built around a prompt provided by the teacher. The assessment was done using two criteria, one including the measurement of students' creativity, and the second was done focusing only on the linguistic aspect of the students' writings.
With the creativity criterion, the results of the study show that Poe's intra-rater reliability was 0.01 (slight), while Bing's was 0.2 (fair), Bard's was 0.2 (fair). This shows that Poe is the least reliable assessment tool among the three. For the inter-rater reliability, there were three assessments done to the same 14 sampled pieces to check the consistency of the results. In the first attempt the inter-rater reliability was 0.04 (slight), the second assessment it was 0.01 (slight), on the third time it was -0.03 (no agreement). There was a decrease in the consistency and reliability of scores over time.
Without the creativity criterion, the results show that Poe's inter-rater reliability level was 0.05 (slight), while Bing's was -0.02 (no agreement), and Bard's was 0.01 (slight). Here, it is shown that Bing was the least reliable. For the inter-rater reliability, the three assessments made by the three software applications were compared. There were three assessments done on the same 14 sampled pieces to check the consistency of the results. In the first attempt, the inter-rater reliability was 0 (slight), the second assessment it was -0.1 (no agreement), on the third time it was -0.13 (no agreement). There was a decrease in the consistency and reliability of scores over time.
The three applications performed in a reliable way to a certain extent without the exclusion of the creativity criterion, this goes against the common belief that AI software cannot assess creativity. Still, the results of the reliability measurements with the creativity criterion show that the assessment scores are not statistically significant, and there's a high probability that the observed agreement is due to random chance. Some limitations of this study were the small sample size, the limited number of criteria, and the lack of human raters for comparison. Future research could involve more participants, more criteria, more AI tools, and human raters to provide a more comprehensive and reliable evaluation of AI tools for assessing EFL story writing.
References
• Andrade, H. G., & Reddy, Y. M. (2010). A review of rubric use in higher education. Assessment & Evaluation in Higher Education, 35(4), 435-448.
• Asadollahi, F., & Salehi, M. (2011). Rater training in writing assessment: An analysis of the training process and effects on the reliability of holistic and analytic scales. Assessing Writing, 16(1), 35-48.
• Attali, Y., & Powers, D. (2008). A developmental writing scale. Journal of Educational Computing Research, 38(4), 367-380.
• Banerjee, D. (2017). Reliability of writing skill assessment. Journal of English Language Teaching, 45(2), 78-92.
• Breland, H. M. (1996). Validity and reliability in writing assessment. Assessing Writing, 3(2), 167-191.
• Bridgeman, B. (1984). The effects of multiple-choice item format on the measurement of reading comprehension. Journal of Educational Measurement, 21(3), 237-247.
• Bridgeman, B., & Carlson, S. (1984). Survey of academic writing tasks. Written Communication, 1(2), 247-280.
• Chan, S., et al. (2022). Exploring the reliability of writing skill assessment in English language teaching. TESOL Quarterly, 56(3), 345-362.
• Chodorow, M., Burstein, J., & Leacock, C. (1999). METER: A Tool for Analyzing the Difficulty of English Texts. In Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization (pp. 35-41).
• Chodorow, M., et al. (1999). Evaluating Web-based automated essay scoring systems. Assessment in Education: Principles, Policy & Practice, 6(3), 329-345.
• Chowdhury, R. (2020). Reliability of assessing writing skills: A comparative study. Journal of Applied Linguistics, 32(4), 567-582.
• Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37-46.
• Crossley, S. A., Kyle, K., & McNamara, D. S. (2014). The tool for the automatic analysis of lexical sophistication (TAALES): Version 2.0. Behavior Research Methods, 46(4), 1030-1047.
• Cumming, A. (1990). Expertise in evaluating second-language compositions. Language Testing, 7(1), 31-51.
• Deane, P., et al. (2014). The impact of computer-based feedback on student writing. Assessing Writing, 19(1), 1-17.
• Drid, I. (2018). Validity and reliability of essay tests in assessing writing skills. Language Testing, 35(2), 189-205.
• Elliott, S. N., & Kuehn, P. (2017). Using computerized scoring rubrics to assess writing. In Handbook of research on assessment literacy and teacher-made testing in the language classroom (pp. 1-22). IGI Global.
• Erguyan, E., & Aksu Dunya, B. (2020). The effect of raters' professional experience on the reliability of writing assessment. Assessment & Evaluation in Higher Education, 45(4), 579-595.
• Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378-382.
• Fleiss, J. L., Levin, B., & Paik, M. C. (2003). Statistical methods for rates and proportions (3rd ed.). Wiley.
• Ghanbari, S., Hashemi, M., & Tavakoli, M. (2012). Validity and reliability issues in the assessment of writing tasks. Language Testing, 29(2), 275-298.
• Kator, A. (1972). Writing skill assessment: A review of the literature. Modern Language Journal, 56(3), 213-225.
• Kline, R. B. (2013). Beyond significance testing: Reforming data analysis methods in behavioral research. American Psychological Association.
• Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174.
• Leacock, C., & Chodorow, M. (2003). Combining local context and WordNet similarity for word sense identification. WordNet: An Electronic Lexical Database, 49-64.
• Lim, F. V. (2011). Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency, and rater experience. Language Testing, 28(1), 51-73.
• Liu, D., & Hu, G. (2014). Rater experience, rating scale length, and judgments of L2 writing performance. Language Testing, 31(2), 267-288.
• Middleton, J. (2019). Understanding validity in writing assessment. Assessing Writing, 6(1), 45-60.
• Mozaffari, M. (2013). Reliability issues in writing skill assessment. English Language Teaching, 21(2), 67-82.
• Moses, J., & Yamat, H. (2021). The role of validity in writing assessment: A critical review. Language Testing, 38(3), 345-362.
• OpenAI. (2021). Assistant (Version 3.5) [Computer software]. Retrieved from https://poe.com/Assistant
• Popham, W. J. (1997). What's wrong—and what's right—with rubrics. Educational Leadership, 55(2), 72-75.
• Rezaei, A. R., & Lovorn, M. (2010). Analytic rubrics in the assessment of second language writing: Rating accuracy and rater experience. Language Testing, 27(1), 51-75.
• Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Sage Publications.
• Trace, J., et al. (2017). Examining rater effects in writing assessment: A many-facet Rasch analysis. Language Testing, 34(1), 95-116.
• Tuckman, B. (1993). Reliability and validity in writing assessment. Journal of Educational Measurement, 30(2), 123-136.
• Wahyuni, S., et al. (2021). Assessing the reliability of writing skill assessment: Challenges and recommendations. Language Teaching Research, 45(1), 67-82.
• Wang, J. (2009). Inter-rater reliability in writing assessment: Uses, interpretations, and impact. Assessing Writing, 14(3), 237-249.
• Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the classroom research agenda. Language Teaching Research, 10(2), 157-180.
• Wheadon, P., et al. (2020). Exploring the validity and reliability of writing skill assessment tools. TESOL Journal, 67(4), 345-362.