Capturing fine-grained teacher performance from student evaluation of teaching via ChatGPT
Keywords:
Student Evaluation of Teaching, Teacher Performance, Automated Labeling, ChatGPTAbstract
Student evaluation of teaching (SET) is a vital component of educational enhancement, yet conventional assessment tools face inherent limitations. While open-ended questions provide a platform for students to convey authentic sentiments, the absence of automated labeling tools poses a challenge in the case of large-scale applications. In response, this study undertakes a comprehensive exploration, centered on the utilization of ChatGPT for capturing fine-grained teacher performance from SET. Based on a collected dataset and manual coding, the performance of ChatGPT with various strategies including zero-shot and few-shot, and some supervised models, including CNN, LSTM and BERT, are evaluated and compared. As a result, ChatGPT exhibits the promise of achieving commendable performance with a small number of labeled samples. This approach reduces the dependency on extensive labeled data, offering an effective solution. However, in terms of performance, a discernible margin persists in comparison to advanced supervised models, BERT. Our study also acknowledges there are various factors, such as task complexity and prompt clarity, influencing ChatGPT’s performance and consistency. In summation, while the integration of ChatGPT into practical SET applications holds significant promise, further explorations are imperative to ensure the alignment of its capabilities with the intricate demands.
Cited as:
Zhang, B., & Tian, X. (2024). Capturing fine-grained teacher performance from student evaluation of teaching via ChatGPT. Education andLifelong Development Research, 1(4), 166-179. https://doi.org/10.46690/elder.2024.04.01
References
Alnaqbi, N.M., & Fouda, W. (2023). Exploring the role of chatgpt and social media in enhancing student evaluation of teaching styles in higher education using neutrosophic sets. International Journal of Neutrosophic Science, 20(4), 181-190.
Alwaely, S.A., El-Zeiny, M.E., Alqudah, H., Alamarnih, E.F.M., Salman, O.K.I., Halim, M., Khasawneh, M.A.S. (2023). The impact of teacher evaluation on professional development and student achievement. Revista de Gestao Social e Ambiental, 17(7), e03484-e03484.
Ambady, N., & Rosenthal, R. (1993). Half a minute: Predicting teacher evaluations from thin slices of nonverbal behavior and physical attractiveness. Journal of Personality and Social Psychology, 64(3), 431.
Annan, S., Tratnack, S., Rubenstein, C., Sawin, E., & Hulton, L. (2013). An integrative review of student evaluations of teaching: Implications for evaluation of nursing faculty. Journal of Professional Nursing, 29(5), e10-e24.
Beran, T., & Rokosh, J. (2009). Instructors’ perspectives on the utility of student ratings of instruction. Instructional Science, 37, 171-184.
Beran, T., Violato, C., Kline, D. (2007). What’s the “use” of student ratings of instruction for administrators? one university’s experience. Canadian Journal of Higher Education, 37, 27-43.
Brockx, B., Van Roy, K., Mortelmans, D. (2012). The student as a commentator: Students’ comments in student evaluations of teaching. Procedia - Social and Behavioral Sciences, 69, 1122-1133.
Chantamuang, O., Polpinij, J., Vorakitphan, V., Luaphol, B. (2022). Sentence-level sentiment analysis for student feedback relevant to teaching process assessment. In Multi-disciplinary Trends in Artificial Intelligence (p. 156-168). Berlin Heidelberg: Springer-Verlag.
Chong, C., Sheikh, U.U., Samah, N., Sha’ameri, A. (2020). Analysis on reflective writing using natural language processing and sentiment analysis. IOP Conference Series: Materials Science and Engineering, 884, 012069.
Clayson, D., & Haley, D. (2011). Are students telling us the truth? a critical look at the student evaluation of teaching. Marketing Education Review, 21, 101-112.
Denson, N., Loveday, T., Dalton, H. (2010). Student evaluation of courses: What predicts satisfaction? Higher Education Research & Development, 29, 339-356.
Devlin, J., Chang, M., Lee, K., Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding, J. Burstein, C. Doran, & T. Solorio (Eds.), 2019 Conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 4171–4186). Association for Computational Linguistics.
Donnon, T., Delver, H., Beran, T. (2010). Student and teaching characteristics related to ratings of instruction in medical sciences graduate programs. Medical Teacher, 32, 327-332.
Eccles, J. S., Adler, T. F., Futterman, R., Goff, S. B., Kaczala, C. M., Meece, J. L., & Midgley, C. (1983). Expectancies, Values, and Academic Behaviors. In J. T. Spence (Ed.), Achievement and Achievement Motivation (pp. 75-146). San Francisco, CA: W. H. Freeman.
Emerson, R., & Records, K. (2007). Design and testing of classroom and clinical teaching evaluation tools for nursing education. International Journal of Nursing Education Scholarship, 4, Article12.
Fiske, S. T., & Taylor, S. E. (2013). Social cognition: From brains to culture (2nd ed.). Sage Publications.
Hammond, I., Taylor, J., Mcmenamin, P. (2003). Value of a structured participant evaluation questionnaire in the development of a surgical education program. The Australian & New Zealand Journal of Obstetrics & Gynaecology, 43, 115-118.
Hazarika, D., Konwar, G., Deb, S., Bora, D. J. (2020). Sentiment Analysis on Twitter by Using TextBlob for Natural Language Processing. Proceedings of the International Conference on Research in Management & Technovation (pp. 63-67).
Hodges, L., & Stanton, K. (2007). Translating comments on student evaluations into the language of learning. Innovative Higher Education, 31, 279-286.
Hommel, B. E. (2023). Expanding the methodological toolbox: Machine-based item desirability ratings as an alternative to human-based ratings. Personality and Individual Differences, 213, 112307.
Hoon, A., Oliver, E., Szpakowska, K., Newton, P. (2015). Use of the “stop, start, continue” method is associated with the production of constructive qualitative feedback by students in higher education. Assessment & Evaluation in Higher Education, 40(5), 755-767.
Hutto, C. J., Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social media text. Proceedings of the Eighth International Conference on Weblogs and Social Media. The AAAI Press.
Jin, Y., Cheng, K., Wang, X., Cai, L. (2023, 07). A review of text sentiment analysis methods and applications. Frontiers in Business, Economics and Management, 10, 58-64.
Khanam, Z. (2023). Sentiment analysis of user reviews in an online learning environment: Analyzing the methods and future prospects. European Journal of Education and Pedagogy, 4, 209-217.
Kim, Y. (2014). Convolutional neural networks for sentence classification. A. Moschitti B. Pang, & W. Daelemans (Eds.), 2014 Conference on Empirical Methods in Natural Language Processing (pp. 1746–1751). ACL.
Kulik, J. (2001). Student ratings: Validity, utility, and controversy. New Directions for Institutional Research, 2001.
Landis, J.R., & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 159-174.
Latif, E., Mai, G., Nyaaba, M., Wu, X., Liu, N., Lu, G., Zhai, X. (2023). Artificial general intelligence (AGI) for education. arXiv:2304.12479.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 195:1-195:35.
Medhat, W., Hassan, A.H., & Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5, 1093-1113.
Nasser-Abu Alhija, F., & Fresko, B. (2002). Faculty views of student evaluation of college teaching. Assessment & Evaluation in Higher Education - ASSESS EVAL HIGH EDUC , 27, 187-198.
Nasser-Abu Alhija, F., & Fresko, B. (2009). Student evaluation of instruction: What can be learned from students’ written comments? Studies In Educational Evaluation, 35, 37-44.
Nisbett, R. E., & Wilson, T. D. (1977). The halo effect: Evidence for unconscious alteration of judgments. Journal of personality and social psychology, 35(4), 250.
Nowak, J., Taspinar, A., Scherer, R. (2017). LSTM recurrent neural networks for short text and sentiment classification. L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L.A. Zadeh, & J.M. Zurada (Eds.), 16th International Conference Artificial Intelligence and Soft Computing (Vol. 10246, pp. 553–562). Springer.
Okoye, K., Nunez-Daruich, S., de la O, J.F.E., Castano-Gonzalez, R., Escamilla, J., Hosseini, S. (2023). A text mining and statistical approach for assessment of pedagogical impact of students’ evaluation of teaching and learning outcome in education. IEEE Access, 11, 9577-9596.
Onwuegbuzie, A., Daniel, L., Collins, K.M. (2009). A meta-validation model for assessing the score-validity of student teacher evaluations. Quality and Quantity, 43, 197-209.
OpenAI. (2022, November). Introducing ChatGPT. Retrieved from https://openai.com/blog/chatgpt
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Lowe, R. (2022). Training language models to follow instructions with human feedback. Annual Conference on Neural Information Processing Systems.
Pineda, P., & Steinhardt, I. (2023). The debate on student evaluations of teaching: Global convergence confronts higher education traditions. Teaching in Higher Education, 28(4), 859–879.
Rajput, Q., Haider, S., Ghani, S. (2016). Lexicon-based sentiment analysis of teachers’ evaluation. Applied Computational Intelligence and Soft Computing, 2016, 1-12.
Ren, P., Yang, L., Luo, F. (2023). Automatic scoring of student feedback for teaching evaluation based on aspect-level sentiment analysis. Education and Information Technologies, 28, 797-814.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
Sedaghat, S. (2023). Early applications of chatgpt in medical practice, education and research. Clinical Medicine, 23, clinmed.2023-0078.
Sherif, M., & Hovland, C. I. (1961). Social judgment: Assimilation and contrast effects in communication and attitude change. Yale Univer. Press.
Sindhu, I., Daudpota, S., Badar, K., Bakhtyar, M., Baber, J., Nurunnabi, M. (2019). Aspect based opinion mining on student’s feedback for faculty teaching performance evaluation. IEEE Access, 7, 108729-108741.
Situmorang, D., Mini, R., Ifdil, I., Liza, L., Rusandi, M.A., Hayati, I., Fitriani, A. (2023). The current existence of chatgpt in education: a double-edged sword? Journal of public health, Online.
Smith, C. (2008). Building effectiveness in teaching through targeted evaluation and response: Connecting evaluation to teaching improvement in higher education. Assessment & Evaluation in Higher Education, 33.
Srinivas, A., & Hanumanthappa, M. (2017). Viale modern approaches for sentiment analysis: A survey. International Journal of Advanced Research in Computer Science, 8, 115-120.
Stupans, I., Mcguren, T., Babey, A.M. (2015). Student evaluation of teaching: A study exploring student rating instrument free-form text comments. Innovative Higher Education, 41, 33-42.
Su, B., & Peng, J. (2023). Sentiment analysis of comment texts on online courses based on hierarchical attention mechanism. Applied Sciences, 13, 4204.
Tian, X., Jing, L., Luo, F., Liu, F. (2022). Inference during reading: multi-label classification for text with continuous semantic units. Applied Intelligence, 52(6), 6292–6305.
Toni, M., & Sudin, M. (2024). Research and development (R&D) interactive media that is effective, efficient and fun for students. Asian Journal of Social and Humanities, 2(6), 1239–1252.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, T., Roziere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., Lample, G. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
Wang, H. (2018). Multi-grain sentiment analysis of teaching reviews based on topic (Unpublished master’s thesis). South China University of Technology.
Wang, P., Chen, L., Cai, Z., Zhu, D., Lin, B., Cao, Y., Kong, L., Liu, Q., Sui, Z. (2024). Large language models are not fair evaluators. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (pp. 9440-9450). Association for Computational Linguistics.
Wright, S. (2011). Student evaluations of teaching: Combining the meta-analyses and demonstrating further evidence for effective use. Assessment & Evaluation in Higher Education, 37, 1-17.
Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Rojas, D., Feng, G., Zhao, H., Lai, H., Yu, H., Wang, H., Sun, J., Zhang, J., Cheng, J., Gui, J., Tang, J., Zhang, J., Li, J., Zhao, L., ... Wang, Z. (2024). ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools. arXiv:2406.12793.
Zhao, Y., Yan, L., Sun, W., Xing, G., Wang, S., Meng, C., Cheng, Z., Ren, Z., Yin, D. (2024). Improving the robustness of large language models via consistency alignment. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (pp. 8931-8941). ELRA and ICCL.
Zhu, I.C., Sun, M., Luo, J., Li, T., Wang, M. (2023). How to harness the potential of chatgpt in education? Knowledge Management & E-Learning, 15, 133-152,