Analysing the Problem of Automatic Evaluation of Language Generation Systems

Martínez-Murillo, Iván; Moreda, Paloma; Lloret, Elena

Analysing the Problem of Automatic Evaluation of Language Generation Systems

Journal:

Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2024

Issue Title: Procesamiento del Lenguaje Natural, Revista nº 72, marzo de 2024

Issue: 72

Pages: 123-136

Type: Article

DIALNET GOOGLE SCHOLAR Open access editor

More publications in: Procesamiento del lenguaje natural

Abstract

Automatic text evaluation metrics are widely used to measure the performance of a Natural Language Generation (NLG) system. However, these metrics have several limitations. This article empirically analyses the problem with current evaluation metrics, such as their lack of ability to measure the semantic quality of a text or their high dependence on the texts they are compared against. Additionally, traditional NLG systems are compared against more recent systems based on neural networks. Finally, an experiment with GPT-4 is proposed to determine if it is a reliable source for evaluating the validity of a text. From the results obtained, it can be concluded that with the current automatic metrics, the improvement of neural systems compared to traditional ones is not so significant. On the other hand, if we analyse the qualitative aspects of the texts generated, this improvement is reflected.

Bibliographic References

Aghahadi, Z. and A. Talebpour. 2022. Avicenna: a challenge dataset for natural language generation toward commonsense syllogistic reasoning. Journal of Applied Non-Classical Logics, 32(1):55–71.
Anderson, P., B. Fernando, M. Johnson, and S. Gould. 2016. SPICE: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 382–398. Springer.
Appelt, D. 1985. Planning english sentences. cambridge university press.
Banerjee, S. and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
Bateman, J. A. 1997. Enabling technology for multilingual natural language generation: the KPML development environment. Natural Language Engineering, 3(1):15–55.
Bhargava, P. and V. Ng. 2022. Commonsense knowledge reasoning and generation with pre-trained language models: A survey. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 12317–12325.
Braun, D., K. Klimt, D. Schneider, and F. Matthes. 2019. SimpleNLG-DE: Adapting SimpleNLG 4 to German. In Proceedings of the 12th International Conference on Natural Language Generation, pages 415–420, Tokyo, Japan, October–November. Association for Computational Linguistics.
Carlsson, F., J. Öhman, F. Liu, S. Verlinden, J. Nivre, and M. Sahlgren. 2022. Finegrained controllable text generation using non-residual prompting. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6837–6857.
Cascallar-Fuentes, A., A. Ramos-Soto, and A. Bugarín Diz. 2018. Adapting SimpleNLG to Galician language. In Proceedings of the 11th International Conference on Natural Language Generation, pages 67–72, Tilburg University, The Netherlands, November. Association for Computational Linguistics.
Chen, G., K. van Deemter, and C. Lin. 2018. SimpleNLG-ZH: a linguistic realisation engine for Mandarin. In Proceedings of the 11th International Conference on Natural Language Generation, pages 57–66, Tilburg University, The Netherlands, November. Association for Computational Linguistics.
Dong, C., Y. Li, H. Gong, M. Chen, J. Li, Y. Shen, and M. Yang. 2023. A survey of natural language generation. ACM Computing Surveys, 55:1–38, 8.
Fu, J., S.-K. Ng, Z. Jiang, and P. Liu. 2023. GPTScore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
Gatt, A. and E. Krahmer. 2018. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research, 61:65–170.
Gatt, A. and E. Reiter. 2009. SimpleNLG: A realisation engine for practical applications. In Proceedings of the 12th European workshop on natural language generation (ENLG 2009), pages 90–93.
Guo, K. 2022. Testing and validating the cosine similarity measure for textual analysis. Available at SSRN 4258463.
Han, J., M. Kamber, and J. Pei. 2012. 2 - getting to know your data. In Data Mining (Third Edition), The Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, Boston, third edition edition, pages 39–82.
He, X., Y. Gong, A.-L. Jin, W. Qi, H. Zhang, J. Jiao, B. Zhou, B. Cheng, S. Yiu, and N. Duan. 2022. Metric-guided distillation: Distilling knowledge from the metric to ranker and retriever for generative commonsense reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 839–852, Abu Dhabi, United Arab Emirates, December. Association for Computational Linguistics.
Hovy, E. 1987. Generating natural language under pragmatic constraints. Journal of Pragmatics, 11(6):689–719.
Ji, Z., N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung. 2023. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
Khapra, M. M. and A. B. Sai. 2021. A tutorial on evaluation metrics used in natural language generation. NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Tutorials, pages 15–19.
Kincaid, J. P., R. P. Fishburne Jr, R. L. Rogers, and B. S. Chissom. 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Millington (Tenn.).
Koller, A. and M. Stone. 2007. Sentence generation as a planning problem. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 336–343, Prague, Czech Republic, June. Association for Computational Linguistics.
Kusner, M., Y. Sun, N. Kolkin, and K. Weinberger. 2015. From word embeddings to document distances. In International conference on machine learning, pages 957–966. PMLR.
Lemon, O. 2011. Learning what to say and how to say it: Joint optimisation of spoken dialogue management and natural language generation. Computer Speech & Language, 25(2):210–221.
Levelt, W. 1989. Speaking: From intention to articulation MIT press. Cambridge, MA.
Lin, B. Y., W. Zhou, M. Shen, P. Zhou, C. Bhagavatula, Y. Choi, and X. Ren. 2020. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823–1840, Online, November. Association for Computational Linguistics.
Lin, C.-Y. 2004. ROUGE: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
Lo, C.-k., A. K. Tumuluru, and D. Wu. 2012. Fully automatic semantic MT evaluation. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 243–252.
Mann, W. C. and J. A. Moore. 1981. Computer generation of multiparagraph English text. American Journal of Computational Linguistics, 7(1):17–29.
McDonald, D. D. 2010. Natural language generation. Handbook of natural language processing, 2:121–144.
Mirza, M., B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, I. J. Goodfellow, and J. Pouget-Abadie. 2014. Generative adversarial nets. Advances in neural information processing systems, 27:2672–2680.
Nakatsu, C. and M. White. 2010. Generating with discourse combinatory categorial grammar. Linguistic Issues in Language Technology, 4.
Nirenburg, S., V. R. Lesser, and E. Nyberg. 1989. Controlling a language generation planner. In IJCAI, pages 1524–1530.
OpenAI. 2023. GPT-4 technical report.
Papineni, K., S. Roukos, T. Ward, and W.-J. Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
Popovic, M. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the tenth workshop on statistical machine translation, pages 392–395.
Radford, A., J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Raffel, C., N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text Transformer. J. Mach. Learn. Res., 21(1), jan.
Ramos-Soto, A., J. Janeiro-Gallardo, and A. Bugarín. 2017. Adapting SimpleNLG to spanish. pages 144–148. Association for Computational Linguistics.
Reiter, E. 1994. Has a consensus NL generation architecture appeared, and is it psycholinguistically plausible? In Proceedings of the Seventh International Workshop on Natural Language Generation.
Rieser, V. and O. Lemon. 2009. Natural language generation as planning under uncertainty for spoken dialogue systems. Empirical Methods in Natural Language Generation: Data-oriented Methods and Empirical Evaluation, pages 105–120.
Roos, Q. 2022. Fine-tuning pre-trained language models for CEFR-level and keyword conditioned text generation: A comparison between google’s t5 and openai’s gpt-2.
Sai, A. B., A. K. Mohankumar, and M. M. Khapra. 2022. A survey of evaluation metrics used for NLG systems. ACM Comput. Surv., 55(2), jan.
Scarselli, F., M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. 2008. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80.
Stanchev, P., W. Wang, and H. Ney. 2019. EED: Extended edit distance measure for machine translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 514–520.
Sukhbaatar, S., J. Weston, R. Fergus, et al. 2015. End-to-end memory networks. Advances in neural information processing systems, 28.
Sutskever, I., O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27.
Tang, T., H. Lu, Y. E. Jiang, H. Huang, D. Zhang, W. X. Zhao, and F. Wei. 2023. Not all metrics are guilty: Improving NLG evaluation with LLM paraphrasing. arXiv preprint arXiv:2305.15067.
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
Vedantam, R., C. Lawrence Zitnick, and D. Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
Wang, H., Y. Liu, C. Zhu, L. Shou, M. Gong, Y. Xu, and M. Zeng. 2021. Retrieval enhanced model for commonsense generation. In Findings of the Association for Computational Linguistics: ACLIJCNLP 2021, pages 3056–3062, Online, August. Association for Computational Linguistics.
Wang, J., Y. Liang, F. Meng, Z. Sun, H. Shi, Z. Li, J. Xu, J. Qu, and J. Zhou. 2023. Is ChatGPT a good NLG evaluator? a preliminary study. In Proceedings of the 4th New Frontiers in Summarization Workshop, pages 1–11, Singapore, December. Association for Computational Linguistics.
Yu, W., C. Zhu, L. Qin, Z. Zhang, T. Zhao, and M. Jiang. 2022. Diversifying content generation for commonsense reasoning with mixture of knowledge graph experts. In NAACL 2022 Workshop on Deep Learning on Graphs for Natural Language Processing.
Yuan, W., G. Neubig, and P. Liu. 2021. BARTScore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277.
Zhang, H., S. Si, H. Wu, and D. Song. 2023. Controllable text generation with residual memory transformer. arXiv preprint arXiv:2309.16231.
Zhang, T., V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi. 2020. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations.
Zhang, Y. and X. Wan. 2024. Situated-Gen: Incorporating geographical and temporal contexts into generative commonsense reasoning. Advances in Neural Information Processing Systems, 36.
Zhu, W. and S. Bhat. 2020. GRUEN for evaluating linguistic quality of generated text. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 94–108, Online, November. Association for Computational Linguistics.

Data source: Dialnet