Analysing the Problem of Automatic Evaluation of Language Generation Systems

  1. Martínez-Murillo, Iván
  2. Moreda, Paloma
  3. Lloret, Elena
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2024

Título del ejemplar: Procesamiento del Lenguaje Natural, Revista nº 72, marzo de 2024

Número: 72

Páginas: 123-136

Tipo: Artículo

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

Las métricas automáticas de evaluación de texto se utilizan ampliamente para medir el rendimiento de un sistema de Generación de Lenguaje Natural (GLN). Sin embargo, estas métricas tienen varias limitaciones. Este articulo propone un estudio empírico donde se analiza el problema que tienen las métricas de evaluación actuales, como la falta capacidad que tienen estos sistemas de medir la calidad semántica de un texto, o la alta dependencia que tienen estas métricas sobre los textos contra los que se comparan. Además, se comparan sistemas de GLN tradicionales contra sistemas más actuales basados en redes neuronales. Finalmente, se propone una experimentación con GPT-4 para determinar si es una fuente fiable para evaluar la calidad de un texto. A partir de los resultados obtenidos, se puede concluir que con las métricas automáticas actuales la mejora de los sistemas neuronales frente a los tradicionales no es tan significativa. En cambio, si se analizan los aspectos cualitativos de los textos generados, si que se refleja esa mejora.

Referencias bibliográficas

  • Aghahadi, Z. and A. Talebpour. 2022. Avicenna: a challenge dataset for natural language generation toward commonsense syllogistic reasoning. Journal of Applied Non-Classical Logics, 32(1):55–71.
  • Anderson, P., B. Fernando, M. Johnson, and S. Gould. 2016. SPICE: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 382–398. Springer.
  • Appelt, D. 1985. Planning english sentences. cambridge university press.
  • Banerjee, S. and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  • Bateman, J. A. 1997. Enabling technology for multilingual natural language generation: the KPML development environment. Natural Language Engineering, 3(1):15–55.
  • Bhargava, P. and V. Ng. 2022. Commonsense knowledge reasoning and generation with pre-trained language models: A survey. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 12317–12325.
  • Braun, D., K. Klimt, D. Schneider, and F. Matthes. 2019. SimpleNLG-DE: Adapting SimpleNLG 4 to German. In Proceedings of the 12th International Conference on Natural Language Generation, pages 415–420, Tokyo, Japan, October–November. Association for Computational Linguistics.
  • Carlsson, F., J. Öhman, F. Liu, S. Verlinden, J. Nivre, and M. Sahlgren. 2022. Finegrained controllable text generation using non-residual prompting. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6837–6857.
  • Cascallar-Fuentes, A., A. Ramos-Soto, and A. Bugarín Diz. 2018. Adapting SimpleNLG to Galician language. In Proceedings of the 11th International Conference on Natural Language Generation, pages 67–72, Tilburg University, The Netherlands, November. Association for Computational Linguistics.
  • Chen, G., K. van Deemter, and C. Lin. 2018. SimpleNLG-ZH: a linguistic realisation engine for Mandarin. In Proceedings of the 11th International Conference on Natural Language Generation, pages 57–66, Tilburg University, The Netherlands, November. Association for Computational Linguistics.
  • Dong, C., Y. Li, H. Gong, M. Chen, J. Li, Y. Shen, and M. Yang. 2023. A survey of natural language generation. ACM Computing Surveys, 55:1–38, 8.
  • Fu, J., S.-K. Ng, Z. Jiang, and P. Liu. 2023. GPTScore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  • Gatt, A. and E. Krahmer. 2018. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research, 61:65–170.
  • Gatt, A. and E. Reiter. 2009. SimpleNLG: A realisation engine for practical applications. In Proceedings of the 12th European workshop on natural language generation (ENLG 2009), pages 90–93.
  • Guo, K. 2022. Testing and validating the cosine similarity measure for textual analysis. Available at SSRN 4258463.
  • Han, J., M. Kamber, and J. Pei. 2012. 2 - getting to know your data. In Data Mining (Third Edition), The Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, Boston, third edition edition, pages 39–82.
  • He, X., Y. Gong, A.-L. Jin, W. Qi, H. Zhang, J. Jiao, B. Zhou, B. Cheng, S. Yiu, and N. Duan. 2022. Metric-guided distillation: Distilling knowledge from the metric to ranker and retriever for generative commonsense reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 839–852, Abu Dhabi, United Arab Emirates, December. Association for Computational Linguistics.
  • Hovy, E. 1987. Generating natural language under pragmatic constraints. Journal of Pragmatics, 11(6):689–719.
  • Ji, Z., N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung. 2023. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  • Khapra, M. M. and A. B. Sai. 2021. A tutorial on evaluation metrics used in natural language generation. NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Tutorials, pages 15–19.
  • Kincaid, J. P., R. P. Fishburne Jr, R. L. Rogers, and B. S. Chissom. 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Millington (Tenn.).
  • Koller, A. and M. Stone. 2007. Sentence generation as a planning problem. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 336–343, Prague, Czech Republic, June. Association for Computational Linguistics.
  • Kusner, M., Y. Sun, N. Kolkin, and K. Weinberger. 2015. From word embeddings to document distances. In International conference on machine learning, pages 957–966. PMLR.
  • Lemon, O. 2011. Learning what to say and how to say it: Joint optimisation of spoken dialogue management and natural language generation. Computer Speech & Language, 25(2):210–221.
  • Levelt, W. 1989. Speaking: From intention to articulation MIT press. Cambridge, MA.
  • Lin, B. Y., W. Zhou, M. Shen, P. Zhou, C. Bhagavatula, Y. Choi, and X. Ren. 2020. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823–1840, Online, November. Association for Computational Linguistics.
  • Lin, C.-Y. 2004. ROUGE: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  • Lo, C.-k., A. K. Tumuluru, and D. Wu. 2012. Fully automatic semantic MT evaluation. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 243–252.
  • Mann, W. C. and J. A. Moore. 1981. Computer generation of multiparagraph English text. American Journal of Computational Linguistics, 7(1):17–29.
  • McDonald, D. D. 2010. Natural language generation. Handbook of natural language processing, 2:121–144.
  • Mirza, M., B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, I. J. Goodfellow, and J. Pouget-Abadie. 2014. Generative adversarial nets. Advances in neural information processing systems, 27:2672–2680.
  • Nakatsu, C. and M. White. 2010. Generating with discourse combinatory categorial grammar. Linguistic Issues in Language Technology, 4.
  • Nirenburg, S., V. R. Lesser, and E. Nyberg. 1989. Controlling a language generation planner. In IJCAI, pages 1524–1530.
  • OpenAI. 2023. GPT-4 technical report.
  • Papineni, K., S. Roukos, T. Ward, and W.-J. Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  • Popovic, M. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the tenth workshop on statistical machine translation, pages 392–395.
  • Radford, A., J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  • Raffel, C., N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text Transformer. J. Mach. Learn. Res., 21(1), jan.
  • Ramos-Soto, A., J. Janeiro-Gallardo, and A. Bugarín. 2017. Adapting SimpleNLG to spanish. pages 144–148. Association for Computational Linguistics.
  • Reiter, E. 1994. Has a consensus NL generation architecture appeared, and is it psycholinguistically plausible? In Proceedings of the Seventh International Workshop on Natural Language Generation.
  • Rieser, V. and O. Lemon. 2009. Natural language generation as planning under uncertainty for spoken dialogue systems. Empirical Methods in Natural Language Generation: Data-oriented Methods and Empirical Evaluation, pages 105–120.
  • Roos, Q. 2022. Fine-tuning pre-trained language models for CEFR-level and keyword conditioned text generation: A comparison between google’s t5 and openai’s gpt-2.
  • Sai, A. B., A. K. Mohankumar, and M. M. Khapra. 2022. A survey of evaluation metrics used for NLG systems. ACM Comput. Surv., 55(2), jan.
  • Scarselli, F., M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. 2008. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80.
  • Stanchev, P., W. Wang, and H. Ney. 2019. EED: Extended edit distance measure for machine translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 514–520.
  • Sukhbaatar, S., J. Weston, R. Fergus, et al. 2015. End-to-end memory networks. Advances in neural information processing systems, 28.
  • Sutskever, I., O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27.
  • Tang, T., H. Lu, Y. E. Jiang, H. Huang, D. Zhang, W. X. Zhao, and F. Wei. 2023. Not all metrics are guilty: Improving NLG evaluation with LLM paraphrasing. arXiv preprint arXiv:2305.15067.
  • Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
  • Vedantam, R., C. Lawrence Zitnick, and D. Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
  • Wang, H., Y. Liu, C. Zhu, L. Shou, M. Gong, Y. Xu, and M. Zeng. 2021. Retrieval enhanced model for commonsense generation. In Findings of the Association for Computational Linguistics: ACLIJCNLP 2021, pages 3056–3062, Online, August. Association for Computational Linguistics.
  • Wang, J., Y. Liang, F. Meng, Z. Sun, H. Shi, Z. Li, J. Xu, J. Qu, and J. Zhou. 2023. Is ChatGPT a good NLG evaluator? a preliminary study. In Proceedings of the 4th New Frontiers in Summarization Workshop, pages 1–11, Singapore, December. Association for Computational Linguistics.
  • Yu, W., C. Zhu, L. Qin, Z. Zhang, T. Zhao, and M. Jiang. 2022. Diversifying content generation for commonsense reasoning with mixture of knowledge graph experts. In NAACL 2022 Workshop on Deep Learning on Graphs for Natural Language Processing.
  • Yuan, W., G. Neubig, and P. Liu. 2021. BARTScore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277.
  • Zhang, H., S. Si, H. Wu, and D. Song. 2023. Controllable text generation with residual memory transformer. arXiv preprint arXiv:2309.16231.
  • Zhang, T., V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi. 2020. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations.
  • Zhang, Y. and X. Wan. 2024. Situated-Gen: Incorporating geographical and temporal contexts into generative commonsense reasoning. Advances in Neural Information Processing Systems, 36.
  • Zhu, W. and S. Bhat. 2020. GRUEN for evaluating linguistic quality of generated text. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 94–108, Online, November. Association for Computational Linguistics.