Contribution to natural language generation for Spanish

García Méndez, Silvia

Contribution to natural language generation for Spanish

García Méndez, Silvia

Supervised by:

Enrique Costa Montenegro Director
Milagros Fernández Gavilanes Co-director

Defence university: Universidade de Vigo

Fecha de defensa: 05 February 2021

Committee:

Patrice Bellot Chair
Juan Carlos Burguillo Rial Secretary
Elena Lloret Pastor Committee member

Type: Thesis

Teseo: 643241 DIALNET Investigo editor

Abstract

In this thesis, we present our research aligned with the field of Natural Language Generation (NLG). Our work represents an effort to bring NLG capabilities to the research community for Spanish language. In this line, several contributions will be presented with the aim of extending the state of the art in this research area. Accordingly, we present a detailed description of the resources created and the architectures designed for NLG taking into consideration the main stages in the traditional pipeline: content determination, text structuring, lexicalisation, and finally, realisation. For this purpose, we created several linguistic resources paying special attention to coverage and accuracy. They contain a wide range of linguistic data, that is, morphological, syntactic and semantic information: aLexiS (a Lexicon for Spanish), eLSA (Augmentative and Alternative Spanish Lexicon) and aLexiE (a Lexicon for English). This work is motivated by the lack of complete linguistic resources useful for real NLG applications, specially in the case of Spanish language. In this line, both aLexiS and aLexiE will be useful in many use cases such as report generation. On the other hand, the eLSA lexicon aims at improving NLG systems to help people diagnosed with communication disorders. In terms of libraries developed for NLG, we present several contributions. Firstly, we introduce the adaptation of the popular SimpleNLG library to Spanish and an enhanced version of it with automatic performance which expands text from keywords. Both solutions can provide applications, such as web apps, with valuable NLG capabilities. Moreover, we present a modular and hybrid architecture for NLG. It combines linguistic knowledge and statistical information (a language model to infer prepositions) to address the NLG task automatically. At the end, our system is able to generate complete, coherent and grammatically/orthographically correct sentences in Spanish from the keywords provided by the users (such as adjectives, nouns and verbs). The main strength of the architecture is its modular feature. This means its constituents (lexicon, grammar and realiser) could be reused or substituted to address other generation challenges or to improve the performance of the system. Moreover, our NLG architecture was designed to be efficient in terms of time required to generate the output but also to be easily extended to other languages, even if they are not linguistically similar like Spanish and English. We prove this valuable feature extending our NLG system to English language. Besides, both NLG systems presented, for Spanish and English, have been evaluated using popular metrics in the state of the art and manual annotations. Finally, the research results obtained are promising and they encourage me to continue my research on the field of automatic NLG systems.