Modelos de lenguaje contextuales para la búsqueda e integración de datos tabulares

Pilaluisa Quinatoa, José Ramiro

Modelos de lenguaje contextuales para la búsqueda e integración de datos tabulares

Pilaluisa Quinatoa, José Ramiro

Dirigida por:

David Tomás Díaz Director

Universidad de defensa: Universitat d'Alacant / Universidad de Alicante

Fecha de defensa: 19 de enero de 2023

Tribunal:

Pietro Manzoni Presidente/a
Irene Garrigós Fernández Secretaria
José María Cecilia Canales Vocal

Departamento:

LENGUAJES Y SISTEMAS INFORMATICOS

Tipo: Tesis

Teseo: 785433 DIALNET RUA editor

Resumen

This thesis proposes an approach for searching and integrating data in tabular format. The novelty of the proposal lies in the use of contextual language models. These models have revolutionised the field of natural language processing (NLP) in recent years. However, few approaches have used these models to work with structured data such as tables. Although some approaches exist for the task of table retrieval, there are currently no approaches that use these models in the whole process of search and integration with union and join operators. In this paper a proposal is made to adapt these language models, originally used on unstructured data, to be applied on structured data. In the process, the effectiveness of different existing models will be evaluated and their input parameters will be adjusted to determine the most effective configuration for the task. In addition, contextual models will be contrasted with non-contextual models, analysing the role of context in the performance of the system. The work also includes a study of how to improve the performance of these systems by removing content from the tables. To this end, we study how reducing the number of rows in the tables affects the vector representation (word embedding) generated by the language model. In this way, we want to determine the possibility of reducing large tables without losing representativeness in the semantic space generated by the language model. Finally, the thesis concludes with a proposal for the annotation of tabular data in order to obtain a dataset that allows better training and evaluation of this type of systems based on machine learning techniques. At present, there are no challenging and varied datasets for the integration task, especially in the case of the join operation. A pilot study of annotation is included, in which an initial corpus of tables is developed for the task of searching and integrating tabular data.