Multilingual data collection for multiple corpus-based approaches to translation and interpretation
- Pereira Gomes da Costa, Hernani
- Miriam Seghiri Director/a
- Gloria Corpas Pastor Director/a
- Ruslan Mitkov Director
Universitat de defensa: Universidad de Málaga
Fecha de defensa: 22 de de novembre de 2019
- Esteban Tomás Montoro del Arco President/a
- María Rosario Bautista Zambrana Secretari/ària
- Luis Meneses Lerín Vocal
Tipus: Tesi
Resum
Corpora are playing an increasingly important role in our multilingual society. High-quality parallel corpora are a preferred resource in the language engineering and the linguistics communities. Nevertheless, the lack of sufficient and up-to-date parallel corpora, especially for narrow domains and poorly-resourced languages is currently one of the major obstacles to further advancement across various areas like translation, language learning and, automatic and assisted translation. An alternative is the use of comparable corpora, which are easier and faster to compile. Corpora, in general, are extremely important for tasks like translation, extraction, inter-linguistic comparisons and discoveries or even to lexicographical resources. Its objectivity, reusability, multiplicity and applicability of uses, easy handling and quick access to large volume of data are just an example of their advantages over other types of limited resources like thesauri or dictionaries. By a way of example, new terms are coined on a daily basis and dictionaries cannot keep up with the rate of emergence of new terms. Accordingly, this research work aims at exploiting and developing new technologies and methods to better ascertain not only translators’ and interpreters’ needs, but also professionals’ and ordinary people’s on their daily tasks, such as corpora and terminology compilation and management. The main topics covered by this work relate to Computational Linguistics (CL), Natural Language Processing (NLP), Machine Translation (MT), Comparable Corpora, Distributional Similarity Measures (DSM), Terminology Extraction Tools (TET) and Terminology Management Tools (TMT). In particular, this work examines three main questions: 1) Is it possible to create a simpler and user-friendly comparable corpora compilation tool? 2) How to identify the most suitable TMT and TET for a given translation or interpreting task? 3) How to automatically assess and measure the internal degree of relatedness in comparable corpora? This work is composed of thirteen peer-reviewed scientific publications, which are included in Appendix A, while the methodology used and the results obtained in these studies are summarised in the main body of this document. The first task was approached by doing an extensive analysis on the existing comparable compilation tools on the market, which their limitations and strengths were reported and considered while a new multilingual comparable corpora prototype, named iComparableCorpora was created. iComparableCorpora aimed not only to overcome various spotted usability problems, limitations and performance issues but also to improve the compilation process flexibility and robustness. The second task of this research focused on addressing translators’ and interpreters’ needs and suggest new methodologies or tools to help them increase productivity and ease their labour-intensive activities. To do so, a set of users’ requirements was carefully compiled from various users’ surveys in the literature. In parallel, a set of features offered by the most known TMT and TET on the market was also identified. Then, by matching the software functionalities offered by these tools with the users’ requirements, two new standardised methodologies capable of evaluating the current TMT and TET on the market were proposed. Finally, new directions of improvement were also suggested mostly due to the current displacement between the users’ needs and offered software functionalities. The third and last research task of this research mainly focused on exploring various methods capable of helping users accessing comparable corpora. In detail, a simple, yet efficient methodology capable of assessing and ranking comparable documents according to their internal degree of similarity was proposed. This method not only can help the user to have a better idea about the quality of the documents in the corpus but also can help deciding which documents should belong or be removed from it. Along this journey, various programs and tools were created. Two of them resulted from the first research question. Namely SCleaner, a web application that helps users to format text copied from a PDF file, and iCompileCorpora, a web interface that guides the user through the creation of multilingual comparable corpora. Regarding the third research question, three programs were created: PreProcessor, a program that helps users to annotate raw textual data; STSModule, a program that aims at helping users computing the semantic similarity between sentences and documents in English; and, finally DSMModule, a program that helps the user to assess and rank documents according to their internal degree of similarity.