Diccionari electrònic bilingüe català>anglés de locucions referencials idiomàtiques de somatismes

Escolano Marín, Xènia

Diccionari electrònic bilingüe català>anglés de locucions referencials idiomàtiques de somatismes

Escolano Marín, Xènia

unter der Leitung von:

Vicent Martines Peres Co-Doktorvater
Elena Sánchez López Co-Doktormutter

Universität der Verteidigung: Universitat d'Alacant / Universidad de Alicante

Fecha de defensa: 22 von Juli von 2021

Gericht:

Fachbereiche:

FILOLOGIA CATALANA

Art: Dissertation

Teseo: 673222 DIALNET RUA editor

Zusammenfassung

The ultimate aim of this research is to design a Catalan>English bilingual electronic dictionary of somatic idioms. To do so, we undertake a semasiological characterisation of somatic idioms to determine their semantic value and morphosyntactic combinatorial possibilities going from language to abstraction, identifying the differences in conceptualisation between Catalan and English and their equivalence relationship on the basis of their occurrences in corpora. The methodology used for the description of these somatisms is based on lexicogrammar but in a bottom-up manner. The theory of lexicogrammar (M. Gross 1975, 1981) holds that every elementary phrase is constituted by at least one first order predicate that introduces its arguments, represented by nouns or phrases. For example, in the phrase Luc admire le courage de Léa, we can find the verbal predicate admirer with its arguments Luc and Léa, that are morphologically equivalent to Luc est admiratif (pour + devant) le courage de Léa, with an adjective predicate, and Luc a de l'admiration pour le courage de Léa, with a nominal predicate, since there is no change in the structure of arguments (Le Pesant & Mathieu-Colas 1998: 8). This entails a systematic description of the syntactic and semantic properties of verbs, predicate nouns and adjectives in the form of tables that represent a class of lexical elements –characterised in terms of syntactic-semantic features (such as human, animal, plant, concrete, abstract, locative and temporal)– which correspond to a certain syntactic category with a series of common distributional and transformational properties. However, this description is insufficient for the complete automatic treatment of language; this is why in 1992 G. Gross adds to lexicogrammar the notion of object classes: semantic groups defined by the syntactic relations they maintain with one or more classes of verbs, called appropriate predicates (Gross 2012: 101). This exhaustive description of the language has made it possible to consider certain linguistic phenomena like fixation for the automatic treatment of language. For example, in the Spanish mixed idioms buscarle las cosquillas a N, dar esquinazo a N o enviar a N a freír espárragos (Gross 2012: 242), we observe that N corresponds to a "human", just like in N to have a headache/backache/stomachache, the first complement corresponds to a "human" and the second, which is likewise variable, also belongs to the same object class: "human" + to have a + "body part" + ache (Le Pesant & Mathieu-Colas [1998: 26], regarding mixed idioms: “avoir mal à [Déterminant + NOM DE PARTIE DU CORPS]"). Considering that computerised corpora play a very prominent role in usage-based linguistics, a trend that considers grammar and use to be closely interrelated, lexicogrammar can represent a theoretical model particularly suitable for the detection and analysis of phraseological units (PhUs) in corpora. In particular, we focus on the description of the 50 most frequent idioms in Catalan containing one of the five most common anthropomorphic somatic lexemes in all languages: hand, head, heart, eye and ear (Mellado 2004). To identify which are these 50 most frequent idioms (10 per each lexeme), we retrieve all the occurrences for each of the five mentioned lexemes from the Corpus Textual Informatitzat de la Llengua Catalana (CTILC) (~ 52M words with texts from 1833-1988). Then we carry out a semi-automatic extraction of the different combinations of the most frequent bigrams (candidates for idioms of the dictionary) for each somatic lexeme using the software Metaconcor. For example, for mà (hand), we obtain: a mà (at hand), a mà (by hand), entre mans (on [one’s] hands), de mà en mà (from hand to hand), mà d’obra (labour force), mà dura (firm hand), picar de mans (to clap [one’s hands]), lligar de peus i mans (to tie [sb’s] hands and feet), a mans plenes (liberally) and rentar-se’n les mans (to wash [one’s] hands [of]). We also consider the most frequent verbal and nominal co-occurrences of each bigram. Once we have translated the Catalan idioms into English, we apply this step to the English equivalents based on their occurrences in the British National Corpus (BNC) (~ 100M words with texts from 1985-1994). In order to design the dictionary, we undertake a syntactic-semantic description of the Catalan idioms and their equivalents in English –indicating its argument structures and semantic values according to the object class to which we ascribe them (the one referring to the values of "somatisms")–, based on the occurrences of these units offered by the CTILC for Catalan and the BNC for English. This analysis is expressed through a tagging recognised by automatic language processing systems, based on the one used by the Laboratoire de Linguistique Informatique (LLI) –current LDI (Lexiques, Dictionnaires, Informatique) from the Paris 13 University–, whose dictionaries follow the system of the Laboratoire d’Automatique Documentaire et Linguistique (LADL) (Paris 7) –founded by M. Gross in 1968– and incorporate the notion of G. Gross’ object classes (1992). Having identified the most frequent idioms and their English equivalents, as well as their distributional (syntagmatic relations) and transformational possibilities (paradigmatic relations), we give account of their semantic relations (for both languages): conceptual variations (polysemy), intersynonymic variants (relations of synonymy), their eventual relations of antonymy and hyperonymy and their paradigmatic variants (according to their occurrences in the corpora used). All this information is registered in the dictionary in the form of four coordinated files with fields of different nature (cf. 2): two files containing the arguments of the idioms (one file for Catalan and another file for English) and two files containing the predicates (i.e. the idioms as entries) (one file for Catalan and another one for English). Considering the relevance object classes may have in fixed expressions, particularly in the above-mentioned mixed structures, our dictionary mostly contains idioms –predicates– that combine invariable (argument human nouns) and variable (anthropomorphic somatic lexemes) elements –arguments–: for example, to wash "one’s" hands (of)/C:"so-ma:elre"/G:v/N0:"hum"/N1:"so-ma:elre"/N3:"ina"/Ca:rentar-se les mans (d’) "alguna cosa", where N0 refers to the subject (a human), and N1 to the first complement (direct object) in the argument structure of this idiom with the somatic lexeme (so) hand, which in this case has a semantic value of refusal of responsibility (elre) (in other cases this lexeme will be in idioms which may evoke, among others, proximity or facility [at hand], manual labour [by hand], activity [on (one’s) hands], severity [firm hand], itinerancy [from hand to hand], human resources [labour force], approval, enthusiasm or attention [to clap (one’s hands)], immobility or repression [to tie (sb’s) hands and feet] and abundance [liberally], which will be reflected in the argument file). N3 is the third complement, which here corresponds to a concrete inanimate object ("ina") (e.g. “I’m entirely opposed to the er (pause) the idea that they should wash their hands of their (pause) er, obligations er, the nineteen sixty eight act (pause) as” [BNC]). In this case there is no N2 complement, which usually refers to an indirect object. Ca indicates the Catalan equivalent of the idiom. Opting for an electronic phraseological dictionary with a semasiological approach implies, on the one hand, finding the PhUs contained in it more easily, since instead of departing from specific concepts, it departs from the semantic values of the units (idioms) grouped in a linguistically well-defined object class. On the other, it offers the advantage of being suitable for Natural Language Processing (NLP), as it has a coding recognised by automatic language processing systems, incorporating numerous fields with information of morphological, syntactic-semantic (the most common distributional and transformational properties) and diasystematic nature (if applicable). It also contains a specific field referred to translations into other languages (bilingual or multilingual). Therefore, these repertoires have a dual function: decoding and encoding information. By way of illustration, amb el cap alt (head high), one of the ten most frequent idioms in Catalan with the somatic lexeme head, would be tagged as follows in our dictionary in each of its five argument structures (identified based on its occurrences in the CTILC): • amb el cap alt/C:"so-cap:or"/G:adv/N0:"hum"/W:anar, caminar, anar-se’n, restar/En:head high. • amb el cap alt/C:"so-cap:or"/G:adv/N0:"hum"/N1:"ina"/W:dominar, exigir, dir/En:head high. • amb el cap alt/C:"so-cap:or"/G:adv/N0:"hum"/N3:"loc"/W:sortir/En:head high. • amb el cap alt/C:"so-cap:or"/G:adv/N0:"hum"/N2:"hum"/W:presentar-se/En:head high. • amb el cap alt/C:"so-cap:or"/G:adv/N0:"hum"/N1:"hum"/W:acompanyar, casar-se amb/En:head high. C refers to the object class of the idiom (somatisms with head that have a semantic value of ‘pride’ [or]), G indicates its grammatical function (adverbial), N0 its subject (a human), N1 its direct object (i.e. a concrete inanimate object [2] or a human [5]), N2 is the second complement in the structure (i.e. a human as an indirect object in [4]), N3 is the third complement (i.e. a locative in [3]), W is the verb with which the idiom combines, and En represents its English translation. This semasiological characterisation of the idioms will take into account their conceptualisation in Catalan and English and their equivalence relationship. In addition, our dictionary of somatisms includes pragmatic tags and examples of use (occurrences). This information does not usually appear in electronic dictionaries, which will render it a more complete work. In fact, one of the main advantages offered by semasiological electronic dictionaries over more traditional ones is that they present in different entries each of the argument structures of a predicate , which allows it to be monosemised. Thus, each usage of a predicate is conceived as a lexical unit to which a description is assigned, a sine qua non for machine translation. This is particularly relevant to hybrid systems, which combine statistics with syntax and semantics (e.g. Systran and Sadaw). However, the utility of this type of tagging goes beyond reducing the potential ambiguity by polysemy in machine translation. Based on corpus linguistics, it may also be applied to Catalan and English language learning and teaching and be useful for translators. The methodology itself could be applicable as well to the training of translators of any language. This would be a form of creating an impact in the promotion of creative multilingualism by comparing languages in a bottom-up manner. The eventual product derived from this thesis, as well as the linguistic data (methodology, examples, tagging and files), can be considered unprecedented. Until now, lexicogrammar had been implemented based on linguistic intuition, which could make the creation of object classes somehow biased by the linguist’s idiolect. Using a syntactic-semantic tagging based on the above-mentioned methodology enables precision and objectivity, starting from real language instances (occurrences of these idioms in corpora). All in all, the usability and replicability of the linguistic data provided in the research may offer a wide range of possibilities, since this tagging could be implemented into linguistic analysis tools (e.g. in the form of tags in the “part of speech” label if idioms are used) and the model could be replicated to expressions from other lexico-phraseological fields, different types of PhUs and other languages.