Phrase table expansion for statistical machine translation with reduced parallel corpora: the chinese-spanish case

HAN, JINGYI

Phrase table expansion for statistical machine translation with reduced parallel corporathe chinese-spanish case

HAN, JINGYI

Dirigée par:

Núria Bel Rafecas Directeur/trice

Université de défendre: Universitat Pompeu Fabra

Fecha de defensa: 13 décembre 2017

Jury:

Pavel Pecina President
Marta Ruiz Costa-Jussà Secrétaire
Felipe Sánchez Martínez Rapporteur

Type: Thèses

Teseo: 514873 DIALNET TDX editor

Résumé

Parallel data scarcity problem is a major challenge faced by Statistical Machine Translation (SMT). The aim of this thesis is to enrich a SMT system by adding more morphological variants and new translation lexicon automatically generated out of monolingual data. To induce bilingual lexicon, instead of taking advantages of comparable corpora or parallel data, we proposed a supervised classifier trained using monolingual features (e.g. word embedding vectors, plus Brown clustering or word frequency information) of only a small amount of translation equivalent word pairs. The classifier is able to predict whether a new word pair is under a translation relation or not. Our experiments of SMT phrase table expansion were conducted on Chinese and Spanish, since we realised that although they are two of the most widely spoken languages of the world, this language pair is suffering from a data scarcity situation. In addition to the problems caused by the words that are not included in the training corpus, the inflection differences between this language pair make the translation even more challenging when only limited parallel data are available. The obtained results demonstrate that, on the one hand, with the method of morphology expansion, the SMT system achieves an improvement of up to + 0.61 BLEU compared to the results of a low resource Chinese-Spanish phrase-based SMT baseline. On the other hand, our supervised classifier reaches a 0.94 F1-score and the SMT experiment results show an improvement of up to +0.70 BLEU with the resulting bilingual lexicon, demonstrating that the errors of the classifier are ultimately controlled by the SMT system.