Training part-of-speech taggers to build machine translation systems for less-resourced language pairs

Sánchez Martínez, Felipe; Armentano Oller, Carme; Pérez Ortiz, Juan Antonio; Forcada Zubizarreta, Mikel L.

Training part-of-speech taggers to build machine translation systems for less-resourced language pairs

Aldizkaria:

Procesamiento del lenguaje natural

ISSN: 1135-5948

Argitalpen urtea: 2007

Zenbakia: 39

Orrialdeak: 257-264

Mota: Artikulua

DIALNET GOOGLE SCHOLAR RUA editor

Beste argitalpen batzuk: Procesamiento del lenguaje natural

Laburpena

In this paper we review an unsupervised method that can be used to train the hidden-Markov-model-based part-of-speech taggers used within the opensource shallow-transfer machine translation (MT) engine Apertium. This method uses the remaining modules of the MT engine and a target language model to obtain part-of-speech taggers that are then used within the Apertium MT engine in order to produce translations. The experimental results on the Occitan-Catalan language pair (a case study of a less-resourced language pair) show that the amount of corpora needed by this training method is small compared with the usual corpus sizes needed by the standard (unsupervised) Baum-Welch algorithm. This makes the method appropriate to train part-of-speech taggers to be used in MT for less-resourced language pairs. Moreover, the translation performance of the MT system embedding the resulting part-of-speech tagger is comparatively better.

Datuen iturria: Dialnet