Training part-of-speech taggers to build machine translation systems for less-resourced language pairs

  1. Sánchez Martínez, Felipe
  2. Armentano Oller, Carme
  3. Pérez Ortiz, Juan Antonio
  4. Forcada Zubizarreta, Mikel L.
Aldizkaria:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Argitalpen urtea: 2007

Zenbakia: 39

Orrialdeak: 257-264

Mota: Artikulua

Beste argitalpen batzuk: Procesamiento del lenguaje natural

Laburpena

In this paper we review an unsupervised method that can be used to train the hidden-Markov-model-based part-of-speech taggers used within the opensource shallow-transfer machine translation (MT) engine Apertium. This method uses the remaining modules of the MT engine and a target language model to obtain part-of-speech taggers that are then used within the Apertium MT engine in order to produce translations. The experimental results on the Occitan-Catalan language pair (a case study of a less-resourced language pair) show that the amount of corpora needed by this training method is small compared with the usual corpus sizes needed by the standard (unsupervised) Baum-Welch algorithm. This makes the method appropriate to train part-of-speech taggers to be used in MT for less-resourced language pairs. Moreover, the translation performance of the MT system embedding the resulting part-of-speech tagger is comparatively better.