Training part-of-speech taggers to build machine translation systems for less-resourced language pairs

  1. Sánchez Martínez, Felipe
  2. Armentano Oller, Carme
  3. Pérez Ortiz, Juan Antonio
  4. Forcada Zubizarreta, Mikel L.
Revue:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Année de publication: 2007

Número: 39

Pages: 257-264

Type: Article

D'autres publications dans: Procesamiento del lenguaje natural

Résumé

In this paper we review an unsupervised method that can be used to train the hidden-Markov-model-based part-of-speech taggers used within the opensource shallow-transfer machine translation (MT) engine Apertium. This method uses the remaining modules of the MT engine and a target language model to obtain part-of-speech taggers that are then used within the Apertium MT engine in order to produce translations. The experimental results on the Occitan-Catalan language pair (a case study of a less-resourced language pair) show that the amount of corpora needed by this training method is small compared with the usual corpus sizes needed by the standard (unsupervised) Baum-Welch algorithm. This makes the method appropriate to train part-of-speech taggers to be used in MT for less-resourced language pairs. Moreover, the translation performance of the MT system embedding the resulting part-of-speech tagger is comparatively better.