Training part-of-speech taggers to build machine translation systems for less-resourced language pairs
- Sánchez Martínez, Felipe
- Armentano Oller, Carme
- Pérez Ortiz, Juan Antonio
- Forcada Zubizarreta, Mikel L.
ISSN: 1135-5948
Argitalpen urtea: 2007
Zenbakia: 39
Orrialdeak: 257-264
Mota: Artikulua
Beste argitalpen batzuk: Procesamiento del lenguaje natural
Laburpena
In this paper we review an unsupervised method that can be used to train the hidden-Markov-model-based part-of-speech taggers used within the opensource shallow-transfer machine translation (MT) engine Apertium. This method uses the remaining modules of the MT engine and a target language model to obtain part-of-speech taggers that are then used within the Apertium MT engine in order to produce translations. The experimental results on the Occitan-Catalan language pair (a case study of a less-resourced language pair) show that the amount of corpora needed by this training method is small compared with the usual corpus sizes needed by the standard (unsupervised) Baum-Welch algorithm. This makes the method appropriate to train part-of-speech taggers to be used in MT for less-resourced language pairs. Moreover, the translation performance of the MT system embedding the resulting part-of-speech tagger is comparatively better.