Automatic Acquisition of Machine Translation Resources in the Abu-MaTran Project

  1. Antonio Toral
  2. Tommi Pirinen
  3. Andy Way
  4. Raphäel Rubino
  5. Gema Ramírez-Sánchez
  6. Sergio Ortiz-Rojas
  7. Víctor Sánchez-Cartagena
  8. Jorge Fernández-Tordera
  9. Mikel Forcada
  10. Miquel Esplà-Gomis
  11. Nikola Ljubesic
  12. Filip Klubicka
  13. Prokopis Prokopidis
  14. Vassilis Papavassiliou
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2015

Issue: 55

Pages: 185-188

Type: Article

More publications in: Procesamiento del lenguaje natural


This paper provides an overview of the research and development activities carried out to alleviate the language resources' bottleneck in machine translation within the Abu-MaTran project. We have developed a range of tools for the acquisition of the main resources required by the two most popular approaches to machine translation, i.e. statistical (corpora) and rule-based models (dictionaries and rules). All these tools have been released under open-source licenses and have been developed with the aim of being useful for industrial exploitation.

Bibliographic References

  • Esplà-Gomis, M. and M. L. Forcada. 2010. Combining content-based and url-based heuristics to harvest aligned bitexts from multilingual sites with bitextor. Prague Bull. Math. Linguistics, 93:77-86.
  • Esplà-Gomis, M., V. M. Sánchez-Cartagena, J. A. Pérez-Ortiz, F. Sánchez-Martínez, M. L. Forcada, and R. C. Carrasco. 2014. An efficient method to assist non-expert users in extending dictionaries by assigning stems and inflectional paradigms to unknown words. In Proceedings of the 17th Annual Conference of the European Association for Machine Translation Translation, pages 19-26, Dubrovnik, Croatia, June.
  • Esplà-Gomis, M., F. Klubicka, N. Ljubesic, S. Ortiz-Rojas, V. Papavassiliou, and P. Prokopidis. 2014. Comparing two acquisition systems for automatically building an english-croatian parallel corpus from multilingual websites. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland, may.
  • Forcada, M. L., S. Ortiz-Rojas, T. Pirinen, R. Rubino, and A. Toral. 2014a. Abu-MaTran deliverable D4.1b MT systems for the second development cycle.
  • Forcada, M. L., T. Pirinen, R. Rubino, and A. Toral. 2014b. Abu-MaTran deliverable D5.1b Evaluation of the MT systems deployed in the second development cycle.
  • Ljubesic, N., D. Fiser, and T. Erjavec. 2014. TweetCaT: a Tool for Building Twitter Corpora of Smaller Languages. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland.
  • Ljubesic, N. and F. Klubicka. 2014. {bs,hr,sr}WaC-web corpora of Bosnian, Croatian and Serbian. In Proceedings of the 9th Web as Corpus Workshop (WaC-9), pages 29-35, Gothenburg, Sweden.
  • Ljubesic, N. and A. Toral. 2014. cawac - a web corpus of catalan and its application to language modeling and machine translation. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland, may.
  • Papavassiliou, V., P. Prokopidis, M. Esplà-Gomis, and S. Ortiz. 2014. Abu-MaTran deliverable D3.2. Corpora Acquisition Software.
  • Papavassiliou, V., P. Prokopidis, and G. Thurmair. 2013. A modular opensource focused crawler for mining monolingual and bilingual corpora from the web. In Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, pages 43-51, Sofia, Bulgaria, August.
  • Rehm, G. and H. Uszkoreit. 2013. METANET Strategic Research Agenda for Multilingual Europe 2020. [Online; accessed 27 March 2015].
  • Rubino, R., M. Esplà-Gomi, A. Toral, V. Papavasiliou, and P. Prokopidis. 2015. DIY Domain Specific Parallel Corpora for Translators. In To appear in Proceedings of the IV International Conference on Corpus Use and Learning to Translate (CULT), Alacant, Spain.
  • Sáanchez-Cartagena, V. M., J. A. Pérez-Ortiz, and F. Sánchez-Martínez. 2015. A generalised alignment template formalism and its application to the inference of shallow-transfer machine translation rules from scarce bilingual corpora. Computer Speech and Language, 32(1):46-90.
  • Sánchez-Martínez, F. and M. L. Forcada. 2009. Inferring shallow-transfer machine translation rules from small parallel corpora. Journal of Artificial Intelligence Research, 34:605-635.