Caracterización de Niveles de Informalidad en Textos de la Web 2.0

  1. Mosquera López, Alejandro
  2. Moreda Pozo, Paloma
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2011

Issue: 47

Pages: 171-177

Type: Article

More publications in: Procesamiento del lenguaje natural

Repositorio Institucional de la Universidad de Alicante: lock_openOpen access Handle

Sustainable development goals

Abstract

Analysis of Web 2.0 texts is a relevant investigation topic nowadays. However, many problems arise when using state of the art tools in this kind of texts. For being able to measure these difficulties first we need to identify the different registers or informality levels that we can find. Therefore, in this paper we will attempt to characterize the informality levels of english texts in Web 2.0 by using non-supervised machine learning techniques, obtaining results of 68 % in F1.

Bibliographic References

  • Andritsos, Periklis, Panayiotis Tsaparas, Renee J. Miller, y Kenneth C. Sevcik. 2003. Limbo: A scalable algorithm to cluster categorical data. Informe técnico, University of Toronto, Department of Computer Science.
  • Atserias, Jordi, Bernardino Casas, Elisabet Comelles, Meritxell González, Lluis Padró, y Muntsa Padró. 2006. Free- Ling 1.3: Syntactic and semantic services in an open-source NLP library. En Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06), páginas 48–55.
  • Biber, D. 1988. Linguistic features: algorithms and functions in Variation across speech and writing. Cambridge University Press.
  • Biber, D. 1995. Dimensions of register variation: A cross-linguistic comparison. New York: Cambridge University Pres Linguistics.
  • Biber, D. 2003. Variation among university spoken and written registers: A new multidimensional analysis. Language and Computers, 46:47–70.
  • Biber, D. y J. Kurjian. 2007. Towards a taxonomy of web registers and text types: A multi-dimensional analysis. En N. Nesselhauf In M. Hundt y C. Biewer, editores, Corpus linguistics and the web. Amsterdam, Rodopi, páginas 109–132.
  • Biber, Douglas, Susan Conrad y Viviana Cortes. 2004. If you look at...: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25:371–405.
  • Cortes, Corinna y Vladimir Vapnik. 1995. Support-vector networks. En Machine Learning, volumen 20, páginas 273–297.
  • Francis, W. N. y H. Kucera. 1979. Brown corpus. Informe t´ecnico, Department of Linguistics, Brown University, Providence, Rhode Island, US.
  • Gries, Stefan Th., John Newman, y Cyrus Shaoul. 2009. N-grams and the clustering of genres. ELR Journal, 5.
  • Hall, M A. 1998. Correlation-based feature selection for machine learning. PhD dissertation Hamilton NZ Waikato University Department of Computer Science.
  • Halliday, M.A.K. y Mohsen Ghadessy. 1988. On the language of physical science. In Mohsen Ghadessy (ed.), Registers of Written English: situational factors and linguistic features. London and New York: Pinter Publishers. 162-178.
  • Hartigan, J. A. y M. A. Wong. 1979. A Kmeans clustering algorithm. Applied Statistics, 28:100–108.
  • Heylighen, Francis y Jean-Marc Dewaele. 1999. Formality of language: definition, measurement and behavioral determinants. Informe técnico, Free University of Brussels.
  • Pelleg, Dan y Andrew W. Moore. 2000. Xmeans: Extending k-means with efficient estimation of the number of clusters. En Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, páginas 727–734, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  • Schmid, Helmut. 1994. Probabilistic partofspeech tagging using decision trees. En Proceedings of the International Conference on New Methods in Language Processing, páginas 44–49.
  • Scott, M. 1999. Wordsmith tools version 3. Sharoff, Serge, Zhili Wu, y Katja Markert. 2010. The web library of babel: evaluating genre collections. En Proceedings of Seventh International Conference on Language Resources and Evaluation (LREC’10), páginas 3063–3070.
  • Squires, L. 2010. Enregistering internet language. Language in Society, 39(04):457–492.
  • Tribble, Christopher. 1999. Writing difficult texts. Ph.D. dissertation. Lancaster University.