Propuesta de un sistema de clasificación de entidades basado en perfiles e independiente del dominio

  1. Isabel Moreno Agulló
  2. María Teresa Romá Ferri
  3. María Paloma Moreda Pozo
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2017

Issue: 59

Pages: 23-30

Type: Article

More publications in: Procesamiento del lenguaje natural

Abstract

Named Entity Recognition and Classification (NERC) is a prerequisite to other natural language processing applications. Nevertheless, the adaptation of NERC systems is expensive given that most of them only work appropiately on the domain for which they were created. Bearing this idea in mind, a named entity classification system, which is profile and machine learning based, is evaluated to determine if the results are maintained regardeless of the domain of the training corpus. To that end, it is tested on 6 types of entities from two different domains in Spanish: general and medical. Applying techniques to balance class distribution, the difference in terms of F1 between domains is 0.02 points (F1: 50.36 versus 50.38, respectively). These results support the domain independence of our profile-based system.

Bibliographic References

  • Al-Rfou, R., V. Kulkarni, B. Perozzi, y S. Skie na. 2014. POLYGLOT-NER: Massive Multilingual Named Entity Recognition. ArXiv e-prints, (October).
  • Alcón, Ó. y E. Lloret. 2015. Estudio de la influencia de incorporar conocimiento léxicosemántico a la técnica de Análisis de Componentes Principales para la generación de resúmenes multilingües. Linguamática, 7(1):53– 63, Julio.
  • Carreras, X., L. Marquez, y L. Padró. 2002. Named entity extraction using adaboost. En Proceeding of the 6th Conference on Natural Language Learning.
  • Chawla, N. V., K. W. Bowyer, L. O. Hall, y W. P. Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357.
  • Freund, Y. y R. E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277–296.
  • Fuentes, M. y H. Rodríguez. 2002. Using cohesive properties of text for automatic summarization. Jotri’02.
  • Gamallo, P., J. C. Pichel, M. Garcia, J. M. Abuín, y T. Fernández-Pena. 2014. Análisis morfosintáctico y clasificación de entidades nombradas en un entorno Big Data. Procesamiento del Lenguaje Natural, 53:17–24.
  • Hall, M., E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, y I. Witten. 2009. The WEKA data mining software: An update. SIGKDD Explorations, 11(1):10–18.
  • Ji, H., J. Nothman, y B. Hachey. 2015. Overview of TAC-KBP2015 Entity Discovery and Linking Tasks. En Proceedings of Text Analysis Conference 2015.
  • Kitoogo, F. y V. Baryamureeba. 2008. Towards domain independent named entity recognition. En Strengthening the Role of ICT in Development, volumen IV. Fountain publishers, páginas 84 – 95.
  • Lopes, L. y R. Vieira. 2015. Building and Applying Profiles Through Term Extraction. En X Brazilian Symposium in Information and Human Language Technology, páginas 91– 100.
  • López, V., A. Fernández, J. G. Moreno-Torres, y F. Herrera. 2012. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Systems with Applications, 39(7):6585–6608.
  • Màrquez, L., L. Villarejo, M. A. Martí, y M. Taulé. 2007. SemEval-2007 Task 09 : Multilevel Semantic Annotation of Catalan and Spanish. En Proceedings of the 4th International Workshop on Semantic Evaluations, páginas 42–47.
  • Marrero, M., J. Urbano, S. Sánchez-Cuadrado, J. Morato, y J. M. Gómez-Berbís. 2013. Named Entity Recognition: Fallacies, challenges and opportunities. Computer Standards and Interfaces, 35(5):482–489.
  • Moreno, I., P. Moreda, y M. Romá-Ferri. 2012. Reconocimiento de entidades nombradas en dominios restringidos. En Actas del III Workshop en Tecnologías de la Informática. páginas 41–57.
  • Padró, L. y E. Stanilovsky. 2012. FreeLing 3.0: Towards Wider Multilinguality. En Proceedings of the Language Resources and Evaluation Conference, páginas 2473–2479.
  • Pradhan, S., N. Elhadad, W. W. Chapman, S. Manandhar, y G. Savova. 2014. SemEval2014 Task 7: Analysis of Clinical Text. páginas 54–62.
  • Sang, E. F. T. K. y F. De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. En Proceedings of the 7th Conference on Natural Language Learning, páginas 142– 147.
  • Segura-Bedmar, I., P. Martınez, y M. HerreroZazo. 2013. SemEval-2013 Task 9: Extraction of Drug-Drug Interactions from Biomedical Texts (DDIExtraction 2013). En Proceedings of the 7th International Workshop on Semantic Evaluation, páginas 341–350.
  • Tjong Kim Sang, E. F. 2002. Introduction to the CoNLL-2002 shared task. En Proceeding of the 6th Conference on Natural Language Learning.
  • Tkachenko, M. y A. Simanovsky. 2012. Selecting Features for Domain-Independent Named Entity Recognition. En Proceedings of KONVENS 2012, páginas 248–253.
  • Tomanek, K. y U. Hahn. 2009. Reducing class imbalance during active learning for named entity annotation. En Proceedings of the fifth international conference on Knowledge capture, páginas 105–112.
  • Uzuner, O., I. Solti, y E. Cadag. 2010. Extracting medication information from clinical text. Journal of the American Medical Informatics Association, 17(5):514–8.
  • Vicente, M. y E. Lloret. 2016. Exploring Flexibility in Natural Language Generation throughout Discursive Analysis of New Textual Genres. En Proceedings of the 2nd International Workshop FETLT, Sevilla, Spain.
  • Wei, Q. y R. L. Dunbrack. 2013. The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics. PLoS ONE, 8(7).