Concit-corpus context citation analysis to learn function, polarity and influence

HERNÁNDEZ ALVAREZ, MYRIAM

Concit-corpus context citation analysis to learn function, polarity and influence

HERNÁNDEZ ALVAREZ, MYRIAM

Dirigée par:

Patricio Martínez Barco Directeur

Université de défendre: Universitat d'Alacant / Universidad de Alicante

Fecha de defensa: 21 septembre 2015

Jury:

María Teresa Martín Valdivia President
Fernando Llopis Pascual Secrétaire
Thamar Solorio Rapporteur

Département:

LENGUAJES Y SISTEMAS INFORMATICOS

Type: Thèses

Teseo: 392030 DIALNET RUA editor

Résumé

Introduction Citation analysis is a method of evaluating the impact of an author, a published work or scientific media. Sugiyama, Kumar, Kan and Tripathi (2010) suggested two types of citation analyses: citation counts (Garfield 1972) and citation context analysis. We accept this categorization as it is close to our thesis. Citation counts is a method that evaluates all citations without making distintion between them. On the other hand, Citation Context Analysis serves not only to classify citations according to qualitative criteria for obtaining information that can be used to improve impact assessment, but also has other applications as summary generation and development of better information retrieval techniques, among others. In the Introduction section of this thesis we refer to many authors that have exposed the weakness of impact assessment using approaches with only citation counting, although these methods have proved to be very useful to valuate author and publishing media impact, they could be improved taking into account results from citation context analysis. Not all citations should have equal weight in the calculation of the impact, which is why we could use context citation analysis for considering quality criteria, such as the purpose and polarity, to differentiate citations. Therefore, iIn our work we focus in Citation Context Analysis to evaluate function, polarity and impact of bibliographic references. Development Some authors have commented on the distortions that can occur when are used only counting methods to valuate impact. Marder et al. (2010) showed that controversial articles with incorrect or incomplete data get higher citation counts. This situation could generate unethical incentives to publish, considering that there is a lot of pressure over the academics, pressure produced by the importance of impact assessment on the researchers career and on the image and relevance of the different scientific media. Therefore, it is necessary to differentiate the nature of each citation with complete citation criteria from citation context analysis. Articles and other scientific documents are types of texts that have specific characteristics. We can mention the following: authors are not explicit about citation purpose; likewise, rarely they are clear about polarity, but rather they avoid criticism in order to evade adverse reactions from colleagues, and conceal negative feelings about work of others through language resources called hedges; many citations are mentioned for reasons outside research interest; and, finally there is specialized lexicon for each knowledge area (Verlic et al., 2008). Context citation analysis must take into account these features. We can see that this is a task with specific and complex challenges. In Chapter 2, in the state-of-the-art, we noticed topics that are still non-resolved such as context size detection; implicit references recognition; definition of features to annotate; hedges detection to distinguish disguised negativity; and, over all, there is the necessity to overcome the absence of a common framework to facilitate research progress in collaborative conditions. This framework should include a standard classification, an annotation scheme, and an annotated corpus according such scheme. In fact, in the state-of-the-art of the present work, it was concluded that the biggest problem facing researches in this field, is that there is no public available annotated corpus that responds to a medium or high granularity scheme in order to contain enough information for context analysis that can be used on a shared basis by scholars. In this thesis, we addressed most of these unresolved issues. For instance, we defined a scheme for citation classification that takes into account function, polarity and impact. The scheme was designed to maintain a simple structure with six functions and three levels of polarity, that when combined with keywords and labels yield high granularity, comparable with complete ontologies as CiTO. Ciancarini (2014) noticed that this kind of ontologies present difficulty for annotation due to their complexity. However, with our proposed scheme and its structure, understanding and application of the scheme are facilitated. Anyway, the annotation task for high granularity remains challenging, but our scheme at least makes it easier that annotators use and take advantage of all possibilities of the scheme. We applied the proposed scheme for annotating a citation corpus composed by 85 articles taken randomly from ACL Anthology with a total of 2195 bibliography cites. Using it, we could evaluate the impact that a citation has in a document. For this purpose, we propose two methods: in the first one, we take into account, in a directly way: negative, neutral or positive polarity to assign an impact category corresponding to Negative, Perfunctory or Significant, respectively. We justify this decision considering that authors have a more favorable disposition towards citations that have greater impact in their works. The second method is an algorithm developed based on criteria obtained from previous studies that link the impact with citation frequency; location in the document; number of sections in which citations appear; some functions and polarity. From this procedure, we obtain a classification using the same three impact categories as stated before. The algorithm results are compared with data obtained from a survey applied to authors who rate citation impact in their own articles. Our method produces good outcomes and has very similar results when we compared 161 impact citations obtained with our algorithm to the corresponding survey responses. We can see that our algorithm captures and relates the most important criteria to be considered for evaluate the impact, because when we match the algorithm results with data obtained in the mentioned survey, we observe a weighted average for F-Measure of 0.93, which is a very satisfactory value that demonstrates an excellent correlation between author's annotation and the results of the algorithm. To continue with our experiments, and in order to use all the generated features, this procedure is applied to train a SVM with SMO classifier with entries that include every function, polarity, frequency, location and number of sections in which each citation occurs. Results have 0.98 weighted average for F-Measures. Consequently, we could train a classifier that uses the features of our corpus in order to recognize our algorithm and automatically classify impact in three levels: Negative, Perfunctory and Significant. The resulting classification has good values of Precision, Recall and F-Measure. We consider that this is a contribution that could be applied to incorporate quality assessment factors to impact valuation, in order to obtain holistic evaluation criteria where not all citations are considered equal for counting. To annotate the corpus, we classify citation function and polarity according to the suggested scheme using an annotation methodology that includes a step of pre-annotation in which keywords and tags are detected to clarify and standardize an internal representation that a coder or annotator creates about citation context. With this method, the mental model is more likely to coincide with the ones produced for other coders and consequently we obtain a good rate of inter annotator agreement. With this pre-annotation step, we dramatically improve the agreement among annotators index, which is indispensable to validate rating reliability and reproducibility of the annotation scheme. We validate this index in Chapter 6, where we can see that the values of Fleiss' Kappa are as high as 0.862 for function and 0.912 for polarity. These values correspond to an almost perfect agreement in accordance to the scale of Landis and Koch (1977). Using keywords and labels, we obtain a notorious improvement, because without this stage, with the same annotators, there were low results for this index: 0.386 and 0.259 for function and polarity respectively due to the complexity of the annotation task for a high granularity scheme. Regarding the context length for classification, in the annotation results, we noticed in that the context length chosen by coders largely corresponds to just one statement: the one with the citation. With less frequency appear length context of two or three sentences. It is probably that the context should not include more than three sentences to cover all the necessary information about the reference. In chapter 7, we evaluate the annotated features as useful entries in a classifier for citation function and polarity. Results rated high for Precision, Recall and F-Measure, which demonstrate suitability of the chosen features for those classifications. We chose algorithms after the recommendations of our initial state-of-the-art study. In our survey, the most suggested algorithms were SVM trained with SMO and Naïve Bayes. In our results, SVM SMO has the best values: 0.833, 0.828 and 0.825 corresponding to Precision, Recall and F-Measure in function classification. These results are higher than the state-of-the-art. For polarity categorization, with the same algorithm again we obtained excellent values for Precision, Recall and F-Measure: 0.886, 0.882 and 0.88. In Chapter 7, we also present experiments for stablish the best combination of training and test sets that have to be independent samples. For our corpus size, is important that the test set has an appropriate size, especially for functions and polarity less frequent. For independence of training and test samples, it is better to choose the Percentage Split option of Weka, and for size considerations, we found that the best proportion between training and test samples is 66% vs 44%. Due to the reliability that is obtained in our corpus annotation, we suggest that the data continue to be annotated manually using our methodology. We state that it is necessary to improve current automatic annotation techniques for obtaining reliable results when applied to an annotation scheme with medium or high granularity. Mandya (2012) categorizes annotation schemes in two classes: the one that uses manual classification and the one that has automatic feature extraction and classification. In that study, we observe that manual classification schemes have medium or high granularity while automatic processed schemes have low granularity. As we said, annotated corpora with a medium or high granularity provide valuable information indispensable to citation context analysis, but its annotation is a complex task, even for human annotators. Therefore, challenges for automatic annotation are big. According to our state-of-the-art study, the schemes with medium or high granularity showed in Table 1, are manually labeled by their authors; in studies that attempted automatic labeling of this kind of corpora, results come out not as good as it is necessary for having reliable data. Conclussions It remains as an important line for future work, the improvement of automatic labeling of corpora according to fine granularity patterns. The effort is justified because this type of annotated corpus provides essential information that is required for citation context analysis. In summary, among our contributions in the present work, we can mention the following: a proposed annotation scheme simple in its structure, but with high granularity; the annotation methodology, particularly as regards to the pre-annotation process to detect keywords and labels that are useful to create mental models, they also serve as input features in classification tasks; a public available annotated corpus that contains those features and is accessible for collaborative work; a method to evaluate citation impact using criteria exposed in other works and algorithms developed in our thesis; and, the experimental finding that the significant context around a citation usually takes no more than three sentences including the one with the mention.