Extraction of medical knowledge from clinical reports and chest x-rays using machine learning techniques
- Bustos Moreno Moreno, Aurelia
- Antonio Pertusa Ibáñez Director
Universidade de defensa: Universitat d'Alacant / Universidad de Alicante
Fecha de defensa: 01 de xullo de 2019
- Nuria Oliver Presidente/a
- Miguel Cazorla Quevedo Secretario
- María de los Desamparados de la Iglesia Vayá Vogal
Tipo: Tese
Resumo
Introduction This thesis addresses the extraction of medical knowledge from clinical text using deep learning techniques. In particular, the proposed methods focus on cancer clinical trial protocols and chest x-rays reports. Clinical trials provide the evidence needed to determine the safety and effectiveness of new medical treatments. These trials are the basis employed for clinical practice guidelines and greatly assist clinicians in their daily practice when making decisions regarding treatment. However, the eligibility criteria used in oncology trials are too restrictive and the results obtained in clinical trials cannot be extrapolated to patients if their clinical profiles were excluded from the clinical trial protocols. The efficacy and safety of new treatments for patients are not, therefore, always defined and requires the manual review of numerous eligibility criteria, which is impracticable for clinicians on a daily basis. The first task that this thesis addresses is to automatically discern, given the clinical characteristics of particular patients, their type of cancer and the intended treatment, whether or not they are represented in the corpus of available clinical trials. The second main task addressed in this thesis is related to knowledge extraction from medical reports associated with radiographs. Conventional radiology remains the most performed technique in radiodiagnosis services, with a percentage close to 75% (Radiología Médica, 2010). In particular, chest x-ray is the most common medical imaging exam with over 35 million taken every year in the US alone (Kamel et al., 2017). They allow for inexpensive screening of several pathologies including masses, pulmonary nodules, effusions, cardiac abnormalities and pneumothorax. Beyond implementing and obtaining results for both clinical trials and chest x-rays, this thesis studies the nature of the health data, the novelty of applying deep learning methods to obtain large-scale labeled medical datasets, and the relevance of its applications in medical research. This thesis describes this journey so that the reader is navigated across multiple disciplines, from engineering to medicine up to ethic considerations in artificial intelligence applied to medicine. Development, methods and results: First, a large medical corpora comprising all cancer clinical trials protocols in the last 18 years published by competent authorities was used to extract medical knowledge in order to help automatically learn patient’s eligibility in these trials. For this, a model is built to automatically predict whether short clinical statements were considered inclusion or exclusion criteria. A method based on deep neural networks is trained on a dataset of 6 million short free-texts to classify them between elegible or not elegible. For this, pretrained word embeddings were used as inputs in order to predict whether or not short free-text statements describing clinical information were considered eligible. The semantic reasoning of the wordembedding representations obtained was also analyzed, being able to identify equivalent treatments for a type of tumor in an analogy with the drugs used to treat other tumors. proof the capability of machine learning methods to discern which are regarded as inclusion or exclusion criteria in short free-text clinical notes, For the second task, all the chest-x rays that had been interpreted and reported by radiologists at the Hospital Universitario de San Juan (Alicante) from Jan 2009 to Dec 2017 were used to build a novel large-scale dataset in which each high-resolution radiograph is labeled with its corresponding metadata, radiological findings and pathologies. This dataset, named PadChest, includes more than 160,000 images obtained from 67,000 patients, covering six different position views and additional information on image acquisition and patient demography. The free text reports written in Spanish by radiologists were labeled with 174 different radiographic findings, 19 differential diagnoses and 104 anatomic locations organized as a hierarchical taxonomy and mapped onto standard Unified Medical Language System (UMLS) terminology. For this, a subset of the reports ( 27%) were manually annotated by trained physicians, whereas the remaining set was automatically labeled with deep supervised learning methods using attention mechanisms and fed with the text reports. The labels generated were then validated in an independent test set achieving a 0.93 Micro-F1 score. To the best of our knowledge, this is one of the largest public chest x-ray databases suitable for training supervised models concerning radiographs, and also the first to contain radiographic reports in Spanish. The PadChest dataset can be downloaded on request from http://bimcv.cipf.es/bimcv-projects/padchest/. Conclusions: First, this thesis show that representation learning using deep neural networks can be successfully leveraged to extract the medical knowledge from clinical trial protocols and potentially assist practitioners when prescribing treatments. Second, deep neural networks are successfully applied to generate and release the largest public chest x-ray image dataset, in term of number of patients, which have been labeled with all radiological findings, diagnoses and anatomic locations as reported by radiologists. The relevance of its potential applications in medical image, have contributed to its extramural diffusion and worldwide reach. References: - Radiología Médica, Sociedad Española de (2010). Radiología esencial. Ed. Médica Panamericana, pp. 1785–1810. - Kamel, Sarah I et al. (2017). “Utilization Trends in Noncardiac Thoracic Imaging, 2002-2014”. In: Journal of the American College of Radiology 14.3, pp. 337–342. - Mullenbach, James et al. (2018). “Explainable Prediction of Medical Codes from Clinical Text”. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Vol. 1, pp. 1101–1111. The works of this thesis are published in the following articles: - Learning Eligibility in Cancer Clinical Trials Using Deep Neural Networks. Aurelia Bustos and Antonio Pertusa. Applied Sciences, 2018, 8(7), 1206. https://doi.org/10.3390/app8071206 - PadChest: A large chest x-ray image dataset with multi-label annotated reports. Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, Maria de la Iglesia-Vayá. ArXiv preprint: https://arxiv.org/abs/1901.07441. Currently under review in Journal of Medical Image Analysis.