Automated Classification of Free-text Pathology Reports for Registration of Incident Cases of Cancer

V. Jouhet; G. Defossez; A. Burgun; P. le Beux; P. Levillain; P. Ingrand; V. Claveau

doi:10.3414/ME11-01-0005

Methods of Information in Medicine, Table of Contents

Methods Inf Med 2012; 51(03): 242-251
DOI: 10.3414/ME11-01-0005

Original Articles

Schattauer GmbH

Automated Classification of Free-text Pathology Reports for Registration of Incident Cases of Cancer

Authors

V. Jouhet

¹Unité d’épidémiologie, biostatistique et registre des cancers de Poitou-Charentes, Faculté de médecine, Centre Hospitalier Universitaire de Poitiers, Université de Poitiers, Poitiers, France
G. Defossez

¹Unité d’épidémiologie, biostatistique et registre des cancers de Poitou-Charentes, Faculté de médecine, Centre Hospitalier Universitaire de Poitiers, Université de Poitiers, Poitiers, France
A. Burgun

²INSERM U936, Faculté de médecine, Université de Rennes 1, Rennes, France
P. le Beux

²INSERM U936, Faculté de médecine, Université de Rennes 1, Rennes, France
P. Levillain

³Anatomie et cytologie pathologiques, Centre Hospitalier Universitaire de Poitiers, Poitiers, France

⁴Centre de Regroupement Informatique et Statistique en Anatomo-Pathologie de Poitou-Charentes, Faculté de médecine, Université de Poitiers, Poitiers, France
P. Ingrand

¹Unité d’épidémiologie, biostatistique et registre des cancers de Poitou-Charentes, Faculté de médecine, Centre Hospitalier Universitaire de Poitiers, Université de Poitiers, Poitiers, France

⁵INSERM, CIC 802, Poitiers, France
V. Claveau

⁶IRISA – CNRS UMR 6074, Rennes, France

Abstract

Summary

Objective: Our study aimed to construct and evaluate functions called “classifiers”, produced by supervised machine learning techniques, in order to categorize automatically pathology reports using solely their content.

Methods: Patients from the Poitou-Charentes Cancer Registry having at least one pathology report and a single non-metastatic invasive neoplasm were included. A descriptor weighting function accounting for the distribution of terms among targeted classes was developed and compared to classic methods based on inverse document frequencies. The classification was performed with support vector machine (SVM) and Naive Bayes classifiers. Two levels of granularity were tested for both the topographical and the morphological axes of the ICD-O3 code. The ability to correctly attribute a precise ICD-O3 code and the ability to attribute the broad category defined by the International Agency for Research on Cancer (IARC) for the multiple primary cancer registration rules were evaluated using F1-measures.

Results: 5121 pathology reports produced by 35 pathologists were selected. The best performance was achieved by our class-weighted descriptor, associated with a SVM classifier. Using this method, the pathology reports were properly classified in the IARC categories with F1-measures of 0.967 for both topography and morphology. The ICD-O3 code attribution had lower performance with a 0.715 F1-measure for topography and 0.854 for morphology.

Conclusion: These results suggest that free-text pathology reports could be useful as a data source for automated systems in order to identify and notify new cases of cancer. Future work is needed to evaluate the improvement in performance obtained from the use of natural language processing, including the case of multiple tumor description and possible incorporation of other medical documents such as surgical reports.

Keywords

Medical Informatics - neoplasm - pathology - free text - automated classification

Full Text

References

References
1 Maojo V, Kulikowski CA. Bioinformatics and medical informatics: collaborations on the road to genomic medicine?. J Am Med Inform Assoc 2003; 10 (06) 515-522.
2 Safran C, Bloomrosen M, Hammond WE, Labkoff S, Markel-Fox S, Tang PC. et al. Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. J Am Med Inform Assoc 2007; 14 (01) 1-9.
3 Murphy SN, Mendis M, Hackett K, Kuttan R, Pan W, Phillips LC. et al. Architecture of the open-source clinical research chart from Informatics for Integrating Biology and the Bedside. AMIA Annu Symp Proc 2007: 548-552.
4 Prokosch HU, Ganslandt T. Perspectives for medical informatics. Reusing the electronic medical record for clinical research. Methods Inf Med 2009; 48 (01) 38-44.
5 Kohane IS. Bioinformatics and clinical informatics: the imperative to collaborate. J Am Med Inform Assoc 2000; 7 (05) 512-516.
6 Garcia-Remesal M, Maojo V, Billhardt H, Crespo J. Integration of relational and textual biomedical sources. A pilot experiment using a semi-automated method for logical schema acquisition. Methods Inf Med 2010; 49 (04) 337-348.
7 MacLennan R. Cancer registration: principles and methods. Items of patient information which may be collected by registries. IARC Sci Publ 1991; 1: 43-63.
8 Buemi A. Pathology of Tumours for Cancer Registry Personnel. IARC: Lyon; 2008
9 Percy C, Fritz A, Jack A, Shanmugarathan S, Sobin L, Parkin D. et al. International Classification of Diseases for Oncology (ICD-O). 3rd ed.. World Health Organization; 2000
10 Curado M, Okamoto N, Ries L, Sriplung H, Young J, Carli M. et al. International rules for multiple primary cancers (ICD-O). 3rd ed.. 2004
11 Crowley RS, Castine M, Mitchell K, Chavan G, McSherry T, Feldman M. caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research. J Am Med Inform Assoc 2010; 17 (03) 253-264.
12 Carrell D, Miglioretti D, Smith-Bindman R. Coding free text radiology reports using the Cancer Text Information Extraction System (caTIES). AMIA Annu Symp Proc 2007: 889
13 Coden A, Savova G, Sominsky I, Tanenblatt M, Masanz J, Schuler K. et al. Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model. J Biomed Inform 2009; 42 (05) 937-949.
14 McCowan IA, Moore D, Fry MJ. Classification of cancer stage from free-text histology reports. Conf Proc IEEE Eng Med Biol Soc 2006; 1: 5153-5156.
15 McCowan IA, Moore DC, Nguyen AN, Bowman RV, Clarke BE, Duhig EE. et al. Collection of cancer stage data by classifying free-text medical reports. J Am Med Inform Assoc 2007; 14 (06) 736-745.
16 Hanauer D, Miela G, Chinnaiyan A, Chang A, Blayney D. The Registry Case Finding Engine: An Automated Tool to Identify Cancer Cases from Unstructured, Free-Text Pathology Reports and Clinical Notes. Journal of the American College of Surgeons 2007; 205 (05) 690-697.
17 Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp 2001: 17-21.
18 Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ. The NLM Indexing Initiative’s Medical Text Indexer. Stud Health Technol Inform 2004; 107 (Pt 1) 268-272.
19 Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc 2004; 11 (05) 392-402.
20 Li Y, Martinez D. Information extraction of multiple categories from pathology reports. Australasian Language Technology Association Workshop (ALTA Workshop 2010): Australasian Language Technology Association (Melbourne) 2010: 41-48.
21 Sebastiani F. Machine learning in automated text categorization. ACM Comput Surv 2002; 34 (02) 1-47.
22 Berg JW. Morphologic classification of human cancer. In: Shottenfeld Jr. DFJ. editor Cancer epidemiology and prevention. 2nd ed.. New York: Oxford University Press; 1996
23 Belot A, Grosclaude P, Bossard N, Jougla E, Benhamou E, Delafosse P. et al. Cancer incidence and mortality in France over the period 1980-2005. Rev Epidemiol Santé Publique 2008; 56 (03) 159-175.
24 Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. SIGKDD Explor Newsl 2009; 11 (01) 10-18.
25 Savoy J. A stemming procedure and stopword list for general French corpora. J Am Soc Inf Sci 1999; 50 (10) 944-952.
26 Salton G, Wong A, Yang C. A vector space model for information retrieval. Communications of the ACM 1975; 18 (11) 613-620.
27 Laroum S, Béchet N, Hamza H, Roche M. Classification automatique de documents bruites afaible contenu textuel. RNTI: Revue des Nouvelles Technologies de l’Information 2009; 1: 25
28 Yang Y, Jan P. A comparative study on feature selection in text categorization. In: Proceedings of ICML-97, 14th international conference on machine learning, Nashville, TN. 1997: 412-420.
29 Clech J, Rakotomalala R, Jalam R. Séléction multivariée de termes. XXXVèmes Journées de Statistiques. Lyon; France: 2003: 933-936.
30 Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Inf Process Manage 1988; 24 (05) 513-523.
31 John GH, Langley P. Estimating Continuous Distributions in Bayesian Classifiers. Eleventh Conference on Uncertainty in Artificial Intelligence. San Mateo; Morgan Kaufmann: 1995: 338-345.
32 Boser BE, Guyon I, Vapnik NV. A training algorithm for optimal margin classifiers. 1992: 144-152.
33 Platt JC. Fast training of support vector machines using sequential minimal optimization. 1999: 185-208.
34 Fleiss JL. Statistical methods for rates and proportions. New York: Wiley; 1981
35 Lertnattee V, Theeramunkong T. Analysis of inverse class frequency in centroid-based text classification. IEEE International Symposium on Communications and Information Technology 2004; 2: 1171-1176.
36 Hripcsak G, Rothschild AS. Agreement, the f-measure, and reliability in information retrieval. J Am Med Inform Assoc 2005; 12 (03) 296-298.
37 Feinstein A, Cicchetti D. High agreement but low kappa: I The problems of two paradoxes. J Clin Epidemiol 1990; 43 (06) 543-549.
38 Cicchetti D, Feinstein A. High agreement but low kappa: II Resolving the paradoxes. J Clin Epidemiol 1990; 43 (06) 551-558.
39 Viera A, Garret J. Understanding interobserver agreement: the kappa statistic. Fam Med 2005; 37 (06) 543-549.