Automated Classification of Free-text Pathology Reports for Registration of Incident Cases of Cancer

V. Jouhet; G. Defossez; A. Burgun; P. le Beux; P. Levillain; P. Ingrand; V. Claveau

doi:10.3414/ME11-01-0005

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035037.xml

Download PDF

Methods Inf Med 2012; 51(03): 242-251
DOI: 10.3414/ME11-01-0005

Original Articles

Schattauer GmbH

Automated Classification of Free-text Pathology Reports for Registration of Incident Cases of Cancer

Authors

V. Jouhet

¹Unité d’épidémiologie, biostatistique et registre des cancers de Poitou-Charentes, Faculté de médecine, Centre Hospitalier Universitaire de Poitiers, Université de Poitiers, Poitiers, France
G. Defossez

¹Unité d’épidémiologie, biostatistique et registre des cancers de Poitou-Charentes, Faculté de médecine, Centre Hospitalier Universitaire de Poitiers, Université de Poitiers, Poitiers, France
A. Burgun

²INSERM U936, Faculté de médecine, Université de Rennes 1, Rennes, France
P. le Beux

²INSERM U936, Faculté de médecine, Université de Rennes 1, Rennes, France
P. Levillain

³Anatomie et cytologie pathologiques, Centre Hospitalier Universitaire de Poitiers, Poitiers, France

⁴Centre de Regroupement Informatique et Statistique en Anatomo-Pathologie de Poitou-Charentes, Faculté de médecine, Université de Poitiers, Poitiers, France
P. Ingrand

¹Unité d’épidémiologie, biostatistique et registre des cancers de Poitou-Charentes, Faculté de médecine, Centre Hospitalier Universitaire de Poitiers, Université de Poitiers, Poitiers, France

⁵INSERM, CIC 802, Poitiers, France
V. Claveau

⁶IRISA – CNRS UMR 6074, Rennes, France

Further Information

Publication History

received:14 January 2011

accepted:30 May 2011

Publication Date:
20 January 2018 (online)

Permissions and Reprints

Summary

Objective: Our study aimed to construct and evaluate functions called “classifiers”, produced by supervised machine learning techniques, in order to categorize automatically pathology reports using solely their content.

Methods: Patients from the Poitou-Charentes Cancer Registry having at least one pathology report and a single non-metastatic invasive neoplasm were included. A descriptor weighting function accounting for the distribution of terms among targeted classes was developed and compared to classic methods based on inverse document frequencies. The classification was performed with support vector machine (SVM) and Naive Bayes classifiers. Two levels of granularity were tested for both the topographical and the morphological axes of the ICD-O3 code. The ability to correctly attribute a precise ICD-O3 code and the ability to attribute the broad category defined by the International Agency for Research on Cancer (IARC) for the multiple primary cancer registration rules were evaluated using F1-measures.

Results: 5121 pathology reports produced by 35 pathologists were selected. The best performance was achieved by our class-weighted descriptor, associated with a SVM classifier. Using this method, the pathology reports were properly classified in the IARC categories with F1-measures of 0.967 for both topography and morphology. The ICD-O3 code attribution had lower performance with a 0.715 F1-measure for topography and 0.854 for morphology.

Conclusion: These results suggest that free-text pathology reports could be useful as a data source for automated systems in order to identify and notify new cases of cancer. Future work is needed to evaluate the improvement in performance obtained from the use of natural language processing, including the case of multiple tumor description and possible incorporation of other medical documents such as surgical reports.

Keywords

Medical Informatics - neoplasm - pathology - free text - automated classification

References
1 Maojo V, Kulikowski CA. Bioinformatics and medical informatics: collaborations on the road to genomic medicine?. J Am Med Inform Assoc 2003; 10 (06) 515-522.

Crossref Search in Google Scholar
Download RIS citation
2 Safran C, Bloomrosen M, Hammond WE, Labkoff S, Markel-Fox S, Tang PC. et al. Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. J Am Med Inform Assoc 2007; 14 (01) 1-9.

Crossref Search in Google Scholar
Download RIS citation
3 Murphy SN, Mendis M, Hackett K, Kuttan R, Pan W, Phillips LC. et al. Architecture of the open-source clinical research chart from Informatics for Integrating Biology and the Bedside. AMIA Annu Symp Proc 2007: 548-552.

Search in Google Scholar
Download RIS citation
4 Prokosch HU, Ganslandt T. Perspectives for medical informatics. Reusing the electronic medical record for clinical research. Methods Inf Med 2009; 48 (01) 38-44.

Thieme Connect Search in Google Scholar
Download RIS citation
5 Kohane IS. Bioinformatics and clinical informatics: the imperative to collaborate. J Am Med Inform Assoc 2000; 7 (05) 512-516.

Crossref Search in Google Scholar
Download RIS citation
6 Garcia-Remesal M, Maojo V, Billhardt H, Crespo J. Integration of relational and textual biomedical sources. A pilot experiment using a semi-automated method for logical schema acquisition. Methods Inf Med 2010; 49 (04) 337-348.

Thieme Connect Search in Google Scholar
Download RIS citation
7 MacLennan R. Cancer registration: principles and methods. Items of patient information which may be collected by registries. IARC Sci Publ 1991; 1: 43-63.

Search in Google Scholar
Download RIS citation
8 Buemi A. Pathology of Tumours for Cancer Registry Personnel. IARC: Lyon; 2008

Search in Google Scholar
Download RIS citation
9 Percy C, Fritz A, Jack A, Shanmugarathan S, Sobin L, Parkin D. et al. International Classification of Diseases for Oncology (ICD-O). 3rd ed.. World Health Organization; 2000

Search in Google Scholar
Download RIS citation
10 Curado M, Okamoto N, Ries L, Sriplung H, Young J, Carli M. et al. International rules for multiple primary cancers (ICD-O). 3rd ed.. 2004

Search in Google Scholar
Download RIS citation
11 Crowley RS, Castine M, Mitchell K, Chavan G, McSherry T, Feldman M. caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research. J Am Med Inform Assoc 2010; 17 (03) 253-264.

Crossref Search in Google Scholar
Download RIS citation
12 Carrell D, Miglioretti D, Smith-Bindman R. Coding free text radiology reports using the Cancer Text Information Extraction System (caTIES). AMIA Annu Symp Proc 2007: 889

Search in Google Scholar
Download RIS citation
13 Coden A, Savova G, Sominsky I, Tanenblatt M, Masanz J, Schuler K. et al. Automatically extracting cancer disease characteristics from pathology reports into a Disease Knowledge Representation Model. J Biomed Inform 2009; 42 (05) 937-949.

Crossref Search in Google Scholar
Download RIS citation
14 McCowan IA, Moore D, Fry MJ. Classification of cancer stage from free-text histology reports. Conf Proc IEEE Eng Med Biol Soc 2006; 1: 5153-5156.

Search in Google Scholar
Download RIS citation
15 McCowan IA, Moore DC, Nguyen AN, Bowman RV, Clarke BE, Duhig EE. et al. Collection of cancer stage data by classifying free-text medical reports. J Am Med Inform Assoc 2007; 14 (06) 736-745.

Crossref Search in Google Scholar
Download RIS citation
16 Hanauer D, Miela G, Chinnaiyan A, Chang A, Blayney D. The Registry Case Finding Engine: An Automated Tool to Identify Cancer Cases from Unstructured, Free-Text Pathology Reports and Clinical Notes. Journal of the American College of Surgeons 2007; 205 (05) 690-697.

Crossref Search in Google Scholar
Download RIS citation
17 Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp 2001: 17-21.

Search in Google Scholar
Download RIS citation
18 Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ. The NLM Indexing Initiative’s Medical Text Indexer. Stud Health Technol Inform 2004; 107 (Pt 1) 268-272.

Search in Google Scholar
Download RIS citation
19 Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc 2004; 11 (05) 392-402.

Crossref Search in Google Scholar
Download RIS citation
20 Li Y, Martinez D. Information extraction of multiple categories from pathology reports. Australasian Language Technology Association Workshop (ALTA Workshop 2010): Australasian Language Technology Association (Melbourne) 2010: 41-48.

Search in Google Scholar
Download RIS citation
21 Sebastiani F. Machine learning in automated text categorization. ACM Comput Surv 2002; 34 (02) 1-47.

Crossref Search in Google Scholar
Download RIS citation
22 Berg JW. Morphologic classification of human cancer. In: Shottenfeld Jr. DFJ. editor Cancer epidemiology and prevention. 2nd ed.. New York: Oxford University Press; 1996

Search in Google Scholar
Download RIS citation
23 Belot A, Grosclaude P, Bossard N, Jougla E, Benhamou E, Delafosse P. et al. Cancer incidence and mortality in France over the period 1980-2005. Rev Epidemiol Santé Publique 2008; 56 (03) 159-175.

Search in Google Scholar
Download RIS citation
24 Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. SIGKDD Explor Newsl 2009; 11 (01) 10-18.

Crossref Search in Google Scholar
Download RIS citation
25 Savoy J. A stemming procedure and stopword list for general French corpora. J Am Soc Inf Sci 1999; 50 (10) 944-952.

Crossref Search in Google Scholar
Download RIS citation
26 Salton G, Wong A, Yang C. A vector space model for information retrieval. Communications of the ACM 1975; 18 (11) 613-620.

Crossref Search in Google Scholar
Download RIS citation
27 Laroum S, Béchet N, Hamza H, Roche M. Classification automatique de documents bruites afaible contenu textuel. RNTI: Revue des Nouvelles Technologies de l’Information 2009; 1: 25

Search in Google Scholar
Download RIS citation
28 Yang Y, Jan P. A comparative study on feature selection in text categorization. In: Proceedings of ICML-97, 14th international conference on machine learning, Nashville, TN. 1997: 412-420.

Search in Google Scholar
Download RIS citation
29 Clech J, Rakotomalala R, Jalam R. Séléction multivariée de termes. XXXVèmes Journées de Statistiques. Lyon; France: 2003: 933-936.

Search in Google Scholar
Download RIS citation
30 Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Inf Process Manage 1988; 24 (05) 513-523.

Crossref Search in Google Scholar
Download RIS citation
31 John GH, Langley P. Estimating Continuous Distributions in Bayesian Classifiers. Eleventh Conference on Uncertainty in Artificial Intelligence. San Mateo; Morgan Kaufmann: 1995: 338-345.

Search in Google Scholar
Download RIS citation
32 Boser BE, Guyon I, Vapnik NV. A training algorithm for optimal margin classifiers. 1992: 144-152.

Search in Google Scholar
Download RIS citation
33 Platt JC. Fast training of support vector machines using sequential minimal optimization. 1999: 185-208.

Search in Google Scholar
Download RIS citation
34 Fleiss JL. Statistical methods for rates and proportions. New York: Wiley; 1981

Search in Google Scholar
Download RIS citation
35 Lertnattee V, Theeramunkong T. Analysis of inverse class frequency in centroid-based text classification. IEEE International Symposium on Communications and Information Technology 2004; 2: 1171-1176.

Search in Google Scholar
Download RIS citation
36 Hripcsak G, Rothschild AS. Agreement, the f-measure, and reliability in information retrieval. J Am Med Inform Assoc 2005; 12 (03) 296-298.

Crossref Search in Google Scholar
Download RIS citation
37 Feinstein A, Cicchetti D. High agreement but low kappa: I The problems of two paradoxes. J Clin Epidemiol 1990; 43 (06) 543-549.

Crossref Search in Google Scholar
Download RIS citation
38 Cicchetti D, Feinstein A. High agreement but low kappa: II Resolving the paradoxes. J Clin Epidemiol 1990; 43 (06) 551-558.

Crossref Search in Google Scholar
Download RIS citation
39 Viera A, Garret J. Understanding interobserver agreement: the kappa statistic. Fam Med 2005; 37 (06) 543-549.

Search in Google Scholar
Download RIS citation

Related Journals

Subscribe to RSS

Share / Bookmark

Automated Classification of Free-text Pathology Reports for Registration of Incident Cases of Cancer

Authors

Publication History

Summary

Keywords

References