Deciphering Abbreviations in Malaysian Clinical Notes Using Machine Learning

Ismat Mohd Sulaiman; Awang Bulgiba; Sameem Abdul Kareem; Abdul Aziz Latip

doi:10.1055/a-2521-4372

Methods of Information in Medicine, Inhaltsverzeichnis

CC BY 4.0 · Methods Inf Med 2024; 63(05/06): 195-202
DOI: 10.1055/a-2521-4372

Original Article

Deciphering Abbreviations in Malaysian Clinical Notes Using Machine Learning

Autoren

Ismat Mohd Sulaiman

¹Health Informatics Centre, Planning Division, Ministry of Health Malaysia, Putrajaya, Malaysia
Awang Bulgiba

²Academy of Sciences, Kuala Lumpur, Malaysia
Sameem Abdul Kareem

³Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Wilayah Persekutuan, Malaysia
Abdul Aziz Latip

⁴MIMOS Berhad, Kuala Lumpur, Malaysia

Abstract

Objective This is the first Malaysian machine learning model to detect and disambiguate abbreviations in clinical notes. The model has been designed to be incorporated into MyHarmony, a natural language processing system, that extracts clinical information for health care management. The model utilizes word embedding to ensure feasibility of use, not in real-time but for secondary analysis, within the constraints of low-resource settings.

Methods A Malaysian clinical embedding, based on Word2Vec model, was developed using 29,895 electronic discharge summaries. The embedding was compared against conventional rule-based and FastText embedding on two tasks: abbreviation detection and abbreviation disambiguation. Machine learning classifiers were applied to assess performance.

Results The Malaysian clinical word embedding contained 7 million word tokens, 24,352 unique vocabularies, and 100 dimensions. For abbreviation detection, the Decision Tree classifier augmented with the Malaysian clinical embedding showed the best performance (F-score of 0.9519). For abbreviation disambiguation, the classifier with the Malaysian clinical embedding had the best performance for most of the abbreviations (F-score of 0.9903).

Conclusion Despite having a smaller vocabulary and dimension, our local clinical word embedding performed better than the larger nonclinical FastText embedding. Word embedding with simple machine learning algorithms can decipher abbreviations well. It also requires lower computational resources and is suitable for implementation in low-resource settings such as Malaysia. The integration of this model into MyHarmony will improve recognition of clinical terms, thus improving the information generated for monitoring Malaysian health care services and policymaking.

Keywords

electronic health record - discharge summaries - word embedding - machine learning - natural language processing - health system management

Volltext

Referenzen

References
1 Ahmad MKS, Sakri MSM, Sulaiman IM. et al. MyHarmony: generating statistics from clinical text for monitoring clinical quality indicators. In: 62nd ISI World Statistic Congress. 2019. , 129. Department of Statistics Malaysia (DOSM);
2 Latip AA, Domingo MST, Sulaiman IM. et al. Automated SNOMED CT mapping of clinical discharge summary data for cardiology queries in clinical facilities. International Journal of Pharma Medicine and Biological Sciences 2021; 10: 8-16
3 Ministry of Health Malaysia. Malaysian Health Data Warehouse (MyHDW) 2015–2016 Start up: Initiation. Selangor: Ministry of Health Malaysia; 2017
4 Hamiel U, Hecht I, Nemet A. et al. Frequency, comprehension and attitudes of physicians towards abbreviations in the medical record. Postgrad Med J 2018; 94 (1111) 254-258
5 Koh KC, Lau KM, Yusof SA. et al. A study on the use of abbreviations among doctors and nurses in the medical department of a tertiary hospital in Malaysia. Med J Malaysia 2015; 70 (06) 334-340
6 Shilo L, Shilo G. Analysis of abbreviations used by residents in admission notes and discharge summaries. QJM 2018; 111 (03) 179-183
7 Wu Y, Xu J, Zhang Y. et al. Clinical abbreviation disambiguation using neural word embeddings. In: Proceedings of BioNLP 15 Beijing,. China: 2015: 171-176 Association for Computational Linguistics;
8 Khattak FK, Jeblee S, Pou-Prom C, Abdalla M, Meaney C, Rudzicz F. A survey of word embeddings for clinical text. J Biomed Inform 2019; 100S: 100057
9 Martínez P, Jaber A. Disambiguating clinical abbreviations using pre-trained word embeddings. In: Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies. Porto Alegre, Brazil: 2021: 501-508
10 Jaber A, Martínez P. Disambiguating clinical abbreviations using a one-fits-all classifier based on deep learning techniques. Methods Inf Med 2022; 61 (S 01): e28-e34
11 Kugic A, Schulz S, Kreuzthaler M. Disambiguation of acronyms in clinical narratives with large language models. J Am Med Inform Assoc 2024; 31 (09) 2040-2046
12 Hosseini M, Hosseini M, Javidan R. Leveraging large language models for clinical abbreviation disambiguation. J Med Syst 2024; 48 (01) 27
13 Dalianis H. Clinical Text Mining: Secondary Use of Electronic Patient Records. 1st ed. Cham: Springer Publishing Company; , Incorporated; 2018
14 Singhal K, Tao T, Juraj G. et al. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv 2023. ; abs/2305.09617.
15 Devlin J, Chang M-W, Lee K. et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota, United States; 2019: 4171-4186
16 Huang K, Altosaar J, Rajesh Ranganath R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. In: CHIL 2020 Workshop. Toronto,: 2019
17 Lee J, Yoon W, Kim S. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020; 36 (04) 1234-1240
18 Karabacak M, Margetis K. Embracing large language models for medical applications: opportunities and challenges. Cureus 2023; 15 (05) e39305
19 Yang R, Tan TF, Lu W, Thirunavukarasu AJ, Ting DSW, Liu N. Large language models in health care: development, applications, and challenges. Health Care Sci 2023; 2 (04) 255-263
20 Yang X, Chen A, PourNejatian N. et al. A large language model for electronic health records. NPJ Digit Med 2022; 5 (01) 194
21 Mikolov T, Grave E, Bojanowski P. et al. Advances in pre-training distributed word representations. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan, 2018. , European Language Resources Association (ELRA).
22 Pennington J, Socher R, Manning CD. GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (ed Alessandro Moschitti BP, Walter Daelemans), Doha, Qatar;. 2014: 1532-1543 . Association for Computational Linguistics.
23 Bojanowski P, Grave E, Joulin A. et al. Enriching word vectors with subword information. Trans Assoc Comput Linguist 2017; 5: 135-146
24 Chen Z, He Z, Liu X, Bian J. Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases. BMC Med Inform Decis Mak 2018; 18 (Suppl. 02) 65
25 Harris ZS. Distributional structure. Word 2015; 1954 (10) 146-162
26 Zhang Y, Chen Q, Yang Z, Lin H, Lu Z. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci Data 2019; 6 (01) 52
27 Beltagy I, Lo K, Cohan A. Scibert: a pretrained language model for scientific text. arXiv preprint arXiv:190310676 2019
28 Stenetorp P, Pyysalo S, Topić G. et al. brat: a web-based tool for NLP-assisted test annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (ed Segond F), Avignon, France;. 2012: 102-107 . Association for Computational Linguistics.
29 Mikolov T, Sutskever I, Chen K. et al. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, Volume 2. Lake Tahoe, Nevada; 2013: 3111-3119 Curran Associates Inc;
30 Joshi M, Pakhomov S, Pedersen T, Chute CG. A comparative study of supervised learning as applied to acronym expansion in clinical reports. AMIA Annu Symp Proc 2006; 2006: 399-403
31 Moon S, Pakhomov S, Melton GB. Automated disambiguation of acronyms and abbreviations in clinical texts: window and training size considerations. AMIA Annu Symp Proc 2012; 2012: 1310-1319
32 Moon S, Berster B-T, Xu H, Cohen T. Word sense disambiguation of clinical abbreviations with hyperdimensional computing. AMIA Annu Symp Proc 2013; 2013: 1007-1016
33 Wu Y, Denny JC, Trent Rosenbloom S. et al. A long journey to short abbreviations: developing an open-source framework for clinical abbreviation recognition and disambiguation (CARD). J Am Med Inform Assoc 2017; 24 (e1): e79-e86
34 Vo C, Cao T. Incremental abbreviation detection in clinical texts. In: 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging,. Vision & Pattern Recognition (icIVPR) Spokane, Washington, United States; 2019: 280-285
35 Haibo H, Yang B, Garcia EA. et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). Hong Kong: 2008: 1322-1328
36 Xu H, Stetson PD, Friedman C. A study of abbreviations in clinical notes. In: AMIA Annu Symp Proc 2007: 821-825
37 Charbonnier J, Wartena C. Using Word Embeddings for Unsupervised Acronym Disambiguation. In: 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, United States; 2018: 2610-2619
38 Bouzekri K, Sheikh Ahmad MK, Hamdan W. et al. Performing analytics on SNOMED CT coded database, Serdang Hospital use-case. In: SNOMED CT Expo 2015. Montevideo, 2015
39 Kumah-Crystal YA, Pirtle CJ, Whyte HM, Goode ES, Anders SH, Lehmann CU. Electronic health record interactions through voice: a review. Appl Clin Inform 2018; 9 (03) 541-552