Deciphering Abbreviations in Malaysian Clinical Notes Using Machine Learning

Ismat Mohd Sulaiman; Awang Bulgiba; Sameem Abdul Kareem; Abdul Aziz Latip

doi:10.1055/a-2521-4372

Methods of Information in Medicine, Table of Contents

CC BY 4.0 · Methods Inf Med 2024; 63(05/06): 195-202
DOI: 10.1055/a-2521-4372

Original Article

Deciphering Abbreviations in Malaysian Clinical Notes Using Machine Learning

Authors

Ismat Mohd Sulaiman

¹Health Informatics Centre, Planning Division, Ministry of Health Malaysia, Putrajaya, Malaysia
Awang Bulgiba

²Academy of Sciences, Kuala Lumpur, Malaysia
Sameem Abdul Kareem

³Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Wilayah Persekutuan, Malaysia
Abdul Aziz Latip

⁴MIMOS Berhad, Kuala Lumpur, Malaysia

Abstract

Full Text

PDF Download

Keywords

electronic health record - discharge summaries - word embedding - machine learning - natural language processing - health system management

Introduction

MyHarmony is a natural language processing (NLP) system, which was built in 2016 to extract information from narrative text, such as diagnosis, procedures, and drugs.[1] [2] [3] It was developed by the Ministry of Health Malaysia to reuse clinical information for health service monitoring and planning. MyHarmony extracts information by recognizing clinical entities, codifies it to a clinical terminology standard like SNOMED CT, and stores it in a structured format for statistical analysis. The system applies a simple tokenization, n-gram, and a set of negations for codification.

One of the challenges of extracting information from electronic clinical notes is the abundant use of abbreviations, which can be ambiguous due to their multiple possible meanings. For example, MR can mean “mister” to address a person, “mitral regurgitation” of the heart valve, or “modified release” for drugs. Misinterpreted abbreviations have been shown to cause patient harm and mislead researchers and policymakers.[4] [5] [6] In Malaysia, 13 to 17% of physicians had experienced delays in giving medications, performing procedures, or making diagnoses due to abbreviation-related reasons, whereas 9 to 12% had made wrong diagnoses and given wrong therapies because of misinterpreted abbreviations.[5] In MyHarmony, extracting ambiguous abbreviations would result in inaccurate analysis.

NLP techniques to address abbreviations in clinical notes have been modernized with the use of word embeddings,[7] [8] [9] transformers,[10] and recently, Large Language Models (LLM).[11] [12] New techniques allow context from surrounding words and therefore are reliant on the writing style and language of the corpus. This aspect is relevant to clinical notes that are known to differ from general domain texts or formal biomedical literature.[13] An example of a LLM specific to the medical domain is Med-PaLM 2 developed by Google, which scored 86.5% on a U.S. Medical Licensing Examination (USMLE) style question.[14]

The evolution of these techniques has brought significant advancement but increasingly demanding computational requirements. The decision to use which technique relies on the resources available for implementation, either to deploy in a live environment during the clinical documentation process or as backend processing after storing clinical data. For example, LLMs such as GPT-3, BioGPT, and Med-PaLM 2, demand high-performance Graphics Processing Unit or even a Tensor Processing Unit to handle large numbers of parameters. Transformer-based models, like BERT[15] and its clinical derivatives including ClinicalBERT[16] and BioBERT,[17] offer more context awareness than word embedding but require specialized hardware and memory capacity.[18] [19] [20] Additionally, Med-PaLM 2 is currently restricted to specific countries, requiring approval for use via Google Cloud.[14]

In contrast, word embeddings, such as Word2Vec,[21] GloVe,[22] and FastText,[23] are lightweight enough to run on a standard Central Processing Unit. In our use case, we chose to explore word embeddings to achieve our objective while meeting the constraints of our low-resource setting. Additionally, it is tailored to the Malaysian clinical context, and the informal writing style in clinical notes, written by non-native English-speaking clinicians in Malaysia.

The objective of our research was to develop and compare the performance of the Malaysian clinical embedding (MyCE) against the nonclinical and larger FastText embeddings (FTE) in the task to detect and disambiguate abbreviations in Malaysian electronic clinical notes. There were no previous studies that compared the performance of a machine learning model using word embedding from a general domain against clinical notes, except that of a formal biomedical literature.[24] The aim is to include abbreviation detection and disambiguation model in the MyHarmony NLP pipeline, thus improving its information extraction.

Literature Review

Word embedding is a representation of a word and its relatedness to another word in the vector space.[8] Each word in the model is embedded with a context represented by vectors, where the vectors correspond to the relatedness of one word to another word in the model. This assumption is based on the distributional hypothesis, where words that are close together suggest some semantic relatedness.[25] Once trained, the word vector can be used as a feature for a machine learning model, such as in predicting a word as an abbreviation and what the abbreviation could mean.

The type of corpus plays a pivotal role in creating the word embedding model. This in turn influences the vocabulary size, word relatedness, and the dimension of the model. The bigger the corpus, the more likely it encompasses an extensive vocabulary and dimension. General word embeddings, like GloVe and FastText, are often trained with a corpus containing billions of words and more than 100 dimensions. However, researchers have shown that domain-specific word embedding, albeit smaller, would perform better than a general word embedding for domain-specific tasks.[24] For that reason, word embedding from the biomedical domain, specifically from formal biomedical literature, has been created and made publicly available. For example, BioWordVec was created using PubMed literature and Medical Subject Headings (MeSH).[26] Several state-of-the-art word embeddings using BERT have also been made available, such as BioBERT using PubMed abstracts and PMC full-text articles[17] and SciBERT using random samples from Semantic Scholar papers.[27] Specifically for clinical notes, ClinicalBERT was created using critical care notes from the Medical Information Mart for Intensive Care (MIMIC)-III database.[16]

There are several challenges in choosing a suitable word embedding for a local implementation to address a specific task like addressing abbreviations. First, there are a limited number of word embeddings using clinical notes that are publicly available due to data privacy policies.[8] For example, BioWordVec was based on the word2vec model but trained on formal literature, which may not reflect the language and writing style in clinical notes. Additionally, publicly available word embeddings are usually trained from MIMIC-III critical care notes from Beth Israel Deaconess Medical Center in Boston, Massachusetts, United States, and may not reflect the context of Malaysian clinical notes.

Methods

This research was conducted on an extract of Malaysian electronic discharge summaries from MyHarmony. It was approved by the relevant ethics committee. [Fig. 1] illustrates the research framework for the detection and disambiguation of abbreviations.

Fig. 1 Design framework to detect and disambiguate abbreviations.

Data Preprocessing

A total of 29,895 anonymized clinical notes from January to December 2017 were extracted from MyHarmony. From the extracts, 1,102 were randomly sampled, of which 939 were Cardiology discharge summaries and 163 were General Medicine discharge summaries. Eleven medical graduates annotated the samples over 3 months using a web-based open-source annotation tool called Brat Rapid Annotation Tool (BRAT).[28] They were required to (1) annotate the abbreviations and (2) give the sense (or expansion) for each abbreviation found. Misspellings and punctuation errors were excluded from the scope of this study. The lists of abbreviations and their senses were standardized with UK English spelling for congruency.

The gold standard annotated clinical notes for abbreviations were tokenized. Each word was labelled either as an abbreviation or not. Those identified as abbreviations were assigned their respective senses. The tokenized clinical notes were assigned three types of features for comparative study: rule-based as a baseline, FTE, and the MyCE. [Fig. 2] illustrates the tokenized clinical notes with labels for abbreviations and their senses.

Fig. 2 A screenshot of the tokenized dataset with labels for abbreviation and its sense.

Features for Abbreviation Detection

There were 42 rule-based features for abbreviation detection based on the morphologies from the abbreviation, the word before, and the word after. The morphologic features are (1) the number of characters (integer); (2) the presence of special characters (Boolean): slash, negative, positive, alias; (3) the presence of period character at the beginning, in between, and at the end; (4) the presence of numbers at the beginning, in between, and at the end; and (5) the presence of character “X” at the beginning, in between, and at the end.

The FTE was trained on 16 billion tokens from Wikipedia and Web crawlers. It contains a 999,994 word vocabulary with 300 dimensions. For abbreviation detection, each token in the discharge summary was mapped to the vocabulary in FTE producing 300 features of word vectors. Out-of-vocabulary tokens, i.e., tokens without word vectors, were replaced with the constant 0 as recommended by Bojanowski et al.[23]

The MyCE was created using the pretrained Word2Vec model[29] and trained on the entire set of discharge summaries extracted from MyHarmony. MyCE consists of seven million tokens, 24,352 word vocabulary and 100 dimensions. Similarly, out-of-vocabulary tokens were replaced with the constant 0.

Features for Abbreviation Disambiguation

For abbreviation disambiguation, the rule-based features were similar to the study by Martínez and Jaber[9] and Wu et al.[7] A total of 42 features were used: the word itself, 10 from the word before and after, 10 from the word position, and 10 from the word direction, and 11 from abbreviation morphologies.

The dataset with FTE and Malaysian clinical word embedding has 11 features each, which represent the average vectors for the abbreviation and the 5-word window size. Null values due to out-of-vocabulary tokens were given a constant “0.” We averaged the vectors, which have been shown to represent the word vectors better,[7] [9] as compared with using minimum and maximum values.

Evaluation

We evaluated the embeddings on two classifiers, namely Decision Tree (DT) and linear Support Vector Machine (SVM). In previous studies, SVM consistently outperformed Naïve Bayes.[9] [30] [31] [32] However, the comparison between SVM and DT was conflicting, which could be due to the different language used.[33] [34]

For abbreviation detection, the models were trained using the Cardiology dataset, internally validated using the 30% holdout method as the data were relatively large and would simulate the application on unseen Cardiology discharge summaries. The General Medicine dataset was used as an external validation.

For abbreviation disambiguation, five abbreviations were identified and listed in [Table 1]. These five abbreviations have more than 200 instances. A similar research[7] discovered that the ideal minimal number of instances should be 200, and abbreviations with less than that would be too small and disproportionate to be trained and tested on. The three datasets were partitioned using 10-fold cross-validation.

Table 1
Five ambiguous abbreviations and distribution of their senses
Ambiguous abbreviation (instances)	Senses (instances)
CBG (202)	• Capillary blood glucose (197, 97.5%) • Capillary blood gas (5, 2.5%)
MG (1078)	• Milligram (1069, 99.2%) • Magnesium (9, 0.8%)
MR (247)	• Modified release (132, 53.4%) • Mitral regurgitation (84, 34%) • Mister (31, 12.6%)
PR (340)	• Pulse rate (314, 92.4%) • Per rectal (20, 5.9%) • Pulmonary regurgitation (6, 1.8%)
T (2537)	• Tablet (2,275, 89.7%) • Temperature (260, 10.2%) • Time (1, 0.04%) • To (1, 0.04%)

Table 2
Evaluation results for abbreviation detection
Classifier	Feature	Precision	Recall	F-score	AUC [95% CI]
Evaluation for cardiology dataset
DT	RB	0.7605	0.6749	0.7151	0.9554 [0.954, 0.957]
	FTE	0.9177	0.9296	0.9236	0.9995 [0.958, 0.960]
	MyCE	0.9445	0.9594	0.9519	0.9996 [0.980, 0.982]
SVM	RB	0.6904	0.1885	0.2961	0.7633 [0.760, 0.767]
	FTE	0.9346	0.9041	0.9191	0.9923 [0.992, 0.993]
	MyCE	0.8837	0.8053	0.8427	0.9714 [0.970, 0.973]
Evaluation for general medicine (external) dataset
DT	RB	0.8496	0.1050	0.1869	0.8428 [0.845, 0.855]
	FTE	0.7735	0.8511	0.8105	0.9220 [0.892, 0.901]
	MyCE	0.7928	0.9054	0.8454	0.9368 [0.912, 0.920]
SVM	RB	0.6875	0.0040	0.0079	0.5125 [0.655, 0.668]
	FTE	0.8243	0.7762	0.7995	0.9704 [0.928, 0.935]
	MyCE	0.7328	0.6660	0.6981	0.9284 [0.867, 0.877]

Abbreviations: AUC, area under the curve; CI, confidence interval; DT, Decision Tree; FTE, FastText word embedding; MyCE, Malaysian clinical word embedding; RB, rule-based features; SVM, support vector machine.

Due to the imbalanced classes for the senses, an Adaptive Synthetic (ADASYN) oversampling technique was applied during the preprocessing stage.[35] Applying an oversampling technique to address imbalanced classes in abbreviation disambiguation has not been seen in previous studies to the best of the authors' knowledge. The technique generates synthetic instances based on the density distribution of closest data points. ADASYN was applied on the datasets with the FTE and Malaysian clinical word embedding. The number of nearest neighbors is imputed as k = 5 and β = 0.5.

The performance of the different models was measured based on precision, recall, F-score (as the harmonic mean between precision and recall), area under the curve, and receiver operating characteristics (ROC) curve.

Results

A total of 1,102 clinical notes were annotated and contained 243,679 word tokens, where 33,824 (19%) were abbreviations, and 7,640 (22.6%) of those abbreviations were ambiguous. The ambiguous abbreviations were constructed from 94 distinct abbreviations and 220 senses.

Abbreviation Detection

The Cardiology dataset contained a total of 178,451 instances, where 70% (126,916) of the instances were used for model training, and 30% (53,535) were held out for internal validation. The classifier models were also externally validated on 65,228 instances from the General Medicine dataset.

[Table 2] presents the comparison between the word embeddings with the DT and SVM classifier models and [Fig. 3] shows the ROC chart. Overall, the DT model with the Malaysian clinical word embedding showed the best performance in detecting abbreviations (F-score of 0.9519). Word embeddings outperformed the rule-based feature in all aspects of the evaluation measures. When tested against the General Medicine dataset, there was a reduction in all aspects of the evaluation measure, with a reduction of F-score on the best-performing model from 0.9519 to 0.8454.

Fig. 3 Receiver operating characteristic (ROC) curve for abbreviation detection. DSE, Malaysian clinical word embedding; DT, Decision Tree; FTE, FastText embedding; RB, rule-based; SVM, support vector machine.

Abbreviation Disambiguation

[Table 3] shows the comparative results of the disambiguation models with different features. Models with the Malaysian clinical word embedding as features outperformed FTE most of the time, except for the abbreviation “PR.” For “PR,” the performance of a rule-based feature was at par with FTE and better than Malaysian clinical word embedding only by 1.26%. However, the F-score was 0.6518, indicating the need for better samples for the model.

Table 3
Evaluation results for disambiguating abbreviations
Feature	Precision	Recall	F-score	AUC [95% CI]
Abbreviation: CBG
RB	0.8034	0.8007	0.8021	0.8561 [0.809, 0.903]
FTE	0.8102	0.8102	0.8102	0.8705 [0.825, 0.916]
MyCE	0.8804	0.9162	0.8980	0.9289 [0.893, 0.965]
Abbreviation: MG
RB	0.9174	0.9450	0.9310	0.9624 [0.953, 0.972]
FTE	0.9243	0.9546	0.9392	0.9637 [0.954, 0.973]
MyCE	0.9814	0.9902	0.9858	0.9968 [0.994, 0.999]
Abbreviation: MR
RB	0.8808	0.8636	0.8721	0.8615 [0.796, 0.921]
FTE	0.8831	0.8676	0.8753	0.8596 [0.792, 0.919]
MyCE	0.9885	0.9856	0.9870	0.9959 [0.990, 1.000]
Abbreviation: PR
RB	0.6405	0.6635	0.6518	0.9980 [0.995, 1.000]
FTE	0.6405	0.6635	0.6518	0.9978 [0.995, 1.000]
MyCE	0.6241	0.6550	0.6392	0.9852 [0.973, 0.995]
Abbreviation: T
RB	0.7839	0.7877	0.7858	0.8766 [0.865, 0.888]
FTE	0.7833	0.7866	0.7849	0.8764 [0.865, 0.888]
MyCE	0.9232	0.9342	0.9286	0.9734 [0.968, 0.978]

Abbreviations: AUC, area under the curve; CI, confidence interval; FTE, FastText embedding; MyCE, Malaysian clinical word embedding; RB, rule-based.

Discussion

Local and Domain-Specific Word Embedding

For both abbreviation detection and disambiguation from clinical notes, this research shows that word embedding as a feature performed better than the rule-based feature. Similar findings have been reported by Xu et al[36] when detecting abbreviations and Martínez and Jaber[9] when disambiguating abbreviations. Word embedding retains the semantic relatedness of the abbreviations and their surrounding words, whereas rule-based features do not.[8] Additionally, word embedding such as word2vec model is engineered using unsupervised learning technique saving time, as compared with rule-based features, which takes more time to be manually extracted.

Our study also suggests that domain-specific and locally built embeddings should be used rather than general embeddings, despite the smaller size. This suggestion concurs with Chen et al[24] and Charbonnier and Wartena[37] findings as domain-specific embeddings represent the relationship better than the corpus from the general domain when resolving domain-specific tasks. The results show that the Malaysian clinical word embedding was able to represent the meaning of the abbreviations better than FTE and thus enabled the classifier model to perform better.

Implementation

The classification model is being incorporated into MyHarmony's NLP pipeline, as shown in [Fig. 4]. After deciphering the abbreviations, an expanded clinical text is produced for the next task of recognizing and codifying clinical entities. This additional task in the NLP pipeline would reduce missed information extracted from the clinical text, thus improving the accuracy of information extraction and the statistical analysis it generates.[38]

Fig. 4 Framework to detect and disambiguate clinical abbreviations in the natural language processing pipeline.

In managing the public health sector, monitoring and evaluating the health care services requires accurate data and information. Retrieving information from the first point of contact in patient care intends to reduce the workload of secondary data collection. Our approach may pave a new norm to collect data when answering ad hoc questions, rather than using the conventional method of questionnaires, quality registries, and data collection systems. A future study is being planned to investigate the cost of resources, time, data quality, and researchers' perception on our proposed approach compared with the traditional approach.

Furthermore, our model can also be applied in the local electronic health record (EHR). This is exemplified in the use case of the real-time clinical abbreviation recognition and disambiguation (rCARD).[33] rCARD was prototyped to detect abbreviations while clinicians make their entry and then suggest possible meanings for the ambiguous abbreviations. The rCARD model uses SVM classifier and was reported to take between 0.630 and 1.649 milliseconds for the clinicians to resolve the ambiguous abbreviations. This same approach is being proposed as the EHR is being implemented throughout public health hospitals and clinics in Malaysia. As clinicians document, abbreviations are automatically detected and highlighted. The clinicians may be given the choice of possible meanings for confirmation. Deciphering abbreviations at the point of entry would improve documentation quality, thus supporting safer continuity of care during follow-up visits. Unambiguous documentation simplifies and increases the accuracy of data retrieval for research, clinical program evaluation, and policymaking.

Additionally, the Malaysian clinical word embedding can be used as a feature to resolve other clinical NLP problems specific to the Malaysian context. A potential area of application is incorporating abbreviation detection and disambiguation models to automate text summarization. This is particularly useful in clinical settings when drafting discharge summaries and referral letters for patients. Another example of application is incorporating the model into speech recognition software that transforms speech-to-text to capture clinical entries.[39] The software can then decipher abbreviations into the correct expansion, thus improving the quality of documentation and reduce misinterpreted abbreviations.

Limitations

This study is not without limitation, as the Malaysian clinical word embedding and the classification models involved only the Cardiology and General Medicine disciplines, which restricts the generalizability of the model if applied to other disciplines. This study showed a reduction of F-score from 0.9519 to 0.8454 when the abbreviation detected model was applied to the General Medicine clinical notes. This reduction is likely due to unseen abbreviations and context. The same outcome can be expected when applied to a surgical-based dataset or other disciplines. However, as more datasets from various disciplines are retrieved for training of the models, this limitation is expected to be overcome.

It is also unclear how the Malaysian clinical word embedding and the models would perform in clinical notes other than discharge summaries. The type of clinical notes may be a factor in the extent of abbreviations used. Discharge summaries are thought to be better structured and better written compared with inpatient notes and admission notes.[13] Applying the model to EHR systems that contain admission notes, inpatient notes, outpatient notes, and procedural reports may need further research and evaluation. One reassurance from this research is that the method established here can be replicated and extended with more robust word embedding and classification models by including other types of clinical notes.

Conclusion

A domain-specific word embedding such as the Malaysian clinical word embedding was able to detect and disambiguate abbreviations in the Malaysian electronic clinical notes better than FTE and the rule-based model. The research also shows that using word embedding with traditional machine learning algorithms can decipher abbreviations well and therefore suitable for implementation in low-resource settings such as Malaysia. This is the first clinical word embedding for the Malaysian context and can be experimented on to resolve other NLP problems, such as machine translation, speech and text recognition, and text summarization. This research is also the first to develop machine learning models to detect and disambiguate abbreviations using Malaysian clinical notes. Future works involve developing a more robust Malaysian clinical word embedding by including discharge summaries from other clinical disciplines; implementing the model in EHRs to prevent misinterpretation of abbreviations during patient care and improving the data quality for health care management and financing; as well as comparing with large language transformer models and its feasibility in low-resource environments.

References

References
1 Ahmad MKS, Sakri MSM, Sulaiman IM. et al. MyHarmony: generating statistics from clinical text for monitoring clinical quality indicators. In: 62nd ISI World Statistic Congress. 2019. , 129. Department of Statistics Malaysia (DOSM);
2 Latip AA, Domingo MST, Sulaiman IM. et al. Automated SNOMED CT mapping of clinical discharge summary data for cardiology queries in clinical facilities. International Journal of Pharma Medicine and Biological Sciences 2021; 10: 8-16
3 Ministry of Health Malaysia. Malaysian Health Data Warehouse (MyHDW) 2015–2016 Start up: Initiation. Selangor: Ministry of Health Malaysia; 2017
4 Hamiel U, Hecht I, Nemet A. et al. Frequency, comprehension and attitudes of physicians towards abbreviations in the medical record. Postgrad Med J 2018; 94 (1111) 254-258
5 Koh KC, Lau KM, Yusof SA. et al. A study on the use of abbreviations among doctors and nurses in the medical department of a tertiary hospital in Malaysia. Med J Malaysia 2015; 70 (06) 334-340
6 Shilo L, Shilo G. Analysis of abbreviations used by residents in admission notes and discharge summaries. QJM 2018; 111 (03) 179-183
7 Wu Y, Xu J, Zhang Y. et al. Clinical abbreviation disambiguation using neural word embeddings. In: Proceedings of BioNLP 15 Beijing,. China: 2015: 171-176 Association for Computational Linguistics;
8 Khattak FK, Jeblee S, Pou-Prom C, Abdalla M, Meaney C, Rudzicz F. A survey of word embeddings for clinical text. J Biomed Inform 2019; 100S: 100057
9 Martínez P, Jaber A. Disambiguating clinical abbreviations using pre-trained word embeddings. In: Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies. Porto Alegre, Brazil: 2021: 501-508
10 Jaber A, Martínez P. Disambiguating clinical abbreviations using a one-fits-all classifier based on deep learning techniques. Methods Inf Med 2022; 61 (S 01): e28-e34
11 Kugic A, Schulz S, Kreuzthaler M. Disambiguation of acronyms in clinical narratives with large language models. J Am Med Inform Assoc 2024; 31 (09) 2040-2046
12 Hosseini M, Hosseini M, Javidan R. Leveraging large language models for clinical abbreviation disambiguation. J Med Syst 2024; 48 (01) 27
13 Dalianis H. Clinical Text Mining: Secondary Use of Electronic Patient Records. 1st ed. Cham: Springer Publishing Company; , Incorporated; 2018
14 Singhal K, Tao T, Juraj G. et al. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv 2023. ; abs/2305.09617.
15 Devlin J, Chang M-W, Lee K. et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota, United States; 2019: 4171-4186
16 Huang K, Altosaar J, Rajesh Ranganath R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. In: CHIL 2020 Workshop. Toronto,: 2019
17 Lee J, Yoon W, Kim S. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020; 36 (04) 1234-1240
18 Karabacak M, Margetis K. Embracing large language models for medical applications: opportunities and challenges. Cureus 2023; 15 (05) e39305
19 Yang R, Tan TF, Lu W, Thirunavukarasu AJ, Ting DSW, Liu N. Large language models in health care: development, applications, and challenges. Health Care Sci 2023; 2 (04) 255-263
20 Yang X, Chen A, PourNejatian N. et al. A large language model for electronic health records. NPJ Digit Med 2022; 5 (01) 194
21 Mikolov T, Grave E, Bojanowski P. et al. Advances in pre-training distributed word representations. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan, 2018. , European Language Resources Association (ELRA).
22 Pennington J, Socher R, Manning CD. GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (ed Alessandro Moschitti BP, Walter Daelemans), Doha, Qatar;. 2014: 1532-1543 . Association for Computational Linguistics.
23 Bojanowski P, Grave E, Joulin A. et al. Enriching word vectors with subword information. Trans Assoc Comput Linguist 2017; 5: 135-146
24 Chen Z, He Z, Liu X, Bian J. Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases. BMC Med Inform Decis Mak 2018; 18 (Suppl. 02) 65
25 Harris ZS. Distributional structure. Word 2015; 1954 (10) 146-162
26 Zhang Y, Chen Q, Yang Z, Lin H, Lu Z. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci Data 2019; 6 (01) 52
27 Beltagy I, Lo K, Cohan A. Scibert: a pretrained language model for scientific text. arXiv preprint arXiv:190310676 2019
28 Stenetorp P, Pyysalo S, Topić G. et al. brat: a web-based tool for NLP-assisted test annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (ed Segond F), Avignon, France;. 2012: 102-107 . Association for Computational Linguistics.
29 Mikolov T, Sutskever I, Chen K. et al. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, Volume 2. Lake Tahoe, Nevada; 2013: 3111-3119 Curran Associates Inc;
30 Joshi M, Pakhomov S, Pedersen T, Chute CG. A comparative study of supervised learning as applied to acronym expansion in clinical reports. AMIA Annu Symp Proc 2006; 2006: 399-403
31 Moon S, Pakhomov S, Melton GB. Automated disambiguation of acronyms and abbreviations in clinical texts: window and training size considerations. AMIA Annu Symp Proc 2012; 2012: 1310-1319
32 Moon S, Berster B-T, Xu H, Cohen T. Word sense disambiguation of clinical abbreviations with hyperdimensional computing. AMIA Annu Symp Proc 2013; 2013: 1007-1016
33 Wu Y, Denny JC, Trent Rosenbloom S. et al. A long journey to short abbreviations: developing an open-source framework for clinical abbreviation recognition and disambiguation (CARD). J Am Med Inform Assoc 2017; 24 (e1): e79-e86
34 Vo C, Cao T. Incremental abbreviation detection in clinical texts. In: 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging,. Vision & Pattern Recognition (icIVPR) Spokane, Washington, United States; 2019: 280-285
35 Haibo H, Yang B, Garcia EA. et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). Hong Kong: 2008: 1322-1328
36 Xu H, Stetson PD, Friedman C. A study of abbreviations in clinical notes. In: AMIA Annu Symp Proc 2007: 821-825
37 Charbonnier J, Wartena C. Using Word Embeddings for Unsupervised Acronym Disambiguation. In: 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, United States; 2018: 2610-2619
38 Bouzekri K, Sheikh Ahmad MK, Hamdan W. et al. Performing analytics on SNOMED CT coded database, Serdang Hospital use-case. In: SNOMED CT Expo 2015. Montevideo, 2015
39 Kumah-Crystal YA, Pirtle CJ, Whyte HM, Goode ES, Anders SH, Lehmann CU. Electronic health record interactions through voice: a review. Appl Clin Inform 2018; 9 (03) 541-552

Figures

Fig. 1 Design framework to detect and disambiguate abbreviations.

Fig. 2 A screenshot of the tokenized dataset with labels for abbreviation and its sense.

Fig. 3 Receiver operating characteristic (ROC) curve for abbreviation detection. DSE, Malaysian clinical word embedding; DT, Decision Tree; FTE, FastText embedding; RB, rule-based; SVM, support vector machine.

Fig. 4 Framework to detect and disambiguate clinical abbreviations in the natural language processing pipeline.