Abstract
Objective This is the first Malaysian machine learning model to detect and disambiguate abbreviations
in clinical notes. The model has been designed to be incorporated into MyHarmony,
a natural language processing system, that extracts clinical information for health
care management. The model utilizes word embedding to ensure feasibility of use, not
in real-time but for secondary analysis, within the constraints of low-resource settings.
Methods A Malaysian clinical embedding, based on Word2Vec model, was developed using 29,895
electronic discharge summaries. The embedding was compared against conventional rule-based
and FastText embedding on two tasks: abbreviation detection and abbreviation disambiguation.
Machine learning classifiers were applied to assess performance.
Results The Malaysian clinical word embedding contained 7 million word tokens, 24,352 unique
vocabularies, and 100 dimensions. For abbreviation detection, the Decision Tree classifier
augmented with the Malaysian clinical embedding showed the best performance (F-score
of 0.9519). For abbreviation disambiguation, the classifier with the Malaysian clinical
embedding had the best performance for most of the abbreviations (F-score of 0.9903).
Conclusion Despite having a smaller vocabulary and dimension, our local clinical word embedding
performed better than the larger nonclinical FastText embedding. Word embedding with
simple machine learning algorithms can decipher abbreviations well. It also requires
lower computational resources and is suitable for implementation in low-resource settings
such as Malaysia. The integration of this model into MyHarmony will improve recognition
of clinical terms, thus improving the information generated for monitoring Malaysian
health care services and policymaking.
Keywords
electronic health record - discharge summaries - word embedding - machine learning
- natural language processing - health system management