Summary
Background: One of the barriers for the effective use of computerized health-care related text
is the ambiguity of abbreviations. To date, the task of disambiguating abbreviations
has been treated as a classification task based on surrounding words. Application
of this framework for languages that have no word boundaries requires pre-processing
to segment a sentence into separate word sequences. While the segmentation processing
is often a source of problem, it is unknown whether word information is really requisite
for abbreviation expansion.
Objectives: The present study examined and compared abbreviation expansion methods with and without
the incorporation of word information as a preliminary study.
Methods: We implemented two abbreviation expansion methods: 1) a morpheme-based method that
relied on word information and therefore required pre-processing, and 2) a character-based
method that relied on simple character information. We compared the expansion accuracies
for these two methods using eight medical abbreviations. Experimental data were automatically
built as a pseudo-annotated corpus using the Internet.
Results: As a result of the experiment, accuracies for the character-based method were from
0.890 to 0.942 while accuracies for the morpheme-based method were from 0.796 to 0.932.
The character-based method significantly outperformed the morpheme-based method for
three of the eight abbreviations (p < 0.05). For the remaining five abbreviations,
no significant differences were found between the two methods.
Conclusions: Character information may be a good alternative in terms of simplicity to morphological
information for abbreviation expansion in English medical abbreviations appeared in
Japanese texts on the Internet.
Keywords
Natural language processing - machine learning - abbreviation - information storage
and retrieval - algorithms