Methods Inf Med 2002; 41(05): 426-434
DOI: 10.1055/s-0038-1634373
Original Article
Schattauer GmbH

Heuristics for Identification of Acronym-Definition Patterns within Text: Towards an Automated Construction of Comprehensive Acronym-Definition Dictionaries

J. D. Wren
1   Program in Genetics and Development, Southwestern Graduate School of Biomedical Sciences, TX, USA
2   McDermott Center for Human Growth and Development, the Center for Biomedical Inventions and Dept. of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, TX, USA
,
H. R. Garner
2   McDermott Center for Human Growth and Development, the Center for Biomedical Inventions and Dept. of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, TX, USA
› Author Affiliations
Further Information

Publication History

Publication Date:
07 February 2018 (online)

Summary

Objectives: To develop an automated, accurate and scalable method by which acronym-definition pairs can be identified within text. Its primary advantage is in enabling information processing methods to resolve author-defined acronyms, but it also allows an automated creation of a reference work on acronym definitions. This has several advantages over manual or semi-automated methods, besides time and effort saved, such as enabling identification of relative frequencies for alternate acronyms and definitions as well as spelling, phrasing and hyphenation variants for a unique acronym-definition pair. It also aids users in identifying acronym/ definition variants present in the literature that may not necessarily be in biomedical databases.

Methods: A set of heuristics to accurately locate and identify the boundaries of acronym-definition pairs was developed and refined in terms of precision and recall on subsets of MEDLINE records. These training sets were gradually increased in size and heuristics re-evaluated to ensure scalability.

Results: Our final set of Acronym Resolving General Heuristics (ARGH) had a sample-based estimated rate of 96.5 ±

0.4% precision and 93.0 ± 2.7% recall when tested on over 12 million MEDLINE records, identifying more than 174,000 unique acronyms and their 737,000 associated definitions.

Conclusions: We estimate that as much as 36% of the acronyms in MEDLINE are associated with more than one definition and, conversely, up to 10% of definitions are associated with more than one acronym. The number of unique acronyms in MEDLINE is increasing at a rate of approximately 11,000 per year, while the number of definitions associated with them is growing at approximately four times that rate. Access to the ARGH database is available online at http://lethargy.swmed.edu/ARGH argh.asp. The heuristic module and database are available upon request.

 
  • REFERENCES

  • 1 Cheng T. Acronymophilia: the exponential growth of the use of acronyms should be resisted. BMJ 1994; 309: 683-4.
  • 2 Cheng TO. Unexplained acronyms. Circulation 1999; 99 (Suppl. 14) 1924-5.
  • 3 Mansel RE. Duplicate acronyms. Lancet 1999; 354 9189 1564.
  • 4 Federiuk CS. The effect of abbreviations on MEDLINE searching. Acad Emerg Med 1999; 6 (Suppl. 04) 292-6.
  • 5 Proux D, Rechenmann F, Julliard L, Pillet VV, Jacq B. Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. Genome Inform Ser Workshop Genome Inform 1998; 9: 72-80.
  • 6 Rindflesch TC, Tanabe L, Weinstein JN, Hunter L. EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pac Symp Biocomput; 2000: 517-28.
  • 7 Rindflesch TC, Hunter L, Aronson AR. Mining molecular binding terminology from biomedical text. Proc AMIA Symp; 1999: 127-31.
  • 8 Blaschke C, Andrade MA, Ouzounis C, Valencia A. Automatic extraction of biological information from scientific text: protein-protein interactions. Proc Int Conf Intell Syst Mol Biol; 1999: 60-7.
  • 9 Jenssen TK, Laegreid A, Komorowski J, Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 2001; 28 (Suppl. 01) 21-8.
  • 10 Dupayrat J. Dictionary of Biomedical Acronyms and Abbreviations. 2nd ed. New York: Wiley; 1990
  • 11 Delong MF. Medical Acronyms, Eponyms & Abbreviations. 3rd ed; 1997
  • 12 Jablonski S. Dictionary of Medical Acronyms & Abbreviations. 4th ed: Hanley & Belfus; 2001
  • 13 Rimer M, O’Connell M. BioABACUS: a database of abbreviations and acronyms in biotechnology and computer science. Bioinformatics 1998; 14 (Suppl. 10) 888-9.
  • 14 Liu H, Lussier Y, Friedman C. A Study of Abbreviations in the UMLS. In: AMIA Annual Symposium;. 2001
  • 15 Taghva K, Gilbreth J. Recognizing acronyms and their definitions. Information Science Research Institute (ISRI); 1995. Report No.: Taghva95-03.
  • 16 Yeates S. Automatic extraction of acronyms from text. In: Proceedings of the Third New Zealand Computer Science Research Students’ Conference; 1999; p. 117-24.
  • 17 Park Y, Byrd R. Hybrid Text Mining for Finding Abbreviations and their Definitions. In: Empirical Methods in Natural Language Processing;. 2001
  • 18 Larkey L, Ogilvie P, Price A, Tamilio B. Acrophile: An Automated Acronym Extractor and Server. In: Proceedings of the ACM Digital Libraries conference;. 2000. p. 205-14.
  • 19 Fukuda K, Tamura A, Tsunoda T, Takagi T. Toward information extraction: identifying protein names from biological papers. Pac Symp Biocomput; 1998: 707-18.
  • 20 Yoshida M, Fukuda K, Takagi T. PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary. Bioinformatics 2000; 16 (Suppl. 02) 169-75.
  • 21 Yu H. Knowledge-based disambiguation of abbreviations. In: AMIA Annual Symposium. 2001