A Complementary Graphical Method for Reducing and Analyzing Large Data Sets[*]Case Studies Demonstrating Thresholds Setting and Selection
28 June 2013
accepted: 26 January 2014
20 January 2018 (online)
Objectives: Graphical displays can make data more understandable; however, large graphs can challenge human comprehension. We have previously described a filtering method to provide high-level summary views of large data sets. In this paper we demonstrate our method for setting and selecting thresholds to limit graph size while retaining important information by applying it to large single and paired data sets, taken from patient and bibliographic databases.
Methods: Four case studies are used to illustrate our method. The data are either patient discharge diagnoses (coded using the International Classification of Diseases, Clinical Modifications [ICD9-CM]) or Medline citations (coded using the Medical Subject Headings [MeSH]). We use combinations of different thresholds to obtain filtered graphs for detailed analysis. The thresholds setting and selection, such as thresholds for node counts, class counts, ratio values, p values (for diff data sets), and percentiles of selected class count thresholds, are demonstrated with details in case studies. The main steps include: data preparation, data manipulation, computation, and threshold selection and visualization. We also describe the data models for different types of thresholds and the considerations for thresholds selection.
Results: The filtered graphs are 1%-3% of the size of the original graphs. For our case studies, the graphs provide 1) the most heavily used ICD9-CM codes, 2) the codes with most patients in a research hospital in 2011, 3) a profile of publications on “heavily represented topics” in MEDLINE in 2011, and 4) validated knowledge about adverse effects of the medication of rosiglitazone and new interesting areas in the ICD9-CM hierarchy associated with patients taking the medication of pioglitazone.
Conclusions: Our filtering method reduces large graphs to a manageable size by re -moving relatively unimportant nodes. The graphical method provides summary views based on computation of usage frequency and semantic context of hierarchical ter -minology. The method is applicable to large data sets (such as a hundred thousand records or more) and can be used to generate new hypotheses from data sets coded with hierarchical terminologies.
KeywordsData mining method - data filtering method - threshold setting - threshold selection - data visualization - hierarchical terminology - data analysis - clinical data repository
* Supplementary material published on our web-site www.methods-online.com
- 1 Brownstein JS, Murphy SN, Goldfine AB, Grant RW, Sordo M, Gainer V. et al Rapid identification of myocardial infarction risk associated with diabetes medications using electronic medical records. Diabetes Care 2010; 33 (03) 526-531.
- 2 Plaisant C, Grosjean J, Bederson B. SpaceTree: Supporting Exploration in Large Node Link Tree, Design Evolution and Empirical Evaluation. In: Procedings of IEEE Symposium on Information Visualization; Boston, MA. 2002: 57-64.
- 3 Keller D, Schaller W, Wong J, de Groen P. World-Wide Web-based graphical user interfaces for laboratory data. Methods Inf Med 2002; 41 (05) 411-413.
- 4 Kopanitsa G, Hildebrand C, Stausberg J, Englmeier K. Visualization of medical data based on EHR standards. Methods Inf Med 2013; 52 (01) 43-50.
- 5 Stiglic G, Kocbek S, Pernek I, Kokol P. Comprehensive Decision Tree Models in Bioinformatics. PLoS ONE 2012; 7 (03) e33812.
- 6 McGuffin MJ, Robert J-M. Quantifying the Space-Efficiency of 2D Graphical Representations of Trees. Information Visualization 2010; 9 (02) 115-140.
- 7 Jing X, Cimino J. Graphical methods for reducing, visualizing and analyzing large data sets using hierarchical terminologies. Washington DC: AMIA; 2011: 635-643.
- 8 Cimino J, Ayres E. The clinical research data repository of the US National Institutes of Health. Stud Health Technol Inform 2010; 160 0Pt (02) 1299-1303.
- 9 Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S. et al Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc 2010; 17 (02) 124-130.
- 10 CDC. International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM). (Updated June 18, 2013; cited Sept. 6, 2013.) Available from. http://www.cdc.gov/nchs/icd/icd9cm.htm
- 11 Vreeman D, McDonald C, Huff S. LOINC® - A Universal Catalog of Individual Clinical Observations and Uniform Representation of Enumerated Collections. Int J Funct Inform Personal Med 2010; 3 (04) 273-291.
- 12 International Health Terminology Standards Development Organization. SNOMED CT. (Cited Sept. 6, 2013.) Available from. http://www.ihtsdo.org/snomed-ct/
- 13 Liu S, Wei M, Moore R, Ganesan V, Nelson S. RxNorm: prescription for electronic drug information exchange. IT Professional 2005; 7 (05) 17-23.
- 14 Graphviz - Graph Visualization Software. (Cited Dec. 12, 2010.) Available from. http://www.graphviz.org/
- 15 Ellson J, Gansner E, Koutsofios E, North S, Woodhull G. Graphviz and Dynagraph - Static and Dynamic Graph Drawing Tools. In. Junger M, Mutzel P. Editors Graph Drawing Software. Springer: Verlag; 2004: 127-148.
- 16 Lindberg D, Humphreys B, McCray A. The Unified Medical Language System. Methods Inf Med 1993; 32 (04) 281-291.
- 17 UMLS Reference Manual (Internet). Bethesda, MD: NLM; 2009. Available from. http://www.ncbi.nlm.nih.gov/books/NBK9676/
- 18 Saeed M, Villarroel M, Reisner AT, Clifford G, Lehman L-W. Moody G. et al Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II): A public-access intensive care unit database. Critical Care Medicine 2011; 39: 952-960.
- 19 NIH N. PubMed Health - Pioglitazone. NLM, NIH. (Cited Sept. 6, 2013.) Available from. http://www.ncbi.nlm.nih.gov/pubmedhealth/PMHT0011742/
- 20 NCBI-NLM-NIH. PubMed Health- Rosiglita- zone. NLM NIH; 2011. (Cited Sept. 6, 2013.) Available from:. http://www.ncbi.nlm.nih.gov/pubmedhealth/PMHT0012041/
- 21 FDA. FDA Drug Safety Communication: Update to ongoing safety review of Actos (pioglitazone) and increased risk of bladder cancer. 2011 (Cited Feb 25, 2012.) Available from. http://www.fda.gov/Drugs/DrugSafety/ucm259150.htm
- 22 The R project for statistical computing. (Cited Dec. 12, 2012.) Available from. http://www.r-project.org/
- 23 Ligges U, Mächler M. Scatterplot3d - an R Package for Visualizing Multivariate Data. Journal of Statistical Software 2003; 8 (11) 1-20.
- 24 Human-Computer Interaction Lab at University of Maryland. Treemap. (Cited Mar. 14, 2012.) Available from. http://www.cs.umd.edu/hcil/treemap/
- 25 Munzner T, Guimbretiere F, Tasiran S, Zhang L, Zhou Y. TreeJuxtaposer: Scalable Tree Comparison using Focus+Context with Guaranteed Visibility. ACM Transactions on Graphics 2003; 22 (03) 453-462.