Feasibility of Feature-based Indexing, Clustering, and Search of Clinical Trials

M. R. Boland; R. Miotto; J. Gao; C. Weng

doi:10.3414/ME12-01-0092

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035037.xml

Download PDF

Methods Inf Med 2013; 52(05): 382-394
DOI: 10.3414/ME12-01-0092

Original Articles

Schattauer GmbH

Feasibility of Feature-based Indexing, Clustering, and Search of Clinical Trials^[*]

A Case Study of Breast Cancer Trials from ClinicalTrials.gov

Authors

M. R. Boland

¹Department of Biomedical Informatics, Columbia University, New York, NY, USA
R. Miotto

¹Department of Biomedical Informatics, Columbia University, New York, NY, USA
J. Gao

¹Department of Biomedical Informatics, Columbia University, New York, NY, USA
C. Weng

¹Department of Biomedical Informatics, Columbia University, New York, NY, USA

²The Irving Institute for Clinical and Translational Research, Columbia University, New York, NY, USA

Further Information

Publication History

received: 03 October 2012

accepted: 21 February 2013

Publication Date:
20 January 2018 (online)

Permissions and Reprints

Summary

Background: When standard therapies fail, clinical trials provide experimental treatment opportunities for patients with drug-resistant illnesses or terminal diseases. Clinical Trials can also provide free treatment and education for individuals who otherwise may not have access to such care. To find relevant clinical trials, patients often search online; however, they often encounter a significant barrier due to the large number of trials and in-effective indexing methods for reducing the trial search space.

Objectives: This study explores the feasibility of feature-based indexing, clustering, and search of clinical trials and informs designs to automate these processes.

Methods: We decomposed 80 randomly selected stage III breast cancer clinical trials into a vector of eligibility features, which were organized into a hierarchy. We clustered trials based on their eligibility feature similarities. In a simulated search process, manually selected features were used to generate specific eligibility questions to filter trials iteratively.

Results: We extracted 1,437 distinct eligi -bility features and achieved an inter-rater agreement of 0.73 for feature extraction for 37 frequent features occurring in more than 20 trials. Using all the 1,437 features we stratified the 80 trials into six clusters containing trials recruiting similar patients by patient-characteristic features, five clusters by disease-characteristic features, and two clusters by mixed features. Most of the features were mapped to one or more Unified Medical Language System (UMLS) concepts, demonstrating the utility of named entity recognition prior to mapping with the UMLS for automatic feature extraction.

Conclusions: It is feasible to develop feature-based indexing and clustering methods for clinical trials to identify trials with similar target populations and to improve trial search efficiency.

Keywords

Medical informatics - search engine - clinical trials - knowledge representation - eligibility determination

^* Supplementary material published on our website www.methods-online.com

Online Supplementary Material (PDF) (PDF) (opens in new window)

References
1 Weng C, Embi P. Informatics Approaches to Participant Recruitment. In: Richesson R, Andrews J. editors Clinical Research Informatics.; Springer: 2012: 428

Search in Google Scholar
Download RIS citation
2 Thadani SR, Weng C, Bigger JT, Ennever JF, Wajngurt D. Electronic Screening Improves Efficiency in Clinical Trial Recruitment. Journal of the American Medical Informatics Association 2009; 16 (06) 869-873.

Crossref PubMed Search in Google Scholar
Download RIS citation
3 Herasevich V, Pieper MS, Pulido J, Gajic O. Enrollment into a time sensitive clinical study in the critical care setting: results from computerized septic shock sniffer implementation. Journal of the American Medical Informatics Association 2011; 18 (05) 639-644.

Crossref PubMed Search in Google Scholar
Download RIS citation
4 Yamamoto K, Sumi E, Yamazaki T, Asai K, Yamori M, Teramukai S. et al. A pragmatic method for electronic medical record-based observational studies: developing an electronic medical records retrieval system for clinical research. BMJ Open 2012. 2 (6)

Search in Google Scholar
Download RIS citation
5 Niland J. Integration of Clinical Research and EHR: Eligibility Coding Standards: ASPIRE (Agreement on Standardized Protocol Inclusion Requirements for Eligibility). http:// crisummit2010.amia.org/files/symposium2008/ S14_Niland.pdf.

Download RIS citation
6 Patel C, Khan S, Gomadam K, Trial X. Using Semantic Technologies to Match Patients to Relevant Clinical Trials Based on Their Personal Health Records. In: Proceedings of the 8th International Semantic Web Conference. 2009: 1-7.

Search in Google Scholar
Download RIS citation
7 Weng C, Tu SW, Sim I, Richesson R. Formal representation of eligibility criteria: a literature review. J Biomed Inform 2010; 43 (03) 451-467. PubMed PMID: 20034594. Pubmed Central PMCID: 2878905. Epub 2009/12/26. Eng

Crossref PubMed Search in Google Scholar
Download RIS citation
8 Penberthy L, Brown R, Puma F, Dahman B. Automated matching software for clinical trials eligibility: Measuring efficiency and flexibility. Contemporary Clinical Trials 2010; 31 (03) 207-217.

Crossref PubMed Search in Google Scholar
Download RIS citation
9 Heiney SP, Adams SA, Drake BF, Bryant LH, Bridges L, Hebert JR. Successful subject recruitment for a prostate cancer behavioral intervention trial. Clinical Trials 2010; 7 (04) 411-417.

Crossref PubMed Search in Google Scholar
Download RIS citation
10 Patel C, Gomadam K, Khan S, Garg V. TrialX: Using semantic technologies to match patients to relevant clinical trials based on their Personal Health Records. J Web Sem 2010; 8 (04) 342-347.

Crossref Search in Google Scholar
Download RIS citation
11 Kernan W, Viscoli C, Brass L, Amatangelo M, Birch A, Clark W. et al. Boosting enrolment in clinical trials: validation of a regional network model. Clinical Trials 2011; 8 (05) 645-653.

Crossref PubMed Search in Google Scholar
Download RIS citation
12 Weng C, Wu X, Luo Z, Boland MR, Theodoratos D, Johnson SB. EliXR: an approach to eligibility criteria extraction and representation. Journal of the American Medical Informatics Association 2011; 18 (Suppl. 01) i116-i124.

Crossref PubMed Search in Google Scholar
Download RIS citation
13 Heinemann S, Thuring S, Wedeken S, Schafer T, Scheidt-Nave C, Ketterer M. et al. A clinical trial alert tool to recruit large patient samples and assess selection bias in general practice research. BMC Med Res Methodol 2011; 11 (16) 1-10. PubMed PMID: 21320358. Pubmed Central PMCID: 3047292. Epub 2011/02/16. eng

Crossref PubMed Search in Google Scholar
Download RIS citation
14 Beauharnais CC, Larkin ME, Zai AH, Boykin EC, Luttrell J, Wexler DJ. Efficacy and cost-effectiveness of an automated screening algorithm in an inpatient clinical trial. Clinical Trials 2012; 9 (02) 198-203.

Crossref PubMed Search in Google Scholar
Download RIS citation
15 Korkontzelos I, Mu T, Ananiadou S. ASCOT: a text mining-based web-service for efficient search and assisted creation of clinical trials. BMC Medical Informatics and Decision Making 2012; 12 (Suppl. 01) S3

Crossref PubMed Search in Google Scholar
Download RIS citation
16 Harris PA, Scott KW, Lebo L, Hassan N, Lightner C, Pulley J. ResearchMatch: a national registry to recruit volunteers for clinical research. Academic medicine. Journal of the Association of American Medical Colleges 2012; 87 (01) 66-73. PubMed PMID: 22104055. Epub 2011/11/23. eng

Search in Google Scholar
Download RIS citation
17 ResearchMatch www.researchmatch.org. Accessed on August 9 2012

PubMed
Download RIS citation
18 caMATCH https://cabigncinihgov/community/tools/caMATCH. Accessed on January 7 2013

PubMed
Download RIS citation
19 Corengi https://wwwcorengicom/. Accessed on January 7 2013

PubMed
Download RIS citation
20 University of Florida Research Affairs Clinical Trials http://wwwhscjufledu/research/ SearchClinicalTrialsaspx. Accessed on January 7 2013

PubMed
Download RIS citation
21 McCray A. Better access to information about clinical trials. Annals of Internal Medicine 2000; 133 (08) 609-614.

Crossref PubMed Search in Google Scholar
Download RIS citation
22 NIH www.clinicaltrials.gov. Accessed on February 10, 2012 and October 2 2012

PubMed
Download RIS citation
23 Muller H, Hanbury A, Al Shorbaji N. Health information search to deal with the exploding amount of health information produced. Methods Inf Med 2012; 51 (06) 516-518. PubMed PMID: 23212781. Epub 2012/12/06. eng

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
24 Tan P-N. Steinbach M, Kumar V. Introduction to Data Mining. Addison-Wesley 2005.

Search in Google Scholar
Download RIS citation
25 Tata S, Patel JM. Estimating the selectivity of tf-idf based cosine similarity predicates. SIGMOD Rec 2007; 36 (02) 7-12.

Search in Google Scholar
Download RIS citation
26 Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. New York: Cambridge University Press; 2008: 482

Search in Google Scholar
Download RIS citation
27 Durao F, Dolog P, Leginus M, Lage R. SimSpectrum: A Similarity Based Spectral Clustering Approach to Generate a Tag Cloud. In: Harth A, Koch N. editors. Current Trends in Web Engineering. Lecture Notes in Computer Science 7059 Berlin Heidelberg: Springer; 2012: 145-154.

Search in Google Scholar
Download RIS citation
28 Korkontzelos I, Mu T, Ananiadou S. ASCOT: a text mining-based web-service for efficient search and assisted creation of clinical trials. BMC Medical Informatics and Decision Making 2012; 12 (Suppl. 01) S3 PubMed PMID doi: 10.1186/1472–6947–12-S1-S3

Crossref PubMed Search in Google Scholar
Download RIS citation
29 Salton G, Fox EA, Wu H. Extended Boolean information retrieval. Commun ACM 1983; 26 (11) 1022-1036.

Crossref Search in Google Scholar
Download RIS citation
30 Salton G. Developments in Automatic Text Retrieval. Science 1991; 253 5023 974-980.

Crossref PubMed Search in Google Scholar
Download RIS citation
31 Denecke K. An Architecture for Diversity-aware Search for Medical Web Content. Methods Inf Med 2012; 51 (06) 549-556. PubMed PMID: 23080127. Epub 2012/10/20. eng

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
32 Turney P. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001). 2001: 1-12.

Search in Google Scholar
Download RIS citation
33 Aula A. Query formulation in web information search. In: Proceedings of IADIS international conference WWW/Internet. 2003. Lisboa (IADIS Press) 403-410.

Search in Google Scholar
Download RIS citation
34 Steinbrook R. Searching for the Right Search - Reaching the Medical Literature. New England Journal of Medicine 2006; 354 (01) 4-7.

Crossref PubMed Search in Google Scholar
Download RIS citation
35 Rogers FB. Medical subject headings. Bulletin of the Medical Library Association 1963; 51: 114-116. PubMed PMID: 13982385. Pubmed Central PMCID: 197951. Epub 1963/01/01. eng

PubMed Search in Google Scholar
Download RIS citation
36 Bakken S, Currie LM, Lee N-J. Roberts WD, Collins SA, Cimino JJ. Integrating evidence into clinical information systems for nursing decision support. International Journal of Medical Informatics 2008; 77 (06) 413-420.

Crossref PubMed Search in Google Scholar
Download RIS citation
37 Burstein J, Kukich K, Wolff S, Lu C, Chodorow M, Braden-Harder L. et al. Automated scoring using a hybrid feature identification technique. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 1; Montreal, Quebec, Canada. 980879: Association for Computational Linguistics 1998: 206-210.

Search in Google Scholar
Download RIS citation
38 Forman G, Kirshenbaum E. Extremely fast text feature extraction for classification and indexing. In: Proceedings of the 17th ACM conference on Information and knowledge management; Napa Valley, California, USA. 1458243: ACM 2008: 1221-1230.

Search in Google Scholar
Download RIS citation
39 Lowe D, Webb AR. Optimized Feature Extraction and the Bayes Decision in Feed-Forward Classifier Networks. IEEE Transations on Pattern Analysis and Machine Intelligence 1991; 13 (04) 355-364.

Crossref Search in Google Scholar
Download RIS citation
40 Clausen M, Korner H, Kurth F. An Efficient Indexing and Search Technique for Multimedia Databases. SIGIR Multimedia Information Retrieval Workshop 2003: 1-12.

Search in Google Scholar
Download RIS citation
41 Lewis DD. Feature selection and feature extraction for text categorization. In: Proceedings of the workshop on Speech and Natural Language; Harriman, New York. 1075574: Association for Computational Linguistics 1992: 212-217.

Search in Google Scholar
Download RIS citation
42 Similarity trials Nat Biotech. 2011; 29 (01) 1

PubMed Search in Google Scholar
Download RIS citation
43 Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research 2004; 32 (Suppl. 01) D267-D270.

Crossref PubMed Search in Google Scholar
Download RIS citation
44 Ross J, Tu S, Carini S, Sim I. Analysis of eligibility criteria complexity in clinical trials. AMIA Summits Transl Sci Proc 2010. (March 1) 46-50.

Search in Google Scholar
Download RIS citation
45 George S. Reducing patient eligibility criteria in cancer clinical trials. J Clin Oncol 1996; 14 (04) 1364-1370.

Crossref PubMed Search in Google Scholar
Download RIS citation
46 Sim I, Olasov B, Carini S. An ontology of randomized controlled trials for evidence-based practice: content specification and evaluation using the competency decomposition method. Journal of Biomedical Informatics 2004; 37: 108-119.

Crossref PubMed Search in Google Scholar
Download RIS citation
47 Sarkar IN. A vector space model approach to identify genetically related diseases. Journal of the American Medical Informatics Association 2012; 19 (02) 249-254.

Crossref PubMed Search in Google Scholar
Download RIS citation
48 Geertzen J. Cohen’s Kappa for more than two annotators with multiple classes http://cosmion.net/jeroen/software/kappao/. Accessed on August 15 2012

PubMed
Download RIS citation
49 Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin 1971; 76 (05) 378-382.

Crossref Search in Google Scholar
Download RIS citation
50 Wishart D. 256. Note: An Algorithm for Hierarchical Classifications. Biometrics 1969; 25 (01) 165-170.

Crossref Search in Google Scholar
Download RIS citation
51 Ward Jr JH. Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association 1963; 58 (301) 236-244.

Crossref Search in Google Scholar
Download RIS citation
52 Suzuki R, Shimodaira H. Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics 2006; 22 (12) 1540-1542.

Crossref PubMed Search in Google Scholar
Download RIS citation
53 Luo Z, Duffy R, Johnson SB, Weng C. Corpus-based Approach to Creating a Semantic Lexicon for Clinical Research Eligibility Criteria from UMLS. AMIA Summits Transl Sci Proc 2010. (March 1) 26-30.

Search in Google Scholar
Download RIS citation
54 Luo Z, Johnson SB, Weng C. Semi-Automatically Inducing Semantic Classes of Clinical Research Eligibility Criteria Using UMLS and Hierarchical Clustering. AMIA Annu Symp Proc 2010. (Nov 13) 487-491.

Search in Google Scholar
Download RIS citation
55 Horridge M. OWLViz - A visualisation plugin for the Protege OWL Plugin. http://www.co-ode.orgldownloads/owlvizl. Accessed on September 24 2012

PubMed Search in Google Scholar
Download RIS citation
56 Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977; 33 (01) 159-174.

Crossref PubMed Search in Google Scholar
Download RIS citation
57 Hripcsak G, Rothschild AS. Agreement, the F-Measure, and Reliability in Information Retrieval. Journal of the American Medical Informatics Association 2005; 12 (03) 296-298.

Crossref PubMed Search in Google Scholar
Download RIS citation
58 Hamers L, Hemeryck Y, Herweyers G, Janssen M, Keters H, Rousseau R. et al. Similarity measures in scientometric research: The Jaccard index versus Salton’s cosine formula. Information Processing & Management 1989; 25 (03) 315-318.

Crossref Search in Google Scholar
Download RIS citation
59 Krieger AM, Green PE. A Generalized Rand-Index Method for Consensus Clustering of Separate Partitions of the Same Data Base. Journal of Classification 1999; 16 (01) 63 PubMed PMID: 4676459

Crossref Search in Google Scholar
Download RIS citation
60 Meriggi F, Abeni C, Di Biasi B, Zaniboni A. The use of bevacizumab and trastuzumab beyond tumor progression: a new avenue in cancer treatment?. Rev Recent Clin Trials 2009; 4 (03) 163-167.

Crossref PubMed Search in Google Scholar
Download RIS citation
61 Martín M, Makhson A, Gligorov J, Lichinitser M, Lluch A, Semiglazov V. et al. Phase II Study of Bevacizumab in Combination with Trastuzumab and Capecitabine as First-Line Treatment for HER-2-positive Locally Recurrent or Metastatic Breast Cancer. The Oncologist 2012; 17 (04) 469-475.

Crossref PubMed Search in Google Scholar
Download RIS citation
62 Evans DA, Zhai C. Noun-phrase analysis in unrestricted text for information retrieval. In: Proceedings of the 34th annual meeting on Association for Computational Linguistics; Santa Cruz, California. 981866: Association for Computational Linguistics 1996: 17-24.

Search in Google Scholar
Download RIS citation
63 Novichkova S, Egorov S, Daraselia N. MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics 2003; 19 (13) 1699-1706.

Crossref PubMed Search in Google Scholar
Download RIS citation
64 Molina A, Pla F. Clause detection using HMM. In: Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7; Toulouse, France. 1455688: Association for Computational Linguistics 2001: 1

Search in Google Scholar
Download RIS citation
65 Pakhomov S, Buntrock J, Duffy P. High throughput modularized NLP system for clinical text. In: Proceedings of the ACL 2005 on Interactive poster and demonstration sessions; Ann Arbor, Michigan. 1225760: Association for Computational Linguistics 2005: 25-28.

Search in Google Scholar
Download RIS citation
66 Restificar A, Ananiadou S. Inferring appropriate eligibility criteria in clinical trial protocols without labeled data. Proceedings of the ACM sixth international workshop on Data and text mining in biomedical informatics; Maui, Hawaii, USA. 2390074: ACM 2012: 21-28.

Search in Google Scholar
Download RIS citation
67 Patel C, Cimino J, Dolby J, Fokoue A, Kalyanpur A, Kershenbaum A. et al. Matching Patient Records to Clinical Trials Using Ontologies. In: Aberer K, Choi K-S, Noy N, Allemang D, Lee K-I, Nixon L. et al. editors. The Semantic Web. Lecture Notes in Computer Science. 4825. Berlin Heidelberg: Springer; 2007: 816-829.

Search in Google Scholar
Download RIS citation

Supplementary Material

Online Supplementary Material (PDF) (PDF) (opens in new window)

Related Journals

Subscribe to RSS

Share / Bookmark

Feasibility of Feature-based Indexing, Clustering, and Search of Clinical Trials[*]

Authors

Publication History

Summary

Keywords

References

Feasibility of Feature-based Indexing, Clustering, and Search of Clinical Trials^[*]