Chi-square-based Scoring Function for Categorization of MEDLINE Citations

A. Kastrin; B. Peterlin; D. Hristovski

doi:10.3414/ME09-01-0009

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035037.xml

Download PDF

Methods Inf Med 2010; 49(04): 371-378
DOI: 10.3414/ME09-01-0009

Original Articles

Schattauer GmbH

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

Authors

A. Kastrin

¹Institute of Medical Genetics, University Medical Centre Ljubljana, Ljubljana, Slovenia
B. Peterlin

¹Institute of Medical Genetics, University Medical Centre Ljubljana, Ljubljana, Slovenia
D. Hristovski

²Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia

Further Information

Publication History

received: 04 February 2009

accepted: 22 January 2009

Publication Date:
17 January 2018 (online)

Permissions and Reprints

Summary

Objectives: Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE^® citations containing genetic relevant topic.

Methods: Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH^® descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain.

Results: Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine-learning algorithms (support vector machines, decision trees, naïve Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine-learning algorithms.

Conclusions: We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.

Keywords

Applied statistics - text mining - natural language processing - document categorization

References
1 Rebholz-Schuhmann D, Kirsch H, Couto F. Facts from text – is text mining ready to deliver?. PLoS Biol 2005; 3 (02) e65-00.

Crossref PubMed Search in Google Scholar
Download RIS citation
2 Manning CD, Schuetze H. Foundations of statistical natural language processing. Cambridge, MA: MIT Press; 1999

Search in Google Scholar
Download RIS citation
3 Humphrey SM, Rindflesch TC, Aronson AR. Automatic indexing by discipline and high-level categories: methodology and potential applications. In: Soergel D, Srinivasan P, Kwasnik B. editors. Proceedings of the 11th ASIS&T SIG/CR Classification Research Workshop; Nov 12, 2000; Chicago, IL. Silver Spring, MD: American Society for Information Science and Technology; 2000. pp 103-116.

Search in Google Scholar
Download RIS citation
4 Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, Tuekam B. et al. PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 2003; 4: 11.

Crossref PubMed Search in Google Scholar
Download RIS citation
5 Dobrokhotov PB, Goutte C, Veuthey AL, Gaussier E. Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation. Bioinformatics 2003; 19 (Suppl 1): i91-i94.

Crossref PubMed Search in Google Scholar
Download RIS citation
6 Bernhardt PJ, Humphrey SM, Rindflesch TC. Determining prominent subdomains in medicine. AMIA Annu Symp Proc 2005 pp 46-50.

PubMed Search in Google Scholar
Download RIS citation
7 Miotto O, Tan TW, Brusic V. Supporting the curation of biological databases with reusable text mining. Genome Inform 2005; 16 (02) 32-44.

PubMed Search in Google Scholar
Download RIS citation
8 Chen D, Müller HM, Sternberg PW. Automatic document classification of biological literature. BMC Bioinformatics 2006; 7: 370.

Crossref PubMed Search in Google Scholar
Download RIS citation
9 McDonald R, Scott Winters R, Ankuda CK, Murphy JA, Rogers AE, Pereira F. et al. An automated procedure to identify biomedical articles that contain cancer-associated gene variants. Hum Mutat 2006; 27 (09) 957-964.

Crossref PubMed Search in Google Scholar
Download RIS citation
10 Wang P, Morgan AA, Zhang Q, Sette A, Peters B. Automating document classification for the Immune Epitope Database. BMC Bioinformatics 2007; 8: 269.

Crossref PubMed Search in Google Scholar
Download RIS citation
11 Cohen AM, Hersh WR. The TREC 2004 genomics track categorization task: classifying full text biomedical documents. J Biomed Discov Collab 2006; 1: 4.

Crossref PubMed Search in Google Scholar
Download RIS citation
12 Hirschman L, Yeh A, Blaschke C, Valencia A. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 2005; 6 (01) S1.

PubMed Search in Google Scholar
Download RIS citation
13 Hristovski D, Stare J, Peterlin B, Dzeroski S. Supporting discovery in medicine by association rule mining in Medline and UMLS. Stud Health Technol Inform 2001; 10 Pt (02) 1344-1348.

Search in Google Scholar
Download RIS citation
14 Hristovski D, Peterlin B, Mitchell JA, Humphrey SM. Using literature-based discovery to identify disease candidate genes. Int J Med Inform 2005; 74 2–4 289-298.

Crossref PubMed Search in Google Scholar
Download RIS citation
15 Chen L, Liu H, Friedman C. Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics 2005; 21 (02) 248-256.

Crossref PubMed Search in Google Scholar
Download RIS citation
16 Di Fabio F, Alvarado C, Majdan A, Gologan A, Voda L, Mitmaker E. et al. Underexpression of miner-alocorticoid receptor in colorectal carcinomas and association with VEGFR-2 overexpression. J Gastrointest Surg 2007; 11 (11) 1521-1528.

Crossref PubMed Search in Google Scholar
Download RIS citation
17 Oakes M, Gaaizauskas R, Fowkes H, Jonsson A, Wan V, Beaulieu M. A method based on the chi-square test for document classification. In: Croft WB, Harper DJ, Kraft DH, Zobel J. editors. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in infor mation retrieval (SIGIR ’01); Sep 9-13, 2001; New Orleans, LA. New York, NY: ACM Press; 2001. pp 440-441.

Search in Google Scholar
Download RIS citation
18 Alexandrov M, Gelbukh AF, Lozovoi G. Chi-square classifier for document categorization. In: Gelbukh A. editor. Computational linguistics and intelligent text processing. Berlin: Springer; 2001. pp 457-459.

Search in Google Scholar
Download RIS citation
19 Kastrin A, Hristovski D. A fast document classification algorithm for gene symbol disambiguation in the BITOLA literature-based discovery support system. AMIA Annu Symp Proc 2008 pp 358-362.

PubMed Search in Google Scholar
Download RIS citation
20 Entrez Gene (FTP repository, cited Oct 22, 2009). Available from: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/.

Download RIS citation
21 Rice JA. Mathematical statistics and data analysis. Belmont, CA: Duxbury Press; 2006

Search in Google Scholar
Download RIS citation
22 Agresti A. Categorical data analysis. Hoboken, NJ: Wiley; 2002

Search in Google Scholar
Download RIS citation
23 Medical Subject Headings Home Page (homepage on the Internet, cited Oct 22, 2009). Available from: http://www.nlm.nih.gov/mesh.

Download RIS citation
24 Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H. et al. Top 10 algorithms in data mining. Knowl Inform Syst 2008; 14 (01) 1-37.

Crossref Search in Google Scholar
Download RIS citation
25 Kononenko I, Kuhar M. Machine learning and data mining: introduction to principles and algorithms. West Sussex: Horwood; 2007

Search in Google Scholar
Download RIS citation
26 LIBSVM: a library for support vector machines (homepage on the Internet, cited Oct 22, 2009). Available from: http://www.csie.ntu.edu.tw/~cjlin/libsvm.

Download RIS citation
27 A practical guide to support vector classification (document on the Internet, cited Oct 22, 2009). Available from: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.

Download RIS citation
28 Stopword List 1 (document on the internet, cited Oct 22, 2009).. Available from: http://www.lextek.com/manuals/onix/stopwords1.html.

Download RIS citation
29 Lovins JB. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 1968; 11 1–2 22-31.

Search in Google Scholar
Download RIS citation
30 Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. Boston, MA: Addison Wesley; 1999

Search in Google Scholar
Download RIS citation
31 MEDLINE Baseline Repository Query Tool (home-page on the Internet, cited Oct 22, 2009). Available from: http://mbr.nlm.nih.gov/Query/index.shtml.

Download RIS citation
32 Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas 1960; 20 (01) 37-46.

Crossref Search in Google Scholar
Download RIS citation
33 Dietterich TG. Approximate statistical test for comparing supervised classification learning algorithms. Neural Comput 1998; 10 (07) 1895-1923.

Crossref PubMed Search in Google Scholar
Download RIS citation
34 R Development Core Team.. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2009

Search in Google Scholar
Download RIS citation
35 Andrej Kastrin Home Page (homepage on the Internet, cited Oct 22, 2009).. Available from: http://www2.arnes.si/~akastr1/annotated_corpus.txt.

Download RIS citation
36 Yetisgen-Yildiz M, Pratt W. The effect of feature representation on MEDLINE document classification. AMIA Annu Symp Proc 2005 pp 849-853.

PubMed Search in Google Scholar
Download RIS citation
37 Koprinska I, Poon J, Clark J, Chan J. Learning to classify e-mail. Inform Sci 2007; 177 (10) 2167-2187.

Crossref Search in Google Scholar
Download RIS citation
38 Duda RO, Hart PE, Stork DG. Pattern classification. New York, NY: Wiley; 2000

Search in Google Scholar
Download RIS citation
39 Rubin DL, Thorn CF, Klein TE, Altman RB. A statistical approach to scanning the biomedical literature for pharmacogenetics knowledge. J Am Med Inform Assoc 2005; 12 (02) 121-129.

PubMed Search in Google Scholar
Download RIS citation
40 Aphinyanaphongs Y, Tsamardinos I, Statnikov A, Hardin D, Aliferis CF. Text categorization models for high-quality article retrieval in internal medicine. J Am Med Inform Assoc 2005; 12 (02) 207-216.

PubMed Search in Google Scholar
Download RIS citation
41 Cohen AM, Hersh WR, Peterson K, Yen PY. Reducing workload in systematic review preparation using automated citation classification. J Am Med Inform Assoc 2006; 13 (02) 206-219.

Crossref PubMed Search in Google Scholar
Download RIS citation
42 BITOLA – Biomedical Discovery Support System (homepage on the Internet, cited Oct 22, 2009). Available from: http://ibmi.mf.uni-lj.si/bitola.

Download RIS citation

Related Journals

Subscribe to RSS

Share / Bookmark

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

Authors

Publication History

Summary

Keywords

References