Probability Machines

J. D. Malley; J. Kruppa; A. Dasgupta; K. G. Malley; A. Ziegler

doi:10.3414/ME00-01-0052

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035037.xml

Share / Bookmark

Facebook X Linkedin Weibo

Download PDF

Methods Inf Med 2012; 51(01): 74-81
DOI: 10.3414/ME00-01-0052

Original Articles

Schattauer GmbH

Probability Machines^[*]

Consistent Probability Estimation Using Nonparametric Learning Machines

J. D. Malley

¹Center for Computational Bioscience, Center for Information Technology, National Institutes of Health, Bethesda, USA

,

J. Kruppa

²Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Germany

,

A. Dasgupta

³Clinical Sciences Section, National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, Bethesda, USA

,

K. G. Malley

⁴Malley Research Programming, Rockville, USA

,

A. Ziegler

²Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Germany

› Author Affiliations

Further Information

Publication History

received:15 June 2011

accepted:05 July 2011

Publication Date:
20 January 2018 (online)

Abstract
Full Text
References
Supplementary Material

Permissions and Reprints

Summary

Background: Most machine learning approaches only provide a classification for binary responses. However, probabilities are required for risk estimation using individual patient characteristics. It has been shown recently that every statistical learning machine known to be consistent for a nonparametric regression problem is a probability machine that is provably consistent for this estimation problem.

Objectives: The aim of this paper is to show how random forests and nearest neighbors can be used for consistent estimation of individual probabilities.

Methods: Two random forest algorithms and two nearest neighbor algorithms are described in detail for estimation of individual probabilities. We discuss the consistency of random forests, nearest neighbors and other learning machines in detail. We conduct a simulation study to illustrate the validity of the methods. We exemplify the algorithms by analyzing two well-known data sets on the diagnosis of appendicitis and the diagnosis of diabetes in Pima Indians.

Results: Simulations demonstrate the validity of the method. With the real data application, we show the accuracy and practicality of this approach. We provide sample code from R packages in which the probability estimation is already available. This means that all calculations can be performed using existing software.

Conclusions: Random forest algorithms as well as nearest neighbor approaches are valid machine learning methods for estimating individual probabilities for binary responses. Freely available implementations are available in R and may be used for applications.

Keywords

Brier score - consistency - random forest - k nearest neighbor - logistic regression - probability estimation

^* Supplementary material published on our website www.methods-online.com

Online Supplementary Material (PDF)

References
1 Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic information. Ann Intern Med 1999; 130: 515-524.

Crossref PubMed Google Scholar
2 Devroye L, Györfi L, Lugosi G. A Probabilistic Theory of Pattern Recognition. Berlin: Springer; 1996.

Google Scholar
3 Györfi L, Kohler M, Krzyak A, Walk H. A Distribution-Free Theory of Nonparametric Regression. New York: Springer; 2002.

Google Scholar
4 Kohler M, Máthé K, Pintér M. Prediction from randomly censored data. J Multivariate Anal 2002; 80: 73-100.

Crossref PubMed Google Scholar
5 Breiman L. Bagging predictors. Mach Learn 1996; 24: 123-140.

Crossref PubMed Google Scholar
6 Breiman L. Random Forests. Mach Learn 2001; 45: 5-32.

Crossref PubMed Google Scholar
7 Mease D, Wyner AJ, Buja A. Boosted classification trees and class probability/quantile estimation. J Mach Learn Res 2007; 8: 409-439.

PubMed Google Scholar
8 Mease D, Wyner A. Evidence contrary to the statistical view of boosting. J Mach Learn Res 2008; 9: 131-156.

PubMed Google Scholar
9 Bartlett PL, Tewari A. Sparseness vs estimating conditional probabilities: Some asymptotic results. J Mach Learn Res 2007; 8: 775-790.

PubMed Google Scholar
10 Glasmachers T. Universal consistency of multi-class support vector classifiation. In Lafferty J, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A. editors. Advances in Neural Information Processing Systems 23. West Chester: Curran Associates Inc; 2010. pp 739-747.

Google Scholar
11 Wang J, Shen X, Liu Y. Probability estimation for large-margin classifiers. Biometrika 2008; 95: 149-167.

Crossref PubMed Google Scholar
12 Chang C-C, Lin C-J. LIBSVM: A library for support vector machines. ACM TIST 2011; 2 (027) 21-27. 27.

PubMed Google Scholar
13 Schwarz DF, König IR, Ziegler A. On safari to Random Jungle: A fast implementation of Random Forests for high dimensional data. Bioinformatics 2010; 26: 1752-1758.

Crossref PubMed Google Scholar
14 König IR, Malley JD, Pajevic S, Weimar C, Diener HC, Ziegler A. Patient-centered yes/no prognosis using learning machines. Int J Data Min Bioinform 2008; 2: 289-341.

Crossref PubMed Google Scholar
15 Biau G, Devroye L, Lugosi G. Consistency of random forests and other averaging classifiers. J Mach Learn Res 2008; 9: 2039-2057.

PubMed Google Scholar
16 Biau G. Analysis of a random forests model2010. Available from http://www.lsta.upmc.fr/BIAU/b6.pdf.

PubMed Google Scholar
17 Biau G, Devroye L. On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification. J Multivariate Anal 2010; 101: 2499-2518.

Crossref PubMed Google Scholar
18 Lin Y, Jeon Y. Random forests and adaptive nearest neighbors. J Am Stat Assoc 2006; 101: 578-590.

Crossref PubMed Google Scholar
19 Biau G, Cérou F, Guyader A. Rates of convergence of the functional k-nearest neighbor estimate. IEEE Transactions on Information Theory 2010; 56: 2034-2040.

Crossref PubMed Google Scholar
20 20 Biau G, Cérou F, Guyader A. On the rate of convergence of the bagged nearest neighbor estimate. J Mach Learn Res 2010; 11: 687-712.

PubMed Google Scholar
21 Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting. Ann Statist 2000; 28: 337-407.

PubMed Google Scholar
22 Gorman RP, Sejnowski TJ. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1988; 1: 75-89.

PubMed Google Scholar
23 Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc 2007; 102: 359-378.

Crossref PubMed Google Scholar
24 Bradley AA, Schwartz SS, Hashino T. Sampling uncertainty and confidence intervals for the Brier score and Brier skill score. Weather and Forecasting 2008; 23: 992-1006.

Crossref PubMed Google Scholar
25 Gillmann G, Minder CE. On Graphically Checking Goodness-of-fit of Binary Logistic Regression Models. Methods Inf Med 2009; 48: 306-310.

Article in Thieme Connect PubMed Google Scholar
26 Marchand A, Van Lente F, Galen RS. The assessment of laboratory tests in the diagnosis of acute appendicitis. Am J Clin Pathol 1983; 80: 369-374.

Crossref PubMed Google Scholar
27 Malley DJ, Malley KG, Pajevic S. Statistical Learning for Biomedical Data. Cambridge: Cambridge University Press; 2011.

Google Scholar
28 Silverstein MD, Ballard DJ. Expert panel assessment of appropriateness of abdominal aortic aneurysm surgery: global judgement versus probability estimation. J Health Serv Res Policy 1998; 3: 134-140.

Crossref PubMed Google Scholar
29 Burnside ES, Rubin DL, Fine JP, Shachter RD, Sisney GA, Leung WK. Bayesian network to predict breast cancer risk of mammographic microcalcifications and reduce number of benign biopsy results: initial experience. Radiology 2006; 240: 666-673.

Crossref PubMed Google Scholar
30 Lebrun G, Charrier C, Lezoray O, Meurie C, Cardot H. A fast and efficient segmentation scheme for cell microscopic image. Cell Mol Biol (Noisy-le-grand) 2007; 53: 51-61.

PubMed Google Scholar
31 Tanaka T, Komatsu K, Takada G, Miyashita M, Ohno T. Probability estimation of final height. Endocr J. 1998; 45 Suppl S145-149.

Crossref PubMed Google Scholar
32 Walsh S, Lindenbergh A, Zuniga SB, Sijen T, de Knijff P, Kayser M. et al. Developmental validation of the IrisPlex system: Determination of blue and brown iris colour for forensic intelligence. Forensic Sci Int Genet 2010

PubMed Google Scholar
33 Wu Y, Zhang HH, Liu Y. Robust Model-Free Multiclass Probability Estimation. J Am Stat Assoc 2010; 105: 424-436.

Crossref PubMed Google Scholar
34 Richard MD, Lippmann RP. Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Comput 1991; 3: 461-483.

Crossref PubMed Google Scholar
35 Wu Y, Liu Y. Non-crossing large-margin probability estimation and its application to robust SVM via preconditioning. Stat Methodol 2011; 8: 56-67.

Crossref PubMed Google Scholar
36 Meinshausen N. Quantile regression forests. J Mach Learn Res 2006; 7: 983-999.

PubMed Google Scholar

Supplementary Material

Online Supplementary Material (PDF)