Class Based Variable Importance for Medical Decision Making

Abstract

In this paper we explore variable importance within tree-based modeling, discussing its strengths and weaknesses with regard to medical inference and action ability. While variable importance is useful in understanding how strongly a variable influences a tree, it does not convey how variables relate to different classes of the target variable. Given that in the medical setting, both prediction and inference are important for successful machine learning, a new measure capturing variable importance with regards to classes is essential. A measure calculated from the paths of training instances through the tree is defined, and initial performance on benchmark datasets is explored. Tree based methods are common for use with medical datasets, the goal being to create a predictive model of one variable based on several input variables. The basic algorithm consists of a single tree, whereby the input starts at the root node and follows a path down the tree, choosing a path based on a splitting decision at each interior node [1]. The prediction is made by whatever leaf node the path ends in, either the majority or average of the node, depending on whether the problem is classification or regression respectively. Several implementations exist, such as ID3 [1,2], C4.5 [1,3] and CART (Classification and Regression Trees) [2], with CART being the implementation in Python’s scikit-learn machine learning library used in this analysis. More sophisticated algorithms build on the simple tree by making an ensemble of thousands trees, pooling the predictions together for a single final prediction. Prominent among these are Random Forests [3], Extra Trees [4-9], and Gradient Boosted Trees [6]. Tree based modeling in itself is popular given that it is easy to use, can easily support multi class prediction, and is better equipped to deal with small n and large p problems, where the number of observations are much smaller than the number of variables. The small n, large p issue is especially relevant in certain medical domains, such as genetic data [5], where hundreds or thousands of measurements can be taken on a handful of patients in a single study. Traditional modeling in this instance, while possible, will likely find a multiplicity of models with comparable error estimates [4]. One major drawback for tree based learning is the lack of interpretability in model behavior. Machine learning can be used for two purposes: prediction and inference. Trees are excellent for prediction; for inference, however, they fall short. Building a single tree, we can examine the set of branching rules to gather insight, but typically a single tree is a poor predictor. Prediction can be improved by aggregating over hundreds of trees, but by doing so, the ability to infer disappears. Regression models, while more rigid in predictive power given that only a single model is made, are straightforward for inference, and thus are easy to convey to decision makers. The co-efficient from a model can be explained as the strength of the effect for the given variable on the target variable: a positive coefficient represents a positive effect, and a negative coefficient represents a negative effect. When trying to determine a course of treatment designed to change an outcome, such as for treating a patient given a poor prognosis from a model, inference can be argued to be just as important for the medical practitioner. In this context, a model should not only be able to detect a disease, but it should also provide insight as to why it detected the disease in order to treat it. This issue of inference has been overlooked in the quest to find more accurate prediction. The main measure used, variable importance, provides some insight into how variables affect the overall model, but it does not provide insight as to how variables interact with the target. Some work using variable importance moves in this direction, such as for understanding the effects of correlated input variables [10-15], adjusting with imbalanced class sizes [10], measuring variable interactions [11], and as a variable selection mechanism [1] [8], but they still do not fully answer the question of how the features affect a given outcome. In classification problems, this question is essential for improving the usability of trees in the medical setting. What we desire is a new measure that conveys how the variable is important with regard to the target variable. In this paper, we raise this question for consideration and offer an initial approach for bridging the gap between prediction and inference. The paper is structured as follows: First, we outline the general approach for building a decision tree. Next, we explore the standard ways of interpreting a tree, both for a single tree and for an ensemble model. We then define a new measure, Class Variable Importance, to capture the strength of the effect of a variable with regard to different classes. Next, we explore the calculation of this new measure on several benchmark datasets. The final section concludes and proposes further areas for research.

Authors and Affiliations

Danielle Baghernejad

Keywords

Related Articles

PCR Methods for Detecting Bovine Respiratory Pathogens

Bovine Respiratory Disease (BRD) is an important disease in cattle production, BRD may be associated with one or more pathogens, of which Mycobacterium bovis, Mycoplasma bovis, and Klebsiella pneumoniae are three importa...

Novel Strategies to Increase Sodium Removal and to Reduce Glucose Exposures in Peritoneal Dialysis

The success of Peritoneal Dialysis (PD) as renal replacement treatment is mainly due to the achievement of dialysis adequacy with a good adaptability to the patients’ needs. Adequate sodium and fluid balance are basic re...

Potentials for Reducing Cancer Incidence in Nigeria

Cancer genomics is changing oncology care and cancer detection rates in developed countries where genetic information can be used to improve care and management as well as inform decisions for personalized care for indiv...

Anti-Trib2 Autoantibody in Narcolepsy as A Result of Hypocretin/Orexin Nerve Deciduation

Narcolepsy is a disabling disorder characterized by recurrent daytime sleepiness and cataplexy [1]. The 90% to 95% loss of hypocretin (also called orexin) neurons in postmortem brains of narcolepsy is reported [2]. An im...

Is Prefilled Contrast Agent Effective in Timesaving and Patient Safety

Background: In CT imaging, when preparing contrast agent, the prefilled type is simpler to use than is the vial type. In this study, the benefits from using prefilled contrast agent were examined. Methods: The benefits o...

Download PDF file
  • EP ID EP573635
  • DOI 10.26717/BJSTR.2017.01.000431
  • Views 140
  • Downloads 0

How To Cite

Danielle Baghernejad (2017). Class Based Variable Importance for Medical Decision Making. Biomedical Journal of Scientific & Technical Research (BJSTR), 1(5), 1328-1335. https://europub.co.uk/articles/-A-573635