Breast cancer diagnosis using feature extraction techniques with supervised and unsupervised classification algorithms
Journal Title: Applied Medical Informatics - Year 2019, Vol 41, Issue 1
Abstract
Background: Breast cancer is a serious disease that affects females around the globe. With the development of clinical technologies, different tumor features have been collected for breast cancer diagnosis. Filtering all the pertinent feature information to support the clinical disease diagnosis is a challenging and time-consuming task. The objective of this research was to diagnose breast cancer based on the extracted tumor features. The main contribution of our study is to use multivariate techniques such as principal component analysis, discriminant analysis and logistic regression for feature reduction combined with machine learning tools to classify and predict the tumor type. A hybrid DA-LR feature reduction is proposed, and models created with reduced features are tested by performing classification using Support Vector Machine, Naive Bayes, Decision Tree, Logistic Regression and Artificial Neural Network. Materials and Methods: Feature extraction and selection are critical to the quality of classifiers founded through data mining methods. To diagnose tumor through reduced features, a hybrid feature extraction is proposed. We tried to predict the disease based on relevant features in the data. The Breast Cancer Wisconsin Diagnostic Dataset obtained from the UCI Irvine Machine Learning Repository has been used in this study. After data pre-processing, the correlation matrix is generated that suggests the presence of multicollinearity. Feature reduction techniques including principal component analysis, discriminant analysis, and logistic regression are applied to extract features. Classification models namely Support vector machine, Naive Bayes, Decision Tree, Logistic Regression and Artificial Neural Network are created with extracted features, and their performance is compared. Result: The results not only illustrate the capability of the proposed approach on breast cancer diagnosis but also show time savings during the training phase. Physicians can also benefit from the mined abstract tumor features by better understanding the properties of different types of tumors. Conclusion: The Naive Bayes and Support Vector machine classification outperforms other classification methods and the model created with hybrid discriminant-logistic (DA-LR) feature selection performs best among all models.
Authors and Affiliations
Maryam SOLTANPOUR GHARIBDOUSTI, Syed HAIDER, Dieudonne OUEDRAOGO, Susan LU
Informatics in nursing. Current and future trends
The need for knowledge in the medical field and computerization, have increased significantly in this century and from the point of view of nurses. It is essential that future basic training programs for nurses include c...
Chest X-Rays Image Classification in Medical Image Analysis
Chest X-Rays image classification is an active research area in medical image analysis as well as computer-aided diagnosis for radiology. The main goal is to improve the quality and productivity of radiologists’ task by...
Frequency of Bullying Behaviours in Secondary Schools in Cluj-Napoca
“Bullying” is generally considered to be a specific form of aggressive behaviour. The aim of this paper is the investigation of gender and age-related bully and victim incidence in Cluj-Napoca secondary schools. A survey...
Assessment of Sonoelastography as Diagnosis Tool of Inflammatory Myopathies
[i]Background[/i]: Inflammatory myopathies represent a special group of pathology. Establishing the correct diagnosis in the early phase and a better follow-up are the main objective for improving the life quality of the...
Processing the data collected over time improved the knowledge discovery in the case of a QoL questionnaire
The objective of this study was to evaluate if linguistically-translated Norfolk Quality of Life for Diabetic Neuropathy questionnaire (QoL-DN) can predict mortality in patients with diabetes mellitus. A subset of 2,083...