Predicting Protein Localization Sites Using an Ensemble Self-Labeled Framework

Journal Title: Biomedical Journal of Scientific & Technical Research (BJSTR) - Year 2018, Vol 11, Issue 2

Abstract

In recent years machine learning has been thoroughly used in the bioinformatics and biomedical field. The prediction of cellular localization of the proteins can be considered very significant task in bioinformatics since wrong localization site can cause various diseases and infections to humans. Ensemble learning algorithms and semi-supervised algorithms have been independently developed to build efficient and robust classification models. In this paper we focus on the prediction of protein localization site in Escherichia Coli and Saccharomyces cerevisiae organisms utilizing a semi-supervised self-labeled algorithm based on ensemble methodologies. The experimental results showed the efficiency of our proposed algorithm compared against state-of-the-art self-labeled techniques. Proteins are important molecules in our cells made up of long sequences of amino acid residues [1]. Each protein within the body has a specific function, while they work normally when they are in the correct localization site. The function of a protein in general can be affected by its cellular localization (the location a protein has in a cell) and contributes to many diseases like cardiovascular, metabolic, neurodegenerative diseases and cancer [2]. Also, it is of high interest in various research areas, like therapeutic target discovery, drug design and biological research [3]. Therefore, the prediction of cellular localization of the proteins can be considered very helpful and is a significant task in bioinformatics which has been studied a lot [4-6]. In general, a prediction tool can take as input some attributes of a protein such as its protein sequence of amino acids and predict the location where this protein resides in a cell, such as the nucleus and Endoplasmic reticulum. X-ray crystallography, electron crystallography and nuclear magnetic resonance are some traditionally biochemical experimental methods adopted [7] for predicting protein cellular location. These methods are accurate and precise in general, but they are inefficient and unpractical because they are expensive and time consuming. Therefore, in the last two decades computational methods especially using machine learning methods have been developed to make predictions [5,8-17]. Escherichia Coli (E. coli) and Saccharomyces cerevisiae (Yeast) are two well characterized unicellular organisms which have been exhaustively studied [18]. These two organisms have different proteins allocated in their cell where they must be at their accurate positions. A wrong localization site of these proteins in the cell can cause various diseases and infections to humans such as bloody diarrhea [19]. In the past, there have been significant efforts for predicting the localization sites of proteins [18-28]. Anastasiadis and Magoulas [18] investigated the performance of K nearest neighbours, feed-forward neural networks with and without cross-validation and ensemble-based techniques for the prediction of protein localization sites in E. coli and Yeast. Their results showed that the ensemble-based techniques had the highest average classification accuracy per class, achieving 91.7% and 66.2% for E. coli and Yeast respectively. Chen [22], implemented three different machine learning techniques: Decision tree, perceptrons, two-layer feed-forward neural network for predicting proteins’ cellular localization on E. coli and Yeast datasets. From the results, a similar prediction accuracy was found for all three techniques and 65%~70% on E. coli dataset and 46%~50% on Yeast dataset. Sengur [23], investigated the performance of an artificial immune system based on fuzzy k-NN algorithm. The highest average classification accuracy was 97.29% for E. coli and 76.4% for Yeast. Bouziane et al. [21], utilized four supervised machine learning algorithms for the prediction of cellular localization sites of proteins. For their experiments, they used Naïve Bayesian, k-Nearest Neighbour and feed-forward neural network classifiers. The highest classification accuracy they managed to achieve was 95.8% for E. coli dataset and 73.4% for Yeast dataset. Very recently Priya and Chhabra [19], proposed a hybrid model of Support Vector Machine and the LogitBoost technique for the prediction of the protein localization site in E. coli bacteria. The maximum classification accuracy achieved was 95.23%. Motivated by previous work Satu et al. [20], utilized E. coli and Yeast datasets for the problem of protein localization prediction. For their experiments they used several data mining classification algorithms which were: lazy classifiers (kNN, KStar), meta classifiers (Iterative Classifies Optimizer, Logit boost, Random Committee, Rotation Forest), function classifiers (Logistics, Simple Logistics), tree classifier (LMT, Random Forest, Random Tree) and artificial neural networks, achieving 87.50% with Rotation Forest and 60.53% with Random Forest maximum classification accuracy for E. coli and Yeast respectively.

Authors and Affiliations

Emmanuel G Pintelas, Panagiotis Pintelas

Keywords

Related Articles

A Pedagogical Description of Channel Interference in Multiphoton Absorption Processes

In this mini-review, the author discusses a different view of two-photon absorption and in general any multi-photon absorption process in a molecular system in a very didactic way. This novel point of view is termed as "...

The Onset of Acne Vulgaris among Adult Women and the Effectiveness of Therapy with Ferulic Acid and D'arsonwal's Currents

Introduction: Acne vulgaris as a skin disease does not only affect young people. It is extremely acute and more common problem among adults, that appears around the age of 25 or lasts continuously from the time of maturi...

The Use of Vaughan Procedure in Bouveret Syndrome

Bouveret Syndrome is a rare complication in which a gallstone impacts in the duodenum and cause a gastric outlet obstruction. The gallstone passes through a fistula between the gallbladder and adhered portion of gastroin...

Metabolic and Biochemical Profiling of Phenolic Compound and their Biosynthesis in Oil Crops

The oil seed crops are the important sources of phenolic compounds in the human diet. Important oil seed crops include olive and soybean. In Virgin olive oil (VOO) Phenolic compounds is present which are important in reg...

Factitious Hemoperfusion Based ROI Optimization in PPGi Analysis and its Application for Blood Pressure Determination

Photoplethysmography Imaging (PPGi) is an emerging technology to monitor physiological parameters. However, the signal-to-noise ratio in PPGi usually is low due to inappropriate selection of Region of Interest (ROI). Her...

Download PDF file
  • EP ID EP588432
  • DOI 10.26717/BJSTR.2018.11.002066
  • Views 157
  • Downloads 0

How To Cite

Emmanuel G Pintelas, Panagiotis Pintelas (2018). Predicting Protein Localization Sites Using an Ensemble Self-Labeled Framework. Biomedical Journal of Scientific & Technical Research (BJSTR), 11(2), 8364-8370. https://europub.co.uk/articles/-A-588432