Exploiting Document Level Semantics in Document Clustering
Journal Title: International Journal of Advanced Computer Science & Applications - Year 2016, Vol 7, Issue 6
Abstract
Document clustering is an unsupervised machine learning method that separates a large subject heterogeneous collection (Corpus) into smaller, more manageable, subject homogeneous collections (clusters). Traditional method of document clustering works around extracting textual features like: terms, sequences, and phrases from documents. These features are independent of each other and do not cater meaning behind these word in the clustering process. In order to perform semantic viable clustering, we believe that the problem of document clustering has two main components: (1) to represent the document in such a form that it inherently captures semantics of the text. This may also help to reduce dimensionality of the document and (2) to define a similarity measure based on the lexical, syntactic and semantic features such that it assigns higher numerical values to document pairs which have higher syntactic and semantic relationship. In this paper, we propose a representation of document by extracting three different types of features from a given document. These are lexical , syntactic and semantic features. A meta-descriptor for each document is proposed using these three features: first lexical, then syntactic and in the last semantic. A document to document similarity matrix is produced where each entry of this matrix contains a three value vector for each lexical , syntactic and semantic . The main contributions from this research are (i) A document level descriptor using three different features for text like: lexical, syntactic and semantics. (ii) we propose a similarity function using these three, and (iii) we define a new candidate clustering algorithm using three component of similarity measure to guide the clustering process in a direction that produce more semantic rich clusters. We performed an extensive series of experiments on standard text mining data sets with external clustering evaluations like: FMeasure and Purity, and have obtained encouraging results.
Authors and Affiliations
Muhammad Rafi, Waleed Arshad, Habibullah Rafay
Design of an Efficient Steganography Model using Lifting based DWT and Modified-LSB Method on FPGA
The data transmission with information hiding is a challenging task in today world. To protect the secret data or image from attackers, the steganography techniques are essential. The steganography is a process of hiding...
Comparison of Intelligent Methods of SOC Estimation for Battery of Photovoltaic System
It is essential to estimate the state of charge (SOC) of lead-acid batteries to improve the stability and reliability of photovoltaic systems. In this paper, we propose SOC estimation methods for a lead-acid battery usin...
Improvement of Control System Performance by Modification of Time Delay
This paper presents a mathematical approach for improving the performance of a control system by modifying the time delay at certain operating conditions. This approach converts a continuous time loop into a discrete tim...
Design of a Cloud Learning System Based on Multi-Agents Approach
Cloud Computing can provide many benefits for university. It is a new paradigm of IT, which provides all resources such as software (SaaS), platform (PaaS) and infrastructure (IaaS) as a service over the Internet. In clo...
Building an Artificial Idiotopic Immune Model Based on Artificial Neural Network Ideology
In the literature, there were many research efforts that utilized the artificial immune networks to model their designed applications, but they were considerably complicated, and restricted to a few areas that such as co...