Exploiting Document Level Semantics in Document Clustering

Abstract

Document clustering is an unsupervised machine learning method that separates a large subject heterogeneous collection (Corpus) into smaller, more manageable, subject homogeneous collections (clusters). Traditional method of document clustering works around extracting textual features like: terms, sequences, and phrases from documents. These features are independent of each other and do not cater meaning behind these word in the clustering process. In order to perform semantic viable clustering, we believe that the problem of document clustering has two main components: (1) to represent the document in such a form that it inherently captures semantics of the text. This may also help to reduce dimensionality of the document and (2) to define a similarity measure based on the lexical, syntactic and semantic features such that it assigns higher numerical values to document pairs which have higher syntactic and semantic relationship. In this paper, we propose a representation of document by extracting three different types of features from a given document. These are lexical , syntactic and semantic features. A meta-descriptor for each document is proposed using these three features: first lexical, then syntactic and in the last semantic. A document to document similarity matrix is produced where each entry of this matrix contains a three value vector for each lexical , syntactic and semantic . The main contributions from this research are (i) A document level descriptor using three different features for text like: lexical, syntactic and semantics. (ii) we propose a similarity function using these three, and (iii) we define a new candidate clustering algorithm using three component of similarity measure to guide the clustering process in a direction that produce more semantic rich clusters. We performed an extensive series of experiments on standard text mining data sets with external clustering evaluations like: FMeasure and Purity, and have obtained encouraging results.

Authors and Affiliations

Muhammad Rafi, Waleed Arshad, Habibullah Rafay

Keywords

Related Articles

Image Encryption Technique based on the Entropy Value of a Random Block

The use of digital images in most fields of information technology systems makes these images usually contain confidential information. When these images transmitted via the Internet especially in the Cloud, it becomes n...

Resistance to Statistical Attacks of Parastrophic Quasigroup Transformation

The resistance to statistical kind of attacks of encrypted messages is a very important property for designing cryptographic primitives. In this paper, the parastrophic quasigroup PE-transformation, proposed elsewhere, i...

Evaluating Mobile Phones and Web Sites for Academic Information Needs

In the last decade, there has been an exponential growth in use of mobile phones among people. Smart phone invention has digitized life of a common man especially after introduction of 3G/4G technology. People are used t...

Implementation of NOGIE and NOWGIE for Human Skin Detection

The Digital image processing is one of the most widely implemented fields worldwide. The most applied applications of digital image processing are facial recognition, finger print recognition, medical imaging, law enforc...

Modelling, Command and Treatment of a PV Pumping System Installed in Tunisia

This paper studied the modeling, the command and the optimization of a photovoltaic (PV) pumping systems using performed strategies of command laws. The system is formed by a PV generator, a DC-DC converter with a maxima...

Download PDF file
  • EP ID EP101623
  • DOI 10.14569/IJACSA.2016.070660
  • Views 71
  • Downloads 0

How To Cite

Muhammad Rafi, Waleed Arshad, Habibullah Rafay (2016). Exploiting Document Level Semantics in Document Clustering. International Journal of Advanced Computer Science & Applications, 7(6), 462-469. https://europub.co.uk/articles/-A-101623