Exploiting Document Level Semantics in Document Clustering

Abstract

Document clustering is an unsupervised machine learning method that separates a large subject heterogeneous collection (Corpus) into smaller, more manageable, subject homogeneous collections (clusters). Traditional method of document clustering works around extracting textual features like: terms, sequences, and phrases from documents. These features are independent of each other and do not cater meaning behind these word in the clustering process. In order to perform semantic viable clustering, we believe that the problem of document clustering has two main components: (1) to represent the document in such a form that it inherently captures semantics of the text. This may also help to reduce dimensionality of the document and (2) to define a similarity measure based on the lexical, syntactic and semantic features such that it assigns higher numerical values to document pairs which have higher syntactic and semantic relationship. In this paper, we propose a representation of document by extracting three different types of features from a given document. These are lexical , syntactic and semantic features. A meta-descriptor for each document is proposed using these three features: first lexical, then syntactic and in the last semantic. A document to document similarity matrix is produced where each entry of this matrix contains a three value vector for each lexical , syntactic and semantic . The main contributions from this research are (i) A document level descriptor using three different features for text like: lexical, syntactic and semantics. (ii) we propose a similarity function using these three, and (iii) we define a new candidate clustering algorithm using three component of similarity measure to guide the clustering process in a direction that produce more semantic rich clusters. We performed an extensive series of experiments on standard text mining data sets with external clustering evaluations like: FMeasure and Purity, and have obtained encouraging results.

Authors and Affiliations

Muhammad Rafi, Waleed Arshad, Habibullah Rafay

Keywords

Related Articles

Exploratory Analysis of the Total Variation of Electrons in the Ionosphere before Telluric Events Greater than M7.0 in the World During 2015-2016

This exploratory observational study analyzes the variation of the total amount of vertical electrons (vTEC) in the ionosphere, 17 days before telluric events with grades greater than M7.0 between 2015 and 2016. Thirty t...

Application of Artificial Neural Network and Information Gain in Building Case-Based Reasoning for Telemarketing Prediction

Traditionally, case-based reasoning (CBR) has been used as advanced technique for representing expert knowledge and reasoning. However, for stochastic business data such as customers’ behavior and users’ preferences, the...

Improving Throughput and Delay by Signaling Modification in Integrated 802.11 and 3G Heterogeneous Wireless Network

Current trends show that UMTS network and WLAN will co-exist and work together to support more users with higher data rate services over a wider area. However, this integration invokes many challenges such as mobility ma...

ComplexCloudSim: Towards Understanding Complexity in QoS-Aware Cloud Scheduling

The cloud is generally assumed to be homogeneous in most of the research efforts related to cloud resource management and the performance of cloud resource can be determined as it is predictable. However, a plethora of c...

Three Dimensional Agricultural Land Modeling using Unmanned Aerial System (UAS)

Nowadays, the unmanned aerial vehicles (UAVs) drones are mostly used in civil and military fields for security and monitoring purposes. They are also involved in the development of electronics communications and navigati...

Download PDF file
  • EP ID EP101623
  • DOI 10.14569/IJACSA.2016.070660
  • Views 88
  • Downloads 0

How To Cite

Muhammad Rafi, Waleed Arshad, Habibullah Rafay (2016). Exploiting Document Level Semantics in Document Clustering. International Journal of Advanced Computer Science & Applications, 7(6), 462-469. https://europub.co.uk/articles/-A-101623