Exploiting Document Level Semantics in Document Clustering
Journal Title: International Journal of Advanced Computer Science & Applications - Year 2016, Vol 7, Issue 6
Abstract
Document clustering is an unsupervised machine learning method that separates a large subject heterogeneous collection (Corpus) into smaller, more manageable, subject homogeneous collections (clusters). Traditional method of document clustering works around extracting textual features like: terms, sequences, and phrases from documents. These features are independent of each other and do not cater meaning behind these word in the clustering process. In order to perform semantic viable clustering, we believe that the problem of document clustering has two main components: (1) to represent the document in such a form that it inherently captures semantics of the text. This may also help to reduce dimensionality of the document and (2) to define a similarity measure based on the lexical, syntactic and semantic features such that it assigns higher numerical values to document pairs which have higher syntactic and semantic relationship. In this paper, we propose a representation of document by extracting three different types of features from a given document. These are lexical , syntactic and semantic features. A meta-descriptor for each document is proposed using these three features: first lexical, then syntactic and in the last semantic. A document to document similarity matrix is produced where each entry of this matrix contains a three value vector for each lexical , syntactic and semantic . The main contributions from this research are (i) A document level descriptor using three different features for text like: lexical, syntactic and semantics. (ii) we propose a similarity function using these three, and (iii) we define a new candidate clustering algorithm using three component of similarity measure to guide the clustering process in a direction that produce more semantic rich clusters. We performed an extensive series of experiments on standard text mining data sets with external clustering evaluations like: FMeasure and Purity, and have obtained encouraging results.
Authors and Affiliations
Muhammad Rafi, Waleed Arshad, Habibullah Rafay
Norm’s Trust Model to Evaluate Norms Benefit Awareness for Norm Adoption in an Open Agent Community
In recent developments, norms have become important entities that are considered in agent-based systems’ designs. Norms are not only able to organize and coordinate the actions and behaviour of agents but have a direct i...
A Competency-Based Ontology for Learning Design Repositories
Learning designs are central resources for educational environments because they provide the organizational structure of learning activities; they are concrete instructional methods. We characterize each learning design...
Identifying Green Services using GSLA Model for Achieving Sustainability in Industries
Green SLA (GSLA) is a formal agreement between service providers/vendors and users/customers incorporating all the traditional/basic commitments (Basic SLAs) as well as incorporating Ecological, Economical, and Ethical (...
Detection and Feature Extraction of Collective Activity in Human-Computer Interaction
Time-based online media, such as video, has been growing in importance. Still, there is limited research on information retrieval of time-coded media content. This work elaborates on the idea of extracting feature charac...
Big-Learn: Towards a Tool Based on Big Data to Improve Research in an E-Learning Environment
In the area of data management for information system and especially at the level of e-learning platforms, the Big Data phenomenon makes the data difficult to deal with standard database or information management tools....