A Semi-supervised approach to Document Clustering with Sequence Constraints

Journal Title: Journal of Independent Studies and Research - Computing - Year 2015, Vol 13, Issue 1

Abstract

Document clustering is usually performed as an unsupervised task. It attempts to separate different groups of documents (clusters) from a document collection based on implicitly identifying the common patterns present in these documents. A semi-supervised approach to this problem recently reported promising results. In semi-supervised approach, an explicit background knowledge (for example: Must-link or Cannot-link information for a pair of documents) is used in the form of constraints to drive the clustering process in the right direction. In this paper, a semi-supervised approach to document clustering is proposed. There are three main contributions through this paper (i) a document is transformed primarily into a graph representation based on Graph-of-Word approach. From this graph, a word sequences of size=3 is extracted. This sequence is used as a feature for the semi-supervised clustering. (ii) A similarity function based on commonword sequences is proposed, and (iii) the constrained based algorithm is designed to perform the actual cluster process through active learning. The proposed algorithm is implemented and extensively tested on three standard text mining datasets. The method clearly outperforms the recently proposed algorithms for document clustering in term of standard evaluation measures for document clustering task.

Authors and Affiliations

Keywords

Related Articles

A Review of Forensic Analysis Techniques for Android Phones

Mobile forensics analysis is the sub-domain of digital forensics, which addresses solving the minor technology misuse cases to substantial international digital crime cases. Mobile forensic refers to the acquisition of d...

Enhanced Auto Completion of Hand Drawn Sketches

Sketching is one of the most effective way to communicate art and imagination of an individual. It adds a sense of realism to the object in this work our system helps the user drawing repetitive structures with in a sket...

Urdu Optical Character Recognition Technique for Jameel Noori Nastaleeq Script

Urdu OCR's have been an object of interest for many developers in the recent years. Active research is being done pertaining to Urdu OCR’s, but because of the complexity associated with Urdu fonts; it still lacks perfect...

Information Extraction of Diseases and its Application

Named Entity Recognition is an essential module of Information Extraction in the field of bio-medical and diseases are one of the most important sector to study in the medical field, but since the amount of incessantly u...

A Review and Comparison of the Traditional Collaborative and Online Collaborative Techniques for Requirement Elicitation

Requirement elicitation is one of the major phases of the software development life cycle. As per authors knowledge, among many reviews, there is no review available on a comparison between Online Collaborative Requireme...

Download PDF file
  • EP ID EP643243
  • DOI 10.31645/jisrc/(2015).13.1.0009
  • Views 150
  • Downloads 0

How To Cite

(2015). A Semi-supervised approach to Document Clustering with Sequence Constraints. Journal of Independent Studies and Research - Computing, 13(1), 65-73. https://europub.co.uk/articles/-A-643243