A Semi-supervised approach to Document Clustering with Sequence Constraints

Journal Title: Journal of Independent Studies and Research - Computing - Year 2015, Vol 13, Issue 1

Abstract

Document clustering is usually performed as an unsupervised task. It attempts to separate different groups of documents (clusters) from a document collection based on implicitly identifying the common patterns present in these documents. A semi-supervised approach to this problem recently reported promising results. In semi-supervised approach, an explicit background knowledge (for example: Must-link or Cannot-link information for a pair of documents) is used in the form of constraints to drive the clustering process in the right direction. In this paper, a semi-supervised approach to document clustering is proposed. There are three main contributions through this paper (i) a document is transformed primarily into a graph representation based on Graph-of-Word approach. From this graph, a word sequences of size=3 is extracted. This sequence is used as a feature for the semi-supervised clustering. (ii) A similarity function based on commonword sequences is proposed, and (iii) the constrained based algorithm is designed to perform the actual cluster process through active learning. The proposed algorithm is implemented and extensively tested on three standard text mining datasets. The method clearly outperforms the recently proposed algorithms for document clustering in term of standard evaluation measures for document clustering task.

Authors and Affiliations

Keywords

Related Articles

Prediction of Suicide Causes in India using Machine Learning

Worldwide, suicide rate is considered one of the most significant issue. With each passing year, the number of suicide is getting increased phenomenally and because of this reason, this research is carried out to predict...

Extracting patterns from Global Terrorist Dataset (GTD) Using Co-Clustering approach

Global Terrorist Dataset (GTD) is a vast collection of terrorist activities reported around the globe. The terrorism database incorporates more than 27,000 terrorism incidents from 1968 to 2014. Every record has spatial...

Prospects of 5G Communications

The next generation of wireless communication is going to meet human demands beyond today’s trend. This study sets the frame on the future of wireless communication that requires real-time responses which pushes this tec...

Implementation of Adaptive Control Algorithm to Overcome the Traffic Congestion Problems of Karachi

Traffic controlling and management is a severe issue of urban cities as well as on high ways in developing countries like South Asian countries but here particularly, in Pakistan. The traffic congestion problem is becomi...

Detection of Duplicate and Near-Duplicate Content for Web Crawlers

There is an abundance of duplicated web documents on the internet. For example, two documents online could be very similar to each other except for a very small portion, such as URLs and advertisements. While such differ...

Download PDF file
  • EP ID EP643243
  • DOI 10.31645/jisrc/(2015).13.1.0009
  • Views 140
  • Downloads 0

How To Cite

(2015). A Semi-supervised approach to Document Clustering with Sequence Constraints. Journal of Independent Studies and Research - Computing, 13(1), 65-73. https://europub.co.uk/articles/-A-643243