A Semi-supervised approach to Document Clustering with Sequence Constraints
Journal Title: Journal of Independent Studies and Research - Computing - Year 2015, Vol 13, Issue 1
Abstract
Document clustering is usually performed as an unsupervised task. It attempts to separate different groups of documents (clusters) from a document collection based on implicitly identifying the common patterns present in these documents. A semi-supervised approach to this problem recently reported promising results. In semi-supervised approach, an explicit background knowledge (for example: Must-link or Cannot-link information for a pair of documents) is used in the form of constraints to drive the clustering process in the right direction. In this paper, a semi-supervised approach to document clustering is proposed. There are three main contributions through this paper (i) a document is transformed primarily into a graph representation based on Graph-of-Word approach. From this graph, a word sequences of size=3 is extracted. This sequence is used as a feature for the semi-supervised clustering. (ii) A similarity function based on commonword sequences is proposed, and (iii) the constrained based algorithm is designed to perform the actual cluster process through active learning. The proposed algorithm is implemented and extensively tested on three standard text mining datasets. The method clearly outperforms the recently proposed algorithms for document clustering in term of standard evaluation measures for document clustering task.
Prediction of Suicide Causes in India using Machine Learning
Worldwide, suicide rate is considered one of the most significant issue. With each passing year, the number of suicide is getting increased phenomenally and because of this reason, this research is carried out to predict...
Extracting patterns from Global Terrorist Dataset (GTD) Using Co-Clustering approach
Global Terrorist Dataset (GTD) is a vast collection of terrorist activities reported around the globe. The terrorism database incorporates more than 27,000 terrorism incidents from 1968 to 2014. Every record has spatial...
Prospects of 5G Communications
The next generation of wireless communication is going to meet human demands beyond today’s trend. This study sets the frame on the future of wireless communication that requires real-time responses which pushes this tec...
Implementation of Adaptive Control Algorithm to Overcome the Traffic Congestion Problems of Karachi
Traffic controlling and management is a severe issue of urban cities as well as on high ways in developing countries like South Asian countries but here particularly, in Pakistan. The traffic congestion problem is becomi...
Detection of Duplicate and Near-Duplicate Content for Web Crawlers
There is an abundance of duplicated web documents on the internet. For example, two documents online could be very similar to each other except for a very small portion, such as URLs and advertisements. While such differ...