A Semi-supervised approach to Document Clustering with Sequence Constraints
Journal Title: Journal of Independent Studies and Research - Computing - Year 2015, Vol 13, Issue 1
Abstract
Document clustering is usually performed as an unsupervised task. It attempts to separate different groups of documents (clusters) from a document collection based on implicitly identifying the common patterns present in these documents. A semi-supervised approach to this problem recently reported promising results. In semi-supervised approach, an explicit background knowledge (for example: Must-link or Cannot-link information for a pair of documents) is used in the form of constraints to drive the clustering process in the right direction. In this paper, a semi-supervised approach to document clustering is proposed. There are three main contributions through this paper (i) a document is transformed primarily into a graph representation based on Graph-of-Word approach. From this graph, a word sequences of size=3 is extracted. This sequence is used as a feature for the semi-supervised clustering. (ii) A similarity function based on commonword sequences is proposed, and (iii) the constrained based algorithm is designed to perform the actual cluster process through active learning. The proposed algorithm is implemented and extensively tested on three standard text mining datasets. The method clearly outperforms the recently proposed algorithms for document clustering in term of standard evaluation measures for document clustering task.
Implementation of Adaptive Control Algorithm to Overcome the Traffic Congestion Problems of Karachi
Traffic controlling and management is a severe issue of urban cities as well as on high ways in developing countries like South Asian countries but here particularly, in Pakistan. The traffic congestion problem is becomi...
Graph Visualization Tools: A Comparative Analysis
Data visualization is becoming a necessity for big organizations as the social networking data is growing rapidly. It is becoming difficult to visualize data and perform complex comparisons. There have been large databas...
Enhanced Auto Completion of Hand Drawn Sketches
Sketching is one of the most effective way to communicate art and imagination of an individual. It adds a sense of realism to the object in this work our system helps the user drawing repetitive structures with in a sket...
Enhancing Data Quality using Human Computation and Crowd Sourcing
This paper is aimed at addressing the issues that are present in the data dumps available at DBpedia by using the concept of associations i.e. concept hierarchy to enhance the quality of those data dumps. These data dump...
Detection of Duplicate and Near-Duplicate Content for Web Crawlers
There is an abundance of duplicated web documents on the internet. For example, two documents online could be very similar to each other except for a very small portion, such as URLs and advertisements. While such differ...