Concept-Based Document Similarity Based on Suffix Tree Document

Abstract

Document clustering has been studied as a post retrieval document visualization technique to provide an intuitive navigation and browsing mechanism by organizing documents into groups and each group represents a different topic. The clustering techniques are based on four concepts: Data representation model, Similarity measure, Clustering model, and Clustering algorithm. In the previous work, phrase has been considered as an informative feature term for improving the effectiveness of document clustering. In this paper, we propose a Concept-based document similarity to compute the similarities of documents based on the Suffix Tree Document (STD) model. By mapping each node in the suffix tree of STD model into a unique feature term in the Vector Space Document (VSD) model, the concept-based document similarity inherits the term ctf (conceptual term frequency), tf (term frequency), df (document frequency) weighting scheme in computing the document similarity with concept. In this paper the concept-based document similarity is applied to the Hierarchical Agglomerative Clustering (HAC) algorithm to develop a new document clustering approach. The new concept-based model analyzes the terms on the sentence, document, and in corpus levels. The similarity between documents is calculated based on a new concept-based similarity measure (Euclidean distance Measure.). The proposed similarity measure takes full advantage of using the concept analysis measures.

Authors and Affiliations

P. Perumal , R. Nedunchezhian , M. Indra Priya

Keywords

Related Articles

Detection and Removal of Bad Smells instantly using a InsRefactor

Software refactoring is one of the essential techniques which are used to improve the software quality without affecting any of the external functionality of the software. There were numerous of software refactoring tool...

Rule Based Exception Handling and Priority Modeling For Diabetes Management Using Ontology and SWRL

Increase treatment quality is the most challenging task because clinical status of the patient and circumstances inside a healthcare organization constantly change. In this paper we present a Rule based Exception handlin...

An Evolving Approach on Video Frame Retrieval Based on Color, Shape and Region

This paper proposes a new methodology for matching of objects in video based on the color, shape and region. The objects are segmented and indexed based on the similarity between the frames. The similarity feature such a...

An analysis of LEACH Protocol in Wireless Sensor Network: A Survey

Wireless Sensor Network is composed of numbers of tiny sensors (nodes) which have the capability of gathering the data about environmental activities and making certain computations on them so that they can be communicat...

Reduce Total Distance and Time Using Genetic Algorithm in Traveling Salesman Problem

Traveling salesman problem is quite known in the field of combinatorial optimization. Through this research describe how the traveling salesman problem is solved by the heuristic method of genetic algorithms. This resear...

Download PDF file
  • EP ID EP108927
  • DOI -
  • Views 118
  • Downloads 0

How To Cite

P. Perumal, R. Nedunchezhian, M. Indra Priya (2012). Concept-Based Document Similarity Based on Suffix Tree Document. International Journal of Computer Science & Engineering Technology, 3(10), 470-475. https://europub.co.uk/articles/-A-108927