Concept-Based Document Similarity Based on Suffix Tree Document

Abstract

Document clustering has been studied as a post retrieval document visualization technique to provide an intuitive navigation and browsing mechanism by organizing documents into groups and each group represents a different topic. The clustering techniques are based on four concepts: Data representation model, Similarity measure, Clustering model, and Clustering algorithm. In the previous work, phrase has been considered as an informative feature term for improving the effectiveness of document clustering. In this paper, we propose a Concept-based document similarity to compute the similarities of documents based on the Suffix Tree Document (STD) model. By mapping each node in the suffix tree of STD model into a unique feature term in the Vector Space Document (VSD) model, the concept-based document similarity inherits the term ctf (conceptual term frequency), tf (term frequency), df (document frequency) weighting scheme in computing the document similarity with concept. In this paper the concept-based document similarity is applied to the Hierarchical Agglomerative Clustering (HAC) algorithm to develop a new document clustering approach. The new concept-based model analyzes the terms on the sentence, document, and in corpus levels. The similarity between documents is calculated based on a new concept-based similarity measure (Euclidean distance Measure.). The proposed similarity measure takes full advantage of using the concept analysis measures.

Authors and Affiliations

P. Perumal , R. Nedunchezhian , M. Indra Priya

Keywords

Related Articles

A novel high-speed image processing technique for detecting edges using abs- Laplacian kernel

Delineating patterns that are alike or in other words, detecting edges that separates them is the most critical step in image processing. There are various methods available e.g Sobel, Prewitt, Canny based edge detection...

ADAPTIVE LANDMARK IDENTIFICATION BASED ON PERTINENT REGION EXTRACTION

The main purpose of this paper is about the relative merits of automated feature detection protocols on cephalometric- related digital radiology images. While the domain dependent techniques such as handcrafted masks are...

An Evaluation of A Country Based Anti- Phishing Approach Using Formal Methods 

Phishing is a fraudulent attack that steals confidential information by mimicking a trustworthy entity in a medium of electronic communication. In this paper, research was conducted to evaluate a proposed country-based m...

Closed Regular Pattern Mining Using Vertical Format

Discovering interesting patterns in transactional databases is often a challenging area by the length of patterns and number of transactions in data mining, which is prohibitively expensive in both time and space. Closed...

Vulnerabilities in Existing GSM Technology that causes Exploitation

Global System for Mobile communications (GSM system) has become the most popular standard for digital cellular communication for day to day communication in the world. Cellular phones have become a ubiquitous means of co...

Download PDF file
  • EP ID EP108927
  • DOI -
  • Views 140
  • Downloads 0

How To Cite

P. Perumal, R. Nedunchezhian, M. Indra Priya (2012). Concept-Based Document Similarity Based on Suffix Tree Document. International Journal of Computer Science & Engineering Technology, 3(10), 470-475. https://europub.co.uk/articles/-A-108927