A Novel Approach for Web Document Classification

Abstract

The web is a huge repository of information and there is a need for categorizing web documents to facilitate the search and retrieval of documents. Web document classification plays an important role in information organization and retrieval.This paper presents a fuzzy set based approach for automatically classifying web documents into one of the classes represented by a set of training documents belonging to a number of classes. Using same word to represent more than one meaning and many words representing one meaning lead to ambiguity especially in web environment where numbers of users are very large. This problem is tackled using fuzzy association wherein each pair of words has a value associated with it. This helps in distinguishing it with other such pairs of words and thus helps in tackling ambiguities. The approach present in this paper does not require any parameter to be given by the user and hence is independent of any bias that may occur due to user input. It requires a training set on which the model is trained and then test set is given as input to be classified. We used Gensim package to implement the approach because of its simplicity and robust nature. The experimental results show that our approach efficiently classifies the web documents by tackling ambiguities among the words.

Authors and Affiliations

Rajendra Kumar Roul

Keywords

Related Articles

Web Content Filtering Techniques: A Survey

For many, accessing the Internet is a mixed blessing; in worst case, it can create serious problems. Web Content Filtering is a firewall to block certain sites from being accessed. Content filtering and the products that...

Pragmatics of Wireless Sensor Networks

Wireless sensor network composed of infrastructureless, small, low-power, low cost, dynamic nature, application oriented, and multihoping wireless nodes, design for the purpose of collecting information by environment se...

Algorithms for Mining Association Rules: An Overview

In this paper, we provide the basic concepts about association rule mining and compared existing algorithms for association rule mining techniques. Of course, a single article cannot describe all the algorithms in detail...

Implementation of Skyline Sweeping Algorithm

Searching keywords in databases is complex task than search in files. Information Retrieval (IR) process search keywords from text files and it is very important that queering keyword to the relational databases. General...

Resolving Set-Streaming Stream-Shop Scheduling in Distributed System by mean of an aFOA

Recently, a new fruit fly optimization algorithm (FOA) is proposed to solve stream-shop scheduling. In this paper, we empirically study the performance of FOA. The experimental results illustrate that FOA cannot solve se...

Download PDF file
  • EP ID EP146491
  • DOI -
  • Views 113
  • Downloads 0

How To Cite

Rajendra Kumar Roul (2013). A Novel Approach for Web Document Classification. International Journal of Computer Science & Engineering Technology, 4(8), 1118-1125. https://europub.co.uk/articles/-A-146491