AN EFFICIENT APPROACH FOR TEMPLATE EXTRACTION
Journal Title: International Journal of Computer Science & Engineering Technology - Year 2012, Vol 3, Issue 8
Abstract
The World Wide Web is a vast and rapidly growing source of useful information which is used to publish and access the information on the Internet. It uses different templates with contents for providing easy access for readers. But, for search engine detecting the template and displaying the content to the users is a major task in retrieval of web pages from the web. The templates are considered harmful because they compromise the performance of clustering and classification of the web pages. In this paper, we present novel algorithm for extracting templates from web documents which are generated from heterogeneous template structures. In the proposed, we are clustering the web documents based on the similarity in the template structure so that the template for each cluster is extracted simultaneously. The resultant clusters will be given as input to the Roadrunner system, which is used to extract information from template web pages.
Authors and Affiliations
Pravallika. CH , Swapna Goud. N , Vishnu Murthy. G
Web Content Filtering Techniques: A Survey
For many, accessing the Internet is a mixed blessing; in worst case, it can create serious problems. Web Content Filtering is a firewall to block certain sites from being accessed. Content filtering and the products that...
Database Privacy- Issues and Solutions
Data Mining, fourth and analytical step of Knowledge discovery in database process is a process of discovering new and interesting patterns in the large datasets. For example, data miner can derive different patterns bas...
GUARANTEED COVERAGE PARTICLE SWARM OPTIMIZATION USING NEIGHBORHOOD TOPOLOGIES
The key behind the research represent in this paper is to understand the behavior of the particle swarm algorithm. This study proposes guaranteed convergence Particle Swarm Optimizer (GCPSO) with various topologies. The...
A Similarity Function with Pruning Strategy for Tree Structured Data
Although several distance or similarity functions for trees have been introduced, their performance is not always satisfactory in different applications. In the base paper the Extended Sub tree (EST) function, where a ne...
Software Security: A Risk Taxonomy
The implementation of software has been challenging for many organizations. As given in the many reports of important failures, the implementation of packaged software and associated changes in business processes has pro...