AN EFFICIENT APPROACH FOR TEMPLATE EXTRACTION

Abstract

The World Wide Web is a vast and rapidly growing source of useful information which is used to publish and access the information on the Internet. It uses different templates with contents for providing easy access for readers. But, for search engine detecting the template and displaying the content to the users is a major task in retrieval of web pages from the web. The templates are considered harmful because they compromise the performance of clustering and classification of the web pages. In this paper, we present novel algorithm for extracting templates from web documents which are generated from heterogeneous template structures. In the proposed, we are clustering the web documents based on the similarity in the template structure so that the template for each cluster is extracted simultaneously. The resultant clusters will be given as input to the Roadrunner system, which is used to extract information from template web pages.

Authors and Affiliations

Pravallika. CH , Swapna Goud. N , Vishnu Murthy. G

Keywords

Related Articles

Web Content Filtering Techniques: A Survey

For many, accessing the Internet is a mixed blessing; in worst case, it can create serious problems. Web Content Filtering is a firewall to block certain sites from being accessed. Content filtering and the products that...

Database Privacy- Issues and Solutions

Data Mining, fourth and analytical step of Knowledge discovery in database process is a process of discovering new and interesting patterns in the large datasets. For example, data miner can derive different patterns bas...

GUARANTEED COVERAGE PARTICLE SWARM OPTIMIZATION USING NEIGHBORHOOD TOPOLOGIES

The key behind the research represent in this paper is to understand the behavior of the particle swarm algorithm. This study proposes guaranteed convergence Particle Swarm Optimizer (GCPSO) with various topologies. The...

A Similarity Function with Pruning Strategy for Tree Structured Data

Although several distance or similarity functions for trees have been introduced, their performance is not always satisfactory in different applications. In the base paper the Extended Sub tree (EST) function, where a ne...

Software Security: A Risk Taxonomy

The implementation of software has been challenging for many organizations. As given in the many reports of important failures, the implementation of packaged software and associated changes in business processes has pro...

Download PDF file
  • EP ID EP108839
  • DOI -
  • Views 109
  • Downloads 0

How To Cite

Pravallika. CH, Swapna Goud. N, Vishnu Murthy. G (2012). AN EFFICIENT APPROACH FOR TEMPLATE EXTRACTION. International Journal of Computer Science & Engineering Technology, 3(8), 348-352. https://europub.co.uk/articles/-A-108839