Automatic Template Extraction using Hyper Graph Technique from Heterogeneous Web Pages  

Abstract

World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the web pages in many websites are automatically populated by using the common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. However, for machines, the templates are considered harmful since they degrade the accuracy and performance of web applications due to irrelevant terms in templates. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. In this paper, we present novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates. We cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously. We develop a novel goodness measure with its fast approximation for clustering and provide comprehensive analysis of our algorithm. Our experimental results with real-life data sets confirm the effectiveness and robustness of our algorithm compared to the state of the art for template detection algorithms 

Authors and Affiliations

D. Kanagalatchumy , Dr. S. Pushpa

Keywords

Related Articles

BER AND SIMULATION OF OFDM MODULATOR AND DEMODULATOR FOR WIRELESS BROADBAND APPLICATIONS  

With the rapid growth of digital wireless communication in recent years, the need for high speed mobile data transmission has increased. New modulation techniques are being implemented to keep up with the desired more...

Privacy Requirement Engineering Based on Modified Evidence Combination Approach 

A major challenge in the field of software engineering is to make users trust the software that they use in their every day professional or recreational activities. Trusting software depends on various elements,...

GRID COMPUTING – AN ALTERNATIVE TO HPC  

Grid Computing delivers on the potential in the growth and abundance of network connected systems and bandwidth: computation, collaboration and communication over the Advanced Web. At the heart of Grid Computing...

Intelligent System for detecting, Modeling, Classification of human behavior using image processing, machine vision and OpenCV 

Surveillance Cameras has proven to be a key factor in enhancing the public security in many countries around the world . In spite of advancements in image processing and machine vision techniques very less is app...

Preserving Privacy Using Data Perturbation in Data Stream

Data stream can be conceived as a continuous and changing sequence of data that continuously arrive at a system to store or process. Examples of data streams include computer network traffic, phone conversations, web sea...

Download PDF file
  • EP ID EP98874
  • DOI -
  • Views 119
  • Downloads 0

How To Cite

D. Kanagalatchumy, Dr. S. Pushpa (2013). Automatic Template Extraction using Hyper Graph Technique from Heterogeneous Web Pages  . International Journal of Advanced Research in Computer Engineering & Technology(IJARCET), 2(4), 1460-1466. https://europub.co.uk/articles/-A-98874