A Methodology for Template Extraction from Heterogeneous Web Pages

Journal Title: Indian Journal of Computer Science and Engineering - Year 2012, Vol 3, Issue 3

Abstract

The World Wide Web is a vast and most useful collection of information. To achieve high productivity in publishing the web pages are automatically evaluated using common templates with contents. The templates are considered harmful because they compromise the relevance judgement of many web information retrieval and web mining methods such as clustering and classification and badly impact the performance and resources of tools that processes the web pages. Thus, the template detection techniques have received a lot of attention to improve the performance of search engines, clustering and classification of web documents. In this paper, we are presenting the approach to detect and extract the templates from heterogeneous web documents and cluster them into different group. The pages belong to each group should possess the same structure .This saves the time to find out best templates from a large number of web document and also saves the memory which is required to find out the best template structure.

Authors and Affiliations

Vidya Kadam , Prakash. R. Devale

Keywords

Related Articles

WEB BASED E-LEARNING IN INDIA: THE CUMULATIVE VIEWS OF DIFFERENT ASPECTS

In the presence of great social diversity in India, it is difficult to change the social background of students, parents and their economical conditions. Therefore the only option left for us is to provide uniform or sta...

Generating Test Cases for Object Oriented Programs Using Specification based Testing Techniques

In today’s world software development industry and researchers has rapidly accepted the object-orientation paradigm for large scale system design. The object oriented language features of encapsulation, information hidin...

SERVICE-ORIENTED CLOUD ARCHITECTURE SCHEMA TO BRIDGE GAP BETWEEN STUDENT, STAFF AND ACADEMIA

Cloud Computing is a new resource platform that offers abundant amount of services for organizations to meet furthermore satisfies the needs without huge and prior investment. When it comes to the development of a cloud,...

SPEECH ENHANCEMENT USING PARTICLE FILTERS:A CRITICAL REVIEW 

Speech Enhancement refers to improvement in quality of a degraded speech signal in broad sense, however the aim is not only improvement of intelligibility but also overall quality, widely applied in de-reverberation, voi...

SOFTWARE RELIABILITY OF PROFICIENT ENACTMENT

A software reliability exemplary projects snags the random process as disillusionments which were the culmination yield of two progressions: emerging faults and initial state values. The predominant classification uses t...

Download PDF file
  • EP ID EP108755
  • DOI -
  • Views 343
  • Downloads 0

How To Cite

Vidya Kadam, Prakash. R. Devale (2012). A Methodology for Template Extraction from Heterogeneous Web Pages. Indian Journal of Computer Science and Engineering, 3(3), 449-452. https://europub.co.uk/articles/-A-108755