Query Based Duplicate Data Detection on WWW

Journal Title: International Journal on Computer Science and Engineering - Year 2010, Vol 2, Issue 4

Abstract

The problem of finding relevant documents has become much more prominent due to the presence of duplicate data on the WWW. This redundancy in results increases the users’ seek time to find the desired information within the search results, while in general most users just want to cull through tens of esult pages to find new/different results. The dentification of similar or near-duplicate pairs in a large ollection is a significant problem with wide-spread pplications. Another contemporary materialization of the problem is the efficient identification of near-duplicate Web pages. This is certainly challenging in the web-scale due to the voluminous data. Therefore, a mechanism needs to be introduced for detecting duplicate data so that relevant search results can be provided to the user. In this paper, architecture is being proposed that introduces methods that run online as well as offline on the asis of favored and disfavored user queries to detect uplicates and near duplicates.

Authors and Affiliations

Ranjna Gupta , Neelam Duhan , A. K. Sharma , Neha Aggarwal

Keywords

Related Articles

A note on Quantum Cryptography

Cryptography provides security for the information and personal details. The combination of 3AQKDP (implicit) and 3AQKDPMA (explicit) quantum cryptography is used to provide authenticated secure communication between sen...

Performance Comparison of Common Table Expressions and Cursors

This paper compares the performance of common table expressions and cursors when implemented for complex queries. Cursor enables traversal of records in a database. Cursors can also be called as “Iterators”, as it perfor...

CHOICES ON DESIGNING GF (P) ELLIPTIC CURVE COPROCESSOR BENEFITING FROM MAPPING HOMOGENEOUS CURVES IN PARALLEL MULTIPLICATIONS

Modular inversion operation is known to be the most time consuming operation in ECC field arithmetic computations. In addition, Many ECC designs that use projective coordinates over GF (p) have not considered different f...

A Novel flow for Reasoning of Medical Diagnostic System using Artificial Feed Forward Neural Networks

In its first part, this contribution reviews shortly the application of neural network methods to medical problems and characterizes its advantages and problems in the context of the medical background. Various research...

Metric for Early Measurement of Software Complexity

Software quality depends on several factors such as on time delivery; within budget and fulfilling user's needs. Complexity is one of the most important factors that may affect the quality. Therefore, measuring and contr...

Download PDF file
  • EP ID EP113538
  • DOI -
  • Views 117
  • Downloads 0

How To Cite

Ranjna Gupta, Neelam Duhan, A. K. Sharma, Neha Aggarwal (2010). Query Based Duplicate Data Detection on WWW. International Journal on Computer Science and Engineering, 2(4), 1395-1400. https://europub.co.uk/articles/-A-113538