A Scalable approach to detect the duplicate data using Iterative parallel sorted neighbourhood method

Abstract

Determining the redundant data in the data server is open research in the data intensive application. Traditional Progressive duplicate detection algorithms namely progressive sorted neighbourhood method (PSNM) with scalable approaches named as Parallel sorted neighbourhood Method, which performs best on small and almost clean datasets, and progressive blocking (PB), which performs best on large and very dirty datasets. Both enhance the efficiency of duplicate detection even on very large datasets; In this paper , we propose Iterative Progressive Sorted Neighbourhood method which is treated as progressive duplicate record detection in order to detect the duplicate records in any kind of the dataset. In comparison to traditional duplicate detection, progressive duplicate record detection satisfies two conditions through improved early quality. Iterative algorithms on PSNM and PB dynamically adjust their behaviour by automatically choosing optimal parameters, e.g., window sizes, block sizes, and sorting keys, rendering their manual specification superfluous. In this way, we significantly ease the parameterization complexity for duplicate detection in general and contribute to the development of more user interactive applications: We can offer fast feedback and alleviate the often difficult parameterization of the algorithms. The contrition of the work is as follows, we propose three dynamic progressive duplicate detection algorithms, PSNM, Iterative PSNM parallel and PB, which expose different strengths and outperform current approaches. We define a novel quality measure for progressive duplicate detection to objectively rank the performance of different approaches. The Duplicate detection algorithm is evaluated on several real-world datasets testing our own and previous algorithms. The duplicate detection workflow comprises the three steps pair-selection, pair-wise comparison, and clustering. For a progressive workflow, only the first and last step needs to be modified. The Experimental results prove that proposed system outperforms the state of arts approaches accuracy and efficiency.

Authors and Affiliations

Dr. R. Priya, Ms. Jiji. R

Keywords

Related Articles

Restoration of Motion Blurred Images using Non Blind Technique-A Review

Image deblurring and restoration has been of great importance nowadays. Image recognition becomes difficult when it comes to blurred and poorly illuminated images and it is here image restoration come to picture. In thi...

Optimization of Different Machining Parameters of En 354 Alloy Steel In CNC Turning Operation Using Taguchi Method

The objective of this experimental study is to develop a single optimization method for lower surface roughness and maximum metal removal rate in terms of process parameters while carrying out CNC turning operation. The...

Comparative Analysis of Multi Storey Buildings with Seismic and Pushover Methods

With the immense loss of life and property witnessed in the last couple of decades alone in India, due to failure of structures caused by earthquakes, attention is now being given to the evaluation of strength in framed...

A Review: Modelling and simulation of spark ignition engines

This paper assimilates the performance and the trends in spark ignition engines and to increase their efficiency in terms of Fuel-Air ratio (FAR) using adaptive control method, maintaining the FAR (Fuel-Air Ratio) by Th...

Reengineering Library from Semi-Digital to Digital Library: A Case Study

An easy access to resources of library can make tasks fast and interesting for users (employee, students, etc.). This paper focuses on increasing the ease of utilizing library resources, by reengineering old, obsolete p...

Download PDF file
  • EP ID EP22778
  • DOI -
  • Views 257
  • Downloads 3

How To Cite

Dr. R. Priya, Ms. Jiji. R (2016). A Scalable approach to detect the duplicate data using Iterative parallel sorted neighbourhood method. International Journal for Research in Applied Science and Engineering Technology (IJRASET), 4(11), -. https://europub.co.uk/articles/-A-22778