A Scalable approach to detect the duplicate data using Iterative parallel sorted neighbourhood method
Journal Title: International Journal for Research in Applied Science and Engineering Technology (IJRASET) - Year 2016, Vol 4, Issue 11
Abstract
Determining the redundant data in the data server is open research in the data intensive application. Traditional Progressive duplicate detection algorithms namely progressive sorted neighbourhood method (PSNM) with scalable approaches named as Parallel sorted neighbourhood Method, which performs best on small and almost clean datasets, and progressive blocking (PB), which performs best on large and very dirty datasets. Both enhance the efficiency of duplicate detection even on very large datasets; In this paper , we propose Iterative Progressive Sorted Neighbourhood method which is treated as progressive duplicate record detection in order to detect the duplicate records in any kind of the dataset. In comparison to traditional duplicate detection, progressive duplicate record detection satisfies two conditions through improved early quality. Iterative algorithms on PSNM and PB dynamically adjust their behaviour by automatically choosing optimal parameters, e.g., window sizes, block sizes, and sorting keys, rendering their manual specification superfluous. In this way, we significantly ease the parameterization complexity for duplicate detection in general and contribute to the development of more user interactive applications: We can offer fast feedback and alleviate the often difficult parameterization of the algorithms. The contrition of the work is as follows, we propose three dynamic progressive duplicate detection algorithms, PSNM, Iterative PSNM parallel and PB, which expose different strengths and outperform current approaches. We define a novel quality measure for progressive duplicate detection to objectively rank the performance of different approaches. The Duplicate detection algorithm is evaluated on several real-world datasets testing our own and previous algorithms. The duplicate detection workflow comprises the three steps pair-selection, pair-wise comparison, and clustering. For a progressive workflow, only the first and last step needs to be modified. The Experimental results prove that proposed system outperforms the state of arts approaches accuracy and efficiency.
Authors and Affiliations
Dr. R. Priya, Ms. Jiji. R
Isolation and Screening of Oraganophosphorous Pesticides (Malathion) Degrading Organisms from Soil
Organophosphate like Malathion is used for control of household and agricultural pests. High levels of malathion contaminates soil, water and aquatic ecosystems. The wide spread use of these pesticides over the years ha...
Performance Analysis and Heat Transfer Studies on Protruding Surfaces of Electronic Components through Forced Convection
Electronics equipment has made inroads in almost every aspect of modern life from toys and appliances to high power computer. Every electronic component depends on the passage of electric current to perform its duty, an...
Improving the Quality of Service of Etisalat Nigeria
Quality of service is a vital performance indicator that is used in determining the efficiency of an industry in terms of services rendered. Delivering excellent service is a winning strategy. Quality service sustains c...
Cube Divisor Cordial Labeling For Some Graphs
Let G = {V (G), E (G)} be a simple graph and ∶ () → {, ,· · · , ||} be a bijection. For each edge , assign the label 1. If either [()] / () or [()] / () and the label 0 otherwise is called a cube d...
Transitioning From Relational Database to Big Data
The amount of available data has increased by a huge amount in the past few years because of new social behavior and vast spread of social system. Big data have played a very important role for innovation and growth. Th...