A Scalable approach to detect the duplicate data using Iterative parallel sorted neighbourhood method

Abstract

Determining the redundant data in the data server is open research in the data intensive application. Traditional Progressive duplicate detection algorithms namely progressive sorted neighbourhood method (PSNM) with scalable approaches named as Parallel sorted neighbourhood Method, which performs best on small and almost clean datasets, and progressive blocking (PB), which performs best on large and very dirty datasets. Both enhance the efficiency of duplicate detection even on very large datasets; In this paper , we propose Iterative Progressive Sorted Neighbourhood method which is treated as progressive duplicate record detection in order to detect the duplicate records in any kind of the dataset. In comparison to traditional duplicate detection, progressive duplicate record detection satisfies two conditions through improved early quality. Iterative algorithms on PSNM and PB dynamically adjust their behaviour by automatically choosing optimal parameters, e.g., window sizes, block sizes, and sorting keys, rendering their manual specification superfluous. In this way, we significantly ease the parameterization complexity for duplicate detection in general and contribute to the development of more user interactive applications: We can offer fast feedback and alleviate the often difficult parameterization of the algorithms. The contrition of the work is as follows, we propose three dynamic progressive duplicate detection algorithms, PSNM, Iterative PSNM parallel and PB, which expose different strengths and outperform current approaches. We define a novel quality measure for progressive duplicate detection to objectively rank the performance of different approaches. The Duplicate detection algorithm is evaluated on several real-world datasets testing our own and previous algorithms. The duplicate detection workflow comprises the three steps pair-selection, pair-wise comparison, and clustering. For a progressive workflow, only the first and last step needs to be modified. The Experimental results prove that proposed system outperforms the state of arts approaches accuracy and efficiency.

Authors and Affiliations

Dr. R. Priya, Ms. Jiji. R

Keywords

Related Articles

A Comparative Study of Scanning Electron Microscopy (SEM) of Multilayer ZNS and CDS Thin Films

The first thin solid films were obtained by electrolysis in 1864, B White Bunsen and Grover obtained metallic film in the year 1852, by thermal evaporation on explosion of a current carrying metal wire. The usefulness o...

Effect of AL-SIC Metal Matrix Composite on Hardness

Metal Matrix composites is well known for its wear resistance, fatigue, fracture toughness, stiffness .In this work SIC is reinforced in Aluminum metal matrix to improve its properties. This is fabricated by powder meta...

A Wearable Device for Continuous Detection and Screening of Epilepsy during Daily Life

Epilepsy is a very fatal condition which is caused as a result of imbalance in the nervous system. The very common symptoms of epilepsy includes sudden fluctuations in heart beat rate and involuntary muscular movements...

Data mining in cloud computing

Data Mining is a process of extracting potentially useful information from raw Data, so as to improve the quality of the information service. With the rapid development of the Internet, the size of the data has increase...

Strength Properties of Concrete by Replacing Coarse Aggregate with Blast Furnace Slag and Fine Aggregate with Crusher Dust

The demand of river sand in the construction industry has consequently increased due to Urbanization and mass construction of housing resulting in the reduction of natural sources and also an increase in price. In such...

Download PDF file
  • EP ID EP22778
  • DOI -
  • Views 182
  • Downloads 3

How To Cite

Dr. R. Priya, Ms. Jiji. R (2016). A Scalable approach to detect the duplicate data using Iterative parallel sorted neighbourhood method. International Journal for Research in Applied Science and Engineering Technology (IJRASET), 4(11), -. https://europub.co.uk/articles/-A-22778