e-ISSN : 0975-3397
Print ISSN : 2229-5631
Home | About Us | Contact Us

ARTICLES IN PRESS

Articles in Press

ISSUES

Current Issue
Archives

CALL FOR PAPERS

CFP 2021

TOPICS

IJCSE Topics

EDITORIAL BOARD

Editors

Indexed in

oa
 

ABSTRACT

Title : Detecting Duplicates and near Duplicates Records in Large Datasets
Authors : Shailesh Singh, Syed Imtiyaz Hassan
Keywords : Big Data; Trigrams; Similarity; Lavensthein Edit Distance; Database data mining; Scholarships.
Issue Date : May 2017.
Abstract :
The rapid growth in data volumes and the need to integrate data from various heterogeneous resources bring to the fore the test of making the efficient detection of the duplicate copy of records in databases. Since the data sources are incoherent and autonomous, they may adopt their own conventions and often, integrating data from different sources may lead to erroneous redundancy of data. To ensure high quality data, the database must validate and filter the incoming data from the external sources. In this regard, data normalization has become a necessity to ensure the high quality of the data stored in these databases. The process of identifying the record pairs that represent the same entity is commonly known as duplicate record detection making it one of the most important tasks in the process of data cleansing. The proposed work suggests an approach to improve the accuracy of the duplicate record detection process which when used in combination with two other concepts of text similarity and edit distance leads to a well filtered data. The background of implementation trials for these concepts was chosen as Scholarship Portal data developed for various organizations where finding and identifying of such records to the most possible extents as well as enabling the genuine students not to be debarred from getting scholarships as it has various kind of reservation/quota mechanism was a dire need.
Page(s) : 178-185
ISSN : 0975–3397
Source : Vol. 9, Issue.05

All Rights Reserved © 2009-2024 Engg Journals Publications
Page copy protected against web site content infringement by CopyscapeCreative Commons License