Share Email Print

Proceedings Paper

Customer and household matching: resolving entity identity in data warehouses
Author(s): Donald J. Berndt; Ronald K. Satterfield
Format Member Price Non-Member Price
PDF $14.40 $18.00
cover GOOD NEWS! Your organization subscribes to the SPIE Digital Library. You may be able to download this paper for free. Check Access

Paper Abstract

The data preparation and cleansing tasks necessary to ensure high quality data are among the most difficult challenges faced in data warehousing and data mining projects. The extraction of source data, transformation into new forms, and loading into a data warehouse environment are all time consuming tasks that can be supported by methodologies and tools. This paper focuses on the problem of record linkage or entity matching, tasks that can be very important in providing high quality data. Merging two or more large databases into a single integrated system is a difficult problem in many industries, especially in the wake of acquisitions. For example, managing customer lists can be challenging when duplicate entries, data entry problems, and changing information conspire to make data quality an elusive target. Common tasks with regard to customer lists include customer matching to reduce duplicate entries and household matching to group customers. These often O(n2) problems can consume significant resources, both in computing infrastructure and human oversight, and the goal of high accuracy in the final integrated database can be difficult to assure. This paper distinguishes between attribute corruption and entity corruption, discussing the various impacts on quality. A metajoin operator is proposed and used to organize past and current entity matching techniques. Finally, a logistic regression approach to implementing the metajoin operator is discussed and illustrated with an example. The metajoin can be used to determine whether two records match, don't match, or require further evaluation by human experts. Properly implemented, the metajoin operator could allow the integration of individual databases with greater accuracy and lower cost.

Paper Details

Date Published: 6 April 2000
PDF: 8 pages
Proc. SPIE 4057, Data Mining and Knowledge Discovery: Theory, Tools, and Technology II, (6 April 2000); doi: 10.1117/12.381731
Show Author Affiliations
Donald J. Berndt, Univ. of South Florida (United States)
Ronald K. Satterfield, Univ. of South Florida (United States)

Published in SPIE Proceedings Vol. 4057:
Data Mining and Knowledge Discovery: Theory, Tools, and Technology II
Belur V. Dasarathy, Editor(s)

© SPIE. Terms of Use
Back to Top