- Record linkage
-
Record linkage (RL) refers to the task of finding entries that refer to the same entity across different data sources (e.g., files, books, websites, databases, etc.). Record linkage is an appropriate technique when you have to join data sets that do not already share a common identifier (e.g., database key, URI, National identification number), as may be the case due to differences in record shape, storage location, and/or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as cross-linked (which is not necessarily the same as that data set containing Linked Data).
Record linkage is a useful tool when performing data mining tasks, where the data originated from different sources or different organizations. Most commonly, performing RL on data sets involves joining records of persons based on name, since no National identification number or similar is recorded in the data. In mathematical graph theory, record linkage can be seen as a technique of resolving bipartite graphs.
Contents
Naming conventions
Record linkage is the term used by statisticians, epidemiologists, and historians, among others. Commercial mail and database applications refer to it as "merge/purge processing" or "list washing". Computer scientists often refer to it as "data matching" or as the "object identity problem". Other names used to describe the same concept include "entity resolution", "identity resolution", "entity disambiguation", "duplicate detection", "record matching", "instance identification", "deduplication", "coreference resolution", "reference reconciliation", and "database hardening". This profusion of terminology has led to few cross-references between these research communities.[1][2]
Methods
There are several approaches to record linkage. The most straightforward is a rules-based approach, in which reasonable rules are developed and then refined as common exceptions are found. The advantage to this approach is that it is possible to get a good deal of accuracy without needing a lot of labeled data to train or test the rules on. The disadvantage is that to obtain very high accuracy, more and more exceptions and special cases need to be handled, and eventually the list of rules gets too complex to be built by hand.
A very popular approach has been Probabilistic Record Linkage (PRL). In this approach, a large set of pairs of records are human-labeled as being matching or differing pairs. Then statistics are calculated from the agreement of fields on matching and differing records to determine weights on each field. During execution, the agreement or disagreement weight for each field is added to get a combined score that represents the probability that the records refer to the same entity. Often there is one threshold above which a pair is considered a match, and another threshold below which it is considered not to be a match. Between the two thresholds a pair is considered to be "possibly a match", and dealt with accordingly (e.g., human reviewed, linked, or not linked, depending on the application).
In recent years, a variety of machine learning techniques have been used in record linkage. It has been recognized that Probabilistic Record Linkage is equivalent to the "Naive Bayes" algorithm in the field of machine learning, and suffers from the same assumption of the independence of its features, which is typically not true. Higher accuracy can often be achieved by using various other machine learning techniques, including a single-layer Perceptron.
Regardless of whether rule-based, PRL, or machine learning techniques are used, normalization of the data is very important. Names are often spelled differently in different sources (e.g., "Wm. Smith", "William Smith", "William J. Smith", "Bill Smith", etc.); dates can be recorded various ways ("1/2/73", "1973.1.2", "Jan 2, 1973"); and places can be recorded differently as well ("Berkeley, CA", "Berkeley, Alameda, California, USA", etc.). By normalizing these into a common format and using comparison techniques that handle additional variation, much more consistency can be achieved, resulting in higher accuracy in any record linkage technique.
Record linkage typically involves two main steps: blocking and scoring. The blocking phase attempts to find groups of records that are similar in some way. For example, a query might select all individuals with the same surname and ZIP code. Of course, many of these individuals would not be the same real person (false positives), and some records for the same individual might have different values for the surname (due to typos, married name, etc.) or ZIP code (e.g., because they moved). Thus robust systems often use multiple blocking passes to group data in various ways in order to come up with groups of records that should be compared to each other.
Once a set of potential matches is identified, a scoring algorithm is used to calculate a score indicating how likely the two records are to really represent the same real entity. By using labeled test data, score thresholds can be selected that yield a desired trade-off between recall and precision. (Recall is the percent of true matching pairs that get a score above a given threshold, and precision is the percent of pairs with a score above a given threshold that are really a match).
History of RL theory
The initial idea goes back to Halbert L. Dunn in 1946.[3] In the 1950s, Howard Borden Newcombe laid the probabilistic foundations of modern record linkage theory.
In 1969, Ivan Fellegi and Alan Sunter formalized these ideas and proved that the probabilistic decision rule they described was optimal when the comparison attributes are conditionally independent. Their pioneering work "A Theory For Record Linkage"[4] is, still today, the mathematical tool for many record linkage applications.
Since the late 1990s, various machine learning techniques have been developed that can, under favorable conditions, be used to estimate the conditional probabilities required by the Fellegi-Sunter (FS) theory. Several researchers have reported that the conditional independence assumption of the FS algorithm is often violated in practice; however, published efforts to explicitly model the conditional dependencies among the comparison attributes have not resulted in an improvement in record linkage quality.[citation needed]
Mathematical model
In an application with two files, A and B, denote the rows (records) by α(a) in file A and β(b) in file B. Assign K characteristics to each record. The set of records that represent identical entities is defined by
and the complement of set M, namely set U representing different entities is defined as
.
A vector, γ is defined, that contains the coded agreements and disagreements on each characteristic:
where K is a subscript for the characteristics (sex, age, marital status, etc.) in the files. The conditional probabilities of observing a specific vector γ given , are defined as
and
respectively.
Applications
Applications in historical research
Record linkage is important to social history research since most data sets, such as census records and parish registers were recorded long before the invention of National identification numbers. When old sources are digitized, linking of data sets is a prerequisite for longitudinal study. This process is often further complicated by lack of standard spelling of names, family names that change according to place of dwelling, changing of administrative boundaries, and problems of checking the data against other sources. Record Linkage was among the most prominent themes in the History and computing field in the 1980s, but has since been subject to less attention in research.
Applications in medical practice and research
Record linkage is an important tool in creating data required for examining the health of the public and of the health care system itself. It can be used to improve data holdings, data collection, quality assessment, and the dissemination of information. Data sources can be examined to eliminate duplicate records, to identify under-reporting and missing cases (e.g., census population counts), to create person-oriented health statistics, and to generate disease registries and health surveillance systems. Some cancer registries link various data sources (e.g., hospital admissions, pathology and clinical reports, and death registrations) to generate their registries. Record linkage is also used to create health indicators. For example, fetal and infant mortality is a general indicator of a country's socioeconomic development, public health, and maternal and child services. If infant death records are matched to birth records, it is possible to use birth variables, such as birth weight and gestational age, along with mortality data, such as cause of death, in analyzing the data. Linkages can help in follow-up studies of cohorts or other groups to determine factors such as vital status, residential status, or health outcomes. Tracing is often needed for follow-up of industrial cohorts, clinical trials, and longitudinal surveys to obtain the cause of death and/or cancer. An example of a successful and long-standing record linkage system allowing for population-based medical research is the Rochester Epidemiology Project based in Rochester, Minnesota.
See also
- Identity resolution
- Linked Data
- Entity-attribute-value model
- Open Data
- Delta encoding
- Identity resolution
- Data deduplication
- Capacity optimization
- Single-instance storage
- Content-addressable storage
References
- ^ Cristen, P & T: Febrl - Freely extensible biomedical record linkage (Manual, release 0.3) p.9
- ^ Elmagarmid, Ahmed; Panagiotis G. Ipeirotis, Vassilios Verykios (January 2007). "Duplicate Record Detection: A Survey" (PDF). IEEE Transactions on Knowledge and Data Engineering 19 (1): pp. 1–16. doi:10.1109/TKDE.2007.9. http://www.cs.purdue.edu/homes/ake/pub/TKDE-0240-0605-1.pdf. Retrieved 2009-03-30.
- ^ Dunn, Halbert L. (December 1946). "Record Linkage" (PDF). American Journal of Public Health 36 (12): pp. 1412–1416. doi:10.2105/AJPH.36.12.1412. http://www.ajph.org/cgi/reprint/36/12/1412. Retrieved 2008-05-31.
- ^ Fellegi, Ivan; Sunter, Alan (December 1969). "A Theory for Record Linkage". Journal of the American Statistical Association 64 (328): pp. 1183–1210. doi:10.2307/2286061. JSTOR 2286061.
External links
- Discussion of record linkage strategy and tactics
- Statistics New Zealand Data Integration Manual
- Data Linkage Project at Penn State, USA
- Match Engine Theory and Configuration
Software implementations
- Duke - small and fast Java deduplication and record linkage engine (open source)
- Data Deduplication and Integration - developed by the CS department at University of Maryland
- DataCleaner - by eobjects.org
- Daurum - Deduplication and Integration tool & services - by Sparsity Technologies
- Febrl - Open source software application for RL - by the Australian National University.
- FRIL - Fine-Grained Record Integration and Linkage Tool - developed at the Emory University and US Centers For Disease Control and Prevention
- Link Plus Probabilistic record linkage program - developed at the US Centers For Disease Control and Prevention
- MTB - Merge ToolBox - by the University of Duisburg-Essen
- Open Source ChoiceMaker Technology - released by ChoiceMaker Technologies
- OpenDQ Massively scalable record linkage / Data Matching - by Infosolve Technologies
- Silk - open source RDF-based record linkage framework from Freie Universität Berlin
- SimMetrics Open source library of String Similarity techniques - by Sam Chapman at the University of Sheffield
- DataQualityTools - by DataQualityApps
- The Link King - SAS System application - developed at Washington State's Division of Alcohol and Substance Abuse (DASA) by Kevin M. Campbell
- MatchMetrix EMPI - Enterprise Master Patient Index - developed by NextGate
- Data Match - DataMatch 2011 for record linkage - developed by Dataladder
- - Oyster Entity Resolution - Oyster Entity Resolution - Developed by The University of Arkansas at Little Rock, John Talburt Ph.D.
Categories:- Computer algebra
- Data management
Wikimedia Foundation. 2010.