Reference Linking Methods – Part 3
By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)
This is the third in a series of four posts that discuss four methods for linking references. These methods are:
- Direct matching
- Transitive linking
- Linking by association
- Asserted Linking
In the last post I discussed transitive linking, and why it is essential for producing a unique and deterministic outcome of an ER process. In this post I will discuss the third method, linking by association.
Recall that in transitive linking, each intermediate reference must be equivalent to the next reference in the chain. However, it is possible to discover links by exploring associations among entity references that don’t rise to the level of equivalence. These explorations are often done using techniques borrowed from graph theory and network analysis.
A simple example shows how association analysis might work. Consider the following four references:
- Ref#1, Mary Smith 123 Oak St
- Ref#2, Mary Smith 456 Elm St
- Ref#3, John Smith 123 Oak St
- Ref#4, John Smith 456 Elm St
Note that none of the six possible pairings of these four references agree on both name and address. Therefore under typical identity rules, none of six pairs would be considered equivalent. The diagram below is a graphical representation of these four references and their relationships.
However, given that this association is unlikely to occur by chance, it is reasonable to infer that these are the same John Smith and Mary Smith at both addresses and that their references should be linked (are equivalent). The decision to link is even more compelling if supported by other evidence, such as uncommon names, or the lack of conflicting evidence, such as different dates-of-birth.
Unlike direct matching and transitive linking where decisions are made pair-by-pair, association analysis allows multiple relationships to be considered at the same time. As in this example, the decision to link the pairs (R1, R2) and (R3, R4) is justified only when the entire configuration of relationships is considered as a whole, not by examining each pair independently. The application of graph theory and network analysis of entity relationships is a rapidly growing area of ER research.
Typically these analyses are intended to produce an algorithm for decomposing (partitioning) a large graph reference into smaller sub-graphs where each sub-graph represents a group (cluster) of equivalent references. These algorithms assess the relative cohesiveness of the sub-graphs in terms of the number and types of connections that each node (reference) has with its adjacent nodes. For example the SCAN (Structural Clustering Algorithm for Networks) developed by Xu, Yuruk, Feng, and Schweiger that uses a measure called graph modularity (http://www.citeulike.org/user/socwangnan/article/1918601).
In the next post we will discuss asserted linking.
[Editor's Note: You can hear Dr. Talburt discuss entity resolution in a recently posted video on Infoglide Software's corporate web site.]





