Reference Linking Methods – Part 1
By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)
In the last few posts, we reviewed the basic architectures used to implement entity resolution (ER) systems. Although this gives us the big picture at the systems level, ER really takes place at the reference (record) level where the system must ultimately decide whether two references are for the same or for different real-world objects, i.e. to link or not to link. In this series I’ll discuss some of the most common methods for making these linking decisions.
I classify these methods into four categories:
- Direct matching
- Transitive matching
- Association analysis
- Assertion
In this post, we’ll consider the first and most familiar category, direct matching. Here we decide to link based on the degree of similarity between the values of corresponding identity attributes. For example, if the identity attributes are first name, last name, and date-of-birth in a certain context, then in direct matching we would compare the values for these attributes between two references. In its simplest form, deterministic matching, the decision is yes if and only if all three values match exactly, i.e. two references should be linked (are equivalent) only when the first names are the same, the last names are the same, and the dates-of-birth are the same. Otherwise they are judged to be references to different persons.
Deterministic matching is very easy to implement, but it is not usually very effective. Its lack of effectiveness stems from the pervasiveness of information quality (IQ) issues. If the reference values are inaccurate, inconsistent, or missing, then direct matching creates too many false negatives, i.e. references to the same entity that really should be linked but don’t satisfy the deterministic matching criteria. When names are misspelled, nicknames are used, values are missing, or date formats are inconsistent, the direct match between references can fail even though the references were intended to reference the same person. It should be clear that IQ is closely related to ER.
To address these issues, most systems rely upon some of level of probabilistic matching. In this form of matching, we link records even if some attributes’ values are different as long as the values of certain other attributes are the same. Using the previous example, we might decide that the context in which we are working only requires an exact match on last name and date-of-birth in order to link the records. Generally this incurs a certain amount of risk of creating a false positive link, i.e. references to the different entities that match on certain attributes but should not be linked. This risk is expressed as the probability that this might happen, hence the term “probabilistic” matching.
Probabilistic matching has been the subject of extensive research. Most modern practice in probabilistic matching is based in the work of two Canadian statisticians, I.P. Fellegi and A. B. Sunter, who published A Theory for Record Linkage in 1969. Their model, the Fellegi-Sunter Model, provides a systematic way of creating a probabilistic matching scheme that is optimal with respect to a given level (tolerance) of false positive and false negative risk.
Probabilistic matching is binary in the sense that attribute values either match or don’t match. In our example, we could represent the case where we only require the last name and date-of-birth to match by the binary string “011” where the zero in the first position means that the first name doesn’t match, while the ones in the second and third positions mean that the last name and date-of-birth must match exactly. When there are three attributes there would be 8 possible binary combinations to consider. The problem with the binary model is that it doesn’t account for similarity. Intuitively we would feel much more confident that the references “John, Doe, 1989-08-13” and “Jon, Doe, 1989-08-13” should be linked than we would the references “John, Doe, 1989-08-13” and “Mary, Doe, 1989-08-13”.
Therefore, a common extension of probabilistic matching is to allow for intermediate levels of similarity between values, i.e. accounting for the fact that attributes values may not be the same but are similar. For example, if the name values differ only by one character, we could say that the names are similar, or if the dates-of-birth differ by less than 10 days, we could say the dates are similar. We have now moved from a binary model to a tertiary (base 3) model so that in our previous example, the first pair of references would fit the pattern “122” and the second pair the pattern “022” where 1 represents similar values and 2 epresents the same value. The downside is that there are now more patterns to analyze and evaluate. For 3 attributes there are now 27 cases to consider instead of 8.
Probabilistic matching that allows for intermediate levels of similarity is sometimes called fuzzy matching. Although the term fuzzy implies that there is some leeway, in practice we must always set a discrete threshold that limits the amount dissimilarity we are willing to tolerate. Fuzzy matching also introduces a plethora of schemes for measuring similarity between two values. In the cases where the values are character strings, such as for names, the schemes are called approximate string matching (ASM) algorithms.
One of the most often used is the Levenshtein Edit Distance that counts the minimum number of character transformations (usually insertion, deletion, and substitution) that will transform one string into another. For example the edit distance between “Smythe” and “Smith” is 2 because in the first string you can substitute “i” for “y” and delete the “e” to create the second string in 2 transformations.
Typically ASM outputs are normalized to a scale from 0 to 1. To normalize edit distance, divide the edit distance by the number of characters is the longest string. In this example, the normalized edit distance would be 2/6 or 0.33. Many other ASM algorithms have been developed such as Jaro, Jaro-Winkler, q-grams, Soundex, Smith-Waterman, and Ukkonen, just to mention a few.
In the next post we will discuss transitive matching.




