Home » Reference Linking Methods – Part 2

Reference Linking Methods – Part 2

By John Talburt, PhD, CDMP, Director, UALR Laboratory for Advanced Research in Entity Resolution and Information Quality (ERIQ)

In the last post we discussed methods for linking references that I classified into the four categories of:

  1. Direct matching
  2. Transitive linking
  3. Linking by association
  4. Asserted Linking

In the last post the focus was on direct matching, which is what most people think of when the topic of entity resolution (ER) comes up. For some people, reference linking and reference matching mean the same thing, and I hope you will see from this series that they are different.  I use these terms independently because there at least three other methods for reference linking used in ER systems today that are not direct matching:  transitive linking, linking by association, and asserted linking.

Transitive linking, sometimes called transitive closure, is a way to link two references that do not directly match by building a chain of links through intermediate references.  If A is linked to B, and B is linked to C, then it must follow that A is linked to C.  Again the operative word here is “linked” not “matched.”  Direct matching itself is not transitive.

For example, suppose an ER process using a probabilistic direct matching scheme requires at least two of three identity attributes to agree between two references in order to link them.  Consider the case where References A and B agree on attributes 1 and 2, and References B and C agree on attributes 2 and 3.  We know that References A and C will have to agree on attribute 2, but A and C will not necessarily agree on attributes 1 and 3, and if not, there will not be a direct match between A and C.  However, we can, and in fact, must link A and C.  Merge-purge processes routinely perform transitive linking by allowing an input reference to be linked to (i.e., merged into) a set of previously linked (i.e., equivalent) references even if it only matches one of the references.  In other words, at the end of an ER process, not all equivalent references will necessarily be a direct match with one another.

The reason is that transitive linking is necessary to make ER outcomes unique and deterministic.  In the Stanford Entity Resolution Framework (SERF), ER processes that lead to a unique outcome are called “Consistent ER” and must satisfy certain conditions.  One of these conditions is the existence of a merge function that complements the match function in a way that assures transitive linking.  If links could only be created using direct matching, then we would be unable to partition the references into equivalence classes of references associated with only one entity identity.  In the previous example, Reference B would have two identities, the one associated with the identity of Reference A and one associated with the identity of Reference C.  If we are forced to choose only one, then it would be an arbitrary decision, and the process would not have a unique outcome.

On the other hand, transitive linking has its dark side.  It can create over-consolidation of identities (leading to false positives) when the direct matching rules are not carefully crafted.  Take a case where linking is by agreement of both name and address or solely based on the agreement of an identification number.  What can potentially happen is that two well-formed equivalence classes of references clustered around two different names and addresses can unexpectedly and erroneously be merged into a single identity because one record in one cluster has a single transposition in its identification number that matches the identification number in one record of the other cluster.  The point is that, with transitive linking, it only takes one link between two large clusters to cause both clusters to be joined together.

In the next post we will take up linking by association and asserted linking.

Comments are closed.