Friday, March 23, 2012

Fuzzy lookup transform row scores 'inconsistent' with individual column scores

I am trying to interpret some of the results I observe when trying to match similar records using a fuzzy lookup transform, but it's not entirely clear how the overall row similarity score is calculated. In particular, sometimes rows with lower individual column similarity scores will achieve a higher similarity and confidence score than a matching row with higher individual column scores.

The transform is configured with 6 text fields set to fuzzy mapping and a minimum similarity of 0, and 3 additional numeric fields with an exact mapping. It is set to return a maximum of 2 matches per lookup and to do an exhaustive search of the reference table.

For example, from the following matching pair of records Match 1 is picked over Match 2 even though it's individual scores are lower.

Match 1 Match 2
-- --
_similarity_author 1.0 1.0
_similarity_title 0.85344648 1.0
_similarity_headline 0.0125 0.0125
_similarity_summary 0.0125 0.0125
_similarity_picture 1.0 1.0
_similarity_caption 1.0 1.0

_similarity 7.8429267E-2 7.3196657E-2
_confidence 0.55728668 0.44271332

In another case both matching records have *identical* scores for every mapped column and yet their similarity and confidence scores are different.

Clearly there are other factors involved in calculating the overall row score. Anybody know what these are?


Fernando Tubio

Can't even begin to describe it in my own words. This article describes the Fuzzy Math real well. Don't know if you've seen it.

http://msdn.microsoft.com/msdnmag/issues/05/09/SQLServer2005/

|||

Thank you Martin.

I've read the article and it explains the lookup process well. Unfortunately it doesn't answer my question. Specifically, having found two matches, why does the matching algorithm discard what appears to be a better match, at least judging from individual column similarity scores.

I am trying to understand the mechanism to determine if there is anything I can tweak in order to force the algorithm to make a better choice.

No comments:

Post a Comment