global code: Fuzzy lookup match issue

Friday, March 23, 2012

Fuzzy lookup match issue

Hello,

I have a peculiar problem in my project. My project design is like this

The number in (...) are count of records.

File feed (1000)

Fuzzy Lookup

against Table2

Split Fz Lookup results

(_Similarity >= 0.60 && _Confidence >= 0.85)

| |

| Write matches to Table1 (250)

Fuzzy Group

Remaining rows (750)

Split Fz Group results

| |

Write Canonicals Write Dupes

to Table2 to Table1

(300) (450)

This is basically a customer de-dupification project.

The Table2 has the canonicals and Table1 has the dupes (of the canonicals).

I already have some data in these tables and the new data is matched against the existing data

in these tables and classified as new customers and duplicate customers.

In the above process one could notice that the rows identified as dupes of already exsting canonicals

by the Fuzzy Lookup task are written into the dupes table (Table1) and will not be processed further down

the line in the project.

But in my case I see that those matches identified by Fuzzy lookup are further being included in the

Fuzzy Grouping also.

When I run this in debug mode in BIDS, it shows the correct numbers as I have depicted in the

illustration above. But, after execution, when I query the tables it shows that all 1000 rows

went through Fuzzy Grouping.

Any thoughts?

Btw, is there anyway to upload attachments to the postings here?

I also tried introducing a Derived Column between the 'Split Fz Lookup Results' and 'Write matches to Table1' to write some string into one of the table columns. It did not.

Friday, March 23, 2012

Fuzzy lookup match issue

No comments:

Post a Comment

global code

Blog Archive

About Me