Hello,
I have a peculiar problem in my project. My project design is like this
The number in (...) are count of records.
File feed (1000)
|
|
Fuzzy Lookup
against Table2
|
|
Split Fz Lookup results
(_Similarity >= 0.60 && _Confidence >= 0.85)
| |
| |
| Write matches to Table1 (250)
|
Fuzzy Group
Remaining rows (750)
|
|
Split Fz Group results
| |
| |
Write Canonicals Write Dupes
to Table2 to Table1
(300) (450)
This is basically a customer de-dupification project.
The Table2 has the canonicals and Table1 has the dupes (of the canonicals).
I already have some data in these tables and the new data is matched against the existing data
in these tables and classified as new customers and duplicate customers.
In the above process one could notice that the rows identified as dupes of already exsting canonicals
by the Fuzzy Lookup task are written into the dupes table (Table1) and will not be processed further down
the line in the project.
But in my case I see that those matches identified by Fuzzy lookup are further being included in the
Fuzzy Grouping also.
When I run this in debug mode in BIDS, it shows the correct numbers as I have depicted in the
illustration above. But, after execution, when I query the tables it shows that all 1000 rows
went through Fuzzy Grouping.
Any thoughts?
Btw, is there anyway to upload attachments to the postings here?
I also tried introducing a Derived Column between the 'Split Fz Lookup Results' and 'Write matches to Table1' to write some string into one of the table columns. It did not.
No comments:
Post a Comment