global code: Fuzzy Grouping in parallel

Wednesday, March 21, 2012

Fuzzy Grouping in parallel

Hello,

I have created a project to do de-dupification of addresses.

I understand that Fuzzy Grouping will take less time if it has lesser data volume to process.

My source feed file is sometimes huge. So I am splitting the input into multiple branches based on

the first letter of the city. There are 7 branches in the process.

Source File Feed

Split data into 7 groups

| | | | | | |

FzGrpg FzGrpg FzGrpg FzGrpg FzGrpg FzGrpg FzGrpg

| | | | | | |

Split Split Split Split Split Split Split

| | | | | | |

- -- -- -- -- -- --

| | | | | | | | | | | | | |

<- - - - - - - Write the Canonicals and Dupes from each of these splits into database - - - - - - - - ->

When I designed this I was hoping that each of the Fuzzy Grouping tasks will execute in parallel.

But in reality they are processing one after the other.

Is there anyway to make them execute in parallel?

Appreciate your help.

Thanks

How do you know they are not going in parallel and what kind of machine are you using?

I haven't use Fuzzy groups before, but parallel processing would depend on the hardware you are using. If the machine you use is a single processor, I doubt you can see parallelism on the process.

|||

Its a 4 dual core CPU machine with 8 GB RAM.

When I run in the debug mode you could see the data flowing in the pipelines at runtime.

|||

Which component are you using right before to fuzzy groups? I *think*, you should be using a multicast in order to get the parallelism you want; as it would generate 7 identical data sets to be consumed by each fuzzy grouping component. See if this article gives you some tips (Parallelism section):

http://www.microsoft.com/technet/prodtechnol/sql/2005/ssisperf.mspx

[Microsoft follow-up] perhaps somebody at MSFT can give you a better explanation

|||

Yes, I believe the current pipeline engine will not do a great job in optimizing this one. It would most probably end up using a single thread as Conditional Split is synchronous. You could try to artificially break synchronicity of the fuzzy grouping branches by adding a fake asynchronous transform (Union All with one input and one output should be good).

The next version of the pipeline scheduler should be able to better optimize distribution of threads.

Thanks.

|||

Bob and Rafael, thanks a lot for your input. So you are saying it should be like this?

Source File Feed

Split data into 7 groups

| | | | | | |

UnionAll UnionAll UnionAll UnionAll UnionAll UnionAll UnionAll

| | | | | | |

FzGrpg FzGrpg FzGrpg FzGrpg FzGrpg FzGrpg FzGrpg

| | | | | | |

Split Split Split Split Split Split Split

| | | | | | |

- -- -- -- -- -- --

| | | | | | | | | | | | | |

<- - - - - - - Write the Canonicals and Dupes from each of these splits into database - - - - - - - - ->

|||

As an alternative, could you do the split in one data flow, dumping each branch to a raw file, and then use a seperate data flow (or 7 data flows) to read in the raw files and do the Fuzzy Grouping?

The engine seems to optimize better with multiple data flows, than with multiple paths in the same data flow.

|||

Yes, that is how I meant to use Union Alls.

The idea of using raw files sounds good too. I would try both and see which one works better for youe scenario.

Thanks.

|||

Thanks a lot guys.

I tried the UnionAll approach and it works perfectly fine. I am able to see that all the threads go in parallel.

A run that used to take around 10 hours got completed in 2.5 hours and that's a lot of saving.

And utilitzation of the CPUs was 100%.

As and when I tweak this more I will keep you all posted.

Thanks again.

|||

KM68 wrote:

Thanks a lot guys.

I tried the UnionAll approach and it works perfectly fine. I am able to see that all the threads go in parallel.

A run that used to take around 10 hours got completed in 2.5 hours and that's a lot of saving.

And utilitzation of the CPUs was 100%.

As and when I tweak this more I will keep you all posted.

Thanks again.

[Microsoft follow-up] This sounds like a possitive feedback.

KM68,

We all will appreciate your updates.

|||

I am glad it worked.

Hopefully, with the next version of the data flow engine you will not need this workaround and the execution time can be trimmed even more.

Thanks.

|||

Thanks.

Any idea when the next version is scheduled for release?

Wednesday, March 21, 2012

Fuzzy Grouping in parallel

No comments:

Post a Comment

global code

Blog Archive

About Me