global code: Fuzzy Grouping: Any success with

I have tried to process > 3 million Fuzzy grouping records on two different servers with no success. 3 mill works but anything above 4 mill doesn't. Some background:

We are trying to de-dup our customer table on: name (.5 min), address1 (.5 min), city (.5 min), state (exact). .8 overall record min score. Output includes additional fields: customerid, sourceid, address2, country, phonenumber Without SP1 installed I couldn't even get a few hundred thousand records to process Two different servers - same problems. Note that SSIS and SQL Server are running locally on both The higher end server has 4GB RAM, the other 2.5 GB RAM. Plenty of free disk space on both SQL Server is configured to use 2 GB of RAM max The page file is currently at 15GB

After running a number of test on both servers trying different batch sizes etc. the one thing I noticed is that it seems to always error out when SSIS takes over and starts chewing up all the available RAM. This happens after the index is created and SSIS starts "warming caches". On both servers SQL Server uses up about 1.6GB of RAM at this point while SSIS keeps taking over RAM until all physical RAM is used up.

Some questions:

Has anyone been able to process more then 3 million records and if so what is your hardware configuration? Should we try running SSIS from a different server so it has access to the full amount of physical RAM? (so it doesn't have to fight for RAM with SQL Server) Should we install Win 2003 Enterprise Server so we can add more RAM? Any ideas why switching to the page file might be causing errors?

Thanks!!

Keith Doyle

It's not so much the number of records as the size of the records that may be the issue. What size of record are you matching?

Do you have any standard (non-fuzzy) lookups? What else is the package doing?

Has anyone been able to process more then 3 million records and if so what is your hardware configuration?|||

Thanks for the info Donald. I am still having problems and there appears to be another user from the following thread that is having the same issues (this is another thread I started):

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=412813&SiteID=1&mode=1.

I ask them to join this thread so we have all the info in one place. Below are my latest test results.

I decided to concentrate on getting this to work on one server at a time. Here is the detailed configuration of the server I am starting with:

Single P4 2.4 GHz CPU|||

I'll have someone investigate at this end and we'll get back to these threads. Thanks for the grat info.

One thing would be to try executing using DTExec outside the debug environment. Or try "Start without Debugging." I doubt this will solve the problem entirely, but it will reduce any effect of the IDE on the problem.

Donald

|||

All testing was done using DTExec. I even went back and tried removing all the "pass through" fields and that didn't help either.

Hopefully you are able to find something out - thanks!!

Keith

|||donald,

is it true that a 64 bit box would resolve this issue?|||

Indeed - 64-bit systems with sufficient memory available (physical or virtual) should not see these issues. I do expect that over the next few years we'll see 64-bit become the standard platform for data integration - but meanwhile I still want to fix this problem!

Donald

|||

Hi,

Fuzzy Grouping uses Fuzzy Lookup underneath, so the following comments apply to both.

FL/FG maintain multiple caches which use up memory in the client process, eg. DTExec.exe. These caches contain recently used portions of the Match Index and also the Reference Table. The caches reduce the number of queries that must be issued against the SQL Server. The Warm Caches phase pre-loads the caches by doing a sequential load of the Match Index/Reference Table instead of bringing over rows one-by-one, as needed (which results in more random access I/O). You should be able to control the size (in MB) of these caches by setting the MaxMemoryUsage property on the transform. If this property is not specified, then FL/FG will continue to consume memory until only a small % of physical memory is free; it should reduce its cache usage if physical memory drops below a threshold. One issue, is that both SQL Server and FL/FG contend for physical memory and it is possible that their caches will end up containing the same rows. One way to reduce this contention, is to set the min & max memory usage for SQL Server as well. This can be done as follows:

sp_configure 'show advanced options', 1

reconfigure

sp_configure 'max server memory', 768

reconfigure

sp_configure 'min server memory', 100

reconfigure

If you still see physical memory being used up after the DTExec.exe process has exited, this is likely because SQL Server is holding the memory. Unfortunately, sqlserver.exe does not fully report its memory usage in TaskMgr. You must issue the commands above to get it to release memory, short of restarting the SQL process or the machine.

Note that SP1 contained a fix to a major memory leak in FL/FG. It appears that there could still be memory leaks that we did not find, or the problem could be the contention with SQL Server for memory. Try setting the MaxMemoryUsage property and see if DTExec.exe memory usage levels off after running for a few minutes in the Process Input phase. If the memory usage continues to grow dramatically, we may need to work with you to isolate a new memory leak.

One more thing to try, would be to change the Exact Match join columns to be Fuzzy Match with MinSimilarity at the column level to be 1.0. FL/FG has a different code path for Exact Match columns, which could be the culprit.

A workaround, which isn't ideal for sure, would be to partition your input into smaller batches based upon an attribute such as State if you believe that it is unlikely that two records sharing the same value for State will be duplicates.

Please let us know if you still see problems after trying the above.

-Kris

|||

Hi - just ran another test with these changes:

SQL Server min memory = 100|||

Hi:

I'm the guy in the other thread that Keith mentioned. My scenario is almost the same i.e. Name and Address De-dup. I'm on

Win2003R2Enterprise,

SQL2005SPI Enterprise,

X64 (Single Proc 3.0 GHz HT, with 2 GB RAM). (Do you mean Itanium or X64, Donald?)

12 GB on two partitions to Paging.

My input data is already scrubbed, so my FG just has a basic Data Source, FG, and Destination. And all I have is a Client ID(pass thro), Name(Fuzzy), Address(Fuzzy). I cannot partition the data state-wise or any other way for business reasons. I also dont have any standard lookups.

I used an assembly and executed it from DTexecUI, so no debugger there.

I got past 1.5 million records once, but have had failures again later. I always manage to get past the "finding similar records" phase, "Post Execute" and "Cleanup".

Picking Canonical Records is where I get Buffer Manager errors. I also saw the spike in page faults that Keith mentioned, but didnt observe which phase it happened. During this "canonical" phase, my CPU use is very low as compared to before this phase. PF usage never crossed 2 GB. Available physical memory stayed at around 40000k.

I had an error on SP1 Install for SSIS (Message - Service Failed to Start). My SSIS Version is 9.0.2047, which seems ok. Does the install error mean anything? Does it make a difference who is the owner of SSIS service? I have Network as the owner by default, and the service does not even start up on boot. Gives timeout error. Will making it System owned help by reducing LSASS activity? Manually starting the service works.

I once also ran into tempdb growth issues, and pre-setting the file sizes to a big number (like 8 GB) seemed to help a bit.

This combination of low PF Usage, Low CPU and maxed out RAM seem to indicate that FG/FL is not swapping RAM/Paging efficiently. I will also try out the Memory Settings that Kris suggested.

Does indexing help at all? I saw all that stuff in MSDN on FTI pre-creation for FL, but dont know how to do that in FG, if at all it is possible.

Thanks, Donald, hope you nail this one!

|||

Have changed over to a two-server config. One runs SSIS and DTexecUI or DTexec, and the other runs the Source and Destination DB.

The DTEXECUI server seems to be fine, really, with over a GB of RAM free, and with hardly a GB of paging used. CPU Utilization stayed at around 50 % all through, which is expected for a HT proc.

The database server had a CPU util of hardly 5% all through, but page file kept growing(max of 2GB), and tempdb kept growing and went upto 50 GB almost, before my disk maxed out. This was immediately after the "Finding Similar Records" phase was 100% completed. I had hardly 40000K of Physical Memory left at this point ( From an initial value of almost 1.4 GB).

What exactly is the relation between the tempDB and the package? From Kris's post it sounds like the FG works in memory. If that is the case, why is my tempdb server disk getting maxed? And I assume this is related only to the destiation DB?

Rows Per batch and Max Insert Commit Size - how exactly does this work? Is it equivalent to partitioning the data i.e. is grouping carried out only within the batch? Books OnLine has nothing substantial about these things :-(

The strange part is that I got this to work on a single server with Client ID + Name + Address, but it has failed thrice on a two server setup.

Thanks a lot for your help.

Kar

|||

Hi,

I have observed another run, a little more carefully this time. I set a Rows per Batch of 100000 and a Commit Size of 200000. I ran on 1.5 million rows with just Client ID and address. The size of my input data table (Client ID, Name, Address) is around 0.4 GB.

DTExec and source DB were on server 1, which ran fine until about 50 % completion of "Finding Similar Records". Almost .6 GB of RAM free, and almost no network activity. "Finding .." was moving along at about 1 % point every 8 min. Page Fault Delta was almost nil. Then all of a sudden, "Finding" stalled, Page Faults Delta for Dtexec just shot up into the thousands, network activity shot up to 50 % almost.

Destination DB on Server 2 was also normal initially, though RAM Available dropped to around 100MB. Then the effects were felt on this server too. Network traffic shot up, and tempdb logs started exploding. The Data files remained stable.

What is the kind of explosion to expect? Could .4 GB of input lead to almost 50 GB of tempdb logs?

Why is page fault delta shooting up at a certain point? And what exactly is DTExec pushing to tempdb when the page faults explode?

Thanks in anticipation!

Kar

|||

Firstly, perhaps you should narrow down the issues. I suggest the following technique:

Measure the perf of the full package - source, transform logic, destination (S+T+D).

Now replace the SQL Server destination with a Row Count component that will simply count the rows into a variable. You do not need to do anything with those rows. You now effectively have a package that is just S+T.

So, the perf of (S+T+D)-(S+T) will give you the impact your destination is having on your package.

If the destination is having a great impact, you may want to consider the following possibilities ...

Do you have indexes on your target table? The OLEDB client for SQL Server uses effectively the same interfaces as Bulk Load and there can be issues with heavy use of tempdb when bulk loading to an indexed table. See http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/incbulkload.mspx#E2D for more information - for SQL 2000 admittedly, but I believe the same applies for 2005.

As I often say, SSIS is a client of it's adapters - the performance of SSIS is often largely an effect of its source and destination. In the case of the OLEDB destination, the Rows per Batch and Commit Size settings are effectively the same as those properties of SQL Server itself for Bulk Import. So have a look at http://msdn2.microsoft.com/en-us/library/ms188267.aspx which describes the optimization of Bulk Import - bearing in mind that SSIS is just another client. So I suggest reading that information and trying some different batch sizes.

When using BOL, please remember that you can add comments (and vote) using the controls at the foot of the BOL page - this will help us to improve the documentation for future use.

hth

Donald

|||

Thanks a lot, Donald. Am looking up the two articles, though my destination table has no index. So I just hope to optimize the Rows per batch and Commit Size.

I want to try out your S+T+D suggestion, but will have to find a time-slot, this can run for 16 plus hours! Meanwhile, I have also ordered an extra disk!

Anyways, I had already physically isolated the three parts, because S+T was on one machine and D on the other machine. And I could see that S was hardly a factor. So doe it still make sense to try your experiment?

I would also like to try and see if I can get past without transactions, so that everything gets commited, instead of being batched. Will the Fuzzy Transform allow that? Or does it need "Transactions Supported"? I have a feeling that there is a very large width transaction here. Please see my prev post. Network traffic suddenly surged, and tempdb.log grew suddenly and things like that. It was not a secular growth over the entire processing period.

On the BOL part, I think the gaps are too numerous for me to go make individual comments. I started sending comments in the September CTP, but found that almost every page was inadequate. I am only referring to SSAS here, because that is what I regularly need. Take a look at help on THIS, for example. I many cases, even a basic code sample is missing, and the text description is very very sketchy.

|||

This thread almost died a natural death :-)

I have succeeded at doing my stuff, but there are still performance and resource issues not quite clear to me.

Especially when I see the performance states in the Fuzzy article on MSDN. (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsql90/html/FzDTSSQL05.asp).

This article shows great performance for Fuzzy Lookup with 2.5 million rows. Fuzzy grouping is supposed to involve creation of an ETI (like in Lookup), before it starts grouping. Somehow, the performance I see does not match up to what I see in the article. Makes me wonder what I am doing wrong.

Help, anyone?

Wednesday, March 21, 2012

Fuzzy Grouping: Any success with > 3 million records?

No comments:

Post a Comment

global code

Blog Archive

About Me