global code: fuzzyness...

Friday, March 23, 2012

fuzzyness...

Greetings
I'm working on a large and growing database of security alarms. I'm looking at ways to mine the data that allow me to group sets of records with similar values. For example
"device-xyz", "high cpu load from device xyz 95% > threshold of 90%"
"device-xyz", "high cpu load from device xyz 96% > threshold of 90%"
if I do a plain sql search
SELECT device-name, Count(device-name) AS total1, Summary
FROM dataset
GROUP BY devicename, Summary;
I end up with two records since summary is differentt. The desire is to have a single record that combines the two using some form of fuzzyness.
I'm not able to use wildcards to solve this problem because I have a large volume of data all with a few values that could be different. I'm really looking I guess at some form of percentage of variation on the description.
I have started to look at a few tools such as dataflux that do look promising.
Has anyone addressed this problem before and can recommend an approach?
Many thanks in advance
Jdjg,
You may want to check out the sql function DIFFERENCE
and/or PATINDEX.
An approach I would recommed is to cut up each of the strings to words.
Then look for the best DIFFERENCE score of each word in the first string with the
words in the 2nd. To aggregate for the whole string use something like standard deviation.
You could encapsulate this in a user defined funtion - it wouldn't be performant though.
Regards
AJ
"djg" <anonymous@.discussions.microsoft.com> wrote in message
news:C618A23A-E59A-4DAC-91C8-883A6C4E3BE1@.microsoft.com...
> Greetings
> I'm working on a large and growing database of security alarms. I'm looking at ways to mine the data that
allow me to group sets of records with similar values. For example
> "device-xyz", "high cpu load from device xyz 95% > threshold of 90%"
> "device-xyz", "high cpu load from device xyz 96% > threshold of 90%"
> if I do a plain sql search
> SELECT device-name, Count(device-name) AS total1, Summary
> FROM dataset
> GROUP BY devicename, Summary;
> I end up with two records since summary is differentt. The desire is to have a single record that combines
the two using some form of fuzzyness.
> I'm not able to use wildcards to solve this problem because I have a large volume of data all with a few
values that could be different. I'm really looking I guess at some form of percentage of variation on the
description.
> I have started to look at a few tools such as dataflux that do look promising.
> Has anyone addressed this problem before and can recommend an approach?
> Many thanks in advance
> J|||On Sun, 26 Oct 2003 11:21:05 -0800, djg
<anonymous@.discussions.microsoft.com> wrote:
>I'm working on a large and growing database of security alarms.
>I'm looking at ways to mine the data that allow me to group
>sets of records with similar values. For example
>"device-xyz", "high cpu load from device xyz 95% > threshold of 90%"
>"device-xyz", "high cpu load from device xyz 96% > threshold of 90%"
>if I do a plain sql search
>SELECT device-name, Count(device-name) AS total1, Summary
>FROM dataset
>GROUP BY devicename, Summary;
>I end up with two records since summary is differentt.
>The desire is to have a single record that combines the two
>using some form of fuzzyness.
How about exactness -- try "with rollup"
Put the decision rules in the front end. :)
J.

Friday, March 23, 2012

fuzzyness...

No comments:

Post a Comment

global code

Blog Archive

About Me