Friday, March 23, 2012

Fuzzy Street Address Search Code

Anybody have any good code for this case...

Given @.address1 as a parameter.

Look in the providers table for providers that match that Address.

But it needs to be a fuzzy lookup. Meaning it has to find these to be matched

123 Fake St.

123 Fake Street

Even possible some misspellings like

123 Fke St would be nice as well.

I know it will possibly involve a Split function, and/or Soundex and Difference functions. But what is best?

I can't find any code when I google this subject.

Help?!?!?

You may be best served to establish FULL TEXT SEARCH for the address column.

Refer to Books Online, Topic: Full Text Search

|||

I just set up the full-text search... This does not seem to be helpful. I can't do a search like:

SELECT *

FROM providers

WHERE CONTAINS (sv_addr, '"123 Fake St"')

Because it will pull up nothing... However I would need to break up the values with some sort of Split function. Then even still it would not pull up ST as matching Street... or get me results spelled wrong like 123 Fke St.

Help?!?! Is there some functionality I'm missing for using a Full-text Search? Or does someone have code already. It would seem that this functionality is needed in all kinds of projects, and that my inquery would be answered completely immediately.

-Robert

|||

This underscores the need to have a good address validation process in place BEFORE bad and/or mis-spelled addressed are put into your server. You may wish to investigate using the US Postal Service web service for addresses and address correction.

I have assisted in incorporating the USPS web service in applications and the results have been excellent. For example, I can misspell my street name and it will still come back with the correct address. In addition, the Zip+4 is provided, there is cross-check between city and zipcode, abbreviations are standardized, etc.

https://secure.shippingapis.com/Registration/

http://www.codeproject.com/useritems/USPS_Web_Tools_Wrapper.asp

|||

umm... no. I have a legacy address database... and I'm hired to write code to search it. I'm not hired to correct the database. The code I'm looking for has to be out there, I just have to look hard enough... or someone will help me and post it.

Thank you for your suggestions.

|||

I've been there, spent quite a bit of time researching Name and Address issues (on a client's nickel of course.)

There are excellent third party 'add-ins' -very expensive though ($100k+) But not too much that really solves the problems.

Deconstruction can help, storing the address components in separate columns -but still a groaner...

Good luck, and keep us posted. Perhaps there is a new approach out there that would be wonderful to share.

|||bump|||There is no "good" way to do what you want with one field called "Address", with any number of values in it.

What I have done in the past is use an exact search for the value and if that does not return any values, use the SOUNDEX() function to try to match something. On large string lengths, this will have a very large number of hits.

Good luck
|||

Here is a beginning to what I want

Code Snippet

CREATE Function [dbo].[AddressFuzzy] (

@.var1 as Varchar(200),

@.var2 as Varchar(200))

RETURNS int AS

--

-- Function Address --

-- ? Matches street name

--

BEGIN

DECLARE

@.a varchar(100), -- street name in var1

@.b varchar(100), -- street name in var2

@.i int

select @.a = Value from dbo.Split(@.var1, ' ') where TokenID = 2

select @.b = Value from dbo.Split(@.var2, ' ') where TokenID = 2

set @.i = difference(@.a, @.b)

RETURN (@.i)

END

The Split function is available as well, if you don't have or understand. This will do a fuzzy match on whatever the second word in the street address is.

so 1) 123 2) Fake 3) Street

It'll fuzzy match 2. I think I can work with this till I get somewhere...

|||

Code Snippet

ALTER Function [dbo].[AddressFuzzy] (

@.var1 as Varchar(200),

@.var2 as Varchar(200))

RETURNS int AS

--

-- Function Address --

-- ? Matches street name

--

BEGIN

DECLARE

@.a varchar(100), -- street name in var1

@.b varchar(100), -- street name in var2

@.c varchar(100), -- street name in var1

@.d varchar(100), -- street name in var2

@.e varchar(100), -- street name in var1

@.f varchar(100), -- street name in var2

@.i char(1),

@.j char(1),

@.k char(1),

@.var1Tokens int,

@.var2Tokens int,

@.var1Plus int,

@.var2Plus int

set @.var1Plus = 0

set @.var2Plus = 0

select @.var1tokens = TokenID from dbo.Split(@.var1, ' ')

select @.var2tokens = TokenID from dbo.Split(@.var2, ' ')

set @.k = '0'

select @.a = Value from dbo.Split(@.var1, ' ') where TokenID = 1

select @.b = Value from dbo.Split(@.var2, ' ') where TokenID = 1

if @.a = @.b set @.k = '1'

select @.c = value from dbo.Split(@.var1, ' ') where TokenID = 2

select @.d = Value from dbo.Split(@.var2, ' ') where TokenID = 2

set @.i = Convert(varchar, difference(@.c, @.d))

if Convert(int, @.i) < 3 begin

if @.var1tokens > 3 begin

-- select @.c = @.c + ' ' + value from dbo.Split(@.var1, ' ') where TokenID = 3

select @.c = value from dbo.Split(@.var1, ' ') where TokenID = 3

set @.var1Plus = 1

end

if @.var2tokens > 3 begin

-- select @.d = @.d + ' ' + Value from dbo.Split(@.var2, ' ') where TokenID = 3

select @.d = value from dbo.Split(@.var1, ' ') where TokenID = 3

set @.var2Plus = 1

end

set @.i = Convert(varchar, difference(@.c, @.d))

end

select @.e = Value from dbo.Split(@.var1, ' ') where TokenID = 3 + @.var1Plus

select @.f = Value from dbo.Split(@.var2, ' ') where TokenID = 3 + @.var2Plus

if @.e is null or @.f is null or @.e = '' or @.f = ''

set @.j = '4'

else

set @.j = Convert(varchar, difference(@.e, @.f))

RETURN Convert(int, (@.k + @.i + @.j))

END

It works fine...

|||

And then, without address standardization, there will be:

123 N Fake St

123 N E Fake St

123 North East Fake St

123 Fake N

123 N Fake St S

As you will discover, there is no 'easy or simple' way to handle addresses unless you follow the USPS rules of standardization -and even then it will still be a major headache. (That's partially why the third party tools are so expensive.) Almost all situations where addressing is a mission critical part of the database structure, addresses are deconstructed into constuitent parts, Number, Direction, StreetName, StreetType, etc. -Think Fire, Police, 911 services. Fuzzy doesn't handle it.

A great many of us have had to tackle this problem. If you 'blow off' our suggestions, you most likely won't find a workable solution.

You would do your client a better service by opening the conversation about why and how to standardize the existing data. It is not a complex process.

|||

The above solution I have found does work. I complained of it's slow-ness but once you narrow down the field using Zip &/or city,state it becomes manageable and an acceptable slowness.

I understand your concerns about the USPS address. Our company has had clients that give us raw addresses from there database, and I've been on the task of turning those into distinct locations.

A lot of the time I have to eye-ball address after address and assign them the same location-id because of the above...
123 N fake st = 123 fake st = .... and so on.

In the future your suggestion does make sense... But for now a fuzzy search on the current database is acceptable.

Now you bring up another point however, a bit unrelated to the question.

The USPS would standardize the database and make it...

123 N Fake St...

And i have a guy search for 123 Fake St... and I would send that to get standarized...

How would the USPS system resolve the two above to be equal?

Would I then take the searched address and see if USPS has it on file? Because it would not match my now standardized database. Or would I take each search and get USPS to tell me the correct way to represent the search and then match it to my database?

-Robert

|||

If there was only a N Fake Street, the return is a standardized address.

If there is both a N and S Fake St, usually 123 would be one North (or South), and 125 would be the other. Rarely would exact same numbering scheme be both North and South. Or the ZIP will be used to determine which is correct.

Again, if address searching is mission critical, you MUST deconstruct

|||I agree with Arnie.

The way the USPS standardization service will return a standard address in the form:
House Number, Direction, Street Name, Modifier (DR, ST, AVE, etc), and city/st/zip, etc

The only way to do any kind of search is you have to know if you are looking for N or S or ST or DR or Ave or whatever. Then you prompt the user for 4 fields, which can have blanks representing "any". Then you can use SOUNDEX on the street name to find something similar.

|||

I spent quite a bit of time for several clients working out the name/address matching possibilities for very large databases. Here is a summary of the super-condensed executive briefing abstract. Wink

None of the available 'free' options can hold a candle to some of the third party products (a couple of which are 'Homeland Securty certified' and often paid for by Homeland Security for governmental requirements). But the cost is high. Obviously, some of these folks have made a large investment in working out name search algorythms, and they rightfully expect a 'good' ROI -and with HS's backing, they are getting it.

Soundex was created to solve a Census Bureau problem at the 20th century, and is biased toward northern european names (primary immigrant influx at the time.) That bias still exists and soundex does NOT handle asian, eastern european, and arabic names worth a crap.

Double Metaphone and NYSIIS are a 'recent' alternatives to soundex (there is documentation and SQL functions available on the 'interweb'.) Double Metaphone and NYSIIS are more robust than soundex(), doing a 'plausible' job handling eastern european, asian, and arabic names.

If you are attempting to 'roll you own' name/address search algorythms, DON"T! Explore using Double Metaphone instead. Don't use soundex() -you will be disappointed.

The Double Metaphone code is publically available, expand on it and publish your enhancements to the larger SQL Community. For addresses, use the USPS web service to standardize the address, and deconstruct the address, storing the 'street' name in its own column.

No comments:

Post a Comment