Sunday, February 19, 2012

Full-Text Search - How to determine word offset in CONTAINS query.

Does anyone know if it is possible to determine the relative word offset (the Occ) from a simple-term query such as:-

SELECT Comments
FROM Production.ProductReview
WHERE CONTAINS(Comments, ' "mountain biking" ');

So, given the text:-

"Maybe it's just because I'm new to mountain biking, but I had a terrible time getting used to these pedals."

I would like the query would return both the text and the word offset of 8. To me, it seems like this would be quite useful as I want to highlight the found text for the user to see. Obviously I can do a post-SELECT scan of the string to find the values but this would seem unnecessary.

If anyone can give me any pointers I'd be very grateful.

Thanks

Mark,
While the word offset (8 in your example) is not directly available via a SQL FTS query, you can use PATINDEX to achieve the same effect:

The following SQL FTS query on the pubs table pub_info will return rows that match the FTS search word (books) and the near by words from 20 characters before the searched keyword(books) for a total lenght of 100 characters.

SELECT pub_id, SubString(pr_info,PatIndex ('%books%',pr_info)-20,100)
FROM pub_info
WHERE Contains(pr_info, 'books')

Highlighting the word or sometimes referred to as Hit-Highlighting is a very important psychological help to searchers looking for the keyword or phrase "in context" with the words near (before and following) the keyword/phrase. This is a FAQ in the public fulltext newsgroup that doesn't have a clean implemention yet...

Regards,
John
SQL Full Text Search Blog
http://spaces.msn.com/members/jtkane/

|||John

I suppose there must be a reason why the offset is not available - I just can't think of it at the moment. It would be good to ask the SQL Server 2005 development team for their response as to why it's not available. Probably something to do with inflectional forms and stemming, as well as the more advanced query features available through the FREETEXT predicate.

I agree the hit-highlighting is an important aid to the user - I think I'll have to implement something along the lines that you've suggested, and maybe if Microsoft provide with a mechanism in the future then retro-fit it.

Many thanks.

Regards
Mark|||

Mark,
I've waited until a SQL Server 2005 development team member would reply to this posting, but they are most likely busy getting Yukon ready to ship... In the meantime, I can only speculate why the offsets are not available... Most likely this is due to either disk space considerations (related to performance) or functionality as offsets are normally used to help determine the "nearness" of one token (word or term) to another. The "nearness" in SQL Server 2000 and SQL Server 2005 is hardcoded to approx. 50 words / tokens / terms. Nearness is also affected by noise (stop) words as well as end-of-sentence and end-of-paragraph characters. I know for a fact (direct experiene) that at least one potential SQL Server (2000 and 2005) FTS customer decided to go with a non-Microsoft solution solely based upon the lack of functionality and lack of customization of the NEAR functionality, I'm sad to say... Note, they didn't implement ORACLE or IBM DB2, but stayed with SQL Server as their appliation was written to SQL Server, it was just the FTS functionality that was not sufficient Sad.

As for hit-highlighting (or highlighting of search keywords or phrases), I too agree that this is an important functionality that is missing in SQL Server 2000 and 2005 Full-text Search (SQL FTS) and hopefully this will be implemented in a post-Yukon version of SQL Server. Note, this feature would have to be newly implemented and most likely Microsoft will not be able to retro-fit it back to SQL Server 2000 or SQL Server 2005, I'm sad to say...

In retrospect, both of these issues are well known to the Microsoft SQL Server 2005 FTS development team, but the overall performance of SQL FTS (greatly improved!) had to take priority and in the long term, there is only so much that can be done with a software server product as large as SQL Server 2005 and still keep it stable, reliable, performant and meet the RTM shipping deadline! My hat is off to the SQL FTS Dev team!

Thanks,
John
SQL Full Text Search Blog
http://spaces.msn.com/members/jtkane/

|||

John

We've come to the same conclusion that they have other priorities - and we can't blame them. FTS is so close to what we need/want, but falls at the final hurdle for our particular requirements. We will continue to use SQL Server 2005 for all other data management because it's a superb platform, and we will use it as the basis for engineering our own specific FTS tool.

We will, of course, keep abreast of all new FTS features in future releases with much anticipation.

Many thanks for your feedback.

Regards
Mark

Feedback-Ferret?
www.metriq.co.uk

No comments:

Post a Comment