- SQL Server Full Text Search
SQL Server Full Text Search is an inexact
string matching technology forSQL Server . It is a powerful and fast way of referencing the contents of almost any character-based column onSQL Server 2000 ,SQL Server 2005 , andSQL Server 2008 .Full text indexes must be populated and are stored inside full-text catalogues.Searching is installed as a separate service called MSSearch, an optional component of SQL Server 7.0/2000/2005.
Types of Searches
* Proximity based searches – cf. Section
* Inflectional based searches – cf. Section
* Weight based searches – cf. Section
* Prefix term searches – cf. Section
* Binary document search – cf. SectionA Common Example
The following article contains SQL code and simple worked examples. These examples refer to an imaginary table called "Athletes". This table contains information about various athletes their events and results in the Olympic Games. Inside this table you will find the following column names:
* id
* name
* surname
* country
* notesThere exists a prior full-text index on the column "notes". The primary key of the table is on the column "id"
Full Text Query Syntax
T-SQL supports two predicates and two functions that can be used to perform full-text searches: CONTAINS, CONTAINSTABLE, FREETEXT, and FREETEXTTABLE. The functionality of these predicates and functions are listed in the following table:
The rank is a number from 0 to 1000 that indicates the relevance of the search result. The higher the value the more relevant the match. Although rank is always calculated internally when a full-text search is performed, it is only exposed through the use of the full-text search functions. A variety of factors are considered in the various ranking algorithms used, including:
* Document length - longer documents are generally considered more relevant
* Number of occurrences of search words/phrases - the more times the search words/phrases occur in the document, the higher the ranking
* Proximity of search words/phrases in proximity search - proximity is factored into rank when the NEAR proximity search operator is used
* User-defined weights - user-defined weights are factored into rank when the ISABOUT weighted search operator is usedThe algorithms used to calculate rank include Jaccard for weighted searches and OKAPI BM25 for FREETEXT searches.
Proximity Searches
Full-text searching has the ability to find words and phrases that are located in close proximity to one another. This is achieved by using the keyword "NEAR" (or the operator "~") between two given search words or phrases.
In SQL Server 2008 words are considered to be "near" when they are within 50 words of one another; in prior versions of SQL Server search words or phrases were considered "near" one another with slightly over 1,100 intervening words. Note that paragraph breaks and other non-space whitespace can affect the calculation of the number of intervening words, though the specifics have not been published by Microsoft. The number of intervening words between phrases is factored into the rank of documents containing both of the search words or phrases. The more intervening words between the two search terms, the lower the total rank. There is currently no method for specifying a user-defined proximity (i.e., "10 words apart", "2 paragraphs apart", etc.)
SQL Server uses a vector-space search algorithm to perform proximity searches. The order in which the phrases appear within the search document are irrelevant to a proximity search. In a SQL Server 2008 proximity search predicate (CONTAINS, FREETEXT), the NEAR operator is mapped internally to the AND operator. The results are the same as if you simply used the AND operator. To take advantage of the proximity search you have to use the CONTAINSTABLE or FREETEXTTABLE functions, both of which allow you to access the internally-generated RANK.
Proximity Search Simple Example #1 using the "NEAR" keyword
SELECT e.name, e.surnameFROM CONTAINSTABLE(Athletes, *, '("Gold" NEAR "Finalists")') As ctJOIN Athletes As eON ct. [KEY] = e.idWHERE ct. [RANK] > 0ORDER BY ct. [RANK] ;
Proximity Search Simple Example #2 using the "~" operator
SELECT e.name, e.surnameFROM CONTAINSTABLE(Athletes, *, '("Gold" ~ "First") OR ("Silver" ~ "Second") OR ("Bronze" ~ "Third")') As ctJOIN Athletes As eON ct. [KEY] = e.idWHERE ct. [RANK] > 0ORDER BY ct. [RANK] ;
Inflectional Searches
Full-text searching has the ability to search for word inflections, that is, for different forms of the same word. Inflectional forms of words are generated by a language-specific components known as "stemmers" at query time. The U.S. English language stemmer (LCID 1033) is a different component from the Spanish language stemmer (LCID 3082), for instance.
The stemmer relies on language-specific rules and dictionary lookups to generate inflectional word forms including gender and neutral verb conjugations, plural and singular nouns, adjective forms, and verb conjugations. As an example, when given the verb "swim", SQL Server generates the inflectional forms "swim", "swims", "swimming", "swam", and "swum" when using the U.S. English stemmer.
The FREETEXT predicate and FREETEXTTABLE function automatically generate inflectional forms for search words. The CONTAINS predicate and CONTAINSTABLE functions, however, require you to use the FORMSOF(INFLECTIONAL, ...) operator in your query string to generate inflectional forms.
Prefix Term Searches
Full-text searching has the ability to search for word beginning with a certain term, or prefix. This is achieved by following your search term with a wild-card asterisk (*) symbol. In order to indicate to SQL Server that your search term is in fact a prefix search term, you must enclose it in double quotes. In a multi-word quoted prefix search phrase all of the words are treated as prefix terms, even if the wildcard * character follows only one word in the phrase. Note that when prefix searches are used SQL Server does not generate inflectional forms or perform thesaurus replacements and expansions of the words in the search phrase.
Prefix Term Simple Example #1
-- This sample matches all documents containing a word that begins with the prefix "al" SELECT name, surnameFROM Athletes aWHERE CONTAINS(*, '"al*"')Prefix Term Simple Example #2
-- This sample matches all documents containing two words, one beginning with the prefix "al" -- and another beginning with the prefix "anon". Matches would include "al anon", -- "alcoholics anonymous", "alan anondez", etc. Notice that the trailing * is only needed-- for one prefix in the search phrase.SELECT name, surnameFROM Athletes aWHERE CONTAINS(*, '"al* anon"')Non-Plain-Text Document Search
Full-text searching has the ability to search through binary data. Some of the supported types are:
* HTML (*.htm or *.html)
* XML (*.xml)
* MS Word (*.doc)
* MS Excel (*.xls)
* MS PowerPoint (*.ppt)
* Adobe Acrobat (*.pdf)Some of these formats may require installing additional ifilter software on the machine and then issuing these commands to enable them to get loaded.
sp_fulltext_service 'verify_signature', 0sp_fulltext_service 'load_os_resources',1.More formats can be added by installing additional ifilters and new ifilters can be developed to handle others.
Weighted Searches
Full-text searching has the ability to associate a higher preference for certain terms over others.This is achieved by using the function: “ISABOUT” and the “WEIGHT” keyword. The keyword is associated with any decimal value between 0.0 and 1.0. The higher the weight the greater preference the match will be given
Weighted Searches Simple Example #1
Noise Words
These are common words that MS Search feels should be ignored when considering matches.
For example the article words like “the”, “an” and the
conjunctions like: “and”, “but” and “or” occur so frequently that to include them in a search would result in too many false positives.Hardware Optimization Considerations
By default, MS SQL Server reserves all the
virtual memory in its process space, minus 1GB. This leaves little for MSSearch. Consider readjusting to a ratio of 8(SQL Server):2(MSSearch).By default the Windows 2000/NTvirtual memory is 1.5 timesphysical memory . Consider increasing this to 3 times physical memory.
Wikimedia Foundation. 2010.