SQL Server Full Text Search

SQL Server Full Text Search

SQL Server Full Text Search is an inexact string matching technology for SQL Server. It is a powerful and fast way of referencing the contents of almost any character-based column on SQL Server 2000, SQL Server 2005, and SQL Server 2008 .

Full text indexes must be populated and are stored inside full-text catalogues.Searching is installed as a separate service called MSSearch, an optional component of SQL Server 7.0/2000/2005.

Types of Searches

* Proximity based searches – cf. Section
* Inflectional based searches – cf. Section
* Weight based searches – cf. Section
* Prefix term searches – cf. Section
* Binary document search – cf. Section

A Common Example

The following article contains SQL code and simple worked examples. These examples refer to an imaginary table called "Athletes". This table contains information about various athletes their events and results in the Olympic Games. Inside this table you will find the following column names:

* id
* name
* surname
* country
* notes

There exists a prior full-text index on the column "notes". The primary key of the table is on the column "id"

Full Text Query Syntax

T-SQL supports two predicates and two functions that can be used to perform full-text searches: CONTAINS, CONTAINSTABLE, FREETEXT, and FREETEXTTABLE. The functionality of these predicates and functions are listed in the following table:

The rank is a number from 0 to 1000 that indicates the relevance of the search result. The higher the value the more relevant the match. Although rank is always calculated internally when a full-text search is performed, it is only exposed through the use of the full-text search functions. A variety of factors are considered in the various ranking algorithms used, including:

* Document length - longer documents are generally considered more relevant
* Number of occurrences of search words/phrases - the more times the search words/phrases occur in the document, the higher the ranking
* Proximity of search words/phrases in proximity search - proximity is factored into rank when the NEAR proximity search operator is used
* User-defined weights - user-defined weights are factored into rank when the ISABOUT weighted search operator is used

The algorithms used to calculate rank include Jaccard for weighted searches and OKAPI BM25 for FREETEXT searches.

Proximity Searches

Full-text searching has the ability to find words and phrases that are located in close proximity to one another. This is achieved by using the keyword "NEAR" (or the operator "~") between two given search words or phrases.

In SQL Server 2008 words are considered to be "near" when they are within 50 words of one another; in prior versions of SQL Server search words or phrases were considered "near" one another with slightly over 1,100 intervening words. Note that paragraph breaks and other non-space whitespace can affect the calculation of the number of intervening words, though the specifics have not been published by Microsoft. The number of intervening words between phrases is factored into the rank of documents containing both of the search words or phrases. The more intervening words between the two search terms, the lower the total rank. There is currently no method for specifying a user-defined proximity (i.e., "10 words apart", "2 paragraphs apart", etc.)

SQL Server uses a vector-space search algorithm to perform proximity searches. The order in which the phrases appear within the search document are irrelevant to a proximity search. In a SQL Server 2008 proximity search predicate (CONTAINS, FREETEXT), the NEAR operator is mapped internally to the AND operator. The results are the same as if you simply used the AND operator. To take advantage of the proximity search you have to use the CONTAINSTABLE or FREETEXTTABLE functions, both of which allow you to access the internally-generated RANK.

Proximity Search Simple Example #1 using the "NEAR" keyword

SELECT e.name, e.surnameFROM CONTAINSTABLE(Athletes, *, '("Gold" NEAR "Finalists")') As ctJOIN Athletes As eON ct. [KEY] = e.idWHERE ct. [RANK] > 0ORDER BY ct. [RANK] ;

Proximity Search Simple Example #2 using the "~" operator

SELECT e.name, e.surnameFROM CONTAINSTABLE(Athletes, *, '("Gold" ~ "First") OR ("Silver" ~ "Second") OR ("Bronze" ~ "Third")') As ctJOIN Athletes As eON ct. [KEY] = e.idWHERE ct. [RANK] > 0ORDER BY ct. [RANK] ;

Inflectional Searches

Full-text searching has the ability to search for word inflections, that is, for different forms of the same word. Inflectional forms of words are generated by a language-specific components known as "stemmers" at query time. The U.S. English language stemmer (LCID 1033) is a different component from the Spanish language stemmer (LCID 3082), for instance.

The stemmer relies on language-specific rules and dictionary lookups to generate inflectional word forms including gender and neutral verb conjugations, plural and singular nouns, adjective forms, and verb conjugations. As an example, when given the verb "swim", SQL Server generates the inflectional forms "swim", "swims", "swimming", "swam", and "swum" when using the U.S. English stemmer.

The FREETEXT predicate and FREETEXTTABLE function automatically generate inflectional forms for search words. The CONTAINS predicate and CONTAINSTABLE functions, however, require you to use the FORMSOF(INFLECTIONAL, ...) operator in your query string to generate inflectional forms.

Prefix Term Searches

Full-text searching has the ability to search for word beginning with a certain term, or prefix. This is achieved by following your search term with a wild-card asterisk (*) symbol. In order to indicate to SQL Server that your search term is in fact a prefix search term, you must enclose it in double quotes. In a multi-word quoted prefix search phrase all of the words are treated as prefix terms, even if the wildcard * character follows only one word in the phrase. Note that when prefix searches are used SQL Server does not generate inflectional forms or perform thesaurus replacements and expansions of the words in the search phrase.

Prefix Term Simple Example #1

-- This sample matches all documents containing a word that begins with the prefix "al" SELECT name, surnameFROM Athletes aWHERE CONTAINS(*, '"al*"')

Prefix Term Simple Example #2

-- This sample matches all documents containing two words, one beginning with the prefix "al" -- and another beginning with the prefix "anon". Matches would include "al anon", -- "alcoholics anonymous", "alan anondez", etc. Notice that the trailing * is only needed-- for one prefix in the search phrase.SELECT name, surnameFROM Athletes aWHERE CONTAINS(*, '"al* anon"')

Non-Plain-Text Document Search

Full-text searching has the ability to search through binary data. Some of the supported types are:

* HTML (*.htm or *.html)
* XML (*.xml)
* MS Word (*.doc)
* MS Excel (*.xls)
* MS PowerPoint (*.ppt)
* Adobe Acrobat (*.pdf)

Some of these formats may require installing additional ifilter software on the machine and then issuing these commands to enable them to get loaded.

sp_fulltext_service 'verify_signature', 0sp_fulltext_service 'load_os_resources',1.

More formats can be added by installing additional ifilters and new ifilters can be developed to handle others.

Weighted Searches

Full-text searching has the ability to associate a higher preference for certain terms over others.This is achieved by using the function: “ISABOUT” and the “WEIGHT” keyword. The keyword is associated with any decimal value between 0.0 and 1.0. The higher the weight the greater preference the match will be given

Weighted Searches Simple Example #1

SELECT name, surnameFROM Athletes aJOIN CONTAINSTABLE(Athletes, notes, 'ISABOUT(Bronze WEIGHT(.2), Silver WEIGHT(.4), Gold WEIGHT(.8))') ctON a.id = ct. [KEY] ORDER BY Rank desc

Noise Words

These are common words that MS Search feels should be ignored when considering matches.

For example the article words like “the”, “an” and the conjunctions like: “and”, “but” and “or” occur so frequently that to include them in a search would result in too many false positives.

Hardware Optimization Considerations

By default, MS SQL Server reserves all the virtual memory in its process space, minus 1GB. This leaves little for MSSearch. Consider readjusting to a ratio of 8(SQL Server):2(MSSearch).By default the Windows 2000/NT virtual memory is 1.5 times physical memory. Consider increasing this to 3 times physical memory.


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Microsoft SQL Server — Developer(s) Microsoft Stable release SQL Server 2008 R2 (10.50.2500.0 Service Pack 1) / July 11, 2011; 4 months ago …   Wikipedia

  • SQL Server — Microsoft SQL Server SQL Server Développeur Microsoft Dernière …   Wikipédia en Français

  • SQL Server 2000 — Microsoft SQL Server SQL Server Développeur Microsoft Dernière …   Wikipédia en Français

  • SQL Server Express — Microsoft SQL Server Express, a freely downloadable and distributable version of Microsoft s SQL Server relational database management system, comprises a database specifically targeted for embedded and smaller scale applications. Contents 1… …   Wikipedia

  • Microsoft SQL Server — SQL Server Développeur Microsoft Dernière version 2008 ( …   Wikipédia en Français

  • SQL Anywhere — is an RDBMS product from Sybase iAnywhere. iAnywhere is a subsidiary of Sybase. Features * Database files are operating system independent. This means that they can be copied between supported platforms. * Can be run on Windows, Windows CE,… …   Wikipedia

  • SQL injection — A SQL injection is often used to attack the security of a website by inputting SQL statements in a web form to get a badly designed website in order to dump the database content to the attacker. SQL injection is a code injection technique that… …   Wikipedia

  • Windows Server 2003 — Part of the Microsoft Windows family …   Wikipedia

  • Microsoft Search Server — (MSS) is an enterprise search platform from Microsoft, based on the search capabilities of Microsoft Office SharePoint Server.[1] MSS shares its architectural underpinnings with the Windows Search platform for both the querying engine as well as… …   Wikipedia

  • Adaptive Server Enterprise — (ASE) es el motor de bases de datos (RDBMS) insignia de la compañía Sybase. ASE es un sistema de gestión de datos, altamente escalable, de alto rendimiento, con soporte a grandes volúmenes de datos, transacciones y usuarios, y de bajo costo, que… …   Wikipedia Español

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”