Proximity search (text)

Proximity search (text)

In text processing, a proximity search looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters. In addition to proximity, some implementations may also impose a constraint on the word order, in that the order in the searched text must be identical to the order of the search query. Proximity searching goes beyond the simple matching of words by adding the constraint of proximity and is generally regarded as a form of advanced search.

For example, a search could be used to find "red brick house", and match phrases such as "red house of brick" or "house made of red brick". By limiting the proximity, these phrases can be matched while avoiding documents where the words are scattered or spread across a page or in unrelated articles in an anthology.

Rationale

The basic, linguistic, assumption is that the proximity of the words in a document implies a relationship between the words. Given that authors of documents try to formulate sentences which contain a single idea, or cluster related ideas within neighboring sentences or organized into paragraphs, there is an inherent, relatively high, probability within the document structure that words used together are related. On the other hand, when two words are on the opposite ends of a book, the probability of a relationship between the words is relatively weak. By limiting search results to only include matches where the words are within the specified maximum proximity, or distance, the search results are assumed to be of higher relevance than the matches where the words are scattered.

Commercial, Internet search engines tend to produce too many matches (known as recall) for the average search query. Proximity searching is one method to reduce the number of pages matches, and to improve the relevance of the matched pages by using word proximity to assist in ranking. As an added benefit, proximity searching helps combat spamdexing by avoiding webpages which contain dictionary lists or shotgun lists of thousands of words, which rank higher in search engines that are heavily biased by word frequency to help in ranking results.

Boolean Syntax and Operators

Note that a proximity search can designate that only some keywords must be within a specified distance. Proximity searching can be used with other search syntax and/or controls to allow more articulate search queries. Sometimes query operators like NEAR, NOT NEAR, FOLLOWED BY, NOT FOLLOWED BY, SENTENCE or FAR are used to indicate a proximity-search limit between specified keywords: "brick NEAR house" and such.

Usage in Commercial Search Engines

Google allows ordered-proximity searching using one asterisk (*) to span each 2 intervening words, but with order specified: "brick *** house" OR "house *** brick" matches up to 7 intervening words (October 2006).

Implicit/automatic versus explicit proximity search: As of November 2006, most Internet search engines except Exalead and Yahoo! only implement an implicit proximity search functionality. That is, they automatically rank those search results higher where the user keywords have a good "overall proximity score" in such results. If only two keywords are in the search query, this has no difference from an explicit proximity search which puts a NEAR operator between the two keywords. However, if three or more than three keywords are present, it is often important for the user to specify which subsets of these keywords expect a proximity in search results. This is useful if the user wants to do a prior art search (e.g. finding an existing approach to complete a specific task, finding a document that discloses a system that exhibits a procedural behavior collaboratively conducted by several components and links between these components).

For example, in a search query in the form of: (keyword1 NEAR keyword2) (keyword1 NEAR keyword3), the query specifies that keyword1 and keyword2 must co-occur closely somewhere in a document, and so must keyword1 and keyword3. However, keyword2 and keyword3 need not occur closely anywhere in the document.

Exalead allows the user to specify the required proximity, as the maximum number of words between keywords. The syntax is (keyword1 NEAR/n keyword2) where n is the number of words. When using the Walhello search-engine, the proximity can be defined by the number of characters between the keywords.

Proximity search within the Google and Yahoo! search engines is possible using full-word wildcards: the wildcard is an asterisk "*" in Google, and an "a" in "Yahoo! Search".

"Google Asterisk:" Using Google's asterisk-in-quotations approach to emulate a NEAR operator is a little cumbersome but does work (as of October 2006). For example, to specify a close (at most 2 words' distance) co-occurrence of "house" and "dog", the following search-expression could be specified:

::: "house * dog" OR "dog * house" <--Search for house/dog up to 2 words apart.

Note the operator "OR" must be in capital letters. One asterisk allows a proximity of at most two words' distance between two search-words. To span 6 intervening words, use 3 asterisks:

::: "house *** dog" OR "dog *** house" <--Search for house/dog up to 6 words apart.

To span up to 8 intervening words in a Google search, use 4 asterisks, etc.

ee also

* Edit distance
* Semantic proximity
* Information retrieval
* Search engine
* Indexing - how texts are indexed to support proximity search


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Proximity search — may refer to: * Proximity search (text) * Proximity search (metric space) …   Wikipedia

  • Web search engine — Search engine redirects here. For other uses, see Search engine (disambiguation). The three most widely used web search engines and their approximate share as of late 2010.[1] A web search engine is designed to search for information on the Wo …   Wikipedia

  • SQL Server Full Text Search — is an inexact string matching technology for SQL Server. It is a powerful and fast way of referencing the contents of almost any character based column on SQL Server 2000, SQL Server 2005, and SQL Server 2008 . Full text indexes must be populated …   Wikipedia

  • Full text search — In text retrieval, full text search refers to a technique for searching a computer stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied …   Wikipedia

  • Index (search engine) — Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and… …   Wikipedia

  • Sphinx (search engine) — Infobox Software name = Sphinx caption = author = developer = Andrew Aksyonoff released = 2001 latest release version = 0.9.8 latest release date = release date|2008|07|15 programming language = C++ operating system = Linux, Windows, Solaris,… …   Wikipedia

  • Nottingham — This article is about the City of Nottingham in England. For the county, see Nottinghamshire. For The University of Nottingham, see The University of Nottingham. For other uses, see Nottingham (disambiguation). City of Nottingham   City …   Wikipedia

  • Lifecasting (video stream) — Lifecasting is a continual broadcast of events in a person s life through digital media. Typically, lifecasting is transmitted through the medium of the Internet and can involve wearable technology. [… …   Wikipedia

  • SN 1054 — Supernova SN 1054 The Crab Nebula, remnant of SN 1054. Credit: NASA/ESA. Observation data (Epoch ?) Supernova type …   Wikipedia

  • Europe, history of — Introduction       history of European peoples and cultures from prehistoric times to the present. Europe is a more ambiguous term than most geographic expressions. Its etymology is doubtful, as is the physical extent of the area it designates.… …   Universalium

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”