Boyer–Moore–Horspool algorithm

Boyer–Moore–Horspool algorithm

In computer science, the Boyer–Moore–Horspool algorithm or Horspool's algorithm is an algorithm for finding substrings in strings.

It is a simplification of the Boyer-Moore algorithm which is related to the Knuth-Morris-Pratt algorithm. The algorithm trades space for time in order to obtain an average-case complexity of O(N) on random text, although it has O(MN) in the worst case. The length of the pattern is M and the length of the search string is N.

Example implementation

Here is an example implementation of the Boyer-Moore-Horspool algorithm, written in C.

#include
#include

/* The constant UCHAR_MAX is assumed to contain the maximum * value of the input character type. Typically it's 255. * size_t is an unsigned type for representing sizes. * If your system doesn't have it, substitute with * unsigned int. */

/* Returns a pointer to the first occurrence of "needle" * within "haystack", or NULL if not found. Works like * memmem(). */const unsigned char *boyermoore_horspool_memmem(const unsigned char* haystack, size_t hlen, const unsigned char* needle, size_t nlen){ size_t scan = 0; size_t bad_char_skip [UCHAR_MAX + 1] ; /* Officially called: * bad character shift */ /* Sanity checks on the parameters */ if (nlen <= 0 || !haystack || !needle) return NULL;

/* ---- Preprocess ---- */ /* Initialize the table to default value */ /* When a character is encountered that does not occur * in the needle, we can safely skip ahead for the whole * length of the needle. */ for (scan = 0; scan <= UCHAR_MAX; scan = scan + 1) bad_char_skip [scan] = nlen;

/* C arrays have the first byte at [0] , therefore: * [nlen - 1] is the last byte of the array. */ size_t last = nlen - 1;

/* Then populate it with the analysis of the needle */ for (scan = 0; scan < last; scan = scan + 1) bad_char_skip [needle [scan] = last - scan;

/* ---- Do the matching ---- */

/* Search the haystack, while the needle can still be within it. */ while (hlen >= nlen) { /* scan from the end of the needle */ for (scan = last; haystack [scan] = needle [scan] ; scan = scan - 1) if (scan = 0) /* If the first byte matches, we've found it. */ return haystack;

/* otherwise, we need to skip some bytes and start again. Note that here we are getting the skip value based on the last byte of needle, no matter where we didn't match. So if needle is: "abcd" then we are skipping based on 'd' and that value will be 4, and for "abcdd" we again skip on 'd' but the value will be only 1. The alternative of pretending that the mismatched character was the last character is slower in the normal case (Eg. finding "abcd" in "...azcd..." gives 4 by using 'd' but only 4-2=2 using 'z'. */ hlen -= bad_char_skip [haystack [last] ; haystack += bad_char_skip [haystack [last] ; }

return NULL;}

Performance

The algorithm performs best with long needle strings, when it consistently hits a non-matching character at or near the final byte of the current position in the haystack and the final byte of the needle doesn't occur elsewhere within the needle. For instance a 32 byte needle ending in "z" searching through a 255 byte haystack which doesn't have a 'z' byte in it would take up to 224 byte comparisons.

The best case is the same as for the Boyer-Moore algorithm in big O notation, although the constant overhead of initialization and for each loop is less.

The worst case behavior happens when the bad character skip is consistently low (with the lower limit of 1 byte movement) and a large portion of the needle matches the haystack. The bad character skip is only low, on a partial match, when the final character of the needle also occurs elsewhere within the needle. With 1 byte movement happening when the same byte is in both of the last two positions.

The canonical degenerate case similar to the above "best" case is a needle of an 'a' byte followed by 31 'z' bytes in a haystack consisting of 255 'z' bytes. This will do 31 successful byte comparisons, a 1 byte comparison that fails and then move forward 1 byte. This process will repeat 223 more times (255 - 32), bringing the total byte comparisons to 7,168 (32 * 224).

The worst case is significantly higher than for the Boyer-Moore algorithm, although obviously this is hard to achieve in normal use cases. It is also worth noting that this worst case is also the worst case for the naive (but usual) memmem() algorithm, although the implementation of that tends to be significantly optimized (and is more cache friendly).

ee also

* Boyer–Moore string search algorithm

References

* [http://www-igm.univ-mlv.fr/%7Elecroq/string/node18.html Description of the algorithm]

External links

* FASMLIB contains implementation in x86 32bit assembly (procedure "mem.find")
* [http://movsd.com/bm.htm "BMH.ASM"] is an implementation of the Boyer–Moore–Horspool algorithm in x86 32bit assembly. ("BM.ASM" implements the Boyer–Moore string search algorithm).
* [http://johannburkard.de/software/stringsearch/ StringSearch – high-performance pattern matching algorithms in Java] – Implementations of many String-Matching-Algorithms in Java (BNDM, Boyer-Moore-Horspool, Boyer-Moore-Horspool-Raita, Shift-Or)
* [http://angusj.com/delphi/ Boyer-Moore-Horspool implementation for Delphi] - Non-visual components (TSearch and TFileSearch) which enable very fast data searches using the Boyer-Moore-Horspool search algorithm.


Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

  • Boyer-Moore-Algorithmus — Der Boyer Moore Algorithmus ist ein String Matching Algorithmus. Der Algorithmus wird dazu genutzt, um in einem Text T einen bestimmten Teiltext (Muster M) zu finden und wurde 1977 von Robert S. Boyer und J Strother Moore entwickelt.… …   Deutsch Wikipedia

  • Algoritmo de búsqueda de cadenas Boyer-Moore — El algoritmo de búsqueda de cadenas Boyer Moore es un particularmente eficiente algoritmo de búsqueda de cadenas, y ha sido el punto de referencia estándar para la literatura de búsqueda de cadenas práctica.[1] Fue desarrollado por Bob Boyer y J… …   Wikipedia Español

  • Nigel Horspool — R. Nigel Horspool[1] is a professor of computer science at the University of Victoria. He invented the Boyer–Moore–Horspool algorithm, a fast string search algorithm adapted from the Boyer–Moore string search algorithm. Horspool is co inventor of …   Wikipedia

  • String searching algorithm — String searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text. Let Σ be… …   Wikipedia

  • Algorithmic efficiency — In computer science, efficiency is used to describe properties of an algorithm relating to how much of various types of resources it consumes. Algorithmic efficiency can be thought of as analogous to engineering productivity for a repeating or… …   Wikipedia

  • List of algorithms — The following is a list of the algorithms described in Wikipedia. See also the list of data structures, list of algorithm general topics and list of terms relating to algorithms and data structures.If you intend to describe a new algorithm,… …   Wikipedia

  • Nigel — For other uses, see Nigel (disambiguation). Nigel Gender Male Origin Word/Name Niall > Njáll > Neel, Niel, Nihel > Nigellus > Nigel Region of origin Normandy, England …   Wikipedia

  • List of terms relating to algorithms and data structures — The [http://www.nist.gov/dads/ NIST Dictionary of Algorithms and Data Structures] is a reference work maintained by the U.S. National Institute of Standards and Technology. It defines a large number of terms relating to algorithms and data… …   Wikipedia

  • Список терминов, относящихся к алгоритмам и структурам данных —   Это служебный список статей, созданный для координации работ по развитию темы.   Данное предупреждение не устанавливается на информационные списки и глоссарии …   Википедия

  • Список терминов — Список терминов, относящихся к алгоритмам и структурам данных   Это сл …   Википедия

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”