CRM114 (program)

CRM114 (program)

CRM114 (full name: "The CRM114 Discriminator") is a program based upon a statistical approach for classifying data, and especially used for filtering email spam. While others have done statistical Bayesian filtering based upon the frequency of single word occurrences in email, CRM114 achieves a higher rate of spam recognition through creating hits based upon phrases up to five words in length. These phrases are used to form a Markov Random Field representing the incoming texts. With this additional contextual recognition, it is one of the more accurate spam filters available. The author claims recognition rates as high as 99.87% in some situations; Holden [http://web.archive.org/web/20050307062526/http://sam.holden.id.au/writings/spam2/ "Spam Filtering II"] ] and TREC 2005 and 2006. [http://trec.nist.gov/pubs/trec14/papers/SPAM.OVERVIEW.pdf "Spam Track Overview" (2005)] - TREC 2005] [http://trec.nist.gov/pubs/trec15/papers/SPAM06.OVERVIEW.pdf "Spam Track Overview" (2006)] - TREC 2005] gave results of better than 99%, with significant variation depending on the particular corpus. CRM114's classifier can also be switched to use Littlestone's Winnow algorithm, character-by-character correlation, a variant on KNN (K-nearest neighbor algorithm) classification called Hyperspace, a bit-entropic classifier that uses
entropy encoding to determine similarity, and other more experimental classifiers.

As an example of pattern recognition software, CRM114 is a good example of machine learning accomplished with a reasonably simple algorithm. GPLed Source code in C is available through the external link.

At a deeper level, CRM114 is also a string pattern matching language, similar to grep or even Perl; although it is Turing complete it is highly tuned for matching text, and even a simple (recursive) definition of the factorial takes almost ten lines, looking somewhat confusing to the uninitiated. Part of this is because the crm114 language syntax is not positional, but declensional. As a programming language, it may be used for many other applications aside from detecting spam. CRM114 uses the TRE approximate-match regex engine, so it's possible to write programs that do not depend on absolutely identical strings matching to function correctly.

Origin of the name

The name CRM114 is taken from the Stanley Kubrick movie "" (1964). The piece of radio equipment onboard the B-52 which is designed not to receive messages lacking a specific code-prefix is designated as the "CRM-114 Discriminator." Kubrick has used this designation in other films as well.

References

External links

* [http://crm114.sourceforge.net The CRM114 home page on SourceForge]
* [http://laurikari.net/tre The TRE approximate regex matcher homepage]


Wikimedia Foundation. 2010.

Игры ⚽ Нужен реферат?

Look at other dictionaries:

  • CRM 114 (fictional device) — For the computer program, see CRM114 (program). Kubrick s two uses of the number 114: the CRM 114 discriminator in Doctor Strangelo …   Wikipedia

  • CRM — may refer to: Contents 1 Sales and marketing 2 Companies and institutions 3 Resource management …   Wikipedia

  • 114 (number) — 114 (one hundred [and] fourteen) is the natural number following 113 and preceding 115. ← 113 115 → 114 ← …   Wikipedia

  • Stanley Kubrick — Infobox Actor name = Stanley Kubrick imagesize = 300px caption = Self Portrait of Kubrick with a Leica III camera, when he worked for Look (from the book Drama and Shadows ). birthdate = July 26, 1928 location = New York City, New York, U.S.… …   Wikipedia

  • CRM 114 (device) — The C.R.M. 114 Discriminator is a fictional piece of critical radio equipment in Stanley Kubrick s film Dr. Strangelove (1964), the destruction of which prevents the crew of a B 52 from hearing the recall code that would stop them from dropping… …   Wikipedia

  • Bogofilter — is a mail filter that classifies e mail as spam or ham (non spam) by a statistical analysis of the message s header and content (body). The program is able to learn from the user s classifications and corrections. It was originally written by… …   Wikipedia

  • E-mail filtering — Email filtering is the processing of email to organize it according to specified criteria. Most often this refers to the automatic processing of incoming messages, but the term also applies to the intervention of human intelligence in addition to …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”