Word salad (computer science)

Word salad (computer science)

Word salad is a mixture of seemingly meaningful words that together signify nothing; [Lavergne 2006:384] the phrase draws its name from the common name for a symptom of schizophrenia, Word salad. When applied to a physical theory, "word salad" is a derogatory description that labels the theory as senseless or utterly devoid of meaning.Fact|date=February 2007

In the context of computer science and linguistics, explicitly constructed word salad is a tool for demonstrating the difference between random utterance and coherent expression of thought. Software such as the Dissociated press within emacs demonstrates the construction of interesting-but-meaningless word salad from large samples of coherent language, by constructing new, random documents that share some of the same word or letter clustering properties as the language sample. These word salads appear as natural language to the inattentive eye or ear, but are clearly meaningless when read or listened to with full attention. In the 21st century, spammers have begun using word salad construction as a way to elude e-mail filtering and attract web page indexing to spam.Gyöngyi 2005] [Lavergne 2006]


=Word salad with spam e-mail =

In response to the growing problem of spam e-mail, filtering tools became available starting around 2002 which implemented a widely employed method known as the naive Bayes classifier. This method uses the probability of various words appearing in spam emails to automatically classify them as spam. For a short time, this worked fairly well to classify emails as probable spam. In response, spammers developed word salad to fool programs employing this method of classification. [For examples see (Lavergne 2006:285) Figure 1.] By adding large amounts of random text somewhere in their message, spammers hope to confuse Bayesian classifiers into classifying the message as "ham e-mail" (non-spam e-mail). Typically, this text contains random words from a dictionary.Fact|date=October 2007

Word salad for web page spam

Gyöngyi and Garcia-Molina state this problem clearly:

"As more and more people rely on the wealth of information available online, increased exposure on the World Wide Web may yield significant financial gains for individuals or organizations. Most frequently, search engines are the entryways to the Web; that is why some people try to mislead search engines, so that their pages would rank high in search results, and thus, capture user attention."Gyöngyi 2005]

entence and paragraph salad

Paragraph salad will reduce the effectiveness of any of the algorithms mentioned above and will lead to higher scores with any Bayesian filters. The only algorithms that might thwart sentence and paragraph salad would be very high level and expensive natural language processing, some kind of artificial intelligence algorithm involving a search engine, or exhaustive listing of spam emails. All of these techniques would be exceptionally expensive, and would likely not be very successful at filtering spam despite their high cost.Fact|date = October 2007

In a related technique, actual text from some large corpus of legitimate English (the plays of Shakespeare, other etexts distributed by Project Gutenberg, random world wide web pages, Wikipedia, or the like) is added into the email. This approach attempts to get around algorithms that could be devised to detect the more primitive form of word salad. [ [http://googlesystem.blogspot.com/2006/08/new-breed-of-spam.html A New Breed of Spam] ]

Letter salad

On an even smaller scale than word salad, spammers use misspellings of words to try to thwart Bayesian filters. Misspelling Viagra as Via6ra, /|/Gr/, or any one of a number of other ways (see Leet), or even using characters from international character sets is an attempt to avoid the high efficiency with which a Bayesian filter would classify any email containing certain words as spam. A simple spell checker might significantly reduce the effectiveness of letter salad approaches, yet most present spam filters do not use one.Fact|date = October 2007

The lengths to which some spammers have gone with letter salad have often produced illegible, almost laughable messages. Reading such email has become akin to deciphering complex custom license plates.Fact|date=February 2007

Word salad filtering

Algorithms for detecting word salad are clearly possible and not particularly difficult to implement.Fact|date=February 2007 They would be, for the most part, more computationally intensive than most rules used by spam filters today (2006). A statistical approach based on Zipf's law of word frequency has potential in detecting simple word salad, as do grammar checking and the use of natural language processing.Lavergne 2006:386] Statistical Markovian analysis, where short phrases are used to determine if they are likely to occur in normal English sentences, is another statistical approach that would be effective against completely random phrasingLavergne 2006:386] but might be fooled by Dissociated press techniques.Fact|date=October 2007

Future

As spam filters get better at detecting simple word and letter salad, spammers will likely migrate towards sentence and paragraph salad techniques.Fact|date=February 2007 In the process of obscuring their message from improving spam filters, they will also obscure their message from potential targets of their advertising, virus distribution, or phishing. At some point, the profitability of spam may be brought down to the point that its volume is substantially reduced.

Notes

References

Citation
first = Zoltán
last = Gyöngyi
authorlink = Zoltán Gyöngyi
first2 = Hector
last2 = Garcia-Molina
author2link = Hector Garcia-Molina
editor-last =
editor-first =
editor2-last =
editor2-first =
contribution = Web spam taxonomy
contribution-url = http://airweb.cse.lehigh.edu/2005/gyongyi.pdf
title = Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005 in The 14th International World Wide Web Conference (WWW 2005) May 10, (Tue)-14 (Sat), 2005, Nippon Convention Center (Makuhari Messe), Chiba, Japan.
year = 2005
pages =
publisher = ACM Press
place = New York, N.Y.
url =
doi =
id =
isbn = 1-59593-046-9

cite conference
first = Thomas
last = Lavergne
authorlink = Thomas Lavergne
coauthors =
title = Unnatural language detection
booktitle = RJCRI'O6: Young Scientist' conference on Information Retrieval
pages = 383-388
publisher = (French?)
date = 2006
location =
url = http://www.irit.fr/ARIA/2006/383.pdf
doi =
id =
accessdate = 2007-10-02


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • Word salad — is a string of words that vaguely resembles language, and may or may not be grammatically correct, but is utterly meaningless. Examples of word salad include:: Wonder, why now? Like, there s this clever girl, and this clown right. They re… …   Wikipedia

  • Inherently funny word — The belief that certain words are inherently funny, for reasons ranging from onomatopoeia to phonosemantics to sexual innuendo, is widespread among people who work in humor.Fact|date=September 2008 Cultural variation The concept of inherent humor …   Wikipedia

  • Flarf poetry — can be characterized as an avant garde poetry movement of the late 20th century and the early 21st century. Its first practitioners practiced an aesthetic dedicated to the exploration of “the inappropriate” in all of its guises. Their method was… …   Wikipedia

  • April Fools Day 2008 — April 1, 2008 was an April Fools Day falling on a Tuesday. In newspapers, magazines and news websites * About.com s Car Reviews posted a fake story that Toyota had announced a new 256 horsepower V6 Prius to accommodate the needs of car buyers… …   Wikipedia

  • Ant — For other uses, see Ant (disambiguation). Ants Temporal range: 130–0 Ma …   Wikipedia

  • Pun — A pun (or paronomasia) is a phrase that deliberately exploits confusion between similar sounding words for humorous or rhetorical effect.A pun may also cause confusion between two senses of the same written or spoken word, due to homophony,… …   Wikipedia

  • Video game addiction — EverCrack redirects here. For the video game frequently known as EverCrack, see EverQuest. See also: Internet addiction disorder and Computer addiction Video game addiction, or more broadly used video game overuse, is excessive or compulsive …   Wikipedia

  • nervous system, human — ▪ anatomy Introduction       system that conducts stimuli from sensory receptors to the brain and spinal cord and that conducts impulses back to other parts of the body. As with other higher vertebrates, the human nervous system has two main… …   Universalium

  • Nonsense — For other uses, see Nonsense (disambiguation). For Wikipedia policy regarding nonsense, see Wikipedia:Patent nonsense. Nonsense is a communication, via speech, writing, or any other symbolic system, that lacks any coherent meaning. Sometimes in… …   Wikipedia

  • Anexo:Falsos amigos — Los falsos amigos son palabras que, a pesar de tener significados diferentes, pueden escribirse o pronunciarse de una manera similar en dos o más idiomas. Lo anterior puede deberse tanto a distintas etimologías como a un cambio en el significado… …   Wikipedia Español

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”