Spider trap

Spider trap: A spider trap (or crawler trap) is a set of web pages that may intentionally or unintentionally be used to cause a web crawler or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash. Web crawlers are also called web spiders, from which the name is derived. Spider traps may be created to "catch" spambots or other crawlers that waste a website's bandwidth. They may also be created unintentionally by calendars that use dynamic pages with links that continually point to the next day or year.

Common techniques used are:

creation of indefinitely deep directory structures like
http://foo.com/bar/foo/bar/foo/bar/foo/bar/.....

dynamic pages like calendars that produce an infinite number of pages for a web crawler to follow.

pages filled with a large number of characters, crashing the lexical analyzer parsing the page.

pages with session-id's based on required cookies.

There is no algorithm to detect all spider traps. Some classes of traps can be detected automatically, but new, unrecognized traps arise quickly.

Politeness

A spider trap causes a web crawler to enter something like an infinite loop, which wastes the spider's resources, lowers its productivity, and, in the case of a poorly written crawler, can crash the program. Polite spiders alternate requests between different hosts, and don't request documents from the same server more than once every several seconds, meaning that a "polite" web crawler is affected to a much lesser degree than an "impolite" crawler.

In addition, sites with spider traps usually have a robots.txt telling bots not to go to the trap, so a legitimate "polite" bot would not fall into the trap, whereas an "impolite" bot which disregards the robots.txt settings would be affected by the trap.

See also

Robots exclusion standard

Web crawler

v · Web search engine (List) · Collaborative search engine · Metasearch engine

Tools
Local search · Vertical search · Search engine marketing · Search engine optimization · Search oriented architecture · Selection-based search · Social search · Document retrieval · Text mining · Web crawler · Multisearch · Federated search · Search aggregator · Index/Web indexing · Focused crawler · Spider trap · Robots exclusion standard · Distributed web crawling · Web archiving · Website mirroring software · Web search query · Voice search · Human flesh search engine · Natural language search engine · Web query classification

Applications
Image search · Video search engine · Enterprise search · Semantic search

Protocols and standards
Z39.50 · Search/Retrieve Web Service · Search/Retrieve via URL · OpenSearch · Representational State Transfer · Website Parse Template · Wide Area Information Servers

See also
Search engine · Desktop search · Online search

This World Wide Web-related article is a stub. You can help Wikipedia by expanding it.

Academic Dictionaries and Encyclopedias

Spider trap

Politeness

See also

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Spider trap

Politeness

See also

Look at other dictionaries:

Share the article and excerpts

Direct link