Web scraping

Web scraping

Web scraping (sometimes called harvesting) generically describes any of various means to extract content from a website over HTTP for the purpose of transforming that content into another format suitable for use in another context. Those who scrape websites may wish to store the information in their own databases or manipulate the data within a spreadsheet (Often, spreadsheets are only able to contain a fraction of the data scraped). Others may utilize data extraction techniques as means of obtaining the most recent data possible, particularly when working with information subject to frequent changes. Investors analyzing stock prices, realtors researching home listings, meteorologists studying weather, or insurance salespeople following insurance prices are a few individuals who might fit this category of users of frequently updated data.

Access to certain information may also provide users with strategic advantage in business. Attorneys might wish to scrape arrest records from county courthouses in search of potential clients. Businesses that know the locations of competitors can make better decisions about where to focus further growth. Another common, but controversial use of information taken from websites is reposting scraped data to other sites.

Scraper sites

A typical example application for web scraping is a web crawler that copies content from one or more existing websites in order to generate a scraper site. The result can range from fair use excerpts or reproduction of text and content, to plagiarized content. In some instances, plagiarized content may be used as an illicit means to increase traffic and advertising revenue. The typical scraper website generates revenue using Google AdSense, hence the term 'Made for AdSense' or MFA website.

Web scraping differs from screen scraping in the sense that a website is really not a visual screen, but a live HTML/JavaScript-based content, with a graphics interface in front of it. Therefore, web scraping does not involve working at the visual interface as screen scraping, but rather working on the underlying object structure (Document Object Model) of the HTML and JavaScript.

Web scraping also differs from screen scraping in that screen scraping typically occurs many times from the same dynamic screen "page", whereas web scraping occurs only once per web page over many different static web pages. Recursive web scraping, by following links to other pages over many web sites, is called "web harvesting". Web harvesting is necessarily performed by a software called a bot or a "webbot", "crawler", "harvester" or "spider" with similar arachnological analogies used to refer to other creepy-crawly aspects of their functions. Web harvesters are typically demonised, while "webbots" are often typecast as benevolent.

There are legal web scraping sites that provide free content and are commonly used by webmasters looking to populate a hastily made site with web content, often to profit by some means from the traffic the article hopefully brings. This content does not help the ranking of the site in search engine results because the content is not original to that page. [http://www.webmasterworld.com/forum44/1639.htm] Original content is a priority of search engines. [http://search.yahoo.com/search?p=%22original+content%22+%22SEO%22&fr=yfp-t-501&toggle=1&cop=mss&ei=UTF-8] Use of free articles usually requires one to link back to the free article site, as well as to a link(s) provided by the author. This is however not necessary as some sites those which provide free articles might also have a clause in their terms of service that does not allow copying content - link back or not. The site Wikipedia.org, (particularly the English Wikipedia) is a common target for web scraping.Hepp, M., D. Bachlechner, and K. Siorpaes: "Harvesting Wiki Consensus - Using Wikipedia Entries as Ontology Elements", in: "Proceedings of the Workshop on Semantic Wikis at the ESWC2006 (ESWC2006)", Budva, Montenegro, 2006.]

Legal issues

Although scraping is against the terms of use of some websites, the enforceability of these terms is unclear. [cite web
url=http://www.chillingeffects.org/linking/faq.cgi#QID596
title=FAQ about linking - Are website terms of use binding contracts?
accessdate=2007-08-20
accessdaymonth=
accessmonthday=
accessyear=
author=
last=
first=
authorlink=
coauthors=
date=2007-08-20
year=
month=
format=
work=
publisher=www.chillingeffects.org
pages=
language=
doi=
archiveurl=
archivedate=
quote=
] While outright duplication of original expression will in many cases be illegal, the courts ruled in Feist Publications v. Rural Telephone Service that duplication of facts is allowable. Also, in a February, 2006 ruling, the Danish Maritime and Commercial Court (Copenhagen) found systematic crawling, indexing and deep linking by portal site ofir.dk of real estate site Home.dk not to conflict with Danish law or the database directive of the European Union. [cite web
url=http://www.bvhd.dk/uploads/tx_mocarticles/S_-_og_Handelsrettens_afg_relse_i_Ofir-sagen.pdf
title=UDSKRIFT AF SØ- & HANDELSRETTENS DOMBOG
accessdate=2007-05-30
accessdaymonth=
accessmonthday=
accessyear=
author=
last=
first=
authorlink=
coauthors=
date=2006-02-24
year=
month=
format=
work=
publisher=bvhd.dk
pages=
language=
doi=
archiveurl=
archivedate=
quote=
]

U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing trespass to chattels, [cite web
url=http://www.tomwbell.com/NetLaw/Ch06.html
title=Internet Law, Ch. 06: Trespass to Chattels
accessdate=2007-08-20
accessdaymonth=
accessmonthday=
accessyear=
author=
last=
first=
authorlink=
coauthors=
date=2007-08-20
year=
month=
format=
work=
publisher=www.tomwbell.com
pages=
language=
doi=
archiveurl=
archivedate=
quote=
] [cite web
url=http://www.chillingeffects.org/linking/faq.cgi#QID460
title=What are the "trespass to chattels" claims some companies or website owners have brought?
accessdate=2007-08-20
accessdaymonth=
accessmonthday=
accessyear=
author=
last=
first=
authorlink=
coauthors=
date=2007-08-20
year=
month=
format=
work=
publisher=www.chillingeffects.org
pages=
language=
doi=
archiveurl=
archivedate=
quote=
] which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. However, to succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system and that the defendant's unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels. [cite web
url=http://www.tomwbell.com/NetLaw/Ch07/Ticketmaster.html
title=Ticketmaster Corp. v. Tickets.com, Inc.
accessdate=2007-08-20
accessdaymonth=
accessmonthday=
accessyear=
author=
last=
first=
authorlink=
coauthors=
date=2007-08-20
year=
month=
format=
work=
publisher=
pages=
language=
doi=
archiveurl=
archivedate=
quote=
]

In Australia, the 2003 Spam Act outlaws some forms of web harvesting. [cite web
url=http://www.dcita.gov.au/__data/assets/pdf_file/21726/DCITA-Spam-O4Bus-Web.pdf
title=Spam Act (Overview for Businesses)
accessdate=2007-08-20
accessdaymonth=
accessmonthday=
accessyear=
author=
last=
first=
authorlink=
coauthors=
date=2007-08-20
year=
month=
format=
work=
publisher=
pages=3
language=
doi=
archiveurl=
archivedate=
quote=
] [cite web
url=http://www.dcita.gov.au/__data/assets/pdf_file/21725/DCITA-Spam-4Bus-Web.pdf
title=Spam Act (Guide for Businesses)
accessdate=2007-08-20
accessdaymonth=
accessmonthday=
accessyear=
author=
last=
first=
authorlink=
coauthors=
date=2007-08-20
year=
month=
format=
work=
publisher=
pages=10
language=
doi=
archiveurl=
archivedate=
quote=
]

Technical measures to stop bots

A web master can use various measures to stop or slow a bot. Some techniques include:
* Blocking an IP address. This will also block all browsing from that address.
* If the application is well behaved, adding entries to robots.txt will be adhered to. You can stop Google and other well-behaved bots this way.
* Sometimes bots declare who they are. Well behaved ones do (for example 'googlebot'). They can be blocked on that basis. Unfortunately, malicious bots may declare they are a normal browser.
* Bots can be blocked by excess traffic monitoring.
* Bots can be blocked with tools to verify that it is a real person accessing the site, such as the CAPTCHA project.
* Sometimes bots can be blocked with carefully crafted Javascript.
* Locating bots with a honeypot or other method to identify the IP addresses of automated crawlers.

oftware

* Web-scraping software comparison
* [http://www.tethyssolutions.com/screen-scrape.htm Automation Anywhere]
* [http://softbytelabs.com/us/products.html Black Widow]
* [http://www.newprosoft.com/web-content-extractor.htm Content Extractor]
* Dapper
* Data Extractor
* Data Ferret
* [http://www.irobotsoft.com/ IRobotSoft]
* [http://www.kapowtech.com/ Kapow RoboMaker]
* List Extractor Pro
* [http://www.gooseeker.com/en/ MetaSeeker]
* [http://www.mozenda.com/ Mozenda]
* [http://www.netveille.com/ Net-Veille]
* [http://www.screen-scraper.com/ screen-scraper]
* [http://www.sitescraper.co.uk/ Site Scraper]
* Visual Web Spider
* Web data extractor
* [http://www.websundew.com/ WebSundew]
* Web Scraper Lite
* [http://www.velocityscape.com/ Web Scraper Plus+]

Notes

References

*

ee also

* Web-scraping software comparison
* Screen scraping
* iMacros Firefox extension for web scraping
* Mashup (web application hybrid)
* Text corpus
* Corpus linguistics

External links

* [http://www.scrapingpages.com Web Scraper Tutorials]

* [http://www.readwriteweb.com/archives/web_30_when_web_sites_become_web_services.php The Future of Web Sites = Web Services] - (with a section on web scraping)

* [http://www.Sitescraper.co.uk/ Site Scraper - Web Scraping] - Web scraping services.
* [http://www.techreform.com/web-scraping/ Techreform - Web Scraping] - A commercial provider of web scraping services based in the United Kingdom
* [http://www.johncow.com/how-to-deal-with-people-who-are-stealing-your-content/ How to deal with people who scrape content] - A comprehensive guide detailing how to deal with people who scrape your content


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • Web Scraping — Der Begriff Screen Scraping (engl., etwa: „Bildschirm auskratzen“) umfasst generell alle Verfahren zum Auslesen von Texten aus Computerbildschirmen. Gegenwärtig wird der Ausdruck jedoch beinahe ausschließlich in Bezug auf Webseiten verwendet… …   Deutsch Wikipedia

  • Web scraping — Le Web scraping (parfois appelé Harvesting) décrit généralement en informatique un moyen d extraire du contenu d un site Web, via un script ou un programme, dans le but de le transformer ou de changer son format pour permettre son utilisation… …   Wikipédia en Français

  • Web-scraping software comparison — This article provides a basic feature comparison for several types of web scraping software. Additional feature details are available from the individual products websites and/or articles. This article is not all inclusive or necessarily up to… …   Wikipedia

  • Web integration — is leveraging the enormous success of the Web Browser to access services and information on the Web. The services can for example include lookup in news archives, searching cheap flights and ordering cinema tickets, even editing Wikipedia.… …   Wikipedia

  • Web crawler — For the search engine of the same name, see WebCrawler. For the fictional robots called Skutters, see Red Dwarf characters#The Skutters. Not to be confused with offline reader. A Web crawler is a computer program that browses the World Wide Web… …   Wikipedia

  • Scraping — Cette page d’homonymie répertorie les différents sujets et articles partageant un même nom. Web scraping, l extraction du contenu d un site Web, Scrappage, une méthode de renaturation de sols par enlèvement d une couche de terre. Catégorie :… …   Wikipédia en Français

  • web scraper — /ˈwɛb skreɪpə/ (say web skraypuh) noun Computers an application which automatically collects data from a website and stores it in a local database or spreadsheet. Compare screen scraper. Also, web harvester. –web scraping, noun …  

  • Web 2.0 — beta. El término Web 2.0 está asociado a aplicaciones web que facilitan el compartir información, la interoperabilidad, el diseño centrado en el usuario y la colaboración en la World Wide Web. Ejemplos de la Web 2.0 son las comunidades web, los… …   Wikipedia Español

  • Web feed — Common web feed icon A web feed (or news feed) is a data format used for providing users with frequently updated content. Content distributors syndicate a web feed, thereby allowing users to subscribe to it. Making a collection of web feeds… …   Wikipedia

  • Web 1.0 — multipleissues notability = May 2008 cleanup = September 2008Web 1.0 is a retronym which refers to the state of the World Wide Web, and any website design style used before the advent of the Web 2.0 phenomonon. It is the general term that has… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”