- Web scraping
Web scraping (sometimes called harvesting) generically describes any of various means to extract content from a
website overHTTP for the purpose of transforming that content into another format suitable for use in another context. Those who scrape websites may wish to store the information in their own databases or manipulate the data within a spreadsheet (Often, spreadsheets are only able to contain a fraction of the data scraped). Others may utilize data extraction techniques as means of obtaining the most recent data possible, particularly when working with information subject to frequent changes. Investors analyzing stock prices, realtors researching home listings, meteorologists studying weather, or insurance salespeople following insurance prices are a few individuals who might fit this category of users of frequently updated data.Access to certain information may also provide users with strategic advantage in business. Attorneys might wish to scrape arrest records from county courthouses in search of potential clients. Businesses that know the locations of competitors can make better decisions about where to focus further growth. Another common, but controversial use of information taken from websites is reposting scraped data to other sites.
Scraper sites
A typical example application for web scraping is a
web crawler that copies content from one or more existing websites in order to generate ascraper site . The result can range fromfair use excerpts or reproduction of text and content, to plagiarized content. In some instances, plagiarized content may be used as an illicit means to increase traffic and advertising revenue. The typical scraper website generates revenue usingGoogle AdSense , hence the term 'Made for AdSense' or MFA website.Web scraping differs from
screen scraping in the sense that a website is really not a visual screen, but a liveHTML /JavaScript -based content, with a graphics interface in front of it. Therefore, web scraping does not involve working at the visual interface as screen scraping, but rather working on the underlying object structure (Document Object Model ) of the HTML and JavaScript.Web scraping also differs from screen scraping in that screen scraping typically occurs many times from the same dynamic screen "page", whereas web scraping occurs only once per web page over many different static web pages. Recursive web scraping, by following links to other pages over many web sites, is called "web harvesting". Web harvesting is necessarily performed by a software called a
bot or a "webbot", "crawler", "harvester" or "spider" with similar arachnological analogies used to refer to other creepy-crawly aspects of their functions. Web harvesters are typically demonised, while "webbots" are often typecast as benevolent.There are legal web scraping sites that provide free content and are commonly used by webmasters looking to populate a hastily made site with
web content , often to profit by some means from the traffic the article hopefully brings. This content does not help the ranking of the site in search engine results because the content is not original to that page. [http://www.webmasterworld.com/forum44/1639.htm] Original content is a priority of search engines. [http://search.yahoo.com/search?p=%22original+content%22+%22SEO%22&fr=yfp-t-501&toggle=1&cop=mss&ei=UTF-8] Use of free articles usually requires one to link back to the free article site, as well as to a link(s) provided by the author. This is however not necessary as some sites those which provide free articles might also have a clause in their terms of service that does not allow copying content - link back or not. The site Wikipedia.org, (particularly the English Wikipedia) is a common target for web scraping.Hepp, M., D. Bachlechner, and K. Siorpaes: "Harvesting Wiki Consensus - Using Wikipedia Entries as Ontology Elements", in: "Proceedings of the Workshop on Semantic Wikis at the ESWC2006 (ESWC2006)", Budva, Montenegro, 2006.]Legal issues
Although scraping is against the
terms of use of some websites, the enforceability of these terms is unclear. [cite web
url=http://www.chillingeffects.org/linking/faq.cgi#QID596
title=FAQ about linking - Are website terms of use binding contracts?
accessdate=2007-08-20
accessdaymonth=
accessmonthday=
accessyear=
author=
last=
first=
authorlink=
coauthors=
date=2007-08-20
year=
month=
format=
work=
publisher=www.chillingeffects.org
pages=
language=
doi=
archiveurl=
archivedate=
quote=] While outright duplication of original expression will in many cases be illegal, the courts ruled inFeist Publications v. Rural Telephone Service that duplication of facts is allowable. Also, in a February, 2006 ruling, the Danish Maritime and Commercial Court (Copenhagen) found systematic crawling, indexing and deep linking by portal site ofir.dk of real estate site Home.dk not to conflict with Danish law or the database directive of the European Union. [cite web
url=http://www.bvhd.dk/uploads/tx_mocarticles/S_-_og_Handelsrettens_afg_relse_i_Ofir-sagen.pdf
title=UDSKRIFT AF SØ- & HANDELSRETTENS DOMBOG
accessdate=2007-05-30
accessdaymonth=
accessmonthday=
accessyear=
author=
last=
first=
authorlink=
coauthors=
date=2006-02-24
year=
month=
format=
work=
publisher=bvhd.dk
pages=
language=
doi=
archiveurl=
archivedate=
quote=]U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing
trespass to chattels , [cite web
url=http://www.tomwbell.com/NetLaw/Ch06.html
title=Internet Law, Ch. 06: Trespass to Chattels
accessdate=2007-08-20
accessdaymonth=
accessmonthday=
accessyear=
author=
last=
first=
authorlink=
coauthors=
date=2007-08-20
year=
month=
format=
work=
publisher=www.tomwbell.com
pages=
language=
doi=
archiveurl=
archivedate=
quote=] [cite web
url=http://www.chillingeffects.org/linking/faq.cgi#QID460
title=What are the "trespass to chattels" claims some companies or website owners have brought?
accessdate=2007-08-20
accessdaymonth=
accessmonthday=
accessyear=
author=
last=
first=
authorlink=
coauthors=
date=2007-08-20
year=
month=
format=
work=
publisher=www.chillingeffects.org
pages=
language=
doi=
archiveurl=
archivedate=
quote=] which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. However, to succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system and that the defendant's unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels. [cite web
url=http://www.tomwbell.com/NetLaw/Ch07/Ticketmaster.html
title=Ticketmaster Corp. v. Tickets.com, Inc.
accessdate=2007-08-20
accessdaymonth=
accessmonthday=
accessyear=
author=
last=
first=
authorlink=
coauthors=
date=2007-08-20
year=
month=
format=
work=
publisher=
pages=
language=
doi=
archiveurl=
archivedate=
quote=]In Australia, the
2003 Spam Act outlaws some forms of web harvesting. [cite web
url=http://www.dcita.gov.au/__data/assets/pdf_file/21726/DCITA-Spam-O4Bus-Web.pdf
title=Spam Act (Overview for Businesses)
accessdate=2007-08-20
accessdaymonth=
accessmonthday=
accessyear=
author=
last=
first=
authorlink=
coauthors=
date=2007-08-20
year=
month=
format=
work=
publisher=
pages=3
language=
doi=
archiveurl=
archivedate=
quote=] [cite web
url=http://www.dcita.gov.au/__data/assets/pdf_file/21725/DCITA-Spam-4Bus-Web.pdf
title=Spam Act (Guide for Businesses)
accessdate=2007-08-20
accessdaymonth=
accessmonthday=
accessyear=
author=
last=
first=
authorlink=
coauthors=
date=2007-08-20
year=
month=
format=
work=
publisher=
pages=10
language=
doi=
archiveurl=
archivedate=
quote=]Technical measures to stop bots
A web master can use various measures to stop or slow a bot. Some techniques include:
* Blocking an IP address. This will also block all browsing from that address.
* If the application is well behaved, adding entries to robots.txt will be adhered to. You can stop Google and other well-behaved bots this way.
* Sometimes bots declare who they are. Well behaved ones do (for example 'googlebot'). They can be blocked on that basis. Unfortunately, malicious bots may declare they are a normal browser.
* Bots can be blocked by excess traffic monitoring.
* Bots can be blocked with tools to verify that it is a real person accessing the site, such as theCAPTCHA project.
* Sometimes bots can be blocked with carefully crafted Javascript.
* Locating bots with a honeypot or other method to identify the IP addresses of automated crawlers.oftware
*
Web-scraping software comparison
* [http://www.tethyssolutions.com/screen-scrape.htm Automation Anywhere]
* [http://softbytelabs.com/us/products.html Black Widow]
* [http://www.newprosoft.com/web-content-extractor.htm Content Extractor]
* Dapper
* Data Extractor
* Data Ferret
* [http://www.irobotsoft.com/ IRobotSoft]
* [http://www.kapowtech.com/ Kapow RoboMaker]
* List Extractor Pro
* [http://www.gooseeker.com/en/ MetaSeeker]
* [http://www.mozenda.com/ Mozenda]
* [http://www.netveille.com/ Net-Veille]
* [http://www.screen-scraper.com/ screen-scraper]
* [http://www.sitescraper.co.uk/ Site Scraper]
* Visual Web Spider
*Web data extractor
* [http://www.websundew.com/ WebSundew]
* Web Scraper Lite
* [http://www.velocityscape.com/ Web Scraper Plus+]Notes
References
*
ee also
*
Web-scraping software comparison
*Screen scraping
*iMacros Firefox extension for web scraping
*Mashup (web application hybrid)
*Text corpus
*Corpus linguistics External links
* [http://www.scrapingpages.com Web Scraper Tutorials]
* [http://www.readwriteweb.com/archives/web_30_when_web_sites_become_web_services.php The Future of Web Sites = Web Services] - (with a section on web scraping)
* [http://www.Sitescraper.co.uk/ Site Scraper - Web Scraping] - Web scraping services.
* [http://www.techreform.com/web-scraping/ Techreform - Web Scraping] - A commercial provider of web scraping services based in the United Kingdom
* [http://www.johncow.com/how-to-deal-with-people-who-are-stealing-your-content/ How to deal with people who scrape content] - A comprehensive guide detailing how to deal with people who scrape your content
Wikimedia Foundation. 2010.