Talk:Data scraping

This article is prone to spam. Please monitor the References and External links sections.

Implementations[edit]

Pappa, I noticed that you wrote about some Perl modules. Are these modules for screen scraping. If so we could include the better ones in the article. JesseHogan 01:44, 9 Dec 2004 (UTC)

Web scraping[edit]

I dispute the notion that screen scraping is relegated to just reading HTML. I've done work on the BlackBerry (J2ME) that required screenscraping solutions... DoomBringer 02:21, 28 May 2005 (UTC)[reply]

Doesn't the article's first paragraph make that clear. JesseHogan 07:04, 30 May 2005 (UTC)[reply]

It just seemed to me that the article (wrongly) gives the impression that screen scraping is relegated to just HTML parsing. DoomBringer 01:15, 1 Jun 2005 (UTC)

Thats just because the editors of most of the article were probobly most familiar with the HTML aspect. And it is an import aspect of modern screen scrapping. The article isn't wrong, it just needs more information concerning non-html screen scapping techniques. If you can write about these then please do. JesseHogan 19:53, 1 Jun 2005 (UTC)

I just did some major editing, intended to expand on the earlier history of screen scraping with terminals (WRT DoomBringer's comments), and to expand on what screen scraping means in general (WRT Nick Douglas's comment). ~~Web scraping examples are readily available in the links in the reference section. Examples of "classic" screen scraping are harder; they would have to be historical anecdotes, I think.~~ Comments here, improvements on the article, are, as always, welcome. --DragonHawk 23:33, 20 November 2005 (UTC)[reply]

Examples[edit]

As a layman, I'm still confused. Are there examples that could be linked? -- Nick Douglas 05:24, 18 September 2005 (UTC)[reply]

I just did some major editing, intended to expand on the earlier history of screen scraping with terminals (WRT DoomBringer's comments), and to expand on what screen scraping means in general (WRT Nick Douglas's comment). Web scraping examples are readily available in the links in the reference section. Examples of "classic" screen scraping are harder; they would have to be historical anecdotes, I think. Comments here, improvements on the article, are, as always, welcome. --DragonHawk 23:33, 20 November 2005 (UTC)[reply]

Links to external examples mostly removed (see #External links). In order to make up for the removal, I've written some generic example descriptions. I'm hope said examples help clarify what scraping is about, without turning this page into a link farm. Comments, clarifications, etc., are welcome. --DragonHawk 13:42, 6 January 2006 (UTC)[reply]

I too see Screen Scraping and Web Scraping as separate topics. I've just completed two applications that use Screen Scraping techniques to interface with a legacy system that was not web-based. —Preceding unsigned comment added by 64.90.21.3 (talk • contribs) 15:55, 28 December 2006

External links[edit]

The article was starting to collect link spam. There were several links to implementations which didn't really add information about scraping (there were just another implementation), or were outright commercial products. In order to avoid POV problems regarding external links, I have removed all external links to implementations which do not also include substantial information on how scraping works in general and how the implementation works in particular. I have also placed HTML comments in the article about this. Others can, of course, add what they want, but I've requested that people here explain their reasoning on why a link should be included. --DragonHawk 13:42, 6 January 2006 (UTC)[reply]

Thanks for your contributions Dragon. I was wonder though, is it necessarily bad to have external links to commercial sites. In some circumstances I could see how a user could benefit from a list of commercial (or free) products related to this article (or any other). I don't think these links are to much of a nuisance for non-interested people. JesseHogan 23:16, 6 January 2006 (UTC)[reply]

It'd be nice to have a way to find screen scrapers for those who want them; perhaps link to a page which lists screen scraping products (preferably one which lists whether the product is free)? --AySz88^-^ 02:28, 7 January 2006 (UTC)[reply]

While I agree that a simple list of web scrapers might well be useful, it is explictly outside the mission of Wikipedia. Wikipedia is not a web directory. See also External link guide. If someone needs a web scraper, or wants one, Google, Yahoo, and other searchers and directories already exist and can do a better job then Wikipedia. Now, if a website has lots of content explaining the "how and why" behind an implementation, that's useful content *about* scraping. But if it's just a link to Yet Another Web Scraper, what does that add to the article? --DragonHawk 00:51, 24 January 2006 (UTC)[reply]

There was an external link that had been added to a purported example with code. But the example showed no code, just a harvested page. If you want to re-instate that link, make sure it shows the scraping code, as the link suggested. Also, log in to show your name and provide a way for this feedback to be given. peterl 11:03, 27 February 2007 (UTC)[reply]

Box-A-Web[edit]

On 20:43, 21 January 2006, an anonymous user added a link to Box-A-Web. No edit summary was given, but the contributor included the link description "Not an article on how to do it in Ruby, but rather a technology demonstrator for drag and drop web scraping using Ruby on Rails Framework".

Investigation: I visited the website in question to check it out. Adverts down the left side. Account required to use. Free registration (no fees). Guest accounts published. Tutorial explains how to use it, but little about how it works -- does make an analogy of XML and RSS to HTML and this tool. Text on tutorial page "as the service is free (currently !)" implies it may or will become commercial in the future.

Conclusion: Reverted. Contains no information on web scraping. Adds nothing to the substance of the article. Anonymous contribution makes discussion with contributor impossible.

--DragonHawk 01:00, 24 January 2006 (UTC)[reply]

Hi. Original contributor here. I am also the author of the site. Acknowledged, there is not a lot of information about how it works, but my original point about including the link to the page is to show that it is possible to do web scraping visually using web technology, compared to other methods, which require either a rich client or a command line script with lots of configuration parameters. Not sure if there is an easier way to demonstrate the point. There is no plan to ever charge for the service, hence the ads on the left hand side. Drop me a mail using the webmaster address and I will reply on a private channel if you need any additional information. —Preceding unsigned comment added by 205.228.74.11 (talk • contribs)

It's good to know that you were working in good faith. However, please understand that Wikipedia is not a web directory, and that adding links to one's own site is strictly against the external link guide. This is important, because popular topics like screen scraping will otherwise eventually consist mainly of huge lists of links. Feel free to add your site to open directories such as DMoz, where it is perfectly appropriate. Thanks! --DragonHawk 02:22, 20 June 2006 (UTC)[reply]

Merge web scraping into screen scraping[edit]

I just discovered that Web scraping has its own article, separate from Screen scraping. I propse merging the content from the Web scraping article into the Web scraping section of the Screen scraping article. The term "web scraping" is derrived from "screen scraping", and the two are closely related in operation, so it makes sense, to me. --DragonHawk 13:57, 27 June 2006 (UTC)[reply]

Web Scraping and Screen Scraping are not the same thing[edit]

Please see my clarification in the Web scraping content. —Preceding unsigned comment added by Stefanandr (talk • contribs) 16:56, 30 June 2006

Bunyip responds... This subject area is actually bigger than "Ben Hur". In a nutshell...

"Screen Scraping" is a form of "Harvesting" but which is not defined in Wikipedia in the computer sense.

We need to start with a description of "Harvesting" and/or "Web Harvesting":

"Web Harvesting" is any software technique in which a software "robot" ("webbot", "crawler" (etc)) "trawls" (ie recursively downloads a page and all the page links in it to a nominated depth) any number of possibly targetted web sites for a variety of reasons, whether legitimate or not. "Web Harvesting" can be done to index web pages for search engines, to hunt for email addresses, phone/account numbers or passwords, to collect metadata, or to perform a http based archive (Eg: http://www.archive.org).

We can then describe Screen Scraping somewhat thusly:

When a human downloads a web page, it is called "browsing". When a computer program records an electronic copy of the textual data on a computer screen, it is called "screen scraping". A "screen scrape" is an electronic copy of the text that a human would have seen on the screen at the time, usually retaining top-bottom, left-right sequence, but it is not an image of the screen. "screen scraping" includes only expressly textual information, and exludes text appearing in image data. The computer program that performs the "screen scrape" is called a "robot". "Screen Scraping" can be used on web sites to collect the html text of the web page. "Screen Scraping" is still very common in high security mainframe-internet interfaces as a robust and inpenetrable (albeit crude) way of sending data from a secure server directly to public and insecure clients. Because the data from the server is static and mostly one way this prevents opportunities for injected code, buffer overflow conditions, or hacking attempts from rogue clients. "Screen Scraping" typically occurs multiple times on the same communication interface.

We can now describe the association between the two as follows:

"Web Scraping" differs from "Screen Scraping" in that the former occurs only once per web page over many different web pages. Recursively "web scraping" by following links to other pages over many web sites is "web harvesting". "web harvesting" is necessarily performed by "robots", often called "webbots", "crawlers", "harvesters" or "spiders" with similar arachnological analogies used to refer to other creepy-crawly aspects of their functions. Rightly or wrongly "web harvesters" are typically demonised as being for malicious purposes, while "webbots" are typecast as having benevolent purposes. In Australia, The Spam Act 2003 outlaws some forms of "web harvesting".

--15:59, 5 September 2006 (UTC)Abunyip

This definition of "screen scraping" raises questions for the "web" usage of the term. The article describes "screen scraping" on very old systems where it's done by reading a terminal's memory through an auxilliary port, and then it describes HTML parsing which has nothing whatsoever to do with screens, and calls it "screen" scraping. On the modern PC, a "screen scraping" program would have to be one that somehow reads the contents of a program's graphical window to get at the data. For example, a Web robot is not a screen scraper, but a program that somehow reads the data from an open Firefox window is. 75.186.36.20 (talk) 04:55, 9 February 2008 (UTC)[reply]

IMHO, most of this article should be moved to the 'Data scraping' article (which is currently only a redirect), and screen scraping and web scraping mentioned as particular implementations of the former. Uker (talk) 15:07, 8 June 2009 (UTC)[reply]

Blocking information wrong[edit]

The page currently says: "These include the blocking of individual and ranges of IP addresses, which stops the majority of "cookie cutter" screen scraping applications." Added section to web scraping on stopping bots. peterl 04:41, 12 February 2007 (UTC)[reply]

Legal Issues[edit]

This page should point out that screen scraping is against the Terms of Use of many -- perhaps most -- commercial websites, which leads to legal liability for the scraper. Indeed, the Digital Millennium Copyright Act in the USA and European Union Copyright Directive specifically address "Circumvention of Copyright Protection Schemes", which would impact anyone scraping commercial sites -- whether for commercial gain or not -- especially when the scraped data is then redistributed.

Commercial sites will aggressively protect their intellectual property, and often have little tolerance for screen scraping, especially where it impacts their commerce. As most legal force is exerted out of the public eye (and also outside of any official lawsuit) it may not be readily apparent just how vigorously commercial websites can act to protect their IP. Those considering screen scraping a commercial site should study its Terms of Use, and also consider the consequences should the site become aware that the scraping is occurring.

I propose language similar to the above, adapted for entry use. Comments? Dracogen 16:55, 21 March 2007 (UTC)[reply]

I'm too busy right now to comment more, but check out web scraping if you haven't already. —DragonHawk (talk) 17:03, 21 March 2007 (UTC)[reply]

Persistent target of spammers[edit]

Data_extraction) has been a persistent target of spammers. The commercial link was typically labeled "Know About Screen Scraping" in the See Also section. It has been removed and replaced several times. history Mrnatural (talk) 19:07, 5 August 2009 (UTC)[reply]

Lynx stdout scrapers[edit]

Somewhat intermediate between
true "screen scrapers" in the modern sense, which interface with GUIs of applications, and consequently need either OCR to read the bitmapped screen directly and convert to text, or access to the underlying data objects,
and
HTML parsers
would be something that takes already-rendered textual output from lynx and tries to figure out what it is seeing, basically tries to infer the underlying HTML to some degree.

Some wiki expert, please find an appropriate place to put this topic in this wiki page, either as a new section between "2 Screen scrapers" and "3 Web scrapers, or as a sub-part of one of those two sections. IMO renaming section 2 to read "Application-UI scrapers", with sub-sections "2.1 General" "2.2 GUI scrapers" "2.3 "Standard output scrapers" would be make the most sense. Section 2.3 would include scraping any of: PTY output, or sub-process pipe stdout, or true pipe output, or Unix command-line vertical-bar piping, or TELNET output, etc.

Serious but COI addition thoughts, LOL[edit]

Scraping is often viewed as a way to get around web site attempts to "protect" data. I've gotten into these battles with a number of people and encourage anyone working on this to site some of these pieces or reliable refs therein,

OTS needs API

cygwin api comments

outdated summary

bio discussion on API

request for SEC API

These topics sometimes come up on the itext mail list as pdf authors seem to be the most prone to creating "protected" documents that are difficult to use with computers.

Thanks. Nerdseeksblonde (talk) 17:26, 24 August 2009 (UTC)[reply]

I guess I'd be thinking about reliable sources that discuss reasons why data scraping is even needed ( it sounds silly) and that would naturally lead to issues with commercial sites that are supported with ads that have no value if no one is exposed to them, to things like concern for slowing public awareness of the contents of required public filings ( everything from building permits to SEC filings could be an issue here but the SEC is at least making machine readable documents available, if not a complete automated API). I'm not sure how ever if these topics get much beyond forums and blogs. Also not entirely sure it is an encyclopedic issue. Nerdseeksblonde (talk) 23:57, 24 August 2009 (UTC)[reply]

Merger[edit]

I merged Report mining into this article. Reyk _YO! 21:42, 3 April 2013 (UTC)[reply]

External links modified[edit]

Hello fellow Wikipedians,

I have just modified one external link on Data scraping. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive https://web.archive.org/web/20150511050542/http://www.wired.com/2014/03/kimono to https://www.wired.com/2014/03/kimono/

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 06:28, 5 September 2017 (UTC)[reply]