Web Scraping is the use of automated software (also known as bots) to extract content and data from a website. It is also classified by the OWASP as an automated threat (OAT-011). Web Scraping differs from Screen Scraping in that it can extract underlying HTML code and data that is stored in databases while Screen Scraping only copies pixels that are displayed on screen. But where is the line drawn between extracting data for legitimate business purposes and malicious data extraction that hurts business? The line seems to be getting blurrier by the day, as efforts to depict Web Scraping as legitimate business grow stronger. Legal actions against Web Scraping are slow and vary by country.
What are valid uses for Web Scraping?
In order to understand the problem, let’s first explain a few of the legitimate use cases for Web Scraping. The first examples are search engine crawlers like Googlebot or Bingbot. These are deployed with three main functions that help create and maintain a searchable index of web pages: crawl, index and rank. Other examples are market research companies pulling data from online forums and social media, as well as price comparison websites pulling prices and product descriptions from various online retailers.
Malicious use cases for Web Scraping
What are some illegal use cases? The easiest way to define illegal Web Scraping is “the extraction of data from a certain website without permission from its owner”. Price Scraping and Content Scraping are two of the most common malicious use cases. Price Scraping usually involves competing businesses scraping your prices in order to beat your prices and win in the marketplace. This hurts businesses due to a loss in the SEO search on price. But you don’t have to be selling any goods or services in order to be targeted by scraping bots. Stealing your proprietary content could be just as bad. Content Scraping is outright content theft at a large scale, and if your content appears elsewhere on the web, your SEO rankings are bound to take a direct hit.
A legitimate business?
In 2020, we discussed the portrayal of “bad bots as a service”. These alleged businesses are offering business intelligence services dubbed pricing intelligence, alternative data for finance, or competitive insights. In addition to that, there is increased pressure within the industries to purchase scraped data. I recently came across a blog post discussing why it is important for an organization to employ web scraping bots in order to remain competitive in their market. No organization wants to lose business because the competition has access to data that is available to purchase. The writer even went the extra mile, explaining recent methods for staying “under the radar” while masquerading as a legitimate user, like the use of residential ISPs as proxies. Another sign of the attempt at legitimizing Web Scraping is the growth of job postings looking for people to fill positions with titles like Web Data Extraction Specialist or Data Scraping Specialist.
The legal stance against Web Scraping
Perhaps the most relevant legal ruling regarding Web Scraping is the hiQ Labs vs. Linkedin case. In their efforts to stop Web Scraping, Linkedin has served hiQ with a cease-and-desist. In return, hiQ filed a lawsuit against Linkedin. The Ninth Circuit appellate has ruled in favor of allowing bots to scrape publicly available content.
Following the decision, LinkedIn filed its petition requesting Supreme Court review in March 2020, to which hiQ responded. They stated that it is debatable whether a company can use the Computer Fraud and Abuse Act to prevent access to information that the website’s users have shared on their public profiles and is available for viewing by anyone with a web browser.
Linkedin isn’t alone in the fight against Web Scraping. In October 2020, Facebook filed a lawsuit in the U.S. against two companies that were engaging in an international data scraping operation spanning multiple websites. And while no major legal action has yet to be taken against Web Scraping operations, their business remains shady at best.
What’s next for Web Scraping?
This situation poses a moral dilemma for organizations. As more of them realize that not leveraging certain techniques may place them at a disadvantage, the probability of them turning to said techniques is high. Especially considering no firm legal action is being taken to put a halt to Web Scraping operations. In an environment where constant efforts are made in order to legitimize Web Scraping, it is difficult to see this particular bot problem going away any time soon.
Take defensive action
As Web Scraping remains an issue that is tricky to solve in a legal manner, an increasing number of organizations are adopting preventative measures. They understand the need to protect their proprietary data, all while maintaining the legitimate flow of traffic to their website.
Imperva offers a best-in-class Advanced Bot Protection solution, able to mitigate the most sophisticated automated threats, including all OWASP automated threats. It leverages superior technology to protect all potential access points, including websites, mobile applications and APIs. And it does so without affecting the experience of legitimate users.
Advanced Bot Protection is a part of Imperva’s Application Security platform. Start your Application Security Free Trial today to protect your assets from Grinch bots and other automated threats.
Try Imperva for Free
Protect your business for 30 days on Imperva.