Web scraping has existed for a long time, and depending on who you ask, it can be loved or hated. But where is the line drawn between extracting data for legitimate business purposes and malicious data extraction that hurts business? The bar is getting blurrier by the day, and the introduction of generative artificial intelligence (AI) and large language models (LLMs) has complicated things even further. Legal actions against web scraping are slow and vary by country, leaving organizations to fend for themselves.
What is Web Scraping?
Web scraping is a technique to swiftly pull large amounts of data from websites using automated software (bots). The OWASP classifies it as an automated threat (OAT-011). Web scraping differs from screen scraping in that it can extract underlying HTML code and data stored in databases, while screen scraping only copies pixels displayed on screen.
Web scraping is not a new phenomenon; it has been around since the dawn of the internet. Early web scraping was manual and involved individuals copying and pasting data from web pages. However, as the internet grew more complex, so did the data extraction methods. Developers started writing code to automate the process, and with the advent of machine learning and AI, web scraping has become more sophisticated and efficient. In the age of AI, web scraping has become a critical tool for businesses to gather data for machine learning models, market research, competitor analysis, and more.
Web Scraping Uses: The Good, the Bad, and the Shady
Not all web scraping is bad – the difference is rooted in how it is conducted and how that data is being used. In its positive form, web scraping is a vital underpinning of the internet that is helpful for organizations and consumers alike. Good bots that perform web scraping enable search engines to help create and maintain a searchable index of web pages, price comparison services to save consumers money, and market researchers to gauge sentiment on social media.
In contrast, bad bots fetch content from a website for purposes outside the site owner’s control, often violating the terms of service. For example, competitors may scrape your pricing information to gain a competitive edge and disrupt your business. Or worse, steal your proprietary content. Content scraping is outright theft at a large scale, and if your content appears elsewhere on the web, your SEO rankings are bound to decrease. In addition, unethical practitioners may scrape personal or sensitive information without consent, leading to privacy violations and potential identity theft.
Alarmingly, bad bots make up 30% of all web traffic today, and web scraping remains one of the most prominent use cases.
Are Web Scraping Services a Legitimate Business?
In recent years, organizations indulging in web scraping have invested heavily in positioning web scraping as a legitimate business. These attempts to rebrand “bad bots as a service” demonstrate themselves in many ways. First, by adopting professional-looking websites offering business intelligence services called pricing intelligence, alternative data for finance, or competitive insights. Typically, these businesses provide data products focused on specific industries. Second, there is increased pressure to purchase scraped data within your industry. No company wants to lose in the marketplace because the competition has access to available data to buy. Finally, there is the growth of job postings looking for people to fill positions with titles like Web Data Extraction Specialist or Data Scraping Specialist.
A quick look at the website or LinkedIn page of these dubious organizations indulging in web scraping operations reveals numerous articles justifying the use of bots to scrape data. I have seen multiple blog posts discussing why an organization must employ web scraping bots to remain competitive in their market. Some blog posts even boast about their bots’ ability to stay “under the radar” while masquerading as legitimate human users. For example, by using residential ISPs as proxies. This begs the question: Why are these bots trying to evade security measures if such a business is legitimate?
Is Web Scraping Legal?
While web scraping is not inherently illegal, how it is conducted and the data’s subsequent use can raise legal and ethical concerns. Actions such as scraping copyrighted content and personal information without consent or engaging in activities that disrupt the normal functioning of a website may be deemed illegal.
The legality of web scraping largely depends on the jurisdiction and specific circumstances. In the United States, for instance, web scraping can be considered legal as long as it does not infringe upon the Computer Fraud and Abuse Act (CFAA), the Digital Millennium Copyright Act (DMCA), or violate any terms of service agreements.
Can Legal Action Be Taken to Prevent Web Scraping?
Yes, it is possible to take legal action against web scraping, but it largely depends on the context. Suppose a website can prove that scraping has caused harm to its operations or has violated terms of service, intellectual property, or privacy rights. In that case, the court may rule against the scraping activity. However, without a comprehensive law against web scraping, each case is evaluated individually, leading to varying outcomes. Several landmark lawsuits have shaped the legal landscape of web scraping:
- In the case of eBay vs. Bidder’s Edge in 2000, eBay successfully sued Bidder’s Edge for scraping its auction data, arguing that the scraping activity exhausted its system and could potentially cause more harm.
- In Facebook vs. Power Ventures in 2009, the court sided with Facebook, ruling that Power Ventures violated intellectual property rights by scraping Facebook user data.
- One of the most recent and significant cases is LinkedIn vs. hiQ Labs in 2019. The Supreme Court ruled that scraping data publicly accessible on the internet is legal, setting a precedent that has implications for future web scraping activities.
Enforcement of web scraping laws can be challenging due to the global nature of the internet and differing regulations. Some entities actively enforce their terms of service through technological measures or legal action, especially if the scraping leads to tangible harm, such as data breaches, privacy violations, or financial losses. However, the extent of enforcement often depends on the severity of the violation and the resources available to the affected parties or relevant authorities.
This situation poses a moral dilemma for organizations. As the need to leverage specific techniques to avoid being disadvantaged, the probability of turning to web scraping increases. In an environment where constant efforts are made to legitimize web scraping, it is difficult to see the bot problem going away soon.
The Legality of Web Scraping in the Age of Artificial Intelligence
The rise of artificial intelligence (AI) and large learning models (LLMs) has brought the discussion about the legality and ethics of web scraping back to center stage. Web scraping has become a crucial component in training AI systems and LLMs. These models, such as OpenAI’s GPT-4, rely on vast data to learn and generate coherent outputs.
By scraping data from the internet, these models can be trained on diverse and extensive data corpora, improving their ability to understand and respond to a wide range of inputs. However, this practice has also raised complex legal and ethical questions that businesses must navigate.
Recently, OpenAI faced lawsuits alleging that it unlawfully copied text from books without obtaining consent from copyright holders. These lawsuits have sparked a debate about the boundaries of data collection for AI training. While some argue that this data is necessary for advancing AI technologies, others contend it infringes copyright laws and privacy rights.
The ethical implications of web scraping extend beyond legality. As AI systems and LLMs are trained on scraped data, they may inadvertently amplify and proliferate private information, posing potential risks to individuals and society. Moreover, the lack of transparency in how this data is used and the difficulty in removing data once it has been incorporated into a model raises additional ethical concerns.
Conclusion
The legality of using bots to grab information from public websites remains unclear. It is indeed a grey area, as many applicable laws were written well before the widespread use of the internet or the development of generative AI, and which laws take priority hasn’t been resolved yet.
In an environment where constant efforts are made to legitimize web scraping, it is difficult to see this bot problem going away soon. Do the existing laws need to be updated to deal with the problem? Should new legislation be introduced to provide more clarity? Certainly, but as courts try to decide the legality of further scraping, companies still have their data stolen and the business logic of their websites abused.
How to Stop Web Scraping?
Instead of looking to the law to eventually solve this technology problem, solve it by adopting a bot management solution that can stop web scraping entirely. Investing in such a solution prevents bot operators, attackers, unsavory competitors, and fraudsters from abusing, misusing, and attacking your applications.
Imperva Advanced Bot Protection is a market-leading bot management solution that safeguards businesses from today’s most sophisticated bot attacks. It protects all entry points – websites, mobile apps, and APIs against every OWASP automated threat, including web scraping. As part of its multilayered approach to bot detection, it includes machine-learning models explicitly tailored to detect web scraping.
Advanced Bot Protection embraces a holistic approach, combining the vigilant service, superior technology, and industry expertise needed to give customers complete visibility and control over the human, good bot, and bad bot traffic, offering multiple response options for each. And most importantly, it does so without imposing unnecessary friction on legitimate users, maintaining the flow of business-critical traffic to your applications.
Try Imperva for Free
Protect your business for 30 days on Imperva.