Web scraping is the process of using bots to extract content and data from a website. Site scraping also called screen scraping or web scraping, can undermine victims’ revenues and profits by siphoning off customers and reducing competitiveness. Web scraping attacks range from harmless data collection for personal research to calculated, repeated data harvesting used to undercut competitor’s prices or to publish valuable information illicitly.
In the last couple of years, web scraping has changed from a minor annoyance into one of the biggest threats to business applications. Persistent scraping bots consume computational resources from the application, bring in zero value, and in the more typical cases, actually damage the application by stealing content from web pages and extracting prices for competition. Indeed, the web scraping business is flourishing, creating a new market of services. Despite the differences in their level of sophistication and persistence, their fundamental goal is essentially the same—extracting valuable data from the application.
We present two web scraping attacks that originated from Tor. In the first case, we track the attack patterns from the perspective of the scraping campaign organizer who is targeting a well-known reseller website. In the second instance, we track the attack from the viewpoint of the victim—a luxury retail site.
A Glimpse into a Scraping Attack
For over a month, we recorded Tor clients using the same Tor exit node—the same IP—for scraping the well-known reseller’s website. The intensity of the attack and the multi-origin nature of many scraping attacks make us believe that the attack was carried out from other Tor clients concurrently. It is likely that during the inactivity periods, seen in Figure 1, the attack disappeared from our radar but continued from other sources. In the peak shown in the blue circle in Figure 1, we detected 4,616 requests during six minutes. Ten days earlier, as shown in the green circle, we detected the second peak, 2,929 requests within ten minutes, which is compliant with the default lifetime of a Tor circuit. At that point, we believe the attacker switched to a different Tor circuit and continued the scraping, which was seen at the victim’s website as another Tor IP.
To tell whether the traffic from this IP is legitimate users browsing the site anonymously from Tor or actual web scrapers, we analyzed the behavior of the Tor-originating traffic and the content of the visited pages. The traffic patterns, as seen in Figure 1, were jagged with silent days interrupted with sudden Tor peaks, typical for automated web attacks like scraping. Furthermore, we were not surprised to find these pages contained product lists and pricing, which is the most scraping-prone web content. The pages were rarely scraped by curious hacker users, but typically by competitors who were adjusting their prices to gain an unfair advantage in competition. Another behavior that characterizes the suspicious traffic was the lack of requests for image resources, which we attribute to selective crawling of the scraper runs.
A closer look at the user-agent headers of the suspicious traffic shown in Figure 3 reveals that as in many web attacks the attacker’s client impersonated different browsers to foil mechanisms that block traffic from automated tools.
The next section reviews another scraping campaign originating from the Tor network, this time from an application perspective to provide a complete picture of the scraping attack.
Scraping Campaign
The victim of the attack was a luxury retail application with almost two million suspicious requests recorded during a single week, presumably all part of the same attack. Only three URLs were targeted by 99 percent of the requests which is typical of scraping attacks. A closer look at these URLs (Figure 2) shows a search page and two product shopping pages with products and their prices clearly a target of the price scraping attack. The attackers used the search function to find products and navigated to the search results all within the product pages.
The attackers used 777 Tor IPs, producing an average of 2,563 requests per IP during the week. Further investigation of the traffic showed that 1,086 session identifiers (jsessionid) were in use during the attack, with each of these session IDs used 1,117 times on average, in many cases from different IPs. For example, Figure 4 shows the usage of a specific session ID from as many as 25 different IPs.
As in the previous attack, user-agent strings were faked, and the attack spanned over 453,547 unique user agents indicating random user-agents. After cleaning some noise from these strings, we identified three main user agents that were used interchangeably throughout the same session with different permutations or versions:
- Mozilla User-Agent: Mozilla/5.0 (Windows; Windows NT 5.1; en-US; +MrTiger) Firefox/10.0 (66% of the requests)
- MSN-BOT User-Agent: msnbot-Products/1.0 (+http://search.msn.com/msnbot.htm) (17% of the requests)
- Opera User-Agent: Opera/9.80 (Windows NT 6.1; U; es-ES) Presto/2.9.181 Version/12.00 x:0) (17% of the request)
Finally, we analyzed the IP distribution (Figure 5) of the attack. We identified three dominant Tor source IPs from which most of the attacks originated. We were not surprised to find three hosting services in the U.S. owned these three IPs, a typical accommodation for Tor exit nodes. The abnormal distribution of requests among IPs is curious. It caught our attention and will be covered in our next blog. Spoiler alert: Think bandwidth.
We saw two scraping attacks originating from Tor. The attack characteristics resemble the profile of a typically organized scraping campaign including, distribution, recycling of session identifiers, multiple user-agent strings, and of course–anonymity. The attackers used anonymity provided by Tor to hide the trail of evidence leading back to them.
Mitigation
Web application firewalls are commonly used to protect against most web application attacks. Today’s advanced web application firewalls have the ability to detect such complex business logic attacks irrespective of the use of Tor networks. Web application firewalls that have integrated threat intelligence can take into account the IP reputation data, and correlate attacks to effectively stop them.
Find out more about blocking scraping attacks by downloading the “Detecting and Blocking Site Scaping” white paper.
Try Imperva for Free
Protect your business for 30 days on Imperva.