Drupal.org keeps spammers out of its community and bad content off its websites with Imperva Bot Management (formerly Distil Networks) browser fingerprinting technology
Overview
Drupal.org has been around for 13 years in support of the Drupal development project. Drupal is an open source content management software that’s used to make many of the websites and applications people use every day. The Drupal community is one of the largest open source development communities in the world, consisting of more than a million passionate developers, designers, trainers, strategists, coordinators, editors, and sponsors working together. Collectively this community builds the Drupal software, provides support, creates documentation, shares networking opportunities, and more. Members’ shared commitment to the open source spirit pushes the Drupal project forward, and new members are always welcome. The website managed by Drupal.org is the primary gathering point for the members of this exciting development effort.
Challenges
Spammers create bogus accounts to post their junk content
Drupal.org has millions of pages on its website, and registered members can post user-generated content such as forum questions and answers, developer modules, blog posts, job postings and more. The Drupal.org website has a highly coveted Google PageRank of 9, which makes it a very high-value target for SEO and spammers who want to put their backlinks and other junk content on Drupal.org’s site. It’s damaging to the Drupal brand to have spam on its site. Not only do legitimate members hate encountering the nuisance content, but Drupal.org is at risk of having its hard-earned high PageRank value lowered for hosting spam. Only registered members can post content to the Drupal.org website, so there’s a continuous onslaught of people creating accounts for the purpose of inserting link spam and other bad content onto the site. Ryan Aslett, Backend Developer Services Engineer, says, “We have implemented every strategy that we could possibly think of to mitigate the spam. We’ve done content analysis. We’ve used our Honeypot module for bot behavioral analysis. We’ve tried timing issues. Even with all of that, we were still getting an onslaught of spam.”
Aslett says these are actual people, not bots, that are creating and using the accounts for spamming. “These are real people sitting in front of a tool that they’ve developed. They aren’t automating anything. They have just been paid by somebody to spam link something, and they’ll come in and post something once a day and then leave. We know it’s not a bot. It’s a real person driving a browser and posting all this junk content.”
It’s too time-consuming to remove spam content
Drupal.org staff members and community volunteers had to spend considerable time manually identifying and removing spam. Brendan Blaine, Technology Manager, says he would spend at least half of his workday every day – and sometimes up to 12 hours in a day – doing nothing but deleting spam.
Some of the community volunteers helped with this effort as well. Much of the spam gets posted in the Drupal support forums. “There will be five good questions about Drupal and then five spam posts,” says Blaine.
It’s up to the volunteer forum webmasters to delete the junk content. If they don’t, members will avoid the forums.
“The volunteers don’t pull any punches when they are dissatisfied with the kind of spam they have to deal with,” says Blaine. “They want to spend their time helping people with Drupal technical questions, and instead they’re deleting spam. Then they tell us we’re doing a bad job in dealing with spam and we need to fix the problem. We really don’t want them wasting their time managing spam. We want them adding value to the project instead of helping clean up somebody else’s mess.”
Fake accounts and spam pollute the community engagement metrics
Aslett says there are 1.9 million user accounts in the organization’s database, but those metrics are skewed by the number of spammer accounts that have been registered over the years. “We know we have extra accounts that aren’t real members,” says Aslett. “It’s hard to gauge what’s actually happening with our community with unclear metrics. We can’t tell if our legitimate growth is increasing or slowing down. It’s hard to make decisions when we don’t have clean analytics and clean site data. We want to increase community engagement to increase the value of the community, but the picture is unclear when there’s activity on the site that’s not legitimate. Plus, unwanted accounts take up space in our database and the backups.”
The Solution
Drupal.org needed a way to stop having to spend staff time cleaning up the messes the spammers were creating. At first they approached it as a spam problem, so they looked at content filtering using Mollom. According to Aslett, “Mollom looks at what people are posting, who’s posting it, and how they’re posting it. All of this can reveal patterns that indicates this is bad stuff. So basically Mollom is able to tell us if what these accounts are posting matches a pattern of ‘spaminess’ but it doesn’t keep the content from being posted in the first place.”
They also tried Honeypot, which is actually a Drupal module that is intended to deter spam bots from completing forms on a website designed using Drupal. Honeypot proved to be effective in stopping bots, but the entities creating the bogus accounts are actual people, and Honeypot doesn’t defeat them.
“As we researched ways to prevent the spam, we discovered that all of these bad actors we wanted to keep out had one thing in common— they were hiding their identities behind proxies,” says Aslett. “This allowed them to avoid having us block their IP address. So we started to hone in on how to unmask the people behind the proxies and block them before they could ever create an account to post their spam. That’s when we started to look into browser fingerprinting technologies.”
Given that Drupal.org is a community of software developers, Aslett says their first inclination was to develop a solution themselves. “When we started to dig into it, we realized this is pretty advanced stuff. We decided to partner with a provider that has this technology already. That’s when we found Imperva Bot Management (formerly Distil Networks).”
Now the module to register for a Drupal.org account runs through the Imperva Cloud CDN. “If someone wants to post content, they have to have an account,” says Aslett. “We run that account creation process through Imperva cloud service and gather device fingerprints on every new account. This process has revealed some striking information that has really helped us whittle away the spammers. We’re at the point where they are removing us from their list of websites to spam, and that’s a big win for us.”
What the Imperva Bot Management data shows
At the time of this writing, Drupal.org has had the Imperva Bot Management solution in place for about nine months. In that time, about 20,000 new accounts have been created, and all of them have had their devices fingerprinted by Imperva. About 10 percent of them have shown indications of being accounts tied to spammers. By looking at the device fingerprints in detail, Aslett says they learned there are only about 200 to 300 bad actors that are creating all the bad accounts.
“When someone creates an account, we capture a browser fingerprint that uniquely identifies them as an individual. Since we have never seen this person’s fingerprint before, we can’t really make a good/bad decision on them yet,” explains Aslett. “We have learned that spammers create new accounts all the time, so as soon as a person creates a second account that has the same fingerprint, we can make the assumption that it’s the same person making multiple accounts. That violates our policy about members having only one account.”
“We’ve seen lots of spammers that create 10, 15, 25 accounts, and they create what we call a ‘super genetic fingerprint’,” says Aslett. “For example, they’re using the most recent version of Google Chrome, they have no plug-ins installed, and they have default settings. There’s nothing about their system that makes them look unique at all because they just never installed anything, and so they look just like other users. Their fingerprints are so non-unique that it’s like they are hiding in blandness. This doesn’t necessarily tell us they are spammers but it does tell us we need to take a closer look at them.”
Even though these accounts with the bland fingerprints are a curiosity, Drupal.org can’t assume they are bad. The same type of fingerprints can happen when a university has a class of 30 students and all the students have an installation that is flashed from a golden master daily. Every single day the students’ devices get reset to the same default state. In effect, the students are all using browsers that look identical to Drupal.org.
For this reason, Aslett’s team can’t automate the lockout of accounts with duplicate fingerprints. “We added a manual step to evaluate these accounts more closely, but the rate at which we are having to look at these cases is dropping,” Aslett explains. “It’s amazing how much less time we spend as time goes on, from looking at the spam, to looking at the fingerprints, to not having to look at it at all.”
Results
Fewer spammer accounts, less spam being posted
As spammers learn they are going to get blocked from Drupal.org, they aren’t even trying to create accounts anymore. If they aren’t setting up accounts, they can’t even post spam on this website.
With less spam being posted, volunteers can devote their time to helping members in the forum instead of removing spam comments. It’s a big boost to the morale of volunteers when they are making meaningful contributions to the greater community rather than wasting their time deleting spam. Even the Drupal.org staffers spend far less time dealing with spam than they did before the Imperva Bot Management solution was deployed.
“If we block them in the registration process, the whole chain of negative activity never happens,” says Aslett.
Better metrics and community engagement
The big reduction in spammers and spam helps Drupal.org fix their metrics. “We can get a more accurate picture of the health of our site and the health of the community,” says Aslett. “Without the pollution of the spam, we can use different strategies to increase community engagement, increase the value of the community, and see if we can get more developers working on the site. We can actually see that impact without having to question whether or not our numbers really mean what we think they mean.”
Conclusion
The Imperva Cloud CDN has become a critical component of Drupal.org’s member registration process. Imperva’s fingerprinting process gives Druapl.org staffers the detailed information they need to weed out bogus accounts, which in turn prevents spam links and content from polluting the website. “This is exactly what we were hoping we were going to get out of implementing Imperva Bot Management filtering solution,” says Aslett.