1

Fight WordPress comment spam with .htaccess

Posted by T. Greg Doucette on Aug 4, 2010 in Technology

Spambots really frost my Wheaties… :mad:

Given the prevalence of Google indexing and the role links to a given site play in search rankings, “spamdexing” is something every blog author is going to face at some point or another. Basically spammers write scripts to leave fake comments on a sh*tload of blogs containing a bunch of links in an effort to boost the search engine rank for their own site.

I had taken a fairly laissez-faire attitude toward spammers since law:/dev/null started back in August, but after getting slammed with spam last month I decided that needed to change. So part of my delay in getting things posted last week (aside from just having a lot to edit) was the product of me dusting off some of my old Computer Science notes and getting intimate with some old spam-fighting techniques.

I’m not sure I’ve got it completely re-mastered, but I figure I’ve got things down enough that I can share some of that insight with y’all. Besides, it took me 6 years to finish a 4-year degree — I might as well put what I learned to some use :beatup:

The overwhelming majority of websites across the globe use the Apache HTTP server, a truly excellent, scalable and secure open-source web server. Odds are good your own blog is running on Apache right now1 and that means you have an effective anti-spam tool built-in using an .htaccess file.

Disclaimer: .htaccess and regular expressions are both powerful tools for web development — especially when they’re combined together. Be über-careful as you work on this file (and make back up copies) because mistakes or typos can basically make your blog totally inaccessible to everyone. I’m also assuming you have at least some familiarity with your own webserver; since I don’t know the specifics of your own setup, proceed at your own risk, caveat emptor, etc etc etc. Basically #dontsuemeplzkthxu ;)

.htaccess is a plaintext file used by the Apache web server to process access-related commands called directives. To create one, all you have to do is create a new plain text file (e.g. in TextEdit on my Mac, after opening a new file I go to Format > Make Plain Text), save it, upload it to your server via FTP or however you directly upload files, then rename it “.htaccess” (without the quotes).

There are all sorts of cool things you can do with .htaccess… but I’m only going to show you a small subset, so feel free to Google for the rest ;)

====================
1) FIRST LOCK DOWN YOUR SERVER…
====================

Certain files on your blog get accessed on the backend by the web server itself or by you via a command-line interface. They’re not the type of thing that should ever be viewable or accessible to the public through a web browser.

For example, you don’t want everyone being able to read your .htaccess file because they’ll know what you’re defending against… and, by implication, what you’re not defending against ;)

Here’s a quick code snippet to block access to these files:

############ PROTECT FILES ############
# This snippet prevents unauthorized access to certain
# core files like .htaccess as well as logs, scripts,
# and other things that can be exploited by spammers
#######################################
<FilesMatch “\.(htaccess|htpasswd|ini|phps|fla|psd|log|sh)$”>
order allow,deny
deny from all
</FilesMatch>

The “#” denotes a comment to the web server, so everything after that symbol is ignored.

The line is a function call, and the “|” works as an OR logical operator, So here the function is telling the web server to run the inner segment of code if any file request contains .htaccess or .htpasswd or .ini or .phps or .fla or .psd or .log or .sh.

That inner segment of code just says to deny all access to the file requested. Someone trying to access this file will get a “403 – Forbidden” error message.

Then the tells the server the function is done.

====================
2) …THEN REDUCE SERVER OVERSHARE
====================

Turns out overshare isn’t just a people problem: computer servers sometimes needlessly share too much information themselves.

On many installations, for example, whenever the Apache web server generates a document (e.g. a “403 – Forbidden” error message or a “404 – Not Found” error) it includes a line at the bottom listing the version of the web server and what modules are running. This Server Signature is designed to help folks accessing websites through proxy servers who might not be able to tell which site generated a given error. But it also lets spammers know what you’re running, and if for some reason you have out-of-date software — more common than you’d think — spammers will then know which security exploits they can use against your server.

This information is still relatively easy to figure out, but there’s no point in letting your server just offer it up willy-nilly ;)

The ServerSignature is usually off by default, but just in case you can use this code:

############ DISABLE SERVER SIGNATURE ############
# This snippet disables the server signature so the server
# is not volunteering data about itself that could be useful
# to spammers in determining what attacks would work best
##################################################

ServerSignature Off

This just tells the Apache web server to shut off its ServerSignature. Very simple. :)

====================
3) BAN REMOTE COMMENTS
====================

In WordPress, leaving a comment accesses the wp-comments-post.php file. Some spammers will try to access this file without ever actually visiting your site.

You can stop these kind of non-local comments with the following code snippet:

############ NON-LOCAL COMMENT BAN ############
# This snippet prevents spammers from directly accessing
# the wp-comments-post.php file. In order to leave a comment
# a spammer must be “in” your domain by visiting your site.
###############################################
RewriteEngine On
RewriteCond %{REQUEST_METHOD} POST
RewriteCond %{REQUEST_URI} ^/?wp-comments-post\.php*
RewriteCond %{HTTP_REFERER} !^http://www\.yourdomaingoeshere\.com [NC,OR]
# RewriteCond %{HTTP_REFERER} ^-?$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteRule .* – [F,L]

The 1st line is a function call that checks if you have the mod_rewrite Apache module installed and running; odds are you do, but it’s good to check just in case. The 2nd line tells Apache to turn on the URL ReWrite engine.

The next 4 lines cover the conditions that must be met for the URL rewrite command to be executed: (1) the spammer must be trying to POST data,2 (2) the POST data must be going directly to the wp-comments-post.php file, and either (3) the POST attempt is not coming from your domain itself (the “NC” in brackets means the URL is not case-sensitive) or (4) the commenter is using a browser that does not have a HTTP_USER_AGENT programmed.3

Assuming that batch of conditions are met — 1 and 2 and (3 or 4) — the ReWriteRule line is executed. In this case the poster gets a 403 Forbidden error when the comment is submitted (the “F” in the brackets) and the ReWriteEngine stops processing because this is the last command (the “L” in the brackets).

You can also uncomment the line I included that blocks people from posting if there is an empty HTTP_REFERER field also. I left this one out because some security programs intentionally send blank referrer info so you don’t know what website someone is coming from, but if you don’t mind the risk of blocking those folks you can enable that rule as well.

====================
4) BAN SPAMMERS
====================

This is the real “meat and potatoes” of the .htaccess file as far as WordPress spam goes, and in my tests over the past couple weeks it’s been highly effective.

Although you can find tutorials online using the ReWriteEngine for this, similar to the non-local comment ban in #3 above, I’m personally a fan of using Apache’s environment variables. Since the objective of spamdexing is to increase rankings in search engines, spammers usually leave referrer code in your logs that you can use to ferret them out and stop them from ever coming back.

Here’s the code snippet:

############ SPAMMER BAN ############
# This snippet uses environment variables to ban spambots
# that come to your site with certain characteristics, such
# as Referer code from a spam-y site
#####################################

SetEnvIfNoCase Via badproxy spambot
SetEnvIfNoCase Referer badspammer1.com spambot
SetEnvIfNoCase Referer badspammer2.ru spambot
# […add as many of these lines as you have bad referrers…]
SetEnvIfNoCase User-Agent ^Bad.Spammer.Browser1 spambot
# […add as many of these lines as you have bad User-Agents…]

order allow,deny
deny from env=spambot
deny from 0.0.0.0
deny from 255.255.255.255
# […add as many of these lines as you have bad IP address not blocked by referrer bans…]
allow from all

So here’s the way this works. If you see a comment from a spam website or you notice a spamming User-Agent in your logs, you create an entry for it like in the first paragraph.

SetEnvIfNoCase tells Apache to create an environment variable if the given characteristic exists. So, in this example, if a spammer is coming from badspammer1.com Apache will create an environment variable called “spambot”.4

Down in the second paragraph, it will deny access to your site from that referrer since the “spambot” variable is true.

Also in this section, you can deny access from specific IP addresses as well if you notice the same IP producing the same spam over and over. For example, earlier this week I had a handful of compromised PCs leaving me spam comments with fake URLs (meaning the Referrer info was useless) and no common User-Agent I could ferret out of my logs. So I just blocked their IP addresses.

Blocking IPs is a bit extreme since they can be dynamically assigned and may end up belonging to a legitimate commenter days later, so if you do block an IP address I’d suggest commenting it out with a “#” after a couple weeks just in case. You can always un-comment it if the spamming picks up again. :)

====================
5) BAN HOTLINKERS
====================

Hotlinking is the process of taking a URL of where an image is hosted and pasting it into your own page. This is particularly common on message boards where folks post images they see around the web. When you hear people talk about “bandwidth theft”, hotlinking is the action that leads to it. Basically people are loading the image from your own server without ever visiting your site.

I’ve always taken a fairly permissive view toward hotlinking, mostly because I generate a lot of tables and graphs that I’m perfectly fine with other people using — and if they use them, I’d like to see in my logs where they’re using them ;)

But sometimes you get someone hotlinking an image that is loaded so many times (like on a super-busy forum) that your server chokes or you use all your bandwidth for a given month or you get a nastygram from a server administrator for hogging system resources. That’s what happened to me earlier this month :( So using the same environment variables approach for banning spammers I wrote up a blacklist for banning certain excessive hotlinkers.

Here’s the code snippet:

############ HOTLINK BAN ############
# This snippet prevents hotlinks to files in your local domain
# to prevent others from stealing your bandwidth (almost always
# used for picture files).
#####################################
SetEnvIfNoCase Referer badhotlinker1.com hotlinkers
SetEnvIfNoCase Referer badhotlinker2.ru hotlinkers
#[…add as many of these lines as you have hotlinkers…]
<FilesMatch “\.(png|jpg|jpeg|gif|bmp|swf|flv|pdf)$”>
order allow,deny
deny from env=hotlinkers
# ErrorDocument 403 /somedirectory/nohotlinking.gif
allow from all
</FilesMatch>

My current anti-hotlinking pic. It needs work.

We create the environment variable “hotlinkers” if someone is coming from a recognized domain where the image is getting hotlinked. We then use the FilesMatch directive (the same type we used in #1 up at the top) to see if they’re trying to load certain image files like .png, .jpg, .gif, and so on.

If they’re accessing those filetypes from the hotlinked domain, they’ll get a 403 Forbidden error instead.

And if you’re in an artistic mood, the commented line sends them to a custom 403 Forbidden error page — just uncomment it and in place of the hotlinked image they’ll instead see whatever you choose to put in its place. In my case I went with advertising for the blog :beatup:

—===—

Hope this helps any of you fellow blawgers who are tired of dealing with spam comments!  If you have any questions let me know in the comments — and if you’ve somehow been banned from commenting, send me an email5 ;)

And if you happen to be one of my CSC colleagues from NC State, please feel free to double-check my syntax and make sure I’ve got everything right :D

Have a great night y’all! :)

  1. If you’re not sure what webserver you’re on, check with your web administrator. []
  2. This is usually what happens when you submit a form online, contrasted with a GET submission where the data being submitted is embedded within the result URL itself. []
  3. This might, in very rare occasions, block a legitimate commenter. I’m not sure if it will ever happen but consider yourself forewarned :) []
  4. The default value for these is TRUE, but you can also type in “spambot=TRUE” if you’re a stickler for proper coding techniques. []
  5. My email address is located at the bottom of our About page ;) []

Tags: , , ,

Copyright © 2023 law:/dev/null All rights reserved. Theme by Laptop Geek.
Find TDot on Twitter or on Google+.