Fighting Web Scraping with Apache Rewrite

Fortunately, a little bit of Apache rewrite-fu applied to the problem makes for a fun and exciting solution. Here is what I did.

First, you have to look through your web server logs and try to discover the IP address of the scraper. The stolen content was being published on a site called http://www.austinpm.com/blog/. The address of that site is:

$ host www.austinpm.com
www.austinpm.com is an alias for austinpm.com.
austinpm.com has address 72.232.98.98

So I searched the log for hosts starting with 72.232.98:

$ grep 72.232.98 access_log
    .
    .
    .
72.232.98.98 - - [14/Nov/2006:20:00:05 -0600] "GET /blog/metablog.xml HTTP/1.0" 200 1084 "-" "-"
72.232.98.98 - - [14/Nov/2006:20:30:08 -0600] "GET /blog/metablog.xml HTTP/1.0" 200 1084 "-" "-"
72.232.98.98 - - [14/Nov/2006:21:00:04 -0600] "GET /blog/metablog.xml HTTP/1.0" 200 1084 "-" "-"
72.232.98.98 - - [14/Nov/2006:21:30:08 -0600] "GET /blog/metablog.xml HTTP/1.0" 200 1084 "-" "-"
    .
    .
    .

This shows that 72.232.98.98 was periodically pulling the URL of my RSS feed. I found my scraper. Armed with the knowledge that they were scraping from a fixed, known host address, I was ready for the next step.

Next, I created an alternate RSS file that I could serve to the scraper. I created a file called metablog-stolen.xml and placed in it an RSS feed with the message:

The web site you currently are reading is stealing its content from Austin Bloggers, the original community meta-blog for Austin. We'd like to invite you to visit the Austin Bloggers site directly, and see what people are blogging about Austin.

You can view this file directly at http://www.austinbloggers.org/metablog-stolen.xml.

The final step is to configure the Apache web server to redirect requests from the scraper to this file. I added the following lines to my site configuration to do this:

RewriteEngine on
RewriteCond %{REMOTE_ADDR} ^72\.232\.98\.98$
RewriteRule ^/blog/metablog.xml /metablog-stolen.xml [L]

The next time the scraper pulled my RSS, the "stolen content" message appeared. Ha ha, take that scraper!

The fun, however, was short lived. Within a day the scraper switched over to stealing content from a local TV station web site.

Comments

Comments have been closed for this entry.

re: Fighting Web Scraping with Apache Rewrite

This is a really brilliant solution. The scrapers are getting out of control and I think webmasters who actually produce unique content are kind of giving up the fight because they haven't found a tech solution. This is really clever!