Fighting Web Scraping with Apache Rewrite

Fortunately, a little bit of Apache rewrite-fu applied to the problem makes for a fun and exciting solution. Here is what I did.

First, you have to look through your web server logs and try to discover the IP address of the scraper. The stolen content was being published on a site called http://www.austinpm.com/blog/. The address of that site is:

$ host www.austinpm.com
www.austinpm.com is an alias for austinpm.com.
austinpm.com has address 72.232.98.98

So I searched the log for hosts starting with 72.232.98:

$ grep 72.232.98 access_log
    .
    .
    .
72.232.98.98 - - [14/Nov/2006:20:00:05 -0600] "GET /blog/metablog.xml HTTP/1.0" 200 1084 "-" "-"
72.232.98.98 - - [14/Nov/2006:20:30:08 -0600] "GET /blog/metablog.xml HTTP/1.0" 200 1084 "-" "-"
72.232.98.98 - - [14/Nov/2006:21:00:04 -0600] "GET /blog/metablog.xml HTTP/1.0" 200 1084 "-" "-"
72.232.98.98 - - [14/Nov/2006:21:30:08 -0600] "GET /blog/metablog.xml HTTP/1.0" 200 1084 "-" "-"
    .
    .
    .

This shows that 72.232.98.98 was periodically pulling the URL of my RSS feed. I found my scraper. Armed with the knowledge that they were scraping from a fixed, known host address, I was ready for the next step.

Next, I created an alternate RSS file that I could serve to the scraper. I created a file called metablog-stolen.xml and placed in it an RSS feed with the message:

The web site you currently are reading is stealing its content from Austin Bloggers, the original community meta-blog for Austin. We'd like to invite you to visit the Austin Bloggers site directly, and see what people are blogging about Austin.

You can view this file directly at http://www.austinbloggers.org/metablog-stolen.xml.

The final step is to configure the Apache web server to redirect requests from the scraper to this file. I added the following lines to my site configuration to do this:

RewriteEngine on
RewriteCond %{REMOTE_ADDR} ^72\.232\.98\.98$
RewriteRule ^/blog/metablog.xml /metablog-stolen.xml [L]

The next time the scraper pulled my RSS, the "stolen content" message appeared. Ha ha, take that scraper!

The fun, however, was short lived. Within a day the scraper switched over to stealing content from a local TV station web site.

Trackback URL for this post:

http://www.unicom.com/trackback/408
Your rating: None

Reply

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <img> <em> <strong> <cite> <code> <blockquote> <pre> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Each email address will be obfuscated in a human readble fashion or (if JavaScript is enabled) replaced with a spamproof clickable link.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
3 + 0 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.