Cyveillance Dirty Tricks

in

Has your web site been visited by Cyveillance recently? It's quite possible, but you probably wouldn't know it. Cyveillance crawls the net spying on web sites. If you say something they don't like about one of their clients, they'll tattle on you.

Cyveillance uses a couple of dirty tricks when they crawl the web. First, they ignore the robot exclusion protocol. This standard allows you to specify portions of a web site that are off limits to robots and other automatic agents. Cyveillance fails to honor the exclusions you may have declared for your web site. They crawl places that 'bots are not supposed to go, in spite of your explicit instructions not to do so.

This can be a problem for web sites that present deep, dynamic content. For example, I have a spam robot trap on my web site. When a 'bot crawling for email addresses to spam hits that page, the trap is sprung. If the 'bot moves beyond that page, it ends up in a never-ending maze of bogus, generated email addresses. The trap keeps the 'bot tied up, and it fills its database with bogus data.

I don't want to trap well behaved 'bots, such as those used by Google to spider web pages. Therefore, I post an exclusion for this area. This protects the well-behaved 'bot from garbage data, and it protects my website from unnecessary load.

Cyveillance ignores these instructions. Their 'bot gets caught in the trap, crawling places I'm specifically trying to keep 'bots away from.

Another problem with the way Cyveillance crawls is that they provide fradulent header information in the HTTP request. Rather than admitting they are a spy 'bot, they pretend they are a web surfer running Microsoft Internet Explorer. When they submit a request to a web site, they declare:

User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

For comparison, when Google crawls a web site, they declare:

User-Agent: Googlebot/2.1 (+http://www.googlebot.com/bot.html)

You could try to keep Cyveillance out of your web site by blocking their network. The problem is that if enough people do this, they may try to hide their origin to get around the blocks. That would be a pretty sleazy thing to do, but no more sleazy than what they do already.

Comments

Comments have been closed for this entry.

re: Cyveillance Dirty Tricks

While doing a bit more research on Cyveillance, I found this interesting discussion that hits on many of the points in my posting.

re: Cyveillance Dirty Tricks

Would the spambait page be a good place to set a tarbaby?

pro: slow access for harvest-bots and spy-bots alike

con: maybe you want the harvest-bots to have speedy access?

some random thoughts from a random thinker...

re: Cyveillance Dirty Tricks

Cyveillance also has the 65.118.41.192/27 CIDR block.