Popular Spam Protection Technique Doesn't Work

in

The idea behind entity encoding is that you replace characters of the email address with codes that a browser will understand but a spammer's address extractor may not recognize. At one time, address extractors were pretty naive tools and could be fooled this way. I ran some tests to see if this still was true, and, unfortunately, it no longer is.

The term entity encoding actually is a misnomer. It typically is used to mean one of two things. In one case, it means replacing characters with their equivalent numeric character reference. For instance, a web browser will render the numeric character reference @ as a @ character. These sorts of encodings can be used anywhere in an HTML document, including a mailto: tag. A naive address extractor may fail to decode the address correctly, thus keeping it away from the spammer.

Sometimes people say entity encoding when they mean URI escaped encoding, as described in RFC 2396 section 2.4.1. For instance, the sequence %40 is equivalent to a @ in a URL. These encodings can't be used anywhere in an HTML document, only in the links, including mailto: links. Again, a naive address extractor may fail to recognize them.

I happened to use the @ character in these examples, but any part of the mailto: URL could be encoded in this fashion. All you need to do is convert the character to the appropriate numeric code and rewrite it an encoded fashion. Just be sure you use the decimal value when composing a numeric character reference and the hexadecimal value when composing URI escaped encoding. Or, better yet, there are online tools that will do this for you.

Entity encoding is a simple technique. It is documented in many places as a possible method to protect your email address. Unfortunately, although it may have worked at one time, it's not useful with current address extraction tools.

To test address extraction capability, I downloaded a product called Web Data Extractor v4.0. I setup a target web page with a number of email addresses, employing a variety of entity encoding techniques. The results are shown in the table below.

trial document source what extractor saw
1 user01atexample [dot] com user01atexample [dot] com
2 user02@example.com user02atexample [dot] com
3 <a href="mailto:user03&#64;example.com"> user03atexample [dot] com
4 <a href="&#109;ailto:user04&#64;example.com"> user04atexample [dot] com
5 <a href="mailto:user05%40example.com"> user05atexample [dot] com
6 <a href="%6Dailto:user06%40example.com"> (not seen)
7 <a href="%6dailto:user07%40example.com"> (not seen)

In trial one, the address was in the clear. The remaining trials used some form of entity encoding. Only in trials 6 and 7, where the "m" in mailto: is URI escape encoded, did the extractor fail. Still, I can't recommend the method. The author could easily remedy this defect in the next release.

Entity encoding is attractive because it's simple, portable and does not hamper functionality. At one time it may have worked, but it no longer is a useful method to hide your email address from web page harvesting.

Dec 5 update: I've posted an article that responds to some of the ideas proposed in the comments below.

Trackback URL for this post:

http://www.unicom.com/trackback/173

Rating

What do you think of this article? Please click the stars below to let me know.

Your rating: None Average: 3.3 (3 votes)