W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2002

RE: Hide the @ from email-harvesting robots

From: Richard A. O'Keefe <ok@cs.otago.ac.nz>
Date: Tue, 5 Mar 2002 10:53:28 +1300 (NZDT)
Message-Id: <200203042153.KAA127299@atlas.otago.ac.nz>
To: html-tidy@w3.org, rick.parsons@eds.com
"Fred" said
	Why not let Tidy allow me to code the "at" sign as &#64;...
and Rick Parsons replied
	I would support a request for feature to that effect.

But (a) we don't need it, and (b) there isn't the slightest reason to
suspect that it would accomplish its intended goal of defeating spammers.

Let's take (b) first.  The aim is to write
    name&#54;host.subdomain.domain
and thereby defeat spammers looking for name@host...

But all a spammer has to do is pick up any of the free programs out there
that handle HTML, such as tidy, and have it spit out attribute values and
plain text, and search the result.  I'm aware of HTML parsers in C, Erlang,
Mercury, Smalltalk, and a few other languages, carefully written to hack
the kind of garbage that's out there.  (One of them uses no fewer than 16
stacks, because lots of pages don't nest their tags properly.)  Indeed, if
the spammer just cleans the page up using Tidy (even if it DOES use this
kluge), then any of the free SGML parsers around could be used.  It'd take
me about 30 minutes to whip up a program to work around this kind of kluge
AND test it enough.

(a) We DO NOT NEED ANY CHANGE TO Tidy to accomplish this hack.
Assuming that the output contains no CDATA sections, then a trivial
    sed -e 's/@/\&#64;/g'
will do the job, because an @ sign can occur in
 - plain text			(change wanted)
 - an attribute value		(change wanted)
 - a comment			(change wanted or harmless)
 - a processing instruction	(potentially harmful)
 - a CDATA section		(definitely harmful)
and processing instructions are extremely rare in HTML, and tend not to
contain @ signs when they do occur.  Using AWK or Perl instead of sed,
we could even cope with processing instructions and CDATA sections.

We don't need any feeping creaturism.  Tidy is complicated enough;
things that can be done _outside_ Tidy using trivial scripts should
be done outside Tidy.

But should everyone be left to invent their own wheels?
No, there should be a scripts/contrib/ directory with examples
of problems that can be solved this way and how they can be solved.
Received on Monday, 4 March 2002 16:53:32 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:51 GMT