Re: Bug 7034

Sam Ruby wrote:
> I happen to believe that [...some other thing...]
> would be far more useful to authors of 
> content intended to be served as text/html than flagging the use of 
> unescaped ampersands in URIs is.

I think one of the criteria for determining conformance rules is that it 
should be possible to give an exact definition of how to write 
conforming HTML documents, and the definition should be possible to 
understand and follow (e.g. it shouldn't be necessary to 
reverse-engineer the parsing algorithm).

Another criteria is that markup which very likely indicates an authoring 
mistake and will result in unexpected behaviour, should be flagged as an 
error by syntax-checking tools in order to help authors write markup 
that works (and therefore it should be a document conformance error 
since that's the mechanism the spec uses to specify the behaviour of 
syntax-checking tools).

Because of the second one, markup like
   <a href="create-file.php?name=a.txt&copy=b.txt">
should be a conformance error (the author probably didn't intend 
"name=a.txt©=b.txt", and if they really did then they could have used 
"&copy;" instead).

Currently the spec says (http://whatwg.org/html#syntax-attribute-value):

"Attribute values are a mixture of text and character references, except 
with the additional restriction that the text cannot contain an 
ambiguous ampersand."

"An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is 
followed by some text other than a space character, a U+003C LESS-THAN 
SIGN character (<), or another U+0026 AMPERSAND character (&)."

If you do want to allow "...dfclick?db=sina&bid=8...", but don't want to 
allow "...&copy=...", then this description would need to be changed to 
something like:

"An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is 
followed by one of the thousands of named character references in this 
table, unless it is one of the hundreds that don't end with ';' and it 
is subsequently followed by an alphanumeric character (unless it is 
"not" and it is followed by "in;")."

which is much harder for authors to follow because they'll have to 
remember the list of thousands of strings to avoid.

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Monday, 22 March 2010 18:37:39 UTC