Re: Bug 7034 from Maciej Stachowiak on 2010-03-22 (public-html@w3.org from March 2010)

From: Maciej Stachowiak <mjs@apple.com>
Date: Mon, 22 Mar 2010 15:39:57 -0700
To: Philip Taylor <pjt47@cam.ac.uk>
Cc: Sam Ruby <rubys@intertwingly.net>, HTMLwg WG <public-html@w3.org>
Message-id: <13E705A2-57E3-45DD-8750-03C10E6400FF@apple.com>

On Mar 22, 2010, at 11:37 AM, Philip Taylor wrote:

> Sam Ruby wrote:
>> I happen to believe that [...some other thing...]
>> would be far more useful to authors of content intended to be  
>> served as text/html than flagging the use of unescaped ampersands  
>> in URIs is.
>
> I think one of the criteria for determining conformance rules is  
> that it should be possible to give an exact definition of how to  
> write conforming HTML documents, and the definition should be  
> possible to understand and follow (e.g. it shouldn't be necessary to  
> reverse-engineer the parsing algorithm).
>
> Another criteria is that markup which very likely indicates an  
> authoring mistake and will result in unexpected behaviour, should be  
> flagged as an error by syntax-checking tools in order to help  
> authors write markup that works (and therefore it should be a  
> document conformance error since that's the mechanism the spec uses  
> to specify the behaviour of syntax-checking tools).
>
> Because of the second one, markup like
>  <a href="create-file.php?name=a.txt&copy=b.txt">
> should be a conformance error (the author probably didn't intend  
> "name=a.txt©=b.txt", and if they really did then they could have  
> used "&copy;" instead).
>
> Currently the spec says (http://whatwg.org/html#syntax-attribute- 
> value):
>
> "Attribute values are a mixture of text and character references,  
> except with the additional restriction that the text cannot contain  
> an ambiguous ampersand."
>
> "An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is  
> followed by some text other than a space character, a U+003C LESS- 
> THAN SIGN character (<), or another U+0026 AMPERSAND character (&)."
>
> If you do want to allow "...dfclick?db=sina&bid=8...", but don't  
> want to allow "...&copy=...", then this description would need to be  
> changed to something like:
>
> "An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is  
> followed by one of the thousands of named character references in  
> this table, unless it is one of the hundreds that don't end with ';'  
> and it is subsequently followed by an alphanumeric character (unless  
> it is "not" and it is followed by "in;")."
>
> which is much harder for authors to follow because they'll have to  
> remember the list of thousands of strings to avoid.

I think this is a reasonable argument, but it appears that many  
content authors use unescaped & while successfully avoiding this hazard.

Another possibility is to change parsing such that "...&copy=..." is  
not a hazard. I believe you had some evidence that this would fix more  
sites than it would break. It seems like it would also have the  
benefit of allowing us to make authoring rules more lenient in a  
beneficial way, without at the same time introducing undue complexity.

Regards,
Maciej

Received on Monday, 22 March 2010 22:40:33 UTC