Re: Change proposal for issue 103, was: ISSUE-103 change proposal from Philip Taylor on 2010-03-24 (public-html@w3.org from March 2010)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Wed, 24 Mar 2010 09:29:41 +0000
To: Maciej Stachowiak <mjs@apple.com>
CC: public-html@w3.org
Message-ID: <4BA9DB85.40402@cam.ac.uk>

Maciej Stachowiak wrote:
> 
> On Mar 22, 2010, at 5:14 PM, Ian Hickson wrote:
> 
>> On Thu, 18 Mar 2010, Philip Taylor wrote:
>>> Anne van Kesteren wrote:
>>>> On Thu, 18 Mar 2010 11:26:48 +0100, Julian Reschke 
>>>> <julian.reschke@gmx.de>
>>>> wrote:
>>>>> Replace the last sentence by:
>>>>>
>>>>> "Note: Due to restrictions of the XML syntax, in XML the U+003C 
>>>>> LESS-THAN
>>>>> SIGN (<) needs be escaped as well."
>>>>
>>>> That seems incomplete. The sequence ]]> comes to mind.
>>>
>>> That's not an issue in attribute values, as far as I'm aware.
>>>
>>> But in attribute values, U+000D and U+000A and U+0009 must be escaped 
>>> too.
>>> (Depending on DTD you might also need to escape any leading or 
>>> trailing U+0020
>>> and at least one of any adjacent pair of U+0020s, I think.)
>>
>> This discussion is exactly the reason why including this in the spec is a
>> bad idea.
> 
> Julian & Philip, how confident are you that the full set of characters 
> that need escaping is U+003C, U+000D, U+000A, U+0009 and U+0020? Does & 
> need to be escaped?

It needs these characters "as well" as the ones already mentioned in the 
previous paragraph in the spec (quotes and &s).

I can't think of any other characters that have particularly special 
behaviour, but what is the purpose of this note? If it is aimed at 
people writing software that emits XML syntax fragments given an 
arbitrary string of Unicode codepoints, attempting to tell them 
everything they need to know in order to serialise safely (i.e. without 
allowing the content to break their entire page), then it would probably 
also have to say that U+FFFE and U+FFFF and other characters in 
U+0000..U+001F aren't allowed, and that they must be encoded in the same 
character encoding as the rest of the document, etc.

It seems silly to duplicate the XML spec in that much detail here - if 
someone's correctly implementing XML then they should already have an 
XML serialiser that deals with all these issues, and repeating the 
information here will be a source of bugs and a waste of time.

If it's aimed at people writing XHTML by hand, telling them about common 
things to be careful of, it probably doesn't need to bother mentioning 
U+0020 because (as far as I can see) that only matters in obscure cases 
when the DTD has set srcdoc to be non-CDATA. But <iframe srcdoc> is not 
a useful feature when writing markup by hand - the use cases were things 
like sandboxing untrusted user comments, and the whole point is that 
people will write software to serialise these values, so it's not useful 
to give advice intended for hand-authoring.

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Wednesday, 24 March 2010 09:30:12 UTC