Re: Change proposal for issue 103, was: ISSUE-103 change proposal from Maciej Stachowiak on 2010-03-24 (public-html@w3.org from March 2010)

From: Maciej Stachowiak <mjs@apple.com>
Date: Wed, 24 Mar 2010 02:54:03 -0700
To: Philip Taylor <pjt47@cam.ac.uk>
Cc: public-html@w3.org
Message-id: <0021B553-49F9-4654-9C31-CA464E2F3EA4@apple.com>

On Mar 24, 2010, at 2:29 AM, Philip Taylor wrote:

> Maciej Stachowiak wrote:
>> On Mar 22, 2010, at 5:14 PM, Ian Hickson wrote:
>>> On Thu, 18 Mar 2010, Philip Taylor wrote:
>>>> Anne van Kesteren wrote:
>>>>> On Thu, 18 Mar 2010 11:26:48 +0100, Julian Reschke <julian.reschke@gmx.de 
>>>>> >
>>>>> wrote:
>>>>>> Replace the last sentence by:
>>>>>>
>>>>>> "Note: Due to restrictions of the XML syntax, in XML the U+003C  
>>>>>> LESS-THAN
>>>>>> SIGN (<) needs be escaped as well."
>>>>>
>>>>> That seems incomplete. The sequence ]]> comes to mind.
>>>>
>>>> That's not an issue in attribute values, as far as I'm aware.
>>>>
>>>> But in attribute values, U+000D and U+000A and U+0009 must be  
>>>> escaped too.
>>>> (Depending on DTD you might also need to escape any leading or  
>>>> trailing U+0020
>>>> and at least one of any adjacent pair of U+0020s, I think.)
>>>
>>> This discussion is exactly the reason why including this in the  
>>> spec is a
>>> bad idea.
>> Julian & Philip, how confident are you that the full set of  
>> characters that need escaping is U+003C, U+000D, U+000A, U+0009 and  
>> U+0020? Does & need to be escaped?
>
> It needs these characters "as well" as the ones already mentioned in  
> the previous paragraph in the spec (quotes and &s).
>
> I can't think of any other characters that have particularly special  
> behaviour, but what is the purpose of this note? If it is aimed at  
> people writing software that emits XML syntax fragments given an  
> arbitrary string of Unicode codepoints, attempting to tell them  
> everything they need to know in order to serialise safely (i.e.  
> without allowing the content to break their entire page), then it  
> would probably also have to say that U+FFFE and U+FFFF and other  
> characters in U+0000..U+001F aren't allowed, and that they must be  
> encoded in the same character encoding as the rest of the document,  
> etc.
>
> It seems silly to duplicate the XML spec in that much detail here -  
> if someone's correctly implementing XML then they should already  
> have an XML serialiser that deals with all these issues, and  
> repeating the information here will be a source of bugs and a waste  
> of time.
>
> If it's aimed at people writing XHTML by hand, telling them about  
> common things to be careful of, it probably doesn't need to bother  
> mentioning U+0020 because (as far as I can see) that only matters in  
> obscure cases when the DTD has set srcdoc to be non-CDATA. But  
> <iframe srcdoc> is not a useful feature when writing markup by hand  
> - the use cases were things like sandboxing untrusted user comments,  
> and the whole point is that people will write software to serialise  
> these values, so it's not useful to give advice intended for hand- 
> authoring.

What spec change (if any) would you recommend on this issue? (I'm not  
sure from the above if you are arguing for a detailed note, a shorter  
but partially incomplete note, no note, or something else.)

Regards,
Maciej

Received on Wednesday, 24 March 2010 09:54:36 UTC