Re: the document character set for text/thml serialization from Robert Burns on 2007-09-09 (public-html@w3.org from September 2007)

From: Robert Burns <rob@robburns.com>
Date: Sun, 9 Sep 2007 15:40:13 -0500
To: Anne van Kesteren <annevk@opera.com>
Cc: "HTML WG" <public-html@w3.org>
Message-Id: <4C26CCEE-474F-48A5-9F4A-78F379A789EA@robburns.com>

HI Anne,

On Sep 9, 2007, at 1:26 PM, Anne van Kesteren wrote:

>
> On Sun, 09 Sep 2007 18:20:03 +0200, Julian Reschke  
> <julian.reschke@gmx.de> wrote:
>> Anne van Kesteren wrote:
>>> On Sun, 09 Sep 2007 16:11:34 +0200, Julian Reschke  
>>> <julian.reschke@gmx.de> wrote:
>>>> We really should answer the question we asked before: why would  
>>>> it be conforming to include those characters in the first place?
>>>  I can see a good reason to prohibit U+0000 (and that's done),  
>>> but what is the reason for making these other characters non- 
>>> conforming? They are not posing any interoperability problem and  
>>> are also supported by the DOM. I'm not sure why we should limit  
>>> the HTML serialization here.
>>
>> So what's the semantics of these characters when they occur inside  
>> HTML? What is a recipient supposed to do with them, for instance,  
>> when they appear inside <p> or a <pre> element?
>
> They should do the same as whenever someone inserts them through  
> the DOM. Seems that browsers display some type of placeholder  
> character: http://software.hixie.ch/utilities/js/live-dom-viewer/?% 
> 3C!DOCTYPE%20html%3E%3Cscript%3Ew(%22%01%22%20%3D%3D%20%22%5C1%22)% 
> 3C%2Fscript%3E
>
> It's not entirely clear to me whether that's in scope of HTML  
> though. We just need to define the "byte stream -> tree" mapping.  
> Although maybe it could be part of the rendering chapter, dunno.

I think Julian's question is not limited to serialization. The issue  
is what meaning these characters have whether inserted into the DOM,  
or inserted through XML, or inserted through the text/html  
serialization? That in itself is an interoperability problem. If HTML  
doesn't specify this and Unicode doesn't specify this then is there  
any specification we can point to that would tell UAs what to do and  
authors what to expect?

So we can't just say that the DOM supports it so the serialization  
should support it because we're in the process of specifying the  
HTML5 DOM and one of the HTML5 serializations. Incidentally I've also  
added this issue to the serialization differences wiki page. I  
included  XML 1.1 in that table because, though Julian says it's a  
failure, the only requirement changes as far as I can see, relate to  
these C0 and C1 control characters and there meaning and serialization.

Take care,
Rob

[1]: <http://esw.w3.org/topic/HTML/ 
SerializationDependentProcessingDifferences#head-325bab981d9fb34bc566af1 
2b58e423352491705>

Received on Sunday, 9 September 2007 20:40:28 UTC