Re: the document character set for text/thml serialization from Anne van Kesteren on 2007-09-09 (public-html@w3.org from September 2007)

From: Anne van Kesteren <annevk@opera.com>
Date: Sun, 09 Sep 2007 20:26:08 +0200
To: "HTML WG" <public-html@w3.org>
Message-ID: <op.tyef5uxb64w2qv@annevk-t60.oslo.opera.com>

On Sun, 09 Sep 2007 18:20:03 +0200, Julian Reschke <julian.reschke@gmx.de>  
wrote:
> Anne van Kesteren wrote:
>> On Sun, 09 Sep 2007 16:11:34 +0200, Julian Reschke  
>> <julian.reschke@gmx.de> wrote:
>>> We really should answer the question we asked before: why would it be  
>>> conforming to include those characters in the first place?
>>  I can see a good reason to prohibit U+0000 (and that's done), but what  
>> is the reason for making these other characters non-conforming? They  
>> are not posing any interoperability problem and are also supported by  
>> the DOM. I'm not sure why we should limit the HTML serialization here.
>
> So what's the semantics of these characters when they occur inside HTML?  
> What is a recipient supposed to do with them, for instance, when they  
> appear inside <p> or a <pre> element?

They should do the same as whenever someone inserts them through the DOM.  
Seems that browsers display some type of placeholder character:  
http://software.hixie.ch/utilities/js/live-dom-viewer/?%3C!DOCTYPE%20html%3E%3Cscript%3Ew(%22%01%22%20%3D%3D%20%22%5C1%22)%3C%2Fscript%3E

It's not entirely clear to me whether that's in scope of HTML though. We  
just need to define the "byte stream -> tree" mapping. Although maybe it  
could be part of the rendering chapter, dunno.

-- 
Anne van Kesteren
<http://annevankesteren.nl/>
<http://www.opera.com/>

Received on Sunday, 9 September 2007 18:26:24 UTC