the document character set for text/thml serialization from Robert Burns on 2007-09-08 (public-html@w3.org from September 2007)

From: Robert Burns <rob@robburns.com>
Date: Fri, 7 Sep 2007 19:02:53 -0500
To: "HTML Working Group <public-html@w3.org>" <public-html@w3.org>
Message-Id: <AB799A39-3280-45FA-87DA-E5FC27252DD6@robburns.com>

Hello all,

I've started some wiki pages to track parsing issues[1] and parsing  
errors[2] for the text/html serialization.  I've also started page to  
track issue in the DOM/document tree that might persist when de- 
serializing from the two different serializations for HTML5[3].

I just added a section on the document character set to list the  
differences in characters allowed in the text/html serialization that  
are not allowed in the XML serialization.  A few issues arise with  
these 28 control characters and the surrogate characters.

First is it a good idea to include characters in the document  
character set (as a part of the document conformance criteria)  
without providing some definition of the meaning for these  
characters. Unicode no longer provides any definition for these  
control characters so its hard to see how this can help with  
interoperability. Certainly the private use characters might have  
similar interoperability issues, but since these really just stand in  
for particular glyphs rather than having extra control capabilities,  
it seems to me to be a different level of interoperability problem  
(the author might just distribute a glyphlette or fontlette for the  
private use character case and the interoperability issue is solved).

Second, since this introduces incompatibility and conversion issues  
with XML, shouldn't we at least try to find use cases for these  
characters before including them in the document conformance  
criteria. We can still process these characters as errors or peculiar  
whitespace characters in the parsing algorithm without including them  
as a part of the official document characters set.

Third, some of these control characters appear to be similar to  
whitespace characters and the text/html parsing algorithm appears to  
treat some of them as whitespace characters. Therefore I think we  
should either explicitly exclude them form the document conformance  
criteria and treat them as errors (though errors handled gracefully),  
or we should include them explicitly in the list of whitespace  
characters so authors know that's how they will be treated.

Fourth, the current draft makes no mention of surrogate characters:  
implying by omission that they are permitted. However, it seems to me  
that we should explicitly exclude surrogates from the document  
conformance criteria. If a surrogate is used in isolation and cannot  
be resolved to a valid character outside Unicode's Basic Multilingual  
Plane, then I think it should be an error and not valid. If this is  
already in the draft then I missed it. Of course this wouldn't effect  
the use of surrogates in the UTF-16 encoding however, for the  
document characters set we should be focussed on the characters as  
they are finally resolved (after character references and surrogate  
pairs have been resolved).

Any thoughts?

Take care,
Rob

[1]: <http://esw.w3.org/topic/HTML/ParseIssues>
[2]: <http://esw.w3.org/topic/HTML/ParseErrors>
[3]: <http://esw.w3.org/topic/HTML/ 
SerializationDependentProcessingDifferences>
[4]: <http://esw.w3.org/topic/HTML/ 
ParseIssues#head-1849fcdaf415814b598cf6e6f1d18119e85282e1>

Received on Saturday, 8 September 2007 00:03:14 UTC