- From: Robert Burns <rob@robburns.com>
- Date: Fri, 7 Sep 2007 19:02:53 -0500
- To: "HTML Working Group <public-html@w3.org>" <public-html@w3.org>
Hello all, I've started some wiki pages to track parsing issues[1] and parsing errors[2] for the text/html serialization. I've also started page to track issue in the DOM/document tree that might persist when de- serializing from the two different serializations for HTML5[3]. I just added a section on the document character set to list the differences in characters allowed in the text/html serialization that are not allowed in the XML serialization. A few issues arise with these 28 control characters and the surrogate characters. First is it a good idea to include characters in the document character set (as a part of the document conformance criteria) without providing some definition of the meaning for these characters. Unicode no longer provides any definition for these control characters so its hard to see how this can help with interoperability. Certainly the private use characters might have similar interoperability issues, but since these really just stand in for particular glyphs rather than having extra control capabilities, it seems to me to be a different level of interoperability problem (the author might just distribute a glyphlette or fontlette for the private use character case and the interoperability issue is solved). Second, since this introduces incompatibility and conversion issues with XML, shouldn't we at least try to find use cases for these characters before including them in the document conformance criteria. We can still process these characters as errors or peculiar whitespace characters in the parsing algorithm without including them as a part of the official document characters set. Third, some of these control characters appear to be similar to whitespace characters and the text/html parsing algorithm appears to treat some of them as whitespace characters. Therefore I think we should either explicitly exclude them form the document conformance criteria and treat them as errors (though errors handled gracefully), or we should include them explicitly in the list of whitespace characters so authors know that's how they will be treated. Fourth, the current draft makes no mention of surrogate characters: implying by omission that they are permitted. However, it seems to me that we should explicitly exclude surrogates from the document conformance criteria. If a surrogate is used in isolation and cannot be resolved to a valid character outside Unicode's Basic Multilingual Plane, then I think it should be an error and not valid. If this is already in the draft then I missed it. Of course this wouldn't effect the use of surrogates in the UTF-16 encoding however, for the document characters set we should be focussed on the characters as they are finally resolved (after character references and surrogate pairs have been resolved). Any thoughts? Take care, Rob [1]: <http://esw.w3.org/topic/HTML/ParseIssues> [2]: <http://esw.w3.org/topic/HTML/ParseErrors> [3]: <http://esw.w3.org/topic/HTML/ SerializationDependentProcessingDifferences> [4]: <http://esw.w3.org/topic/HTML/ ParseIssues#head-1849fcdaf415814b598cf6e6f1d18119e85282e1>
Received on Saturday, 8 September 2007 00:03:14 UTC