Re: the document character set for text/thml serialization

On Sep 7, 2007, at 7:02 PM, Robert Burns wrote:

> I've started some wiki pages to track parsing issues[1] and parsing  
> errors[2] for the text/html serialization.  I've also started page  
> to track issue in the DOM/document tree that might persist when de- 
> serializing from the two different serializations for HTML5[3].
>
> I just added a section on the document character set to list the  
> differences in characters allowed in the text/html serialization  
> that are not allowed in the XML serialization.  A few issues arise  
> with these 28 control characters and the surrogate characters.
>
> First is it a good idea to include characters in the document  
> character set (as a part of the document conformance criteria)  
> without providing some definition of the meaning for these  
> characters. Unicode no longer provides any definition for these  
> control characters so its hard to see how this can help with  
> interoperability. Certainly the private use characters might have  
> similar interoperability issues, but since these really just stand  
> in for particular glyphs rather than having extra control  
> capabilities, it seems to me to be a different level of  
> interoperability problem (the author might just distribute a  
> glyphlette or fontlette for the private use character case and the  
> interoperability issue is solved).
>
> Second, since this introduces incompatibility and conversion issues  
> with XML, shouldn't we at least try to find use cases for these  
> characters before including them in the document conformance  
> criteria. We can still process these characters as errors or  
> peculiar whitespace characters in the parsing algorithm without  
> including them as a part of the official document characters set.
>
> Third, some of these control characters appear to be similar to  
> whitespace characters and the text/html parsing algorithm appears  
> to treat some of them as whitespace characters. Therefore I think  
> we should either explicitly exclude them form the document  
> conformance criteria and treat them as errors (though errors  
> handled gracefully), or we should include them explicitly in the  
> list of whitespace characters so authors know that's how they will  
> be treated.
>
> Fourth, the current draft makes no mention of surrogate characters:  
> implying by omission that they are permitted. However, it seems to  
> me that we should explicitly exclude surrogates from the document  
> conformance criteria. If a surrogate is used in isolation and  
> cannot be resolved to a valid character outside Unicode's Basic  
> Multilingual Plane, then I think it should be an error and not  
> valid. If this is already in the draft then I missed it. Of course  
> this wouldn't effect the use of surrogates in the UTF-16 encoding  
> however, for the document characters set we should be focussed on  
> the characters as they are finally resolved (after character  
> references and surrogate pairs have been resolved).
>
> Any thoughts?
>
> [1]: <http://esw.w3.org/topic/HTML/ParseIssues>
> [2]: <http://esw.w3.org/topic/HTML/ParseErrors>
> [3]: <http://esw.w3.org/topic/HTML/ 
> SerializationDependentProcessingDifferences>
> [4]: <http://esw.w3.org/topic/HTML/ 
> ParseIssues#head-1849fcdaf415814b598cf6e6f1d18119e85282e1>

I see now that XML 1.1 permits all of these control characters as  
part of the document character set, however all of these ASCII  
control characters must be included only as character references in  
XML 1.1. That leaves only the issues of surrogates; whitespace  
handling for these characters (if any: e.g., U+000B, U+000C, and U 
+0085). Though I think our WGs practice of finding use cases for a  
feature before including it is apt here too. Is being compatible with  
XML 1.1 enough of a use case? How would authors use these characters?

Take care,
Rob

Received on Saturday, 8 September 2007 00:40:46 UTC