Proposal for document conformance criteria on characters —was [Re: the document character set for text/thml serialization]

Hello WG,

Based on this thread, I've put together some proposed language and  
sections on guiding document authors and implementors on using  
characters [1]. Since Unicode has nearly 100,000 assigned characters  
and well over  100,000 additional private use code points, I thought  
it would be a good idea to give a little extra scrutiny on these  
handful of characters that aren't like the others.

There are a few categories of characters that aren't graphical  
characters, but serve some other purpose. The current draft excludes U 
+0000 and characters over 10FFFF which is a good start. However, i  
think since there are only a few dozen characters that have special  
properties and a few broad categories of characters that might  
confuse authors and editing UA implementors, its worth devoting a  
section or small chapter to the topic.

I expect to conduct some browser tests of handling around some of  
these characters too: to round-out what the draft already says about  
some of these characters. Also to help authors and implementors  
understand the intricacies of Unicode characters, the page proposes  
to include four tiers of character usage: 1) follow Unicode norms and  
guidance; 2) HTML5 provided norms and guidance (discusses whitespace  
and other HTML5 specific character issues); 3) Discourage use of  
characters in favor of other facilities (this includes the C0 and C1  
control characters and the compatibility characters) ; 4) Avoid any  
use of characters: these characters would be basically deprecated for  
use in HTML5 (strongly discourages characters where markup exists  
instead).

It also includes a list of facilities beyond the HTML scope needed to  
eliminate the need for the compatibility characters. Within the CSS3  
timeframe, it looks quite possible, with one exception. I don't think  
CSS3 has proposed anything to turn on ligatures. This seems like a  
very basic feature that CSS should provide. In some languages  
ligatures are quite important or even indispensable.

Its still in a preliminary stage, but feedback is welcome. Richard  
Ishida, if you could take a look at to make sure there aren't any  
inaccuracies that would be appreciated.

Take care,
Rob

[1]: <http://esw.w3.org/topic/HTML/HTMLCharacterUsage#preview>




original message
-------------------------------
On Sep 10, 2007, at 3:44 AM, Julian Reschke wrote:

>
> Robert Burns wrote:
>> I think Julian's question is not limited to serialization. The  
>> issue is what meaning these characters have whether inserted into  
>> the DOM, or inserted through XML, or inserted through the text/ 
>> html serialization?
>
> Correct. As a matter of fact, the fact that it's possible to add  
> illegal characters through XML DOM level 1, and then XML  
> serializers either create broken XML or throw exceptions later on  
> also has been a source of frustration for many programmers.
>
>> That in itself is an interoperability problem. If HTML doesn't  
>> specify this and Unicode doesn't specify this then is there any  
>> specification we can point to that would tell UAs what to do and  
>> authors what to expect?
>
> Right.
>
>> So we can't just say that the DOM supports it so the serialization  
>> should support it because we're in the process of specifying the  
>> HTML5 DOM and one of the HTML5 serializations. Incidentally I've  
>> also added this issue to the serialization differences wiki page.  
>> I included  XML 1.1 in that table because, though Julian says it's  
>> a failure, the only requirement changes as far as I can see,  
>> relate to these C0 and C1 control characters and there meaning and  
>> serialization.
> > ...
>
> The failure is largely about interop. The are almost no benefits of  
> XML 1.1 over 1.0, but the transition is so expensive that as far as  
> I can tell, it just hasn't occurred.
>
> Best regards, Julian
>

Received on Tuesday, 11 September 2007 06:59:48 UTC