- From: Robert Burns <rob@robburns.com>
- Date: Tue, 11 Sep 2007 01:59:32 -0500
- To: Julian Reschke <julian.reschke@gmx.de>
- Cc: Anne van Kesteren <annevk@opera.com>, HTML WG <public-html@w3.org>, Richard Ishida <ishida@w3.org>
Hello WG, Based on this thread, I've put together some proposed language and sections on guiding document authors and implementors on using characters [1]. Since Unicode has nearly 100,000 assigned characters and well over 100,000 additional private use code points, I thought it would be a good idea to give a little extra scrutiny on these handful of characters that aren't like the others. There are a few categories of characters that aren't graphical characters, but serve some other purpose. The current draft excludes U +0000 and characters over 10FFFF which is a good start. However, i think since there are only a few dozen characters that have special properties and a few broad categories of characters that might confuse authors and editing UA implementors, its worth devoting a section or small chapter to the topic. I expect to conduct some browser tests of handling around some of these characters too: to round-out what the draft already says about some of these characters. Also to help authors and implementors understand the intricacies of Unicode characters, the page proposes to include four tiers of character usage: 1) follow Unicode norms and guidance; 2) HTML5 provided norms and guidance (discusses whitespace and other HTML5 specific character issues); 3) Discourage use of characters in favor of other facilities (this includes the C0 and C1 control characters and the compatibility characters) ; 4) Avoid any use of characters: these characters would be basically deprecated for use in HTML5 (strongly discourages characters where markup exists instead). It also includes a list of facilities beyond the HTML scope needed to eliminate the need for the compatibility characters. Within the CSS3 timeframe, it looks quite possible, with one exception. I don't think CSS3 has proposed anything to turn on ligatures. This seems like a very basic feature that CSS should provide. In some languages ligatures are quite important or even indispensable. Its still in a preliminary stage, but feedback is welcome. Richard Ishida, if you could take a look at to make sure there aren't any inaccuracies that would be appreciated. Take care, Rob [1]: <http://esw.w3.org/topic/HTML/HTMLCharacterUsage#preview> original message ------------------------------- On Sep 10, 2007, at 3:44 AM, Julian Reschke wrote: > > Robert Burns wrote: >> I think Julian's question is not limited to serialization. The >> issue is what meaning these characters have whether inserted into >> the DOM, or inserted through XML, or inserted through the text/ >> html serialization? > > Correct. As a matter of fact, the fact that it's possible to add > illegal characters through XML DOM level 1, and then XML > serializers either create broken XML or throw exceptions later on > also has been a source of frustration for many programmers. > >> That in itself is an interoperability problem. If HTML doesn't >> specify this and Unicode doesn't specify this then is there any >> specification we can point to that would tell UAs what to do and >> authors what to expect? > > Right. > >> So we can't just say that the DOM supports it so the serialization >> should support it because we're in the process of specifying the >> HTML5 DOM and one of the HTML5 serializations. Incidentally I've >> also added this issue to the serialization differences wiki page. >> I included XML 1.1 in that table because, though Julian says it's >> a failure, the only requirement changes as far as I can see, >> relate to these C0 and C1 control characters and there meaning and >> serialization. > > ... > > The failure is largely about interop. The are almost no benefits of > XML 1.1 over 1.0, but the transition is so expensive that as far as > I can tell, it just hasn't occurred. > > Best regards, Julian >
Received on Tuesday, 11 September 2007 06:59:48 UTC