- From: François Yergeau <francois@yergeau.com>
- Date: Tue, 27 May 2003 14:15:34 -0400
- To: Tex Texin <tex@i18nguy.com>
- Cc: GEO <public-i18n-geo@w3.org>, Bjoern Hoehrmann <derhoermi@gmx.net>
Reordering a bit: Tex Texin a écrit: > 4) I do know that IE maps the c1 range into 1252 values. Not so fast! With a test page in XHTML (supposedly kicks IE in strict parsing mode), I see only the euro displayed, all the others are white rectangles. Surprisingly, it's much the same in NS7 (s/white rectangle/replacement char/)! > 5) C0 in html- the spec says yes. I agree there are no glyphs, when I tried > it, although  was a different box than - for some reason. > However, the validator rejects these characters. I want to get Martin's > comment on why dtd makes it unused. OK, let's dive. Looking more closely, the spec doesn't really say yes. Section 5.1 obliquely says that the document character set is ISO10646. And 10646 in turn doesn't define the C0 and C1 controls it just says (Clause 15): "This coded character set provides for use of control functions encoded according to ISO/IEC 6429 or similarly structured standards for control functions, and standards derived from these." And if you look at the code charts, the C0 and C1 areas are conspicuously empty (grayed out). The HTML spec contains an SGML declaration (section 20.1) which formally declares what the document character set is: CHARSET BASESET "ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6" DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED 128 32 UNUSED 160 55136 160 55296 2048 UNUSED -- SURROGATES -- 57344 1056768 57344 So it explicitly excludes all of C0, C1 and #x7F, except for TAB, CR and LF. We were both wrong. > 3) NULL- ok, what to say about it? I don't want to doc browsers' random > behavior. It would be nice to say its illegal and be done with it. Well, there you have it now. It's illegal in HTML, XML 1.0 and XML 1.1. > 7) character entity references. Maybe this is a terminology problem on my > part. Well, HTML does have this terminology (in 5.3.2). It's not defined clearly, but it seems to cover all the predefined entities that HTML offers "to give authors a more intuitive way of referring to characters in the document character set". The purported intuitiveness is, of course, lost on non-English-speakers. > The title of http://www.w3.org/TR/2002/CR-xml11-20021015/#sec4.1 is character > and entity references. > I presumed the former was ncr and the latter was CER. It is possible to give a > name to a character in xml. It is possible to give a name to an entity, which may or may not contain a single character. The five entities that XML predefines (lt, gt, amp, quot and apos) do contain only one character. > 9) AHA! ok. I see the production changed in > http://www.w3.org/TR/2002/CR-xml11-20021015/#sec2.2 > so that 7f-9f except 85 is excluded. Seems to me an odd thing to do, although It makes things quite symmetrical. Apart from the few useful ones (CR, LF, TAB and NEL), all controls must be represented as NCRs; NULL is forbidden altogether. Regards, -- François
Received on Tuesday, 27 May 2003 14:15:43 UTC