- From: Tex Texin <tex@i18nguy.com>
- Date: Wed, 28 May 2003 02:09:09 -0400
- To: François Yergeau <francois@yergeau.com>
- CC: GEO <public-i18n-geo@w3.org>, Bjoern Hoehrmann <derhoermi@gmx.net>
François, great stuff. 4) could it be font on your end? I didn't try xhtml, but for an html example, I see several of the characters in the 0x80-90 range in both IE and NS7. Perhaps the stricter parsing corrects the other characters, but it is odd the euro shows thru then. 5) Unicode 3, section 13.1 says more or less it is up to the application but falls back to 6429. The chart shows the control names. My simplistic interpretation is both standards (Unicode, 10646) are saying it is up to the application to define their use, and html defines them as not to be used. 3) Nulls- ok! 7) CER- ok. I did mean to say that you can name an entity, but I wrote character... 9) OK. What seemed odd to me was recognizing the controls were needed enough to add NCRs to support them and then ruling out the c1 range. And why allow them as ncr's but not as encoded values? thanks François. I really appreciate your taking the time to dig thru this. I think we have way too much detail for the Q&A so, I'll have to figure out how to handle it. Great stuff, and much appreciated. tex François Yergeau wrote: > > Reordering a bit: > > Tex Texin a écrit: > > 4) I do know that IE maps the c1 range into 1252 values. > > Not so fast! With a test page in XHTML (supposedly kicks IE in strict > parsing mode), I see only the euro displayed, all the others are white > rectangles. Surprisingly, it's much the same in NS7 (s/white > rectangle/replacement char/)! > > > 5) C0 in html- the spec says yes. I agree there are no glyphs, when I tried > > it, although  was a different box than - for some reason. > > However, the validator rejects these characters. I want to get Martin's > > comment on why dtd makes it unused. > > OK, let's dive. Looking more closely, the spec doesn't really say yes. > Section 5.1 obliquely says that the document character set is > ISO10646. And 10646 in turn doesn't define the C0 and C1 controls it > just says (Clause 15): "This coded character set provides for use of > control functions encoded according to ISO/IEC 6429 or similarly > structured standards for control functions, and standards derived from > these." And if you look at the code charts, the C0 and C1 areas are > conspicuously empty (grayed out). > > The HTML spec contains an SGML declaration (section 20.1) which formally > declares what the document character set is: > > CHARSET > BASESET "ISO Registration Number 177//CHARSET > ISO/IEC 10646-1:1993 UCS-4 with > implementation level 3//ESC 2/5 2/15 4/6" > DESCSET 0 9 UNUSED > 9 2 9 > 11 2 UNUSED > 13 1 13 > 14 18 UNUSED > 32 95 32 > 127 1 UNUSED > 128 32 UNUSED > 160 55136 160 > 55296 2048 UNUSED -- SURROGATES -- > 57344 1056768 57344 > > So it explicitly excludes all of C0, C1 and #x7F, except for TAB, CR and > LF. We were both wrong. > > > 3) NULL- ok, what to say about it? I don't want to doc browsers' random > > behavior. It would be nice to say its illegal and be done with it. > > Well, there you have it now. It's illegal in HTML, XML 1.0 and XML 1.1. > > > 7) character entity references. Maybe this is a terminology problem on my > > part. > > Well, HTML does have this terminology (in 5.3.2). It's not defined > clearly, but it seems to cover all the predefined entities that HTML > offers "to give authors a more intuitive way of referring to characters > in the document character set". The purported intuitiveness is, of > course, lost on non-English-speakers. > > > The title of http://www.w3.org/TR/2002/CR-xml11-20021015/#sec4.1 is character > > and entity references. > > I presumed the former was ncr and the latter was CER. It is possible to give a > > name to a character in xml. > > It is possible to give a name to an entity, which may or may not contain > a single character. The five entities that XML predefines (lt, gt, amp, > quot and apos) do contain only one character. > > > 9) AHA! ok. I see the production changed in > > http://www.w3.org/TR/2002/CR-xml11-20021015/#sec2.2 > > so that 7f-9f except 85 is excluded. Seems to me an odd thing to do, although > > It makes things quite symmetrical. Apart from the few useful ones (CR, > LF, TAB and NEL), all controls must be represented as NCRs; NULL is > forbidden altogether. > > Regards, > > -- > François -- ------------------------------------------------------------- Tex Texin cell: +1 781 789 1898 mailto:Tex@XenCraft.com Xen Master http://www.i18nGuy.com XenCraft http://www.XenCraft.com Making e-Business Work Around the World -------------------------------------------------------------
Received on Wednesday, 28 May 2003 02:11:31 UTC