Re: Q&A: control codes from François Yergeau on 2003-05-27 (public-i18n-geo@w3.org from May 2003)

From: François Yergeau <francois@yergeau.com>
Date: Tue, 27 May 2003 14:15:34 -0400
To: Tex Texin <tex@i18nguy.com>
Cc: GEO <public-i18n-geo@w3.org>, Bjoern Hoehrmann <derhoermi@gmx.net>
Message-id: <3ED3AB46.70509@yergeau.com>
Reordering a bit:

Tex Texin a écrit:
> 4) I do know that IE maps the c1 range into 1252 values.

Not so fast!  With a test page in XHTML (supposedly kicks IE in strict 
parsing mode), I see only the euro displayed, all the others are white 
rectangles.  Surprisingly, it's much the same in NS7 (s/white 
rectangle/replacement char/)!


> 5) C0 in html- the spec says yes. I agree there are no glyphs, when I tried
> it, although &#2; was a different box than &#x3;-&#x8; for some reason.
> However, the validator rejects these characters. I want to get Martin's
> comment on why dtd makes it unused.

OK, let's dive.  Looking more closely, the spec doesn't really say yes. 
  Section 5.1 obliquely says that the document character set is 
ISO10646.   And 10646 in turn doesn't define the C0 and C1 controls it 
just says (Clause 15): "This coded character set provides for use of 
control functions encoded according to ISO/IEC 6429 or similarly 
structured standards for control functions, and standards derived from 
these."  And if you look at the code charts, the C0 and C1 areas are 
conspicuously empty (grayed out).

The HTML spec contains an SGML declaration (section 20.1) which formally 
declares what the document character set is:

     CHARSET
           BASESET  "ISO Registration Number 177//CHARSET
                     ISO/IEC 10646-1:1993 UCS-4 with
                     implementation level 3//ESC 2/5 2/15 4/6"
          DESCSET 0       9       UNUSED
                  9       2       9
                  11      2       UNUSED
                  13      1       13
                  14      18      UNUSED
                  32      95      32
                  127     1       UNUSED
                  128     32      UNUSED
                  160     55136   160
                  55296   2048    UNUSED  -- SURROGATES --
                  57344   1056768 57344

So it explicitly excludes all of C0, C1 and #x7F, except for TAB, CR and 
LF.  We were both wrong.


 > 3) NULL- ok, what to say about it? I don't want to doc browsers' random
 > behavior. It would be nice to say its illegal and be done with it.

Well, there you have it now.  It's illegal in HTML, XML 1.0 and XML 1.1.


> 7) character entity references. Maybe this is a terminology problem on my
> part.

Well, HTML does have this terminology (in 5.3.2).  It's not defined 
clearly, but it seems to cover all the predefined entities that HTML 
offers "to give authors a more intuitive way of referring to characters 
in the document character set".  The purported intuitiveness is, of 
course, lost on non-English-speakers.

> The title of http://www.w3.org/TR/2002/CR-xml11-20021015/#sec4.1 is character
> and entity references.
> I presumed the former was ncr and the latter was CER. It is possible to give a
> name to a character in xml.

It is possible to give a name to an entity, which may or may not contain 
a single character.  The five entities that XML predefines (lt, gt, amp, 
quot and apos) do contain only one character.


> 9) AHA! ok. I see the production changed in
> http://www.w3.org/TR/2002/CR-xml11-20021015/#sec2.2
> so that 7f-9f except 85 is excluded. Seems to me an odd thing to do, although

It makes things quite symmetrical.  Apart from the few useful ones (CR, 
LF, TAB and NEL), all controls must be represented as NCRs; NULL is 
forbidden altogether.

Regards,

-- 
François
Received on Tuesday, 27 May 2003 14:15:43 UTC