W3C home > Mailing lists > Public > public-i18n-geo@w3.org > May 2003

Re: Q&A: control codes

From: Tex Texin <tex@i18nguy.com>
Date: Mon, 26 May 2003 17:17:09 -0400
Message-ID: <3ED28455.2DD1674C@I18nGuy.com>
To: François Yergeau <francois@yergeau.com>
CC: GEO <public-i18n-geo@w3.org>, Bjoern Hoehrmann <derhoermi@gmx.net>

François,

thanks as always.
Mostly agreement here and some new understanding on my part. I numbered your
comments in the text and just reply here:

1) control code characters - I think of it as characters that represent
control codes, so don't really see a problem.
I'll look at this, after we deal with the more substantial comments. I do see
your point though in codes not being characters.

2) Originally, I really did just mean to discuss just c0... Actually c0
without null, since I didn't think there was an issue with null, but you
undermined all of this... ;-(  I'll come back to the scope after we review the
rest of your points.

3) NULL- ok, what to say about it? I don't want to doc browsers' random
behavior. It would be nice to say its illegal and be done with it.

4) I do know that IE maps the c1 range into 1252 values. It encourages people
to presume that those values represent legit characters rather than forcing
them to use correct NCRs. On the other hand, I am sure it is a compatibility
issue for many many pages. If we discuss c1, I can add this as a caution.

5) C0 in html- the spec says yes. I agree there are no glyphs, when I tried
it, although &#2; was a different box than &#x3;-&#x8; for some reason.
However, the validator rejects these characters. I want to get Martin's
comment on why dtd makes it unused.

6) c1 is allowed in XML. Yes I agree. That is why I restricted myself to c0. I
didn't see an issue. Since it is supported, legacy apps don't have a problem
migrating to *ML and using it. However, I now understand it is restricted in
1.1, so needs to be covered.

7) character entity references. Maybe this is a terminology problem on my
part.
The title of http://www.w3.org/TR/2002/CR-xml11-20021015/#sec4.1 is character
and entity references.
I presumed the former was ncr and the latter was CER. It is possible to give a
name to a character in xml.
Is that not called a CER? I guess HTML uses CER for the predefined ones. I
realize XML doesn't have these definitions predefined.
Shall I just call it an entity reference?

8) NCR doesn't work for null. ok, I'll doc this.

9) AHA! ok. I see the production changed in
http://www.w3.org/TR/2002/CR-xml11-20021015/#sec2.2
so that 7f-9f except 85 is excluded. Seems to me an odd thing to do, although
I appreciate there is a lot of ambiguity around controls.


ok, thanks very much for your careful review! 

What I thought was going to be a cheap, easy one paragraph Q&A has become a
much larger effort... oh well. Too bad it is in such a marginal area, although
I know a couple programmers that were having trouble with this...

tex


François Yergeau wrote:
> 
> Comments embedded below.
> 
> Tex Texin a écrit:
> >   Questions & Answers: HTML, XHTML, XML and control code characters
> 

1
> "control code characters" looks wrong to me.  Should be "control codes"
> or "control characters".
> 
> >     Question...
> >
> > Are there any differences in the way HTML, XML, and XHTML support the
> > "control code" characters (U+0001-U+001F)?
> 

2
> This shouldn't be restricted to the range U+0001-U+001F.  As the title
> implies, it should also cover the C1 range.  The issues and solutions
> are pretty much the same for C0 and C1, although details differ.  See
> also my comment on Note 1 below.
> 
> >     * HTML supports the C0 range.


3
> 
> Really?  Hmmm, yes, the spec seems to say that.  Your mileage may vary
> though, especially with U+0000 NULL.  If I put &#x0000; in an HTML file,
> IE6 displays it as "&#x0000;", litterally.  It doesn't interpret it as
> an NCR.  

5
Other NCRs in the C0 and C1 ranges variously give a
> missing-glyph rectangle or nothing at all.  

4
&#x0080; yields a euro sign!
> 
> >     * XML 1.0 and XHTML do not allow these characters.
> 

6
> The C1 range is allowed in XML 1.0.
> 
> >     * XML 1.1 allows the characters to be represented by Numeric
> >       Character References (NCR) or Character Entity References.
> 

7
> There are no such things as Character Entity References in XML.  The
> controls are allowed *only* as NCRs.
> 
> >       Solutions
> > In XML 1.1, the simplest alternative is to represent any occurence of a
> > C0 character with an NCR. For example, the character "ESCAPE" U+001B
> > would be represented by either &#x1B; (hexadecimal) or &#27; (decimal).
> 

8
> Doesn't work for U+0000 NULL.
> 
> > XML 1.1, according to Section 4.1 Character and Entity References
> > <http://www.w3.org/TR/2002/CR-xml11-20021015/#sec4.1>, extended XML to
> > allow the Unicode characters in the C0 range to be represented as
> > Numeric Character References or Character Entity References.

noted
 
> Not the latter, see above.
> 
> >       NOTES:
> >
> > 1 There is a similar set of control codes in the range U+007F-U+009F,
> > known as the C1 range. Since these are not excluded by any of the markup
> > languages, they are not discussed here.

9
> 
> XML 1.1 does exclude the C1 controls (except U+0085 NEL) as litteral
> characters, just like the C0.  They are allowed as NCRs.
> 

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com
                         
XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------
Received on Monday, 26 May 2003 17:18:56 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:27:59 UTC