Re: Q&A: control codes from François Yergeau on 2003-05-26 (public-i18n-geo@w3.org from May 2003)

From: François Yergeau <francois@yergeau.com>
Date: Mon, 26 May 2003 10:29:12 -0400
To: Tex Texin <tex@i18nguy.com>
Cc: GEO <public-i18n-geo@w3.org>, Bjoern Hoehrmann <derhoermi@gmx.net>
Message-id: <3ED224B8.1050905@yergeau.com>
Comments embedded below.

Tex Texin a écrit:
>   Questions & Answers: HTML, XHTML, XML and control code characters

"control code characters" looks wrong to me.  Should be "control codes" 
or "control characters".


>     Question...
> 
> Are there any differences in the way HTML, XML, and XHTML support the 
> "control code" characters (U+0001-U+001F)?

This shouldn't be restricted to the range U+0001-U+001F.  As the title 
implies, it should also cover the C1 range.  The issues and solutions 
are pretty much the same for C0 and C1, although details differ.  See 
also my comment on Note 1 below.

>     Answer...
> 
> Yes, there are differences. The control characters in the range 
> U+0001-U+001F are also known as the "C0" range.1, <#note1> 2 <#note2> 
> The differences between HTML, XML, and XHTML, in supporting the C0 range 
> can be important if you have existing data that includes characters in 
> the C0 range, and you want to represent those characters within one of 
> the markup languages.
> 
> Note, that when control characters are used for formatting text, for 
> example Form Feed, U+000C, it is better to replace the characters with 
> appropriate markup3 <#note3>. If the data is not really textual, but 
> binary, then it may be more practical to encode it, for example using 
> base64.
> 
> When C0 characters represent other kinds of text data, it can be 
> important to maintain the character values in context. The display of 
> most of the C0 characters by browsers is behavior that is unspecified. 
> Maintenance of C0 range characters in text is generally more important 
> for data interchange. Programmers working with legacy applications that 
> may have data in the C0 range should be aware of which markup languages 
> support the range.
> 
> A brief statement of the situation is:
> 
>     * HTML supports the C0 range.

Really?  Hmmm, yes, the spec seems to say that.  Your mileage may vary 
though, especially with U+0000 NULL.  If I put &#x0000; in an HTML file, 
IE6 displays it as "&#x0000;", litterally.  It doesn't interpret it as 
an NCR.  Other NCRs in the C0 and C1 ranges variously give a 
missing-glyph rectangle or nothing at all.  &#x0080; yields a euro sign!

>     * XML 1.0 and XHTML do not allow these characters.

The C1 range is allowed in XML 1.0.

>     * XML 1.1 allows the characters to be represented by Numeric
>       Character References (NCR) or Character Entity References.

There are no such things as Character Entity References in XML.  The 
controls are allowed *only* as NCRs.


>       Solutions
> 
> If you need to represent these characters in XML 1.0 or XHTML, you can 
> create a convention to represent them and replace every occurence with 
> that convention. An alternative is to encode the data. For example, 
> encode the data as base64 or as hexadecimal values, to ensure only 
> supported characters are used in the markup language text. (And of 
> course, decoding the text when reading the files.) Note that XML Schema 
> provides data types for these encodings.
> 
> Another alternative is to store the data in an external document and 
> reference it from the XML document.
> 
> In XML 1.1, the simplest alternative is to represent any occurence of a 
> C0 character with an NCR. For example, the character "ESCAPE" U+001B 
> would be represented by either &#x1B; (hexadecimal) or &#27; (decimal).

Doesn't work for U+0000 NULL.

> 
> 
>       Additional Details
> 
> The HTML 4 specification, Section 5.1 The Document Character Set 
> <http://www.w3.org/TR/REC-html40/charset.html#h-5.1>, simply declares 
> that HTML supports ISO 10646 and its first 5 amendments 
> <http://www.w3.org/TR/REC-html40/references.html#ref-ISO10646>. ISO 
> 10646 is equivalent to the Unicode Character Set. The implication is 
> that control code characters in the range U+0001-U+001F are supported by 
> HTML.
> 
> XML 1.0 on the other hand declares in Section 2.2 Charsets 
> <http://www.w3.org/TR/2000/REC-xml-20001006#charsets> that supported 
> characters include: #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
> [#x10000-#x10FFFF]. Notably, except for tab (U+0009), line feed 
> (U+000A), and carriage return (U+000D), characters in the range 
> U+0001-U+001F are NOT supported.
> 
> XHTML, being the intersection of HTML 4 and XML 1.0, has the same 
> limitations as XML 1.0. Although not stated explicitly, it is referred 
> to indirectly in C.15. White Space Characters in HTML vs. XML 
> <http://www.w3.org/TR/xhtml1/#C_15>.
> 
> XML 1.1, according to Section 4.1 Character and Entity References 
> <http://www.w3.org/TR/2002/CR-xml11-20021015/#sec4.1>, extended XML to 
> allow the Unicode characters in the C0 range to be represented as 
> Numeric Character References or Character Entity References.

Not the latter, see above.


>       NOTES:
> 
> 1 There is a similar set of control codes in the range U+007F-U+009F, 
> known as the C1 range. Since these are not excluded by any of the markup 
> languages, they are not discussed here.

XML 1.1 does exclude the C1 controls (except U+0085 NEL) as litteral 
characters, just like the C0.  They are allowed as NCRs.


> 
> 2 More details on the C0 range are available in the Unicode Code Chart: 
> C0 Controls and Basic Latin <http://www.unicode.org/charts/PDF/U0000.pdf>.
> 
> 3 The document Unicode in XML and other Markup Languages 
> <http://www.w3.org/TR/unicode-xml/> contains guidelines on the use of 
> the Unicode Standard in conjunction with markup languages such as XML.

-- 
François
Received on Monday, 26 May 2003 13:24:22 UTC