W3C Architecture Domain International Home Page

Questions & Answers: HTML, XHTML, XML and control code characters

Question...

Are there any differences in the way HTML, XML, and XHTML support the "control code" characters (U+0001-U+001F)?

Answer...

Yes, there are differences. The control characters in the range U+0001-U+001F are also known as the "C0" range.1, 2 The differences between HTML, XML, and XHTML, in supporting the C0 range can be important if you have existing data that includes characters in the C0 range, and you want to represent those characters within one of the markup languages.

Note, that when control characters are used for formatting text, for example Form Feed, U+000C, it is better to replace the characters with appropriate markup3. If the data is not really textual, but binary, then it may be more practical to encode it, for example using base64.

When C0 characters represent other kinds of text data, it can be important to maintain the character values in context. The display of most of the C0 characters by browsers is behavior that is unspecified. Maintenance of C0 range characters in text is generally more important for data interchange. Programmers working with legacy applications that may have data in the C0 range should be aware of which markup languages support the range.

A brief statement of the situation is:

Solutions

If you need to represent these characters in XML 1.0 or XHTML, you can create a convention to represent them and replace every occurence with that convention. An alternative is to encode the data. For example, encode the data as base64 or as hexadecimal values, to ensure only supported characters are used in the markup language text. (And of course, decoding the text when reading the files.) Note that XML Schema provides data types for these encodings.

Another alternative is to store the data in an external document and reference it from the XML document.

In XML 1.1, the simplest alternative is to represent any occurence of a C0 character with an NCR. For example, the character "ESCAPE" U+001B would be represented by either  (hexadecimal) or  (decimal).

Additional Details

The HTML 4 specification, Section 5.1 The Document Character Set, simply declares that HTML supports ISO 10646 and its first 5 amendments. ISO 10646 is equivalent to the Unicode Character Set. The implication is that control code characters in the range U+0001-U+001F are supported by HTML.

XML 1.0 on the other hand declares in Section 2.2 Charsets that supported characters include: #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]. Notably, except for tab (U+0009), line feed (U+000A), and carriage return (U+000D), characters in the range U+0001-U+001F are NOT supported.

XHTML, being the intersection of HTML 4 and XML 1.0, has the same limitations as XML 1.0. Although not stated explicitly, it is referred to indirectly in C.15. White Space Characters in HTML vs. XML.

XML 1.1, according to Section 4.1 Character and Entity References, extended XML to allow the Unicode characters in the C0 range to be represented as Numeric Character References or Character Entity References.

NOTES:

1 There is a similar set of control codes in the range U+007F-U+009F, known as the C1 range. Since these are not excluded by any of the markup languages, they are not discussed here.

2 More details on the C0 range are available in the Unicode Code Chart: C0 Controls and Basic Latin.

3 The document Unicode in XML and other Markup Languages contains guidelines on the use of the Unicode Standard in conjunction with markup languages such as XML.


Authored by Tex Texin

Version: $Id: qa-forms-utf-8.html,v 1.15 2003/05/12 11:12:20 duerst Exp $