Illegal Unicode characters in XML

A change to production [2] that would have made C0 control characters
legal in XML was removed from the latest XML 1.1 draft published on
April 25th, 2002.

C0 control characters may not be important nor used at all when
creating documents from the scratch. However, since these characters
are legal characters in Unicode, there are many instances of documents
or snippets of text in the world which contains these characters. When
XML documents are created as data container to contain these text,
proprietary markups must be invented in each XML vocabulary or the
data must be encoded in base 64 or similar scheme. It would be very
convenient if there is a standard way to represent these characters in
XML documents.

I propose that XML 1.1 includes a recommendation for the markup of
characters that are not legal in XML but are in Unicode.  The markup
should look like this:

<xml:orphanedChar value="#x000c" />

The markup represents a single character. In valid documents, the
element must be declared. The use of the markup for the legal XML
characters should be discouraged.

xml: prefix is not a requirement, but it would be nice if it gets one.

-------------------
Shigemichi Yazawa
yazawa@globalsight.com

Received on Tuesday, 7 May 2002 17:09:47 UTC