A Rational Approach to XML Robustness Requirements

Rick Jelliffe
ricko@topologi.com
24/02/2002

XML 1.0's success and widespread adoption is in large part because it solves a problem that no other technology has: it makes the character set problem go away.

XML 1.0 solves the character set problem by adopting three measures:

It adopts Unicode as its document character set.
It allows any encoding, and requires the encoding be labelled (or defaulted).
It provides error-detection mechanisms for mislabelled encodings, as part of well-formedness

The last of these measures is not well-recognized. Yet it provides a check which gives XML 1.0 a fundamental robustness. The measures are integrated: use a different character set than Unicode, and coverage is compromise; allow guessing of encodings, and fragility is increased; provide no error-checking, and fragility is also increased. The XML 1.1 discussions have brought this issue more to light.

The error-detection mechanisms in XML are three:

The behaviour when an infeasible codepoint or transition in the incoming codes is found is specified.
The Unicode characters allowed in a document are restricted..
The Unicode characters allowed in XML names are restricted.

The particular policies in place for XML 1.0 for using these mechanisms are:

A trancoding error is fatal. However, in many important cases (for example, between the ISO 8859-n family) this will catch no errors. Furthermore, many and perhaps most existing transcoder libraries silently strip out infeasible code sequences. Furthermore, the requirement to fail at encoding errors was only clarified as part of XML 1.0 second edition, so even where transcoding libraries could flag an error, deployed implementations could have chosen to continue processing.
Only a small selection of document characters are restricted: in particular the C0 range.
A large selection of characters are restricted from use as names.

In this note, I will not deal with other important considerations relevant to characters in XML: that there are accessibility considerations which legitimize banning symbol characters that have no pronunciation (i.e. in any particular locale); that the character U+0000 will cause problems in zero-terminated strings; that MIME requirements for "textual" content means that control characters are inappropriate for use in text/* documents; etc.

Instead, I want to provide additional information to help guage the effectiveness of XML's current error-detection mechanisms, and to see if this information allows us to come up simpler rules which give nearly as useful coverage.

Probability

Before starting, it is useful to consider that statistical methods are the basis of much quality assurance and quality control. Even in data communications, probability plays a role: for example, the Cyclic Redundancy Check on Internet protocols and other checksums are not 100% reliable.

However, they do not need to be; foremost, because if the possibility of an error in one sample is e/t, the possibility that n samples will not detect the error are (t-e)^n/t^n.

There are two rational approaches to error-detection policy in XML:

Restrict characters to the smallest possible number, based on Unicode properties. This gives the maximum possible number of redundant characters. For names, this is the current policy in XML 1.0.
Enumerate and analyze the most common possible classes of transcoding errors, and determine whether the natures of the codes themselves allow effective rules to be formulated to detect errors. That is the approach in this note.

The approach of simply removing checks for encoding errors is simply bad engineering, in the absence of any other layers or methods to perform error detection.

Encoding Errors

The causes of encoding errors include:

human error or ignorance: a user may not know that the encoding they are using is not correct;
webserver error: web servers will send data as 8859-1 or ASCII by default; if the server is set up with a different default, particular files may be still be sent out with incorrect encoding;
programmer error: most programming language IO methods output data using the locale's encoding by default;
proxy error: a transcoding proxy recodes the document without changing the XML header; if the document is saved as-is, it will be in error even if the proxy sent the correct MIME header information.

Let us take these as a working set of classes of errors we should consider.

UTF-16 mislabelled as UTF-8, and vice versa.
Windows code pages mislabelled as ISO 8859-n, and vice versa.
Mislabelling as 8859-1, from webserver defaults.

UTF-8 labelled as UTF-16

There is no problem here. No delimiters will be detected, and the document cannot be WF.

UTF-16 labelled as UTF-8

There is no problem here. U+0000 is not allowed in the document character set, and an error
will be detected in every case of an incoming XML entity that contains any markup.

UTF-8 mislabelled as ISO 8859-n

UTF-8 uses the range 0x80 to 0x9F. The probability that a random two-byte character has this is 1:2. The probability that a random three-byte characters has that code is 1:3. The probability that a new character > U+FFFF has that code is 1:4.

So restricting the document character set to disallow C1 will be effective to catch UTF-8 mislabelling, except for documents with very few (in repertoire, not in frequency) non-ASCII characters. It would be interesting to check whether the Euro is caught.

ISO 8859-n pages mislabelled as ISO 8859-n

The ISO 8859-n encodings are mutually feasible: no errors will be detected by either a transcoder or by checking the document character set for unallocated or deprecated characters. The only method available to detect encoding errors is by restricting the name rules.

An important case here is when a document is deemed incorectly to be ISO 8859-1. ISO 8859-1 has the useful property that it does not have XML 1.0 name characters in the A0 to BF range.

Character encodings which do have name characters (ref http://www.kostis.net/charsets/) in that range include:

ISO 8859-2
ISO 8859-3
ISO 8859-4
ISO 8859-5
ISO 8859-7

In the cases of ISO 8859- 2,3,4 the characters that would be detected will be, to a great extent, language- dependent. In the case of Greek, it comes down to whether diacritical or tone marks are used: if native-language markup is used with tonos marks, then restricting the range U+00A0 to U+BF will reliably detect encoding errors, given the high incidence of the use of tonos in Greek words.

Each of the ISO 8859-n character sets has holes with non-name characters. These provided additional potential error-detection points.

For ISO 8859-1, the characters 0xD7 and 0xF7 are examples. For the non-Latin ISO 8859 character sets, these one or both of these two codes are used for common name characters. Restricting these characters should be effective in catching errors in those non-Latin scripts. (Greek, Cyrillic, Arabic, Hebrew). (Russian KOI8 may be in this class too.)

Most of the Latin character sets share these same characters, so again, Latin

Windows code pages labelled as ISO 8859-n

The Windows code pages allow characters in the 0x80 to 0x9F range. When labelled as ISO8859-1, these occupy the currently ambigous C1 range in Unicode: this range is for privately defined control characters, unless the higher-level protocol specifies a particular control character set. One C1 control character that has a specifial significance is NEL.

The introduction of the Euro as 0x80 in CP1252 means that the previously harmless practice where documents created by "ANSI" tools could (if they only used the Latin 1 characters) be labelled ISO 8859-1 is no longer appropriate.

All mislabelling of ANSI as ISO 8859-1 would be caught by disallowing the C1 controls from the document character set.

Big 5 mislabelled as ISO 8859-1

Big 5 is the character set used in Taiwan and Honk Kong, and increasingly in Mainland China due to trade.

Assume character frequency is random in the Big 5 file. A Big5 character has a 1/5 possibility of containing a code point in its first byte in the A0 to BF range (this is slightly less actually, because A0 to A4 are not used for Han characters; however, the second byte may also contain these characters, so we will let them cancel each other out). Assume a document using native language encoding has 20 elements and 20 attributes, each with two characters: without duplicates, that gives 80 characters.

For all possible DTDs with these qualities, the chances that restricting the range of name characters by disallowing U+00A0 to U+00B7 will not detect an encoding problem are therefore 4^80/5^80 =
approx 1.8e-7: very low.

In the case of the Big5 superset Big 5 Plus, detecting C1 characters will also detect the encoding errors. However, because it is rarer characters in Big5 Plus, I doubt that this has much effect in practise.

Shift JIS mislabelled as ISO 8859-1

Shift JIs is the character set used for external text on Japanese PCs.

It uses the code points 0x81 to 0x9F. If the C1 range in Unicode is disallowed as document characters, encoding errors should be detected for even small documents. The probabilities involved are for not detecting a problem are around 5^n/24^n, where n is the number of Japanese characters used in names in the some typical document type. This is very reliable.

Other Encodings

Looking through the code tables in Lunde's CJKV information processing and the website http://czyborra.com/charsets/ it seems that many other character sets also can have encoding errors
of mislabelling as ISO 8859-1 detected reliably:

Johab (Korean)
VISCII

Discussion

Restricted document and naming rules provide an effective method of catching encoding errors in significant cases.

Restrictions to the document character set are better than restrictions to the naming rules, because a document may not be using native language markup.

Examining various encodings reveals some critical ranges. Dealing with these ranges appropriately would maintain XML's current robustness while allowing well-formedness to be decoupled from specific versions of Unicode (See below for recommendations to achieve this). Indeed, the robustness of XML would probably be increased, while the implementation complexity significantly decreased.

Errors detected by these methods should come under the category "encoding errors" if detected immediately after transcoding, and "bad document character" and "bad name character" otherwise.

Finally, I note that errors relating to encoding belong to well-formedness of a document. Errors relating to which characters are allowed in a well-formed document relate to validity. Therefore detailed prescription or proscription of name characters related to policy should be moved to be a validity issue, not a WF issue though sometimes encoding errors may be the cause of an invalid name. An important consideration for validation is that only the element and attribute names in a schema need to be validated for name-rule consistency, and not every element in an instance (in the absense of ANY content types): so it is possible for instances to be parsed fast while still getting the benefit of strict name rules--errors are detected when no schema rule can be found for an element or attribute.

Recommendations

The character U+0000 NUL should not be an allowed character in XML documents.
The C1 characters U+0080 to U+009F should not be allowed characters in XML documents, with the exception of NEL if needs be.
The Latin1 non-name characters U+00A0 to U+00BF and U+00D7 and U+00F7 should not be allowed in XML Names.
Other restrictions to characters may be useful for particular circumstances, but these will tend to be specific to the encoding confusion involved.
Users of the ISO 8859 character sets for Latin, other than ISO 8859-1, should be warned to pay particular attention to encoding issues, as the chances that an encoding error will be detected will depend on the language they use and may even depend on whether they use rarer characters.