A Rational Approach to XML Robustness Requirements
Rick Jelliffe
ricko@topologi.com
24/02/2002
XML 1.0's success and widespread adoption is in large part because
it solves a problem that no other technology has: it makes the character
set problem go away.
XML 1.0 solves the character set problem by adopting three measures:
- It adopts Unicode as its document character set.
- It allows any encoding, and requires the encoding be labelled (or defaulted).
- It provides error-detection mechanisms for mislabelled encodings,
as part of well-formedness
The last of these measures is not well-recognized. Yet it provides a
check which gives XML 1.0 a fundamental robustness. The measures
are integrated: use a different character set than Unicode, and coverage
is compromise; allow guessing of encodings, and fragility is increased; provide
no error-checking, and fragility is also increased. The XML 1.1 discussions
have brought this issue more to light.
The error-detection mechanisms in XML are three:
- The behaviour when an infeasible codepoint or transition in the incoming
codes is found is specified.
- The Unicode characters allowed in a document are restricted..
- The Unicode characters allowed in XML names are restricted.
The particular policies in place for XML 1.0 for using these mechanisms
are:
- A trancoding error is fatal. However, in many important cases (for
example, between the ISO 8859-n family) this will catch no errors. Furthermore,
many and perhaps most existing transcoder libraries silently strip out infeasible
code sequences. Furthermore, the requirement to fail at encoding errors
was only clarified as part of XML 1.0 second edition, so even where transcoding
libraries could flag an error, deployed implementations could have chosen
to continue processing.
- Only a small selection of document characters are restricted: in particular
the C0 range.
- A large selection of characters are restricted from use as names.
In this note, I will not deal with other important considerations relevant
to characters in XML: that there are accessibility considerations which legitimize
banning symbol characters that have no pronunciation (i.e. in any particular
locale); that the character U+0000 will cause problems in zero-terminated
strings; that MIME requirements for "textual" content means that control
characters are inappropriate for use in text/* documents; etc.
Instead, I want to provide additional information to help guage the effectiveness
of XML's current error-detection mechanisms, and to see if this information
allows us to come up simpler rules which give nearly as useful coverage.
Probability
Before starting, it is useful to consider that statistical methods are the
basis of much quality assurance and quality control. Even in data communications,
probability plays a role: for example, the Cyclic Redundancy Check on Internet
protocols and other checksums are not 100% reliable.
However, they do not need to be; foremost, because if the possibility of
an error in one sample is e/t, the possibility that n samples will not detect
the error are (t-e)^n/t^n.
There are two rational approaches to error-detection policy in XML:
- Restrict characters to the smallest possible number, based on Unicode
properties. This gives the maximum possible number of redundant characters.
For names, this is the current policy in XML 1.0.
- Enumerate and analyze the most common possible classes of transcoding
errors, and determine whether the natures of the codes themselves allow effective
rules to be formulated to detect errors. That is the approach in this note.
The approach of simply removing checks for encoding errors is simply bad
engineering, in the absence of any other layers or methods to perform error
detection.
Encoding Errors
The causes of encoding errors include:
- human error or ignorance: a user may not know that the encoding they
are using is not correct;
- webserver error: web servers will send data as 8859-1 or ASCII by default;
if the server is set up with a different default, particular files may be
still be sent out with incorrect encoding;
- programmer error: most programming language IO methods output data
using the locale's encoding by default;
- proxy error: a transcoding proxy recodes the document without changing
the XML header; if the document is saved as-is, it will be in error even
if the proxy sent the correct MIME header information.
Let us take these as a working set of classes of errors we should consider.
- UTF-16 mislabelled as UTF-8, and vice versa.
- Windows code pages mislabelled as ISO 8859-n, and vice versa.
- Mislabelling as 8859-1, from webserver defaults.
UTF-8 labelled as UTF-16
There is no problem here. No delimiters will be detected, and the document
cannot be WF.
UTF-16 labelled as UTF-8
There is no problem here. U+0000 is not allowed in the document character
set, and an error
will be detected in every case of an incoming XML entity that contains any
markup.
UTF-8 mislabelled as ISO 8859-n
UTF-8 uses the range 0x80 to 0x9F. The probability that a random two-byte
character has this is 1:2. The probability that a random three-byte characters
has that code is 1:3. The probability that a new character > U+FFFF has
that code is 1:4.
So restricting the document character set to disallow C1 will be effective
to catch UTF-8 mislabelling, except for documents with very few (in repertoire,
not in frequency) non-ASCII characters. It would be interesting to
check whether the Euro is caught.
ISO 8859-n pages mislabelled as ISO 8859-n
The ISO 8859-n encodings are mutually feasible: no errors will be detected
by either a transcoder or by checking the document character set for unallocated
or deprecated characters. The only method available to detect encoding errors
is by restricting the name rules.
An important case here is when a document is deemed incorectly to be ISO
8859-1. ISO 8859-1 has the useful property that it does not have XML 1.0
name characters in the A0 to BF range.
Character encodings which do have name characters (ref http://www.kostis.net/charsets/)
in that range include:
- ISO 8859-2
- ISO 8859-3
- ISO 8859-4
- ISO 8859-5
- ISO 8859-7
In the cases of ISO 8859- 2,3,4 the characters that would be detected
will be, to a great extent, language- dependent. In the case of Greek, it
comes down to whether diacritical or tone marks are used: if native-language
markup is used with tonos marks, then restricting the range U+00A0 to U+BF
will reliably detect encoding errors, given the high incidence of the use
of tonos in Greek words.
Each of the ISO 8859-n character sets has holes with non-name characters.
These provided additional potential error-detection points.
For ISO 8859-1, the characters 0xD7 and 0xF7 are examples. For the
non-Latin ISO 8859 character sets, these one or both of these two codes are
used for common name characters. Restricting these characters should be effective
in catching errors in those non-Latin scripts. (Greek, Cyrillic, Arabic,
Hebrew). (Russian KOI8 may be in this class too.)
Most of the Latin character sets share these same characters, so again, Latin
Windows code pages labelled as ISO 8859-n
The Windows code pages allow characters in the 0x80 to 0x9F range. When
labelled as ISO8859-1, these occupy the currently ambigous C1 range in Unicode:
this range is for privately defined control characters, unless the higher-level
protocol specifies a particular control character set. One C1 control
character that has a specifial significance is NEL.
The introduction of the Euro as 0x80 in CP1252 means that the previously
harmless practice where documents created by "ANSI" tools could (if they
only used the Latin 1 characters) be labelled ISO 8859-1 is no longer appropriate.
All mislabelling of ANSI as ISO 8859-1 would be caught by disallowing
the C1 controls from the document character set.
Big 5 mislabelled as ISO 8859-1
Big 5 is the character set used in Taiwan and Honk Kong, and increasingly
in Mainland China due to trade.
Assume character frequency is random in the Big 5 file. A Big5 character
has a 1/5 possibility of containing a code point in its first byte in the
A0 to BF range (this is slightly less actually, because A0 to A4 are not
used for Han characters; however, the second byte may also contain these
characters, so we will let them cancel each other out). Assume a document
using native language encoding has 20 elements and 20 attributes, each with
two characters: without duplicates, that gives 80 characters.
For all possible DTDs with these qualities, the chances that restricting
the range of name characters by disallowing U+00A0 to U+00B7 will not detect
an encoding problem are therefore 4^80/5^80 =
approx 1.8e-7: very low.
In the case of the Big5 superset Big 5 Plus, detecting C1 characters will
also detect the encoding errors. However, because it is rarer characters
in Big5 Plus, I doubt that this has much effect in practise.
Shift JIS mislabelled as ISO 8859-1
Shift JIs is the character set used for external text on Japanese PCs.
It uses the code points 0x81 to 0x9F. If the C1 range in Unicode
is disallowed as document characters, encoding errors should be detected
for even small documents. The probabilities involved are for not detecting
a problem are around 5^n/24^n, where n is the number of Japanese characters
used in names in the some typical document type. This is very reliable.
Other Encodings
Looking through the code tables in Lunde's CJKV information processing and
the website http://czyborra.com/charsets/ it seems that many other character
sets also can have encoding errors
of mislabelling as ISO 8859-1 detected reliably:
Discussion
Restricted document and naming rules provide an effective method of catching
encoding errors in significant cases.
Restrictions to the document character set are better than restrictions to
the naming rules, because a document may not be using native language markup.
Examining various encodings reveals some critical ranges. Dealing with these
ranges appropriately would maintain XML's current robustness while allowing
well-formedness to be decoupled from specific versions of Unicode (See below
for recommendations to achieve this). Indeed, the robustness of XML
would probably be increased, while the implementation complexity significantly
decreased.
Errors detected by these methods should come under the category "encoding
errors" if detected immediately after transcoding, and "bad document character"
and "bad name character" otherwise.
Finally, I note that errors relating to encoding belong to well-formedness
of a document. Errors relating to which characters are allowed in a
well-formed document relate to validity. Therefore detailed prescription
or proscription of name characters related to policy should be moved to be
a validity issue, not a WF issue though sometimes encoding errors may be
the cause of an invalid name. An important consideration for validation is
that only the element and attribute names in a schema need to be validated
for name-rule consistency, and not every element in an instance (in the absense
of ANY content types): so it is possible for instances to be parsed fast
while still getting the benefit of strict name rules--errors are detected
when no schema rule can be found for an element or attribute.
Recommendations
- The character U+0000 NUL should not be an allowed character in XML
documents.
- The C1 characters U+0080 to U+009F should not be allowed characters
in XML documents, with the exception of NEL if needs be.
- The Latin1 non-name characters U+00A0 to U+00BF and U+00D7 and U+00F7
should not be allowed in XML Names.
- Other restrictions to characters may be useful for particular circumstances,
but these will tend to be specific to the encoding confusion involved.
- Users of the ISO 8859 character sets for Latin, other than ISO 8859-1,
should be warned to pay particular attention to encoding issues, as the chances
that an encoding error will be detected will depend on the language they
use and may even depend on whether they use rarer characters.