XML 1.1 CR comments and implementation report

This describes the issues I encountered in adding XML 1.1 support to RXP.

RXP now accepts both XML 1.0 and XML 1.1 documents, applying the
appropriate parsing rules to documents of each kind.  The main changes
required were:

 - Recording for each parser instance and entity its version.

 - Checking that entities did not have a version number later than the
   document entity.

 - Adding a new character type table.  I accepted the overhead of
   an extra 64k here though I could have used more bits from the
   1.0 table at the expense of slower lookup.

 - Using the appropriate table for checking character legality in the
   input functions, depending on the version of the document.

 - Using the appropriate table for checking name character legality in the
   name parsing functions, depending on the version of the document.
   Since characters are represented internally in UTF-16, and XML 1.1
   allows characters above 0xffff in names, names can now contain
   surrogates.  Fortunately the legal name characters above 0xffff
   can be distinguished by the high surrogate only, so it is not necessary
   to look at pairs when parsing names.

 - Changing the way input errors (eg bad UTF-8) are recorded: previously
   a SUB character was placed in the input stream, but SUB can now occur
   in the replacement text of internal entities (but see spec issues below).

 - Changes to line-end normalization on input, depending on the version
   of the document.

 - Changing character reference processing to allow references to
   control codes for 1.1 documents.

 - The structure of the parser makes it hard to handle NEL and LSEP
   characters in the XML declaration, because the encoding is not
   yet known.  I do not allow them, see spec issues below.

 - I did not attempt to implement Unicode normalization checking.

XML 1.1 is not appreciably more complex than XML 1.0, but some
complexity arises from supporting both within the same parser.  The
need to apply different checks to characters in the input inner loop
depending on the document version adds some overhead.  If necessary
this could be overcome by having 1.0 and 1.1 versions of the input
code, but there are already several versions for different encodings
and this would double the number.  On the other hand, the I estimate
that the overhead is only around 3%.

Issues with the spec:

XML 1.1 is not a strict superset of XML 1.0 (in the sense that a
well-formed or valid 1.0 document does not necessarily remain
well-formed if it is relabelled as 1.1) because of the exclusion
of the C1 controls and DEL from the Char production.  The main gain
from this will be the detection of documents in Microsoft encodings
such as cp1252 which are mislabelled as Latin-1.  Whether this
is worth the loss of superset-ness and the implementation overhead
I am not sure.

XML 1.0 was careful to only allow ASCII characters in the XML
declaration.  XML 1.1 should continue this, explicitly prohibiting NEL
and LSEP as whitespace in the XML declaration.

As it stands, the spec requires double escaping of control characters
if they appear in internal entities.  This is because internal entity
replacement text must match the content production, and therefore can
only contain characters matching Char.  This is pointless and would be
expensive to implement, so it should be changed.

There should be a production for Char + the escapable controls, so that
other specs can refer to it is as the "internal" character set of XML
documents.

The exclusion of NUL is important for existing APIs and the spec is
right to exclude it.

Canonical XML (and the unoffical canonical XML used in the test suite)
will need to be changed to output controls as character references.


-- Richard

Received on Friday, 14 February 2003 14:17:35 UTC