Ambiguity in XML 1.1 CR needs correction

The 1.1 spec has an ambiguity: it is not clear whether the various productions
for characters apply to the characters that can appear in the infoset or the
characters that can appear in the text of a document.

Until 1.1, I don't think this was a problem, because no such distinction existed.
In 1.1, the control characters are in this category. 

I believe the correct way out is:

1) To clarify that the productions relate to the characters that can appear in
the infoset

2) To make control character rejection a part of input conditioning. This may
be best done by renaming s2.11 "End-of-Line and Control Code Handling"

and adding the following text:

"It is a well-formedness error for control characters (characters in the
range 0x00 to 0x1F and 0x7F to 0x9F) to appear in an external
parsed entity, with the exception of the whitespace characters in the
previous paragraph.  Control characters, except 0x00, must be marked
up using numeric characer references."
 
3) To reviseproduction 2 to be
[2]     Char    ::=    [#01 - #xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

I believe the WG could fairly claim that is a change in expression not in intent, 
and so perhaps does not require a new CR?

It would also prevent the horrible problem of the current formulation 
that control characters cannot appear in entities without causing a WF error,
and would have to be escaped (and re-escaped for each level of depth of the
entity reference.) It is clearly bogus to expect the expression of a character
to be dependent on its use in references !  Furthermore,
it is against ISO 8879.  Furthermore it is unusable and confusing for people.


Cheers
Rick Jelliffe

P.S. Another approach, which may fit in with some implementations better, would be
to revise production 2 to be

[2]     Char    ::=    [#01 - #x10FFFF]
then add a disconnected production for the allowed values of an external
parsed entity
[x]     EPE_Char    ::=    #x9 | #xA | #xD | [#x20-#x7E] | #x85 | [#xA0-#xD7FF]
                      | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
and then add a disconnected production for the allowed values for the value of
an NCR
[x]     NCR_Char    ::=    [#x01-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

In other words, make it clear that you don't need to to multiple checks: just when
you bring a character in, and when you derefence a NCR.

Received on Monday, 13 January 2003 08:55:21 UTC