RE: Bare surrogates in XML - must halt and catch fire?

I don't believe you did (as you say you meant to) cc the
XML Core WG, so I am doing that now.

paul

p.s.  I had already forwarded Richard Tobin's response at
http://lists.w3.org/Archives/Public/public-xml-core-wg/2007Mar/0004
to the CG list, though you apparently replied before seeing that.
 

-----Original Message-----
From: w3c-xml-cg-request@w3.org [mailto:w3c-xml-cg-request@w3.org] On
Behalf Of C.M.Sperberg-McQueen
Sent: Wednesday, 2007 March 21 15:35
To: Chris Lilley
Cc: C. M. Sperberg-McQueen; XML CG; Richard Ishida; Felix Sasaki; W3C
SVG Working Group
Subject: Re: Bare surrogates in XML - must halt and catch fire?


[sorry for the slow response.  i just saw this query.]

On 7 Mar 2007, at 16:37 , Chris Lilley wrote:

> ...
> What is not clear is that XML specifically forbids bare surrogates
> (ie, half of a surrogate pair).

I'm not sure whether you mean (a) It's not clear whether XML
forbids bare surrogates or not, or (b) while it's a fact
that XML forbids bare surrogates, that fact is not clear to
all readers.

> This came up in recent SVG WG
> discussions.  Is the XML parser required to reject an xml document
> containing a bare surrogate? Would that be a well formedness error, or
> some other sort of error?

I believe the answers are yes, and unspecified (but most processors
are likely to treat it as a WF error).

But it's been a while since I worked through all of the details,
so I'm cc'ing the XML Core WG as well, in the hopes that Francois
or someone else in Core who has been deeply inside the XML spec
lately can check my analysis.

To be more precise, let us consider an octet stream we receive,
with respect to which we wish to ask "is it a well-formed XML
document?"  Let us suppose we recognize the octet stream as UTF16,
either by following the rules in the XML spec or on account of an
external label, or because an omniscient being, or just a being
with particular knowledge of the case (such as the creator of
the data stream, in this case me), has whispered "UTF16" in our ear.

If we ask a UTF16-savvy dump utility to show us the data, we might
see this:

   003C 003F 0078 006D < ? x m
   006C 0020 0076 0065 l   v e
   0072 0073 0069 006F r s i o
   006E 003D 0022 0031 n = " 1
   002E 0030 0022 003F . 0 " ?
   003E 000A 003C 0078 >   < x
   003E 0048 0069 002C > H i ,
   D801 0020 004D 006F .   M o
   006D 002E 003C 002F m . < /
   0078 003E 000A      x >

So what we've got looks a lot like

   <?xml version="1.0"?>
   <x>Hi,* Mom.</x>

except that where the * appears in the lines above, we have
the 16-bit value D801, which in a normal UTF16 encoding
would be half of a surrogate character.  We can ask several
questions:

Q. Is this a well-formed XML document?
A. What do you mean by 'this'?
Q. I mean the octet stream.
A. Octet streams are streams of bits.  XML documents are
    sequences of characters.  The question seems to embody
    a category error.
Q. Who are you, Spock?  Does this octet stream encode a
    well-formed XML document?
A. Now the question is conceptually well-formed.
Q. Pedant.
A. Hey, you ask a language-lawyer question, you get a
    language-lawyer answer.  No, the octet stream doesn't
    represent a well-formed XML document.
Q. Why not?
A. Because the octet stream does not represent a sequence
    of characters in the UTF-16 encoding.  To represent a
    well-formed XML document, an octet stream must encode
    a sequence of characters which is that well-formed
    XML document.  The 16-bit value D801, followed as it is
    here by 0020, does not encode a character.  The
    octet sequence is not UTF-16.
Q. What if I said it was encoded not in UTF-16 (which has
    defined the surrogate characters) but in UCS-2 (which
    doesn't define surrogate characters)?
A. I'd have to check the Unicode specs to be sure.  Hold
    on ...
Q. Wait, don't bother.  Suppose I invented an encoding and
    called it x-myencoding and said this is a legitimate
    encoding of a sequence of Unicode 1.0 characters, and
    D801 represents U+D801, or equivalently the Unicode 1.0
    character whose integer value is 55297.
A. I don't think Unicode defines a character at that point.
    In fact, I'm pretty sure they say explicitly that
    there isn't one and can never be one.
Q. Not in Unicode 1.0.  Surrogates weren't til later.
    Is it well-formed then?
A. No.
Q. Why not?
A. Two reasons.  First, by not including an encoding
    declaration, you implicitly claimed that the encoding
    was either UTF-8 or UTF-16, or else reliably given by an
    external authority.  (You will have to read up on the
    current state of the various RFCs to get a chapter and
    verse account of when and where and how and why for
    all of this.)  The external authority who whispered in
    my ear clearly said "UTF-16", not "x-myencoding".
Q. So if I added an encoding declaration would it be
    well-formed?
A. No.  You told me that the octets in the relevant bit
    of the data stream represent the Unicode 1.0 characters
    whose integers are (in hex, I can't do decimal conversions
    on the fly) ... 002C, D801, 0020, ...
    I'm taking your word for it that the octet stream
    correctly represents those characters.  But production [2]
    of XML says clearly that the second of those characters,
    the one whose number is D801, is not a legal XML
    character.  So if the octet stream is correctly recognized
    as being encoded in x-myencoding, then we have a
    sequence of characters but not a well-formed XML
    document.
Q. What if I told you that I was wrong, earlier, when I
    said that x-myencoding treats D801 as a representation
    of the Unicode 1.0 character whose integer is xD801?
A. I wouldn't be the least bit surprised.
Q. What if I told you that D801 is recognized as a
    valid representation of the character whose number is
    33, i.e. x21?
A. That would be exclamation point.
Q. So is the octet stream a well-formed XML docu-- I mean,
    does the octet stream now repre-- er, encode, a well-formed
    XML document?
A. You're telling me it encodes the sequence of characters whose
    conventional representation is

    <?xml version="1.0"?>
    <x>Hi,! Mom.</x>

    That sequence of characters is indeed a well-formed XML
    document.  I have to grant that, even if I deplore your choice
    of character encodings.  And your English punctuation isn't
    too hot, either.
Q. So going back to the earlier examples, when we assumed a UTF-16
    encoding.  The octet stream wasn't a--I mean, didn't encode
    a well-formed XML document.  So did it have a well-formedness
    error?  And crucially, is a processor required to detect
    encoding errors?
A. Most readers of the spec seem to agree that a sequence of
    characters which fails to match the 'document' production
    or violates some WF constraint in the spec, has a well-formedness
    error.  (They are taking the term "textual object" to mean
    "sequence of characters", which may or may not be a perfect
    interpretation.)  It's less clear whether something which
    is not a sequence of characters, or not a textual object,
    can rise to the status of having a well-formedness error.
    The coffee cup in my hand does not match the 'document'
    production of the XML spec.  It is not, and does not encode,
    any well-formed XML document.  At least, not using any encoding
    in common use.  I could invent one tomorrow in which my
    coffee cup encodes the character sequence <x/>, just to be
    able to say that my coffee cup encodes an XML document.
    But today I'm busy.  So today, my coffee cup encodes no WF
    XML document.  Can we infer, then, that it has a well-formedness
    error?  The spec neither requires us to say so, nor forbids
    it.
Q. Is a processor required to detect encoding errors?
A. If I hand you a document encoded in ISO 8859-7 and tell you
    it's in ISO 8859-1, do you guarantee that you will detect
    the error?
Q. That could be hard.
A. Yes. Impossible in principle.  On the other hand, if you tell me
    the data stream is encoded in encoding E, and it turns out,
    when decoded using the rules of encoding E, not to produce
    a well-formed XML document, it's probably worth reporting,
    right?
Q. Right.  But aren't I the one supposed to be asking the questions?
    For the third time of asking, is a processor required to detect
    and report the issue with the D801 character in the example
    we started with?
A. Yes.  The XML spec says, in section 4.3.3:

       It is a fatal error if an XML entity is determined ...
       to be in a certain encoding but contains byte sequences
       that are not legal in that encoding.

    Conforming processors are required to detect fatal errors
    and report them.  The one in the example thus may or may not
    be a well-formedness error, but it's definitely a fatal
    error.
Q. Didn't you just tell me that it was impossible in principle
    to detect all cases in which the encoding declaration is
    inaccurate?
A. I did.
Q. So why does the XML spec require the detection of an error you
    say is impossible to detect in principle, in the general
    case?  In section 4.3.3 the Rec also says

       In the absence of information provided by an external
       transport protocol (e.g. HTTP or MIME), it is a fatal
       error for an entity including an encoding declaration to
       be presented to the XML processor in an encoding other
       than that named in the declaration,

    That seems to mean that if I told you the document was in
    ISO 8859-1 when it was really in 8859-7, you would be
    obligated to detect it.
A. Yeah, you know, I was wondering about that myself.  I thought
    at first that maybe the Core WG had snuck that in later,
    after the first edition, but no, it's been there all along.
    I think I must just have lost the argument with the rest of
    the WG on that one.  Fortunately, there's a metaphysical
    defense.  It's true that the octet stream you gave me encoded
    a document in ISO 8859-7, and that may have been the one you
    wanted me to validate and process.  But it also encodes a
    document encoded in ISO 8859-1, which is the one I actually
    did validate and process.  That document didn't make much
    sense -- a number of passages just looked like gibberish,
    in fact -- but when I'm playing the role of well-formedness
    checker I try to avoid making stylistic comments on my
    users' prose.  It alarms them.  And they find most of the
    suggestions pedantic.
Q. Why do you think would that be?

I hope this helps.

--CMSMcQ

Received on Friday, 23 March 2007 22:57:28 UTC