- From: Grosso, Paul <pgrosso@ptc.com>
- Date: Fri, 23 Mar 2007 18:53:56 -0400
- To: "XML CG" <w3c-xml-cg@w3.org>, <public-xml-core-wg@w3.org>
- Cc: "Richard Ishida" <ishida@w3.org>, "Felix Sasaki" <fsasaki@w3.org>, "W3C SVG Working Group" <w3c-svg-wg@w3.org>
I don't believe you did (as you say you meant to) cc the XML Core WG, so I am doing that now. paul p.s. I had already forwarded Richard Tobin's response at http://lists.w3.org/Archives/Public/public-xml-core-wg/2007Mar/0004 to the CG list, though you apparently replied before seeing that. -----Original Message----- From: w3c-xml-cg-request@w3.org [mailto:w3c-xml-cg-request@w3.org] On Behalf Of C.M.Sperberg-McQueen Sent: Wednesday, 2007 March 21 15:35 To: Chris Lilley Cc: C. M. Sperberg-McQueen; XML CG; Richard Ishida; Felix Sasaki; W3C SVG Working Group Subject: Re: Bare surrogates in XML - must halt and catch fire? [sorry for the slow response. i just saw this query.] On 7 Mar 2007, at 16:37 , Chris Lilley wrote: > ... > What is not clear is that XML specifically forbids bare surrogates > (ie, half of a surrogate pair). I'm not sure whether you mean (a) It's not clear whether XML forbids bare surrogates or not, or (b) while it's a fact that XML forbids bare surrogates, that fact is not clear to all readers. > This came up in recent SVG WG > discussions. Is the XML parser required to reject an xml document > containing a bare surrogate? Would that be a well formedness error, or > some other sort of error? I believe the answers are yes, and unspecified (but most processors are likely to treat it as a WF error). But it's been a while since I worked through all of the details, so I'm cc'ing the XML Core WG as well, in the hopes that Francois or someone else in Core who has been deeply inside the XML spec lately can check my analysis. To be more precise, let us consider an octet stream we receive, with respect to which we wish to ask "is it a well-formed XML document?" Let us suppose we recognize the octet stream as UTF16, either by following the rules in the XML spec or on account of an external label, or because an omniscient being, or just a being with particular knowledge of the case (such as the creator of the data stream, in this case me), has whispered "UTF16" in our ear. If we ask a UTF16-savvy dump utility to show us the data, we might see this: 003C 003F 0078 006D < ? x m 006C 0020 0076 0065 l v e 0072 0073 0069 006F r s i o 006E 003D 0022 0031 n = " 1 002E 0030 0022 003F . 0 " ? 003E 000A 003C 0078 > < x 003E 0048 0069 002C > H i , D801 0020 004D 006F . M o 006D 002E 003C 002F m . < / 0078 003E 000A x > So what we've got looks a lot like <?xml version="1.0"?> <x>Hi,* Mom.</x> except that where the * appears in the lines above, we have the 16-bit value D801, which in a normal UTF16 encoding would be half of a surrogate character. We can ask several questions: Q. Is this a well-formed XML document? A. What do you mean by 'this'? Q. I mean the octet stream. A. Octet streams are streams of bits. XML documents are sequences of characters. The question seems to embody a category error. Q. Who are you, Spock? Does this octet stream encode a well-formed XML document? A. Now the question is conceptually well-formed. Q. Pedant. A. Hey, you ask a language-lawyer question, you get a language-lawyer answer. No, the octet stream doesn't represent a well-formed XML document. Q. Why not? A. Because the octet stream does not represent a sequence of characters in the UTF-16 encoding. To represent a well-formed XML document, an octet stream must encode a sequence of characters which is that well-formed XML document. The 16-bit value D801, followed as it is here by 0020, does not encode a character. The octet sequence is not UTF-16. Q. What if I said it was encoded not in UTF-16 (which has defined the surrogate characters) but in UCS-2 (which doesn't define surrogate characters)? A. I'd have to check the Unicode specs to be sure. Hold on ... Q. Wait, don't bother. Suppose I invented an encoding and called it x-myencoding and said this is a legitimate encoding of a sequence of Unicode 1.0 characters, and D801 represents U+D801, or equivalently the Unicode 1.0 character whose integer value is 55297. A. I don't think Unicode defines a character at that point. In fact, I'm pretty sure they say explicitly that there isn't one and can never be one. Q. Not in Unicode 1.0. Surrogates weren't til later. Is it well-formed then? A. No. Q. Why not? A. Two reasons. First, by not including an encoding declaration, you implicitly claimed that the encoding was either UTF-8 or UTF-16, or else reliably given by an external authority. (You will have to read up on the current state of the various RFCs to get a chapter and verse account of when and where and how and why for all of this.) The external authority who whispered in my ear clearly said "UTF-16", not "x-myencoding". Q. So if I added an encoding declaration would it be well-formed? A. No. You told me that the octets in the relevant bit of the data stream represent the Unicode 1.0 characters whose integers are (in hex, I can't do decimal conversions on the fly) ... 002C, D801, 0020, ... I'm taking your word for it that the octet stream correctly represents those characters. But production [2] of XML says clearly that the second of those characters, the one whose number is D801, is not a legal XML character. So if the octet stream is correctly recognized as being encoded in x-myencoding, then we have a sequence of characters but not a well-formed XML document. Q. What if I told you that I was wrong, earlier, when I said that x-myencoding treats D801 as a representation of the Unicode 1.0 character whose integer is xD801? A. I wouldn't be the least bit surprised. Q. What if I told you that D801 is recognized as a valid representation of the character whose number is 33, i.e. x21? A. That would be exclamation point. Q. So is the octet stream a well-formed XML docu-- I mean, does the octet stream now repre-- er, encode, a well-formed XML document? A. You're telling me it encodes the sequence of characters whose conventional representation is <?xml version="1.0"?> <x>Hi,! Mom.</x> That sequence of characters is indeed a well-formed XML document. I have to grant that, even if I deplore your choice of character encodings. And your English punctuation isn't too hot, either. Q. So going back to the earlier examples, when we assumed a UTF-16 encoding. The octet stream wasn't a--I mean, didn't encode a well-formed XML document. So did it have a well-formedness error? And crucially, is a processor required to detect encoding errors? A. Most readers of the spec seem to agree that a sequence of characters which fails to match the 'document' production or violates some WF constraint in the spec, has a well-formedness error. (They are taking the term "textual object" to mean "sequence of characters", which may or may not be a perfect interpretation.) It's less clear whether something which is not a sequence of characters, or not a textual object, can rise to the status of having a well-formedness error. The coffee cup in my hand does not match the 'document' production of the XML spec. It is not, and does not encode, any well-formed XML document. At least, not using any encoding in common use. I could invent one tomorrow in which my coffee cup encodes the character sequence <x/>, just to be able to say that my coffee cup encodes an XML document. But today I'm busy. So today, my coffee cup encodes no WF XML document. Can we infer, then, that it has a well-formedness error? The spec neither requires us to say so, nor forbids it. Q. Is a processor required to detect encoding errors? A. If I hand you a document encoded in ISO 8859-7 and tell you it's in ISO 8859-1, do you guarantee that you will detect the error? Q. That could be hard. A. Yes. Impossible in principle. On the other hand, if you tell me the data stream is encoded in encoding E, and it turns out, when decoded using the rules of encoding E, not to produce a well-formed XML document, it's probably worth reporting, right? Q. Right. But aren't I the one supposed to be asking the questions? For the third time of asking, is a processor required to detect and report the issue with the D801 character in the example we started with? A. Yes. The XML spec says, in section 4.3.3: It is a fatal error if an XML entity is determined ... to be in a certain encoding but contains byte sequences that are not legal in that encoding. Conforming processors are required to detect fatal errors and report them. The one in the example thus may or may not be a well-formedness error, but it's definitely a fatal error. Q. Didn't you just tell me that it was impossible in principle to detect all cases in which the encoding declaration is inaccurate? A. I did. Q. So why does the XML spec require the detection of an error you say is impossible to detect in principle, in the general case? In section 4.3.3 the Rec also says In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, That seems to mean that if I told you the document was in ISO 8859-1 when it was really in 8859-7, you would be obligated to detect it. A. Yeah, you know, I was wondering about that myself. I thought at first that maybe the Core WG had snuck that in later, after the first edition, but no, it's been there all along. I think I must just have lost the argument with the rest of the WG on that one. Fortunately, there's a metaphysical defense. It's true that the octet stream you gave me encoded a document in ISO 8859-7, and that may have been the one you wanted me to validate and process. But it also encodes a document encoded in ISO 8859-1, which is the one I actually did validate and process. That document didn't make much sense -- a number of passages just looked like gibberish, in fact -- but when I'm playing the role of well-formedness checker I try to avoid making stylistic comments on my users' prose. It alarms them. And they find most of the suggestions pedantic. Q. Why do you think would that be? I hope this helps. --CMSMcQ
Received on Friday, 23 March 2007 22:57:28 UTC