W3C home > Mailing lists > Public > public-xml-core-wg@w3.org > March 2007

Re: FW: Bare surrogates in XML - must halt and catch fire?

From: Richard Tobin <richard@inf.ed.ac.uk>
Date: Thu, 8 Mar 2007 15:56:00 +0000 (GMT)
To: "Grosso, Paul" <pgrosso@ptc.com>, <public-xml-core-wg@w3.org>
Message-Id: <20070308155600.7AB971BD8F4@macpro.inf.ed.ac.uk>


> What is not clear is that XML specifically forbids bare surrogates
> (ie, half of a surrogate pair). This came up in recent SVG WG
> discussions.  Is the XML parser required to reject an xml document
> containing a bare surrogate? Would that be a well formedness error, or
> some other sort of error?

I'm not sure what the question means.  Here are two possibilities:

(a) Does XML allow unpaired surrogates in a UTF-16 (etc) document?

    No, unpaired surrogates are not legal in UTF-16 ("ill-formed"
    according to D35 in section 3.9 of Unicode 4.0), so by 4.3.3
    it is a fatal error because it is "determined ... to be in a
    certain encoding and contains byte sequences that are not legal
    in that encoding".  Presumably the wording in that section about
    irregular UTF-8 code unit sequences is no longer required, since
    recent Unicode make it clear that these are ill-formed.

(b) Does XML allow characters whose code point is that of a surrogate?

    No, because it would violate production 2.

-- Richard
Received on Thursday, 8 March 2007 15:56:17 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 January 2008 14:21:35 GMT