W3C home > Mailing lists > Public > public-xml-core-wg@w3.org > March 2007

FW: Bare surrogates in XML - must halt and catch fire?

From: Grosso, Paul <pgrosso@ptc.com>
Date: Thu, 8 Mar 2007 09:30:46 -0500
Message-ID: <CF83BAA719FD2C439D25CBB1C9D1D302069F3E4D@HQ-MAIL4.ptcnet.ptc.com>
To: <public-xml-core-wg@w3.org>

Comments?

paul

-----Original Message-----
From: w3c-xml-cg-request@w3.org  On Behalf Of Chris Lilley
Sent: Wednesday, 2007 March 07 17:37
To: XML CG
Cc: Richard Ishida; Felix Sasaki; W3C SVG Working Group
Subject: Bare surrogates in XML - must halt and catch fire?


Hello XML CG, Richard, Felix,

In XML 4th edition:

   [Definition: A parsed entity contains text, a sequence of
   characters, which may represent markup or character data.]
   [Definition: A character is an atomic unit of text as specified by
   ISO/IEC 10646:2000 [ISO/IEC 10646]. Legal characters are tab,
   carriage return, line feed, and the legal characters of Unicode and
   ISO/IEC 10646. The versions of these standards cited in A.1
   Normative References were current at the time this document was
   prepared. New characters may be added to these standards by
   amendments or new editions. Consequently, XML processors MUST
   accept any character in the range specified for Char. ]
   http://www.w3.org/TR/xml/#charsets

This makes it clear that potentially valid characters must be
accepted. The character range is also clear:

  [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
  [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate
  blocks, FFFE, and FFFF. */

Charmod is clear about bare surrogates:

  Unicode contains some code points for internal use (such as
  noncharacters) or special functions (such as surrogate code points).

  C079 [S] Specifications SHOULD NOT allow the use of codepoints
   reserved by Unicode for internal use.
   http://www.w3.org/TR/charmod/#C079

  C078 [S]  Specifications MUST NOT allow the use of surrogate
    code points.
    http://www.w3.org/TR/charmod/#C078

What is not clear is that XML specifically forbids bare surrogates
(ie, half of a surrogate pair). This came up in recent SVG WG
discussions.  Is the XML parser required to reject an xml document
containing a bare surrogate? Would that be a well formedness error, or
some other sort of error?

-- 
 Chris Lilley                    mailto:chris@w3.org
 Interaction Domain Leader
 Co-Chair, W3C SVG Working Group
 W3C Graphics Activity Lead
 Co-Chair, W3C Hypertext CG
Received on Thursday, 8 March 2007 14:31:55 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 January 2008 14:21:35 GMT