[Bug 12105] New: [XDM 3.0] Allow any Unicode character in a string


           Summary: [XDM 3.0] Allow any Unicode character in a string
           Product: XPath / XQuery / XSLT
           Version: Working drafts
          Platform: PC
        OS/Version: Windows NT
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Data Model 3.0
        AssignedTo: ndw@nwalsh.com
        ReportedBy: mike@saxonica.com
         QAContact: public-qt-comments@w3.org

This is an enhancement request to enhance the data model so that any Unicode
character is allowed in a string. It is raised in response to an action from
the XSL Working Group.

In practice the proposed change means (a) all XML 1.1 characters are allowed by
all processors, and (b) the Unicode NUL character (x0) is allowed by all

Serialization would fail if a string contains a character not permitted in the
version of XML that is the target of serialization. Tree construction, however,
will not reject any characters as invalid.

Parsing of lexical XML is still free to use XML 1.0 or XML 1.1 rules an
implementor discretion.

Justification: we allow input from sources that are not constrained by the XML
rules, notably by using unparsed-text() or codepoints-to-string(), or by
calling external functions. Restricting the character set that can be returned
by these functions creates work for implementors, imposes a performance
penalty, and restricts what users can do with the language, all quite

We want to allow import of JSON data, with full round-tripping. This is
hampered by the fact that JSON strings allow characters that are not legal in
XDM. The alternative is to hold such strings in escaped form, which is very
inconvenient for users.

Casting to string will not reject characters disallowed by XML. For validation
of XDM nodes (e.g. using [xsl:]validation or XQuery validate{}) it will be
implementation-defined whether the character set allowed in xs:string values is
XML 1.0, XML 1.1, or the full XDM set. This preserves the freedom of
implementations to use an off-the-shelf validation engines. 

[For the avoidance of doubt, "any character" does not include unpaired
surrogates. It is of course possible that some external data sources will
supply pseudo-strings containing unpaired surrogates. This is analogous to
supplying a string that is supposed to be encoded in UTF-8 but contains bytes
that cannot be decoded: it is not possible to interpret what is returned as a
sequence of characters. An interface that wishes to handle octet streams
containing such oddities must handle it as a sequence of integers, or as

Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Thursday, 17 February 2011 09:11:59 UTC