RE: Forbidden character issue not addressed in XSLT 1.1 WD from Mike Brown on 2001-01-08 (xsl-editors@w3.org from January to March 2001)

From: Mike Brown <mbrown@corp.webb.net>
Date: Mon, 8 Jan 2001 12:16:56 -0700
To: "'xsl-editors@w3.org'" <xsl-editors@w3.org>
Message-ID: <8D96EDA0AC04D31197B400A0C96C1480F70952@OSSEX1>

> This is dealt with in the paragraph at the end of section 14.2.

I'm not convinced that this paragraph is sufficient. Your approach seems to
be to say that non-XML characters cannot make their way into a tree in a
transformation, so there's no need to discuss translation of the characters
in output. Yet by leaving the mapping implementation-dependent without
explictly forbidding non-XML characters in the result tree, you create a
need to address the issue in the section on output.

Regardless of how an implementation might go about dealing with
difficult-to-map characters into the XPath/XSLT tree that it acts upon, is
it or is it not possible to have non-XML characters in a source, stylesheet
or result tree?

My research does not indicate an explicit statement one way or the other on
this subject in the XPath or XSLT specs. The closest I have found is in the
introductions of each spec, where these statements are made:

"A transformation in the XSLT language is expressed as a well-formed XML
document [XML]" ... this seems to imply that a stylesheet tree cannot
contain non-XML characters.

"The primary purpose of XPath is to address parts of an XML [XML] document"
... this seems to imply that any tree in the XPath model cannot contain
non-XML characters.

However, XSLT section 3.1 acknowledges the possibility of the source tree
being derived from a non-well-formed document, and in such a situation
deviates from XPath and allows the tree to not resemble one that was derived
from a well-formed XML document. This seems to imply that source trees could
therefore contain non-XML characters as well.

> It can't be dealt with purely by the XML output method, because there
> may be illegal surrogate pairs that would make the behavior of the
> string functions undefined. 

I understand what you are saying, but I disagree with your premise. It is
impossible for a conforming XPath implementation to produce string objects
that contain non-XML characters.

Section 3.6 of XPath is very clear on these points:

1. String objects must consist of 0 or more XML characters;

2. Implementations must differentiate between code values and characters;
surrogate pairs are code value sequences that must be mapped to characters.

So it is the string-returning function's responsibility to return a string
that conforms to these rules; XPath operates on strings at the character
level, not the code value level. How the implementation goes about mapping
illegal code values to characters, regardless of whether they are from the
surrogate range, is up to the implementation.

> (Also HTML does not allow any character.)

Do you have a reference?

When reviewing section 5 of the HTML 4.01 spec and the comments in the HTML
4.0 strict DTD, I saw no indication of any particular ISO/IEC 10646
characters being forbidden in HTML documents, in general, without getting
into syntactic restrictions on where exactly certain characters can appear.
XML on the other hand explicitly disallows certain ranges.

   - Mike
____________________________________________________________________
Mike J. Brown, software engineer at            My XML/XSL resources:
webb.net in Denver, Colorado, USA              http://skew.org/xml/

Received on Monday, 8 January 2001 14:16:23 UTC