p:unescape-markup from Richard Tobin on 2008-05-13 (public-xml-processing-model-wg@w3.org from May 2008)

From: Richard Tobin <richard@inf.ed.ac.uk>
Date: Tue, 13 May 2008 14:02:52 +0100 (BST)
To: public-xml-processing-model-wg@w3.org
Message-Id: <20080513130252.7EA8F393D5D@macpro.inf.ed.ac.uk>

>    [NEW] ACTION: Richard to attempt to clarify the prose of the
>    unescape-markup with respect to the XML Declaration, document types, XML
>    version, etc. [recorded in
>    http://www.w3.org/2008/05/08-xproc-minutes.html#action03[14]]

I looked at the existing description of p:unescape-markup and was
surprised to see that it says:

  When the string value is parsed, the original document element is
  preserved so that the result will be well-formed XML even if the
  content consists of multiple, sibling elements.

That is, the text is parsed as an external entity rather than an XML
document.  This implies that it can't have a DOCTYPE - we don't want
to invent a new kind of document that's effectively an XML document
with multiple top-level elements allowed.

Why do we allow this?  Is it just because p:escape-markup can produce
such things (because it serializes the children of the document
element)?  Do we really want it?

If p:escape-markup produces and p:unescape-markup consumes external
entities rather than XML documents, this raises various issues about
the xml declaration.  For a non-document entity, it's actually a
text declaration, so "standalone" is not allowed and "encoding" is
required (even though the encoding is irrelevant, since we're
dealing with characters not bytes).  Do we say it's up to the
user to ensure that the serialization options produce a legal
serialised result?

-- Richard


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Received on Tuesday, 13 May 2008 13:03:37 UTC