W3C home > Mailing lists > Public > public-qt-comments@w3.org > February 2016

[Bug 29479] New: [XSLT30] Streaming and non-well-formed documents

From: <bugzilla@jessica.w3.org>
Date: Thu, 18 Feb 2016 12:27:05 +0000
To: public-qt-comments@w3.org
Message-ID: <bug-29479-523@http.www.w3.org/Bugs/Public/>
https://www.w3.org/Bugs/Public/show_bug.cgi?id=29479

            Bug ID: 29479
           Summary: [XSLT30] Streaming and non-well-formed documents
           Product: XPath / XQuery / XSLT
           Version: Candidate Recommendation
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XSLT 3.0
          Assignee: mike@saxonica.com
          Reporter: abel.braaksma@xs4all.nl
        QA Contact: public-qt-comments@w3.org
  Target Milestone: ---

Martin Honnen brought this to my attention in a bug report on Exselt (ECS-12).
Het quoted a part of the spec:

"A streamed transformation that only accesses part of the input
document (for example, a header at the start of a document) is not
required to continue reading once the data it needs has been read.
This means that XML well-formedness or validity errors occurring in
the unread part of the input stream may go undetected."

and gave this example:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="3.0"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xs">

<xsl:param name="input-uri" as="xs:string" select="'test201602170101.xml'"/>

<xsl:param name="items-to-copy" as="xs:integer" select="4"/>
<xsl:variable name="children-to-copy" as="xs:integer" select="$items-to-copy +
1"/>

<xsl:mode streamable="yes"/>

<xsl:output indent="yes"/>

<xsl:template name="xsl:initial-template">
  <xsl:stream href="{$input-uri}">
    <xsl:apply-templates/>
  </xsl:stream>
</xsl:template>

<xsl:template match="/*">
  <xsl:copy>
    <xsl:iterate select="*">
      <xsl:copy-of select="."/>
      <xsl:if test="position() eq $children-to-copy">
        <xsl:break/>
      </xsl:if>
    </xsl:iterate>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>

with the following input:

<root>
  <header>...</header>
  <item name="1">...</item>
  <item name="2">...</item>
  <item name="3">...</item>
  <item name="4">...</item>
  <item>
</root>

This input is deliberately not well-formed.

He ran the example with Saxon as well, which threw no error. My product threw a
rather unclear internal error which is clearly a bug. 

However, this shows a peculiar situation that may arise with non well-formed
documents. I would challenge that in this case the error can be ignored,
because the xsl:copy is shallow-copying the <root> element. To complete that
copy it needs to read through to the end.

If the template were written differently, this error may not need to arise:

<xsl:template match="/*">
   <xsl:element name="{name()}">
      <xsl:iterate....>
   </xslelement>
</xsl:template>

But even then, whether or not an error is raised will be entirely
implementation dependent. 

I am wondering if we can make this more interoperable. For instance by
requiring an option to at least through to the end. This will not always be
feasible, hence it must be a user option, but one that a processor *must*
support.

Conversely, how much a processor looks ahead before it "breaks" further
processing (recall that <xsl:break> is not a real break, it just skips over the
next items, it doesn't mean that these items should not be processed) is
implementation defined, but I wonder if we could be more prescriptive about
where and when a processor is really allowed to skip further processing of a
document.

The main use-case for adding the line above is for when a user is interested
only in a certain leaf node, or existence of one, and further processing is not
needed. The problem is: can we define when "further processing is not needed"?

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
Received on Thursday, 18 February 2016 12:27:08 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:57:59 UTC