[Bug 29752] New: [XSLT30]two accumulator examples using count(tokenize(., '\s+')) respectively count(tokenize(., '\W+')) to count words give odd results

https://www.w3.org/Bugs/Public/show_bug.cgi?id=29752

            Bug ID: 29752
           Summary: [XSLT30]two accumulator examples using
                    count(tokenize(., '\s+')) respectively
                    count(tokenize(., '\W+')) to count words give odd
                    results
           Product: XPath / XQuery / XSLT
           Version: Candidate Recommendation
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: normal
          Priority: P2
         Component: XSLT 3.0
          Assignee: mike@saxonica.com
          Reporter: martin.honnen@gmx.de
        QA Contact: public-qt-comments@w3.org
  Target Milestone: ---

The section about accumulators gives two examples said to count words in
sections respectively in the document, one is in
https://www.w3.org/XML/Group/qtspecs/specifications/xslt-30/html/#func-accumulator-after
and defines 

<xsl:accumulator name="w" initial-value="0" streamable="true" as="xs:integer">
   <xsl:accumulator-rule match="text()" 
                         select="$value + count(tokenize(., '\s+'))"/>
</xsl:accumulator>

and 

<xsl:template match="section">
   <xsl:apply-templates/>
   (words: <xsl:value-of select="accumulator-after('w') -
accumulator-before('w')"/>)
</xsl:template>

the other is in the section
https://www.w3.org/XML/Group/qtspecs/specifications/xslt-30/html/#accumulator-examples
and defines 


  <xsl:accumulator name="word-count" 
                   as="xs:integer" 
                   initial-value="0">
    <xsl:accumulator-rule match="text()" 
         select="$value + count(tokenize(string(.), '\W+'))"/>
  </xsl:accumulator>

and

   <xsl:template match="/">
     <xsl:apply-templates/>
     <p>Word count: <xsl:value-of
select="accumulator-after('word-count')"/></p>
   </xsl:template>


I realize the examples are supposed to be short and illustrate the use of
accumulators rather than providing a good word count implementation but when I
test them on documents containing any white space text nodes both above
approaches give rather odd and too high results for the word count compared to
what a human reader would count. 

For instance for the input

<?xml version="1.0" encoding="UTF-8"?>
<doc>
        <section id="sec1">This is a quick test.</section>
        <section id="sec2">
                <p>The quick <b>brown</b> fox jumped over the lazy dog.</p>
        </section>
</doc>

the complete stylesheet using the tokenize(., '\W+') is like this

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema"
        xmlns:math="http://www.w3.org/2005/xpath-functions/math"
exclude-result-prefixes="xs math"
        version="3.0">

        <xsl:mode on-no-match="shallow-copy" streamable="yes"/>
        <xsl:global-context-item use-accumulators="w" streamable="yes"/>

        <xsl:accumulator name="w" initial-value="0" streamable="true"
as="xs:integer">
                <xsl:accumulator-rule match="text()" select="$value +
count(tokenize(., '\W+'))"/>
        </xsl:accumulator>

        <xsl:template match="/*">
                <xsl:copy>
                        <xsl:apply-templates/>
                        <p>Total count of words in document : <xsl:value-of
select="accumulator-after('w')"/></p>
                </xsl:copy>
        </xsl:template>

        <xsl:template match="section">
                <xsl:copy>
                        <xsl:apply-templates select="@*"/>
                        <xsl:apply-templates/>
                        <p>(words: <xsl:value-of select="accumulator-after('w')
- accumulator-before('w')"/>)</p>
                </xsl:copy>
        </xsl:template>

</xsl:stylesheet>


and when run with Saxon 9.7 EE outputs

<?xml version="1.0" encoding="UTF-8"?><doc>
        <section id="sec1">This is a quick test.<p>(words: 6)</p></section>
        <section id="sec2">
                <p>The quick <b>brown</b> fox jumped over the lazy dog.</p>
        <p>(words: 16)</p></section>
<p>Total count of words in document : 28</p></doc>


As both examples in the spec in terms of the accumulator actually want to do
the same, namely count the words in text nodes, I wonder whether it is not
possible to include a slightly longer but more precise accumulator definition
in the form of

        <xsl:accumulator name="w" initial-value="0" streamable="true"
as="xs:integer">
                <xsl:accumulator-rule match="text()">
                        <xsl:variable name="words" as="xs:string*">
                                <xsl:analyze-string select="." regex="\w+">
                                        <xsl:matching-substring>
                                                <xsl:sequence
select="regex-group(0)"/>
                                        </xsl:matching-substring>
                                </xsl:analyze-string>
                        </xsl:variable>
                        <xsl:sequence select="$value + count($words)"/>
                </xsl:accumulator-rule>
        </xsl:accumulator>

once in the spec and have both spec examples reference/use that accumulator
definition.

The word count results with a full stylesheet

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
        xmlns:xs="http://www.w3.org/2001/XMLSchema"
        xmlns:math="http://www.w3.org/2005/xpath-functions/math"
exclude-result-prefixes="xs math"
        version="3.0">

        <xsl:mode on-no-match="shallow-copy" streamable="yes"/>
        <xsl:global-context-item use-accumulators="w" streamable="yes"/>

        <xsl:accumulator name="w" initial-value="0" streamable="true"
as="xs:integer">
                <xsl:accumulator-rule match="text()">
                        <xsl:variable name="words" as="xs:string*">
                                <xsl:analyze-string select="." regex="\w+">
                                        <xsl:matching-substring>
                                                <xsl:sequence
select="regex-group(0)"/>
                                        </xsl:matching-substring>
                                </xsl:analyze-string>
                        </xsl:variable>
                        <xsl:sequence select="$value + count($words)"/>
                </xsl:accumulator-rule>
        </xsl:accumulator>

        <xsl:template match="/*">
                <xsl:copy>
                        <xsl:apply-templates/>
                        <p>Total count of words in document : <xsl:value-of
select="accumulator-after('w')"/></p>
                </xsl:copy>
        </xsl:template>

        <xsl:template match="section">
                <xsl:copy>
                        <xsl:apply-templates select="@*"/>
                        <xsl:apply-templates/>
                        <p>(words: <xsl:value-of select="accumulator-after('w')
- accumulator-before('w')"/>)</p>
                </xsl:copy>
        </xsl:template>

</xsl:stylesheet>


are then

<?xml version="1.0" encoding="UTF-8"?><doc>
        <section id="sec1">This is a quick test.<p>(words: 5)</p></section>
        <section id="sec2">
                <p>The quick <b>brown</b> fox jumped over the lazy dog.</p>
        <p>(words: 9)</p></section>
<p>Total count of words in document : 14</p></doc>

which seems more natural and correct as a word count.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.

Received on Monday, 25 July 2016 10:26:58 UTC