- From: <bugzilla@jessica.w3.org>
- Date: Mon, 25 Jul 2016 10:26:49 +0000
- To: public-qt-comments@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=29752
Bug ID: 29752
Summary: [XSLT30]two accumulator examples using
count(tokenize(., '\s+')) respectively
count(tokenize(., '\W+')) to count words give odd
results
Product: XPath / XQuery / XSLT
Version: Candidate Recommendation
Hardware: PC
OS: Windows NT
Status: NEW
Severity: normal
Priority: P2
Component: XSLT 3.0
Assignee: mike@saxonica.com
Reporter: martin.honnen@gmx.de
QA Contact: public-qt-comments@w3.org
Target Milestone: ---
The section about accumulators gives two examples said to count words in
sections respectively in the document, one is in
https://www.w3.org/XML/Group/qtspecs/specifications/xslt-30/html/#func-accumulator-after
and defines
<xsl:accumulator name="w" initial-value="0" streamable="true" as="xs:integer">
<xsl:accumulator-rule match="text()"
select="$value + count(tokenize(., '\s+'))"/>
</xsl:accumulator>
and
<xsl:template match="section">
<xsl:apply-templates/>
(words: <xsl:value-of select="accumulator-after('w') -
accumulator-before('w')"/>)
</xsl:template>
the other is in the section
https://www.w3.org/XML/Group/qtspecs/specifications/xslt-30/html/#accumulator-examples
and defines
<xsl:accumulator name="word-count"
as="xs:integer"
initial-value="0">
<xsl:accumulator-rule match="text()"
select="$value + count(tokenize(string(.), '\W+'))"/>
</xsl:accumulator>
and
<xsl:template match="/">
<xsl:apply-templates/>
<p>Word count: <xsl:value-of
select="accumulator-after('word-count')"/></p>
</xsl:template>
I realize the examples are supposed to be short and illustrate the use of
accumulators rather than providing a good word count implementation but when I
test them on documents containing any white space text nodes both above
approaches give rather odd and too high results for the word count compared to
what a human reader would count.
For instance for the input
<?xml version="1.0" encoding="UTF-8"?>
<doc>
<section id="sec1">This is a quick test.</section>
<section id="sec2">
<p>The quick <b>brown</b> fox jumped over the lazy dog.</p>
</section>
</doc>
the complete stylesheet using the tokenize(., '\W+') is like this
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
exclude-result-prefixes="xs math"
version="3.0">
<xsl:mode on-no-match="shallow-copy" streamable="yes"/>
<xsl:global-context-item use-accumulators="w" streamable="yes"/>
<xsl:accumulator name="w" initial-value="0" streamable="true"
as="xs:integer">
<xsl:accumulator-rule match="text()" select="$value +
count(tokenize(., '\W+'))"/>
</xsl:accumulator>
<xsl:template match="/*">
<xsl:copy>
<xsl:apply-templates/>
<p>Total count of words in document : <xsl:value-of
select="accumulator-after('w')"/></p>
</xsl:copy>
</xsl:template>
<xsl:template match="section">
<xsl:copy>
<xsl:apply-templates select="@*"/>
<xsl:apply-templates/>
<p>(words: <xsl:value-of select="accumulator-after('w')
- accumulator-before('w')"/>)</p>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
and when run with Saxon 9.7 EE outputs
<?xml version="1.0" encoding="UTF-8"?><doc>
<section id="sec1">This is a quick test.<p>(words: 6)</p></section>
<section id="sec2">
<p>The quick <b>brown</b> fox jumped over the lazy dog.</p>
<p>(words: 16)</p></section>
<p>Total count of words in document : 28</p></doc>
As both examples in the spec in terms of the accumulator actually want to do
the same, namely count the words in text nodes, I wonder whether it is not
possible to include a slightly longer but more precise accumulator definition
in the form of
<xsl:accumulator name="w" initial-value="0" streamable="true"
as="xs:integer">
<xsl:accumulator-rule match="text()">
<xsl:variable name="words" as="xs:string*">
<xsl:analyze-string select="." regex="\w+">
<xsl:matching-substring>
<xsl:sequence
select="regex-group(0)"/>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:variable>
<xsl:sequence select="$value + count($words)"/>
</xsl:accumulator-rule>
</xsl:accumulator>
once in the spec and have both spec examples reference/use that accumulator
definition.
The word count results with a full stylesheet
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
exclude-result-prefixes="xs math"
version="3.0">
<xsl:mode on-no-match="shallow-copy" streamable="yes"/>
<xsl:global-context-item use-accumulators="w" streamable="yes"/>
<xsl:accumulator name="w" initial-value="0" streamable="true"
as="xs:integer">
<xsl:accumulator-rule match="text()">
<xsl:variable name="words" as="xs:string*">
<xsl:analyze-string select="." regex="\w+">
<xsl:matching-substring>
<xsl:sequence
select="regex-group(0)"/>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:variable>
<xsl:sequence select="$value + count($words)"/>
</xsl:accumulator-rule>
</xsl:accumulator>
<xsl:template match="/*">
<xsl:copy>
<xsl:apply-templates/>
<p>Total count of words in document : <xsl:value-of
select="accumulator-after('w')"/></p>
</xsl:copy>
</xsl:template>
<xsl:template match="section">
<xsl:copy>
<xsl:apply-templates select="@*"/>
<xsl:apply-templates/>
<p>(words: <xsl:value-of select="accumulator-after('w')
- accumulator-before('w')"/>)</p>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
are then
<?xml version="1.0" encoding="UTF-8"?><doc>
<section id="sec1">This is a quick test.<p>(words: 5)</p></section>
<section id="sec2">
<p>The quick <b>brown</b> fox jumped over the lazy dog.</p>
<p>(words: 9)</p></section>
<p>Total count of words in document : 14</p></doc>
which seems more natural and correct as a word count.
--
You are receiving this mail because:
You are the QA Contact for the bug.
Received on Monday, 25 July 2016 10:26:58 UTC