- From: <bugzilla@jessica.w3.org>
- Date: Mon, 25 Jul 2016 10:26:49 +0000
- To: public-qt-comments@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=29752 Bug ID: 29752 Summary: [XSLT30]two accumulator examples using count(tokenize(., '\s+')) respectively count(tokenize(., '\W+')) to count words give odd results Product: XPath / XQuery / XSLT Version: Candidate Recommendation Hardware: PC OS: Windows NT Status: NEW Severity: normal Priority: P2 Component: XSLT 3.0 Assignee: mike@saxonica.com Reporter: martin.honnen@gmx.de QA Contact: public-qt-comments@w3.org Target Milestone: --- The section about accumulators gives two examples said to count words in sections respectively in the document, one is in https://www.w3.org/XML/Group/qtspecs/specifications/xslt-30/html/#func-accumulator-after and defines <xsl:accumulator name="w" initial-value="0" streamable="true" as="xs:integer"> <xsl:accumulator-rule match="text()" select="$value + count(tokenize(., '\s+'))"/> </xsl:accumulator> and <xsl:template match="section"> <xsl:apply-templates/> (words: <xsl:value-of select="accumulator-after('w') - accumulator-before('w')"/>) </xsl:template> the other is in the section https://www.w3.org/XML/Group/qtspecs/specifications/xslt-30/html/#accumulator-examples and defines <xsl:accumulator name="word-count" as="xs:integer" initial-value="0"> <xsl:accumulator-rule match="text()" select="$value + count(tokenize(string(.), '\W+'))"/> </xsl:accumulator> and <xsl:template match="/"> <xsl:apply-templates/> <p>Word count: <xsl:value-of select="accumulator-after('word-count')"/></p> </xsl:template> I realize the examples are supposed to be short and illustrate the use of accumulators rather than providing a good word count implementation but when I test them on documents containing any white space text nodes both above approaches give rather odd and too high results for the word count compared to what a human reader would count. For instance for the input <?xml version="1.0" encoding="UTF-8"?> <doc> <section id="sec1">This is a quick test.</section> <section id="sec2"> <p>The quick <b>brown</b> fox jumped over the lazy dog.</p> </section> </doc> the complete stylesheet using the tokenize(., '\W+') is like this <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:math="http://www.w3.org/2005/xpath-functions/math" exclude-result-prefixes="xs math" version="3.0"> <xsl:mode on-no-match="shallow-copy" streamable="yes"/> <xsl:global-context-item use-accumulators="w" streamable="yes"/> <xsl:accumulator name="w" initial-value="0" streamable="true" as="xs:integer"> <xsl:accumulator-rule match="text()" select="$value + count(tokenize(., '\W+'))"/> </xsl:accumulator> <xsl:template match="/*"> <xsl:copy> <xsl:apply-templates/> <p>Total count of words in document : <xsl:value-of select="accumulator-after('w')"/></p> </xsl:copy> </xsl:template> <xsl:template match="section"> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates/> <p>(words: <xsl:value-of select="accumulator-after('w') - accumulator-before('w')"/>)</p> </xsl:copy> </xsl:template> </xsl:stylesheet> and when run with Saxon 9.7 EE outputs <?xml version="1.0" encoding="UTF-8"?><doc> <section id="sec1">This is a quick test.<p>(words: 6)</p></section> <section id="sec2"> <p>The quick <b>brown</b> fox jumped over the lazy dog.</p> <p>(words: 16)</p></section> <p>Total count of words in document : 28</p></doc> As both examples in the spec in terms of the accumulator actually want to do the same, namely count the words in text nodes, I wonder whether it is not possible to include a slightly longer but more precise accumulator definition in the form of <xsl:accumulator name="w" initial-value="0" streamable="true" as="xs:integer"> <xsl:accumulator-rule match="text()"> <xsl:variable name="words" as="xs:string*"> <xsl:analyze-string select="." regex="\w+"> <xsl:matching-substring> <xsl:sequence select="regex-group(0)"/> </xsl:matching-substring> </xsl:analyze-string> </xsl:variable> <xsl:sequence select="$value + count($words)"/> </xsl:accumulator-rule> </xsl:accumulator> once in the spec and have both spec examples reference/use that accumulator definition. The word count results with a full stylesheet <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:math="http://www.w3.org/2005/xpath-functions/math" exclude-result-prefixes="xs math" version="3.0"> <xsl:mode on-no-match="shallow-copy" streamable="yes"/> <xsl:global-context-item use-accumulators="w" streamable="yes"/> <xsl:accumulator name="w" initial-value="0" streamable="true" as="xs:integer"> <xsl:accumulator-rule match="text()"> <xsl:variable name="words" as="xs:string*"> <xsl:analyze-string select="." regex="\w+"> <xsl:matching-substring> <xsl:sequence select="regex-group(0)"/> </xsl:matching-substring> </xsl:analyze-string> </xsl:variable> <xsl:sequence select="$value + count($words)"/> </xsl:accumulator-rule> </xsl:accumulator> <xsl:template match="/*"> <xsl:copy> <xsl:apply-templates/> <p>Total count of words in document : <xsl:value-of select="accumulator-after('w')"/></p> </xsl:copy> </xsl:template> <xsl:template match="section"> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates/> <p>(words: <xsl:value-of select="accumulator-after('w') - accumulator-before('w')"/>)</p> </xsl:copy> </xsl:template> </xsl:stylesheet> are then <?xml version="1.0" encoding="UTF-8"?><doc> <section id="sec1">This is a quick test.<p>(words: 5)</p></section> <section id="sec2"> <p>The quick <b>brown</b> fox jumped over the lazy dog.</p> <p>(words: 9)</p></section> <p>Total count of words in document : 14</p></doc> which seems more natural and correct as a word count. -- You are receiving this mail because: You are the QA Contact for the bug.
Received on Monday, 25 July 2016 10:26:58 UTC