- From: <bugzilla@jessica.w3.org>
- Date: Mon, 14 Feb 2011 09:50:37 +0000
- To: public-qt-comments@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=12057
Summary: [FT] Sentence breaks
Product: XPath / XQuery / XSLT
Version: Proposed Recommendation
Platform: PC
OS/Version: Windows NT
Status: NEW
Severity: normal
Priority: P2
Component: Full Text 1.0
AssignedTo: jim.melton@acm.org
ReportedBy: tim@cbcl.co.uk
QAContact: public-qt-comments@w3.org
In Section 3. Full-Text, the text
"This sample tokenization uses white space, punctuation and XML tags as
word-breakers and <p> for paragraph boundaries. The results may be different
for other tokenizations."
fails to state what rule has been used to identify sentence boundaries. The
guidelines for running the test suite give the rule as:
"sentences are separated by a period (a/k/a "full stop") followed immediately
by white space,"
The example in 3 Full-Text Selections uses the following XML.
<books>
<book number="1">
<title shortTitle="Improving Web Site Usability">Improving
the Usability of a Web Site Through Expert Reviews and
Usability Testing</title>
<author>Millicent Marigold</author>
<author>Montana Marigold</author>
<editor>Véra Tudor-Medina</editor>
<content>
<p>The usability of a Web site is how well the
site supports the users in achieving specified
goals. A Web site should facilitate learning,
and enable efficient and effective task
completion, while propagating few errors.
</p>
<note>This book has been approved by the Web Site
Users Association.
</note>
</content>
</book>
</books>
Following the rule for sentence breaking from the test stuie guidelines, test
examples-364-2 derived from section 3.6.4 of the specification appears to be
incorrect. The specification says:
The following expression returns true, because the tokens "usability" and
"Marigold" are contained within different sentences:
//book contains text "usability" ftand "Marigold" different sentence
However, the first sentence break appears after the word "goals", so the two
words only ever appear in the same sentence.
There is no suggestion in the text that the beginning (end) of a paragraph
necessarily start (ends) a sentence.
It is also unclear how paragraph boundaries are identified. Consider the
following input:
<root>
A <p>B</p> C
</root>
I can see three possibilities:
1. There are three paragraphs: one containing A, one containing B and one
containing C).
2. There are two paragraphs: one containing A, one containing B C.
3. There are two paragraphs: one containing A B, one containing C.
It is not clear from the specification which interpretation is correct.
--
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Monday, 14 February 2011 09:50:39 UTC