- From: <bugzilla@jessica.w3.org>
- Date: Mon, 14 Feb 2011 09:50:37 +0000
- To: public-qt-comments@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=12057 Summary: [FT] Sentence breaks Product: XPath / XQuery / XSLT Version: Proposed Recommendation Platform: PC OS/Version: Windows NT Status: NEW Severity: normal Priority: P2 Component: Full Text 1.0 AssignedTo: jim.melton@acm.org ReportedBy: tim@cbcl.co.uk QAContact: public-qt-comments@w3.org In Section 3. Full-Text, the text "This sample tokenization uses white space, punctuation and XML tags as word-breakers and <p> for paragraph boundaries. The results may be different for other tokenizations." fails to state what rule has been used to identify sentence boundaries. The guidelines for running the test suite give the rule as: "sentences are separated by a period (a/k/a "full stop") followed immediately by white space," The example in 3 Full-Text Selections uses the following XML. <books> <book number="1"> <title shortTitle="Improving Web Site Usability">Improving the Usability of a Web Site Through Expert Reviews and Usability Testing</title> <author>Millicent Marigold</author> <author>Montana Marigold</author> <editor>Véra Tudor-Medina</editor> <content> <p>The usability of a Web site is how well the site supports the users in achieving specified goals. A Web site should facilitate learning, and enable efficient and effective task completion, while propagating few errors. </p> <note>This book has been approved by the Web Site Users Association. </note> </content> </book> </books> Following the rule for sentence breaking from the test stuie guidelines, test examples-364-2 derived from section 3.6.4 of the specification appears to be incorrect. The specification says: The following expression returns true, because the tokens "usability" and "Marigold" are contained within different sentences: //book contains text "usability" ftand "Marigold" different sentence However, the first sentence break appears after the word "goals", so the two words only ever appear in the same sentence. There is no suggestion in the text that the beginning (end) of a paragraph necessarily start (ends) a sentence. It is also unclear how paragraph boundaries are identified. Consider the following input: <root> A <p>B</p> C </root> I can see three possibilities: 1. There are three paragraphs: one containing A, one containing B and one containing C). 2. There are two paragraphs: one containing A, one containing B C. 3. There are two paragraphs: one containing A B, one containing C. It is not clear from the specification which interpretation is correct. -- Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
Received on Monday, 14 February 2011 09:50:39 UTC