[Bug 12057] New: [FT] Sentence breaks

http://www.w3.org/Bugs/Public/show_bug.cgi?id=12057

           Summary: [FT] Sentence breaks
           Product: XPath / XQuery / XSLT
           Version: Proposed Recommendation
          Platform: PC
        OS/Version: Windows NT
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Full Text 1.0
        AssignedTo: jim.melton@acm.org
        ReportedBy: tim@cbcl.co.uk
         QAContact: public-qt-comments@w3.org


In Section 3. Full-Text, the text

"This sample tokenization uses white space, punctuation and XML tags as
word-breakers and <p> for paragraph boundaries. The results may be different
for other tokenizations."

fails to state what rule has been used to identify sentence boundaries.  The
guidelines for running the test suite give the rule as:

"sentences are separated by a period (a/k/a "full stop") followed immediately
by white space,"

The example in 3 Full-Text Selections uses the following XML.

<books>
  <book number="1">
    <title shortTitle="Improving Web Site Usability">Improving  
        the Usability of a Web Site Through Expert Reviews and
        Usability Testing</title>
    <author>Millicent Marigold</author>
    <author>Montana Marigold</author>
    <editor>Véra Tudor-Medina</editor>
    <content>
      <p>The usability of a Web site is how well the  
          site supports the users in achieving specified  
          goals. A Web site should facilitate learning,  
          and enable efficient and effective task  
          completion, while propagating few errors.
      </p>
      <note>This book has been approved by the Web Site  
          Users Association.
      </note>
    </content>
  </book>
</books>

Following the rule for sentence breaking from the test stuie guidelines, test 
examples-364-2 derived from section 3.6.4 of the specification appears to be
incorrect.  The specification says:

The following expression returns true, because the tokens "usability" and
"Marigold" are contained within different sentences:

//book contains text "usability" ftand "Marigold" different sentence

However, the first sentence break appears after the word "goals", so the two
words only ever appear in the same sentence.

There is no suggestion in the text that the beginning (end) of a paragraph
necessarily start (ends) a sentence.

It is also unclear how paragraph boundaries are identified.  Consider the
following input:

<root>
  A <p>B</p> C
</root>

I can see three possibilities:

1.  There are three paragraphs: one containing A, one containing B and one
containing C).
2.  There are two paragraphs: one containing A, one containing B C.
3.  There are two paragraphs: one containing A B, one containing C.

It is not clear from the specification which interpretation is correct.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Monday, 14 February 2011 09:50:39 UTC