[Bug 12057] [FT] Sentence breaks from bugzilla@jessica.w3.org on 2011-02-21 (public-qt-comments@w3.org from February 2011)

From: <bugzilla@jessica.w3.org>
Date: Mon, 21 Feb 2011 21:18:08 +0000
To: public-qt-comments@w3.org
Message-Id: <E1Prd8y-0006aY-Rl@jessica.w3.org>

http://www.w3.org/Bugs/Public/show_bug.cgi?id=12057

Michael Dyck <jmdyck@ibiblio.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jmdyck@ibiblio.org

--- Comment #1 from Michael Dyck <jmdyck@ibiblio.org> 2011-02-21 21:18:08 UTC ---
(personal response:)

(In reply to comment #0)
> In Section 3. Full-Text, the text
> 
> "This sample tokenization uses white space, punctuation and XML tags as
> word-breakers and <p> for paragraph boundaries. The results may be different
> for other tokenizations."
> 
> fails to state what rule has been used to identify sentence boundaries.

Hm, right. Since we have some examples involving sentences (in section 3.6.4),
we should probably copy the text you quoted from test suite's guidelines.

> There is no suggestion in the text that the beginning (end) of a paragraph
> necessarily start (ends) a sentence.

We should probably add that to the description of the sample tokenization.

> It is also unclear how paragraph boundaries are identified.  Consider the
> following input:
> 
> <root>
>   A <p>B</p> C
> </root>
> 
> I can see three possibilities:
> 
> 1.  There are three paragraphs: one containing A, one containing B and one
> containing C).
> 2.  There are two paragraphs: one containing A, one containing B C.
> 3.  There are two paragraphs: one containing A B, one containing C.
> 
> It is not clear from the specification which interpretation is correct.

I think they're all conformant (and there are perhaps other possibilities).
It's up to each implementation to indicate how it identifies paragraph
boundaries (if it supports paragraphs).

In the sample tokenization, I'd say it's clear that A and B are not in the same
paragraph (because <p> is a "paragraph boundary"), so #3 is out. I'm not sure
it's necessary to describe the sample tokenization precisely enough to
distinguish between #1 and #2 -- do you know of any examples or tests where it
makes a difference to the result?

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Monday, 21 February 2011 21:18:10 UTC