- From: <bugzilla@wiggum.w3.org>
- Date: Sat, 23 Jun 2007 09:51:13 +0000
- To: public-qt-comments@w3.org
- CC:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=4697
Summary: [FT] editorial: 1.1 Full-Text Search and XML
Product: XPath / XQuery / XSLT
Version: Last Call drafts
Platform: All
OS/Version: All
Status: NEW
Severity: minor
Priority: P2
Component: Full Text
AssignedTo: jim.melton@acm.org
ReportedBy: jmdyck@ibiblio.org
QAContact: public-qt-comments@w3.org
1.1 Full-Text Search and XML
[1]
"The following definitions apply to full-text search:"
Note that 4 and 5 aren't actually definitions.
(5 belongs with a definition of tokenization.)
[2]
"A token is defined as a character, n-gram, or sequence of characters"
[2a]
Usually, definitions don't say "is defined as". Change to just "is"?
[2b]
There is no definition of "n-gram", and no other use of it in the
document. Delete?
[3]
"Each instance of a token consists of one or more consecutive characters."
[3a]
The phrase "Each instance of a token" is undefined, and suggests that
a token is an abstract thing that has to be instantiated. Just say
"Each token".
[3b]
I think you'll get a better definition if you combine the two
sentences, e.g.:
A token is a sequence of one or more consecutive characters,
returned by a tokenizer as a basic unit to be searched.
[3c]
You might need to clarify what you mean by "consecutive". Point 5
appears to give implementations the freedom to treat
i<i>tal</i>ic
as a single (6-character) token, but its characters are not
consecutive characters in the XML document.
[4]
"a phrase/sentence/paragraph is an ordered sequence of any number of
tokens."
Must/should the order of the sequence reflect document order? (e.g.,
tokens are ordered according to the document order of their first
character) Either way, it might be good to say.
[5]
"... the containment hierarchy: paragraphs contain sentences, which
contain tokens"
[5a]
Phrases don't (aren't required to) participate in the containment
hierarchy? (Can a phrase match across a sentence boundary?)
[5b]
It's not clear what it means for one ordered sequence of tokens (A) to
"contain" another (B). Presumably all the tokens in B must be in A,
and I'm guessing they have to be in the same order. Do the tokens of
B also have to be consecutive in A?
[6]
"The tokenizer has to process two codepoint equal strings in the same way,
i.e., it should identify the same tokens."
[6a]
Change "has to" to "must"?
[6b]
Change "should" to "must"?
[6c]
These constraints on tokenization are stated in four places (1.1,
2.1, 4.1, and appx I) in slightly different ways. Surely we can
delete/merge a few of them.
Received on Saturday, 23 June 2007 09:51:15 UTC