- From: <bugzilla@wiggum.w3.org>
- Date: Sat, 23 Jun 2007 09:51:13 +0000
- To: public-qt-comments@w3.org
- CC:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=4697 Summary: [FT] editorial: 1.1 Full-Text Search and XML Product: XPath / XQuery / XSLT Version: Last Call drafts Platform: All OS/Version: All Status: NEW Severity: minor Priority: P2 Component: Full Text AssignedTo: jim.melton@acm.org ReportedBy: jmdyck@ibiblio.org QAContact: public-qt-comments@w3.org 1.1 Full-Text Search and XML [1] "The following definitions apply to full-text search:" Note that 4 and 5 aren't actually definitions. (5 belongs with a definition of tokenization.) [2] "A token is defined as a character, n-gram, or sequence of characters" [2a] Usually, definitions don't say "is defined as". Change to just "is"? [2b] There is no definition of "n-gram", and no other use of it in the document. Delete? [3] "Each instance of a token consists of one or more consecutive characters." [3a] The phrase "Each instance of a token" is undefined, and suggests that a token is an abstract thing that has to be instantiated. Just say "Each token". [3b] I think you'll get a better definition if you combine the two sentences, e.g.: A token is a sequence of one or more consecutive characters, returned by a tokenizer as a basic unit to be searched. [3c] You might need to clarify what you mean by "consecutive". Point 5 appears to give implementations the freedom to treat i<i>tal</i>ic as a single (6-character) token, but its characters are not consecutive characters in the XML document. [4] "a phrase/sentence/paragraph is an ordered sequence of any number of tokens." Must/should the order of the sequence reflect document order? (e.g., tokens are ordered according to the document order of their first character) Either way, it might be good to say. [5] "... the containment hierarchy: paragraphs contain sentences, which contain tokens" [5a] Phrases don't (aren't required to) participate in the containment hierarchy? (Can a phrase match across a sentence boundary?) [5b] It's not clear what it means for one ordered sequence of tokens (A) to "contain" another (B). Presumably all the tokens in B must be in A, and I'm guessing they have to be in the same order. Do the tokens of B also have to be consecutive in A? [6] "The tokenizer has to process two codepoint equal strings in the same way, i.e., it should identify the same tokens." [6a] Change "has to" to "must"? [6b] Change "should" to "must"? [6c] These constraints on tokenization are stated in four places (1.1, 2.1, 4.1, and appx I) in slightly different ways. Surely we can delete/merge a few of them.
Received on Saturday, 23 June 2007 09:51:15 UTC