- From: <bugzilla@wiggum.w3.org>
- Date: Wed, 25 Jan 2006 01:21:33 +0000
- To: public-qt-comments@w3.org
- Cc:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=2766
Summary: Word or Token (need clarification)
Product: XPath / XQuery / XSLT
Version: Working drafts
Platform: All
OS/Version: Windows XP
Status: NEW
Severity: normal
Priority: P2
Component: Full Text
AssignedTo: sihem@research.att.com
ReportedBy: joaquin.delgado@oracle.com
QAContact: public-qt-comments@w3.org
According to the last published draft:
"A word is defined as a character, n-gram, or sequence of characters returned
by a tokenizer as a basic unit to be searched. Each instance of a word
consists of one or more consecutive characters. Beyond that, words are
implementation-defined. Note that consecutive words need not be separated by
either punctuation or space, and words may overlap. A phrase is a sequence of
ordered words which may contain any number of words."
I'm not convinced we should use "word", which has its own semantics in plain
English, in the above definition. The problem I have with "word" is that it
may get confused with the meaning of "word" in plain English which is
associated with a concept. Notice that an N-gram or an arbitrary sequence of
characters does not have such connotation. I think the definition above
relates more to "token". In fact later we later refer to words as
tokens: "Whatever a tokenizer for a particular language chooses to do, it must
preserve the containment hierarchy: paragraphs contain sentences which contain
words. The tokenizer has to evaluate two equal strings in the same way, i.e.,
it should identify the same tokens." and also use the data structure called
TokenInfo. I think its better to use tokens all throughout the document or
clearly state that words and tokens mean the same thing.
Received on Wednesday, 25 January 2006 01:21:35 UTC