- From: <bugzilla@wiggum.w3.org>
- Date: Wed, 25 Jan 2006 01:21:33 +0000
- To: public-qt-comments@w3.org
- Cc:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=2766 Summary: Word or Token (need clarification) Product: XPath / XQuery / XSLT Version: Working drafts Platform: All OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Full Text AssignedTo: sihem@research.att.com ReportedBy: joaquin.delgado@oracle.com QAContact: public-qt-comments@w3.org According to the last published draft: "A word is defined as a character, n-gram, or sequence of characters returned by a tokenizer as a basic unit to be searched. Each instance of a word consists of one or more consecutive characters. Beyond that, words are implementation-defined. Note that consecutive words need not be separated by either punctuation or space, and words may overlap. A phrase is a sequence of ordered words which may contain any number of words." I'm not convinced we should use "word", which has its own semantics in plain English, in the above definition. The problem I have with "word" is that it may get confused with the meaning of "word" in plain English which is associated with a concept. Notice that an N-gram or an arbitrary sequence of characters does not have such connotation. I think the definition above relates more to "token". In fact later we later refer to words as tokens: "Whatever a tokenizer for a particular language chooses to do, it must preserve the containment hierarchy: paragraphs contain sentences which contain words. The tokenizer has to evaluate two equal strings in the same way, i.e., it should identify the same tokens." and also use the data structure called TokenInfo. I think its better to use tokens all throughout the document or clearly state that words and tokens mean the same thing.
Received on Wednesday, 25 January 2006 01:21:35 UTC