[Bug 2766] Word or Token (need clarification) from bugzilla@wiggum.w3.org on 2006-01-25 (public-qt-comments@w3.org from January 2006)

From: <bugzilla@wiggum.w3.org>
Date: Wed, 25 Jan 2006 01:21:33 +0000
To: public-qt-comments@w3.org
Cc:
Message-Id: <E1F1ZLp-0004qe-EZ@wiggum.w3.org>

http://www.w3.org/Bugs/Public/show_bug.cgi?id=2766

           Summary: Word or Token (need clarification)
           Product: XPath / XQuery / XSLT
           Version: Working drafts
          Platform: All
        OS/Version: Windows XP
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Full Text
        AssignedTo: sihem@research.att.com
        ReportedBy: joaquin.delgado@oracle.com
         QAContact: public-qt-comments@w3.org


According to the last published draft:

"A word is defined as a character, n-gram, or sequence of characters returned 
by a tokenizer as a basic unit to be searched. Each instance of a word 
consists of one or more consecutive characters. Beyond that, words are 
implementation-defined. Note that consecutive words need not be separated by 
either punctuation or space, and words may overlap. A phrase is a sequence of 
ordered words which may contain any number of words."

I'm not convinced we should use "word", which has its own semantics in plain 
English, in the above definition. The problem I have with "word" is that it 
may get confused with the meaning  of "word" in plain English which is 
associated with a concept. Notice that an N-gram or an arbitrary sequence of 
characters does not have such connotation. I think the definition above 
relates more to "token". In fact later we later refer to words as 
tokens: "Whatever a tokenizer for a particular language chooses to do, it must 
preserve the containment hierarchy: paragraphs contain sentences which contain 
words. The tokenizer has to evaluate two equal strings in the same way, i.e., 
it should identify the same tokens." and also use the data structure called 
TokenInfo. I think its better to use tokens all throughout the document or 
clearly state that words and tokens mean the same thing.

Received on Wednesday, 25 January 2006 01:21:35 UTC