W3C home > Mailing lists > Public > public-qt-comments@w3.org > June 2007

[Bug 4697] [FT] editorial: 1.1 Full-Text Search and XML

From: <bugzilla@wiggum.w3.org>
Date: Sat, 23 Jun 2007 09:51:13 +0000
CC:
To: public-qt-comments@w3.org
Message-Id: <E1I22Gv-00053v-Dd@wiggum.w3.org>

http://www.w3.org/Bugs/Public/show_bug.cgi?id=4697

           Summary: [FT] editorial: 1.1 Full-Text Search and XML
           Product: XPath / XQuery / XSLT
           Version: Last Call drafts
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Full Text
        AssignedTo: jim.melton@acm.org
        ReportedBy: jmdyck@ibiblio.org
         QAContact: public-qt-comments@w3.org


1.1 Full-Text Search and XML

[1]
"The following definitions apply to full-text search:"
    Note that 4 and 5 aren't actually definitions.
    (5 belongs with a definition of tokenization.)

[2]
"A token is defined as a character, n-gram, or sequence of characters"
    [2a]
    Usually, definitions don't say "is defined as". Change to just "is"?

    [2b]
    There is no definition of "n-gram", and no other use of it in the
    document. Delete?

[3]
"Each instance of a token consists of one or more consecutive characters."
    [3a]
    The phrase "Each instance of a token" is undefined, and suggests that
    a token is an abstract thing that has to be instantiated. Just say
    "Each token".

    [3b]
    I think you'll get a better definition if you combine the two
    sentences, e.g.:
        A token is a sequence of one or more consecutive characters,
        returned by a tokenizer as a basic unit to be searched.

    [3c]
    You might need to clarify what you mean by "consecutive".  Point 5
    appears to give implementations the freedom to treat
        i<i>tal</i>ic
    as a single (6-character) token, but its characters are not
    consecutive characters in the XML document.

[4]
"a phrase/sentence/paragraph is an ordered sequence of any number of
tokens."
    Must/should the order of the sequence reflect document order? (e.g.,
    tokens are ordered according to the document order of their first
    character) Either way, it might be good to say.

[5]
"... the containment hierarchy: paragraphs contain sentences, which
contain tokens"
    [5a]
    Phrases don't (aren't required to) participate in the containment
    hierarchy? (Can a phrase match across a sentence boundary?)

    [5b]
    It's not clear what it means for one ordered sequence of tokens (A) to
    "contain" another (B). Presumably all the tokens in B must be in A,
    and I'm guessing they have to be in the same order. Do the tokens of
    B also have to be consecutive in A?

[6]
"The tokenizer has to process two codepoint equal strings in the same way,
i.e., it should identify the same tokens."
    [6a]
    Change "has to" to "must"?

    [6b]
    Change "should" to "must"?

    [6c]
    These constraints on tokenization are stated in four places (1.1,
    2.1, 4.1, and appx I) in slightly different ways. Surely we can
    delete/merge a few of them.
Received on Saturday, 23 June 2007 09:51:15 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:45:33 UTC