W3C home > Mailing lists > Public > public-qt-comments@w3.org > June 2007

[Bug 4697] [FT] editorial: 1.1 Full-Text Search and XML

From: <bugzilla@wiggum.w3.org>
Date: Sat, 23 Jun 2007 09:51:13 +0000
To: public-qt-comments@w3.org
Message-Id: <E1I22Gv-00053v-Dd@wiggum.w3.org>


           Summary: [FT] editorial: 1.1 Full-Text Search and XML
           Product: XPath / XQuery / XSLT
           Version: Last Call drafts
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Full Text
        AssignedTo: jim.melton@acm.org
        ReportedBy: jmdyck@ibiblio.org
         QAContact: public-qt-comments@w3.org

1.1 Full-Text Search and XML

"The following definitions apply to full-text search:"
    Note that 4 and 5 aren't actually definitions.
    (5 belongs with a definition of tokenization.)

"A token is defined as a character, n-gram, or sequence of characters"
    Usually, definitions don't say "is defined as". Change to just "is"?

    There is no definition of "n-gram", and no other use of it in the
    document. Delete?

"Each instance of a token consists of one or more consecutive characters."
    The phrase "Each instance of a token" is undefined, and suggests that
    a token is an abstract thing that has to be instantiated. Just say
    "Each token".

    I think you'll get a better definition if you combine the two
    sentences, e.g.:
        A token is a sequence of one or more consecutive characters,
        returned by a tokenizer as a basic unit to be searched.

    You might need to clarify what you mean by "consecutive".  Point 5
    appears to give implementations the freedom to treat
    as a single (6-character) token, but its characters are not
    consecutive characters in the XML document.

"a phrase/sentence/paragraph is an ordered sequence of any number of
    Must/should the order of the sequence reflect document order? (e.g.,
    tokens are ordered according to the document order of their first
    character) Either way, it might be good to say.

"... the containment hierarchy: paragraphs contain sentences, which
contain tokens"
    Phrases don't (aren't required to) participate in the containment
    hierarchy? (Can a phrase match across a sentence boundary?)

    It's not clear what it means for one ordered sequence of tokens (A) to
    "contain" another (B). Presumably all the tokens in B must be in A,
    and I'm guessing they have to be in the same order. Do the tokens of
    B also have to be consecutive in A?

"The tokenizer has to process two codepoint equal strings in the same way,
i.e., it should identify the same tokens."
    Change "has to" to "must"?

    Change "should" to "must"?

    These constraints on tokenization are stated in four places (1.1,
    2.1, 4.1, and appx I) in slightly different ways. Surely we can
    delete/merge a few of them.
Received on Saturday, 23 June 2007 09:51:15 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:45:33 UTC