[Bug 4698] [FT] editorial: 2.1 Processing Model from bugzilla@wiggum.w3.org on 2007-06-23 (public-qt-comments@w3.org from June 2007)

From: <bugzilla@wiggum.w3.org>
Date: Sat, 23 Jun 2007 09:55:36 +0000
To: public-qt-comments@w3.org
CC:
Message-Id: <E1I22LA-0005FD-C7@wiggum.w3.org>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=4698

           Summary: [FT] editorial: 2.1 Processing Model
           Product: XPath / XQuery / XSLT
           Version: Last Call drafts
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Full Text
        AssignedTo: jim.melton@acm.org
        ReportedBy: jmdyck@ibiblio.org
         QAContact: public-qt-comments@w3.org


2.1 Processing Model

[1]
section
    I think the spec might be better off with the contents of this section
    put elsewhere. E.g., the stuff on tokenization can be merged into 4.1;
    pretty much everything else is specific to full-text contains
    expressions, so can be merged into 2.2.1.

[2]
para 1
"As part of the External Processing that is described in the XQuery
Processing Model, when an XML document is parsed into an Infoset/PSVI and
ultimately into a XQuery Data Model instance, a full-text process called
tokenization is usually executed."
    With respect to the Processing Model, tokenization is *not* part of
    external processing, because:
    (a) there's no allowance for tokens in the Data Model, and
    (b) the only place/time where the thing-to-be-tokenized and the
        options-by-which-to-tokenize-it are guaranteed to come together
        is within the query at evaluation time.
    (Implementations may be able to statically determine [or guess] some
    combinations, and so do pre-tokenization, but that's not something
    that is [or should be] captured in the Processing Model.) Replace the
    para with something like:
        "At various points in full-text processing, the processor is
        called upon to 'tokenize' a string."

[3]
para 3
'including the definition of the term "words"'
    Delete. (Avoid using the term "words".)

[4]
"interprete"
    Change to "interpret".

[5]
list 1
"2. ... the containment hierarchy (e.g., paragraphs contain sentences,
which contain words)"
    I think you mean "i.e.", not "e.g.". (If that's just an *example* of a
    containment hierarchy, then who gets to define the actual hierarchy
    that the tekenizer must preserve?)

[6]
para 5
"evaluated within the normal Query Processing (XQuery Processing Model),"
    Odd. Delete "the"? De-capitalize "Query Processing"?
    Is the parenthesized text supposed to be a link?
    Could just delete the whole quoted phrase; it doesn't seem relevant.

[7]
list 2
"3. ... which contents may be ignored"
    [7a]
    s/which contents/whose contents/

    [7b]
    s/may/must/

[8]
para 8 (2nd after diagram)
"Tokenization normally occurs at the time of parsing of the original XML
documents, for example, during the Data Model Generation process"
    That may be true in the real world, but not in the Processing Model.
    See my comment for para 1 above.

[9]
para 9, 11, ...
"Full Text expression"
    When this section refers to a "Full Text expression", it specifically
    means a full-text contains expression. Might as well be specific.

[10]
list 3
"1. ... the set of search context items"
    s/set/sequence/

[11]
"2. Evaluate the (optional) ignore expression, resulting in the set of
ignored nodes and virtually delete the ignore nodes from the search
context nodes tree."
    [11a]
    The ignore option must be evaluated for each search context item, so
    2 should be the new 4a.

    [11b]
    s/ignore expression/ignore option/

    [11c]
    s/nodes and virtually/nodes, and virtually/ (or "nodes. Virtually")

    [11d]
    s/ignore nodes/ignored nodes/

    [11e]
    s/the search context nodes tree/the search context item/

[12]
"4a. Apply the tokenization algorithm"
    In terms of the processing model, you can't do tokenization at this
    level. Each different FTPrimaryWithOptions within the FTSelection
    is allowed to have different FTMatchOptions, some of which affect
    tokenization. So theoretically, each FTWords causes its own
    tokenization of the search context item.

[13]
'4b. Evaluate the simple "FTWord" operators'
    s/FTWord/FTWords/

[14]
'against the tokenized input'
    s/input/context item/
    ("input" suggests an external document)

[15]
"4c. ... in a bottom up fashion"
    s/bottom up/bottom-up/

[16]
"At each step the AllMatches instance produced by the previous steps"
    s/instance/instances/

[17]
"and a new instance of the AllMatches"
    s/instance of the AllMatches/AllMatches instance/

[18]
"the FTMatchOptions are controlling the semantics"
    s/are controlling/control/

[19]
"5. Convert the AllMatches instance"
    s/the AllMatches instance/the topmost AllMatches instances/
    (since each search context item results in one topmost AllMatches
    instanmce)
Received on Saturday, 23 June 2007 09:55:39 UTC