[Bug 4697] [FT] editorial: 1.1 Full-Text Search and XML

http://www.w3.org/Bugs/Public/show_bug.cgi?id=4697





------- Comment #1 from jmdyck@ibiblio.org  2007-08-27 06:15 -------
A specific suggestion for point [6c], which also takes care of [6a] and [6b]:

In section 1.1, second list, item 3:

--- Extract the definitions of "sentence" and "paragraph" and put them between
items 2 and 3. (Append them to item 2, or make a new item, whichever you
prefer.)

--- Delete the three sentences at the end of the item:
        Whatever a tokenizer for a particular language chooses to do,
        it must preserve the containment hierarchy: paragraphs contain
        sentences, which contain tokens.

        The tokenizer has to process two codepoint equal strings in the
        same way, i.e., it should identify the same tokens. Everything
        else about the behavior of the tokenizer is implementation-defined.

--- Move the definition of tokenization (and the subsequent constraints, and
the Note re overlapping tokens) from 4.1 to replace the sentences deleted
above.

    But instead of the 4.1 phrasing:
        paragraphs contain sentences contain words
    use the 1.1 phrasing:
        paragraphs contain sentences, which contain tokens

--- As for the three sentences at the start of the item, delete or reposition
or leave them, as you please. (It might be more stylistically consistent to put
them after the definition.) 

In section 2.1, delete the repeated paragraph and list:
   "Tokenization, including .. same tokens in each."

Received on Monday, 27 August 2007 06:15:19 UTC