[Bug 3904] [FT] Tokenization: implementation-defined or implementation-dependent

http://www.w3.org/Bugs/Public/show_bug.cgi?id=3904





------- Comment #2 from pcase@crs.loc.gov  2006-11-17 14:42 -------
I surveyed existing text and have copied out all the instances where we have
called tokenization implementation-defined or implementation-dependent. I
propose the following changes to reflect th FTTF decision that tokenization
SHOULD be implementation-defined. I was not sure whether the SHOULD statement
belonged in the 2.1 Processing Model or in 4.1 Tokenization, so I have placed
it in both, however one could be truncated and refer to the details in other.

>>In 2.1 Processing Model

As part of the External Processing that is described in the XQuery Processing
Model, when an XML document is parsed into an Infoset/PSVI and ultimately into
a XQuery Data Model instance, an implementation-defined full-text process
called tokenization is usually executed.

>>>>Replace "an implementation-defined full-text process" with "a full-text process"



The tokenization process is implementation-dependent. For example, the
tokenization may differ from domain to domain and from language to language.
This specification will only impose a very small number of constraints on the
semantics of a correct tokenizer. As a consequence, all the examples in this
document are only given for explanatory purposes but they are not mandatory,
i.e. the result of such full-text queries will of course depend on the
tokenizer that is being used.

>>>>Replace with: 
Tokenization, including the definition of the term "words", SHOULD be
implementation-defined. Implementations SHOULD expose the rules and sample
results of tokenization as much as possible to enable users to predict and
interprete the results of tokenization. Tokenization MUST only conform to these
constraints:

a. Each word MUST consist of one or more consecutive characters;

b. The tokenizer MUST preserve the containment hierarchy (paragraphs contain
sentences contain words); and

c. The tokenizer MUST, when tokenizing two equal strings, identify the same
tokens in each. 

A sample tokenization is used for the examples in this document. The results
might be different for other tokenizations.



3. Apply the tokenization algorithm to query string(s). (FT2.1 -- this is
implementation-dependent)

>>>>Delete (FT2.1 -- this is implementation-dependent)



4. For each search context item:

a. Apply the tokenization algorithm in order to extract potentially matching
terms together with their positional information. This step results in a
sequence of token occurrences. (FT2.2 -- this is implementation-dependent)

>>>>Delete (FT2.2 -- this is implementation-dependent)



>>In 4.1 Tokenization 

[Definition: Formally, tokenization is the process of converting the string
value of a node to a sequence of token occurrences, taking the structural
information of the node into account to identify token, sentence, and paragraph
boundaries.]

Tokenization is subject to the following constraint:

Attribute values are not tokenized.

4.1.1 Examples

The following document fragment is the source document for examples in this
section. Tokenization is implementation-defined. A sample tokenization is used
for the examples in this section. The results might be different for other
tokenizations.

>>>>Replace with: 
[Definition: Formally, tokenization is the process of converting the string
value of a node to a sequence of token occurrences, taking the structural
information of the node into account to identify token, sentence, and paragraph
boundaries.]

Tokenization, including the definition of the term "words", SHOULD be
implementation-defined. Implementations SHOULD expose the rules and sample
results of tokenization as much as possible to enable users to predict and
interprete the results of tokenization. Tokenization MUST only conform to these
constraints:

a. Each word MUST consist of one or more consecutive characters;

b. The tokenizer MUST preserve the containment hierarchy (paragraphs contain
sentences contain words); and

c. The tokenizer MUST, when tokenizing two equal strings, identify the same
tokens in each. 

4.1.1 Examples

The following document fragment is the source document for examples in this
section. A sample tokenization is used for the examples in this document. The
results might be different for other tokenizations.

>>>>>>Please notice that I removed the "Attribute values are not tokenized" constraint, because we do allow attribute values to be tokenized and queried explicitly.



Note:

While this matching function assumes a tokenized representation of the search
strings, it does not assume a tokenized representation of the input items in
$searchContext, i.e. the texts in which the search happens. Hence, the
tokenization of the search context is implicit in this function and coupled to
the retrieval of matches. Of course, this does not imply that tokenization of
the search context cannot be done a priori. Because tokenization is
implementation-defined, the tokenization of each item in $searchContext does
not necessarily take into account the match options in $matchOptions or the
search tokens in $searchTokens. This allows implementations to tokenize and
index input data without the knowledge of particular match options used in
full-text queries.

>>>>Replace with:
Note:

While this matching function assumes a tokenized representation of the search
strings, it does not assume a tokenized representation of the input items in
$searchContext, i.e. the texts in which the search happens. Hence, the
tokenization of the search context is implicit in this function and coupled to
the retrieval of matches. Of course, this does not imply that tokenization of
the search context cannot be done a priori. The tokenization of each item in
$searchContext does not necessarily take into account the match options in
$matchOptions or the search tokens in $searchTokens. This allows
implementations to tokenize and index input data without the knowledge of
particular match options used in full-text queries.



>>In Appendix I Checklist of Implementation-Defined Features (Non-Normative)

1. Everything about tokenization, including the definition of the term "words",
is implementation-defined, except that 

a. each word consists of one or more consecutive characters;

b. the tokenizer must preserve the containment hierarchy (paragraphs contain
sentences contain words); and

c. the tokenizer must, when tokenizing two equal strings, identify the same
tokens in each. 

>>>>Replace with: 
Tokenization, including the definition of the term "words", SHOULD be
implementation-defined. Implementations SHOULD expose the rules and sample
results of tokenization as much as possible to enable users to predict and
interprete the results of tokenization. Tokenization MUST only conform to these
constraints:

a. Each word MUST consist of one or more consecutive characters;

b. The tokenizer MUST preserve the containment hierarchy (paragraphs contain
sentences contain words); and

c. The tokenizer MUST, when tokenizing two equal strings, identify the same
tokens in each. 



This completes ACTION FTTF-128-04 on Pat: To provide the wording for having
Tokenization as implementation-defined be a "should". 

Received on Friday, 17 November 2006 14:42:59 UTC