[Bug 4667] [Full text LC draft sec. 2.1] Status of text (nodes)

http://www.w3.org/Bugs/Public/show_bug.cgi?id=4667


jim.melton@acm.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED




------- Comment #1 from jim.melton@acm.org  2007-06-26 16:59 -------
If we ignore questions about the effect of element boundaries on tokenization,
it is correct to say that the tokenization algorithm is applied to the string
value of the search context.  See section 4.1, definition of tokenization. 
Therefore, the number of tokens (token occurrences) does not vary depending on
embedded markup. 

The specification currently says, in Section 1.1, second list, item 5, "Some
formatting markup serves well as token boundaries, for example, paragraphs are
most commonly delimited by formatting markup. Other formatting markup may not
serve well as token boundaries.  Implementations are free to provide
implementation-defined ways to differentiate between the markup's effect on
token boundaries during tokenization."  If all element tags in the search
context fall adjacent to locations that would be tokenization boundaries in any
case (e.g., space characters in our default tokenizer), then the answer does
not change based on whether an implementation chooses to use or to ignore the
presence of certain elements to determine token boundaries. 

When an element tag occurs at a location other than an "ordinary" token
bounday, then the answer might change when the tokenizer chooses to create a
token boundary based on the presence and position of the element tag. 

We believe that this answers your question.  However, it occurs to us that an
example might help readers. Do you think that an example such as this one would
satisfy your comment? "<p>Emphasize a <em>syl</em>lable, as well as a
<em>word</em> within text.</p>".  There might be ten or eleven tokens in that
text, based on whether the implementation chooses to create a token boundary at
the first </em> closing tag.  The second <em>...</em> element would not (in our
default tokenizer) affect the number of tokens in the search context. 

Regarding the use of the FTIgnoreOption, the document currently states (see
section 4.3.1) that the process of ignoring is dependent on implementation
decisions about whether various element boundaries are or are not ignored
during tokenization.  We anticipate a proposal (from Michael Rys) that would
further relax the limitations to allow the behavior to be even more
implementation-defined.

I have marked this bug FIXED in the belief that we have answered your question
and with our agreement that we will happily add an example such a the one we
suggested at your request.  If you agree that this solves your concern, please
mark this bug FIXED. 

Received on Tuesday, 26 June 2007 16:59:46 UTC