- From: <bugzilla@wiggum.w3.org>
- Date: Tue, 26 Jun 2007 16:59:43 +0000
- To: public-qt-comments@w3.org
- CC:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=4667 jim.melton@acm.org changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED ------- Comment #1 from jim.melton@acm.org 2007-06-26 16:59 ------- If we ignore questions about the effect of element boundaries on tokenization, it is correct to say that the tokenization algorithm is applied to the string value of the search context. See section 4.1, definition of tokenization. Therefore, the number of tokens (token occurrences) does not vary depending on embedded markup. The specification currently says, in Section 1.1, second list, item 5, "Some formatting markup serves well as token boundaries, for example, paragraphs are most commonly delimited by formatting markup. Other formatting markup may not serve well as token boundaries. Implementations are free to provide implementation-defined ways to differentiate between the markup's effect on token boundaries during tokenization." If all element tags in the search context fall adjacent to locations that would be tokenization boundaries in any case (e.g., space characters in our default tokenizer), then the answer does not change based on whether an implementation chooses to use or to ignore the presence of certain elements to determine token boundaries. When an element tag occurs at a location other than an "ordinary" token bounday, then the answer might change when the tokenizer chooses to create a token boundary based on the presence and position of the element tag. We believe that this answers your question. However, it occurs to us that an example might help readers. Do you think that an example such as this one would satisfy your comment? "<p>Emphasize a <em>syl</em>lable, as well as a <em>word</em> within text.</p>". There might be ten or eleven tokens in that text, based on whether the implementation chooses to create a token boundary at the first </em> closing tag. The second <em>...</em> element would not (in our default tokenizer) affect the number of tokens in the search context. Regarding the use of the FTIgnoreOption, the document currently states (see section 4.3.1) that the process of ignoring is dependent on implementation decisions about whether various element boundaries are or are not ignored during tokenization. We anticipate a proposal (from Michael Rys) that would further relax the limitations to allow the behavior to be even more implementation-defined. I have marked this bug FIXED in the belief that we have answered your question and with our agreement that we will happily add an example such a the one we suggested at your request. If you agree that this solves your concern, please mark this bug FIXED.
Received on Tuesday, 26 June 2007 16:59:46 UTC