- From: <bugzilla@wiggum.w3.org>
- Date: Mon, 02 Oct 2006 18:14:55 +0000
- To: public-qt-comments@w3.org
- CC:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3783
Summary: Tokenization: When to flow-through/flow-around markup?
Product: XPath / XQuery / XSLT
Version: Working drafts
Platform: PC
OS/Version: Windows XP
Status: NEW
Severity: normal
Priority: P2
Component: Full Text
AssignedTo: jim.melton@acm.org
ReportedBy: joaquin.delgado@oracle.com
QAContact: public-qt-comments@w3.org
>Issue: paragraphs and sentences (Test, mostly)
>Sentence boundary detection is highly language-dependent and
>relies on specific language and perhaps even vocabulary knowledge.
>Paragraph boundaries ditto likewise, although in practice folks
>put paragraph structure into their markup, so then the issue is
>which markup counts as breaking paragraphs and which doesn't?
>
>Issue: flow-through/flow-around markup (Test, mostly)
>Similarly: which markup indicates word breaks and which doesn't?
>Which markup is flowed-around (e.g. footnotes) for phrase and
>proximity matching?
>
>I call these two spec issues also only because it is weird that
>we have query options for ignoring some nodes, but not for
>specifying any of these other important facts. For the record,
>I think it is correct not to have them in the query, but I also
>think putting ignored nodes into the query is a big mistake as
>well. I also think we need to acknowledge them in some way in
>testing and the spec.
>
>
Now, here we do have a testing issue as well as spec problem and we should
discuss this in the taskforce right away. I would categorize these two issues
under the same umbrella: when to flow-through/flow-around markup. In other
words, there are some nodes that should be considered/ignored for tokenization
and querying and that might alter the semantics of some of the operators
defined in the spec. You have a valid point about FTIgnoreOption. For example,
Can a bold markup, which is not a word breaker and therefor ignored by the
tokenizer, be considered as part of the search context (i.e. allowing the
search to be restricted to bolded nodes only)?
I propose to have the capabilities to
* Ignore tags in a particular namespace (e.g. XHTML namespace)
* Declare tags as delimiters for word, sentence and paragraphs.
Received on Monday, 2 October 2006 18:15:05 UTC