- From: <bugzilla@wiggum.w3.org>
- Date: Mon, 02 Oct 2006 18:14:55 +0000
- To: public-qt-comments@w3.org
- CC:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3783 Summary: Tokenization: When to flow-through/flow-around markup? Product: XPath / XQuery / XSLT Version: Working drafts Platform: PC OS/Version: Windows XP Status: NEW Severity: normal Priority: P2 Component: Full Text AssignedTo: jim.melton@acm.org ReportedBy: joaquin.delgado@oracle.com QAContact: public-qt-comments@w3.org >Issue: paragraphs and sentences (Test, mostly) >Sentence boundary detection is highly language-dependent and >relies on specific language and perhaps even vocabulary knowledge. >Paragraph boundaries ditto likewise, although in practice folks >put paragraph structure into their markup, so then the issue is >which markup counts as breaking paragraphs and which doesn't? > >Issue: flow-through/flow-around markup (Test, mostly) >Similarly: which markup indicates word breaks and which doesn't? >Which markup is flowed-around (e.g. footnotes) for phrase and >proximity matching? > >I call these two spec issues also only because it is weird that >we have query options for ignoring some nodes, but not for >specifying any of these other important facts. For the record, >I think it is correct not to have them in the query, but I also >think putting ignored nodes into the query is a big mistake as >well. I also think we need to acknowledge them in some way in >testing and the spec. > > Now, here we do have a testing issue as well as spec problem and we should discuss this in the taskforce right away. I would categorize these two issues under the same umbrella: when to flow-through/flow-around markup. In other words, there are some nodes that should be considered/ignored for tokenization and querying and that might alter the semantics of some of the operators defined in the spec. You have a valid point about FTIgnoreOption. For example, Can a bold markup, which is not a word breaker and therefor ignored by the tokenizer, be considered as part of the search context (i.e. allowing the search to be restricted to bolded nodes only)? I propose to have the capabilities to * Ignore tags in a particular namespace (e.g. XHTML namespace) * Declare tags as delimiters for word, sentence and paragraphs.
Received on Monday, 2 October 2006 18:15:05 UTC