[Bug 3783] Tokenization: When to flow-through/flow-around markup?

http://www.w3.org/Bugs/Public/show_bug.cgi?id=3783

           Summary: Tokenization: When to flow-through/flow-around markup?
           Product: XPath / XQuery / XSLT
           Version: Working drafts
          Platform: PC
        OS/Version: Windows XP
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Full Text
        AssignedTo: jim.melton@acm.org
        ReportedBy: joaquin.delgado@oracle.com
         QAContact: public-qt-comments@w3.org


>Issue: paragraphs and sentences (Test, mostly)
>Sentence boundary detection is highly language-dependent and
>relies on specific language and perhaps even vocabulary knowledge.
>Paragraph boundaries ditto likewise, although in practice folks
>put paragraph structure into their markup, so then the issue is
>which markup counts as breaking paragraphs and which doesn't?
>
>Issue: flow-through/flow-around markup (Test, mostly)
>Similarly: which markup indicates word breaks and which doesn't?
>Which markup is flowed-around (e.g. footnotes) for phrase and
>proximity matching?
>
>I call these two spec issues also only because it is weird that
>we have query options for ignoring some nodes, but not for
>specifying any of these other important facts.  For the record,
>I think it is correct not to have them in the query, but I also
>think putting ignored nodes into the query is a big mistake as
>well.  I also think we need to acknowledge them in some way in
>testing and the spec.
>  
>
Now, here we do have a testing issue as well as spec problem and we should
discuss this in the taskforce right away. I would categorize these two issues
under the same umbrella: when to flow-through/flow-around markup. In other
words, there are some nodes that should be considered/ignored for tokenization
and querying and that might alter the semantics of some of the operators
defined in the spec. You have a valid point about FTIgnoreOption. For example,
Can a bold markup, which is not a word breaker and therefor ignored by the
tokenizer,   be considered as part of the search context (i.e. allowing the
search to be restricted to bolded nodes only)?

I propose to have the capabilities to

    * Ignore tags in a particular namespace (e.g. XHTML namespace)
    * Declare tags as delimiters for word, sentence and paragraphs.

Received on Monday, 2 October 2006 18:15:05 UTC