W3C home > Mailing lists > Public > public-qt-comments@w3.org > November 2010

[Bug 11272] New: [FT] Tokenization and wildcards

From: <bugzilla@jessica.w3.org>
Date: Tue, 09 Nov 2010 11:32:20 +0000
To: public-qt-comments@w3.org
Message-ID: <bug-11272-523@http.www.w3.org/Bugs/Public/>

           Summary: [FT] Tokenization and wildcards
           Product: XPath / XQuery / XSLT
           Version: Candidate Recommendation
          Platform: PC
        OS/Version: Windows NT
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Full Text 1.0
        AssignedTo: jim.melton@acm.org
        ReportedBy: tim@cbcl.co.uk
         QAContact: public-qt-comments@w3.org

It is also unclear whether query and search context tokenization is necessarily
the same function, and how matching and implementation-defined tokenization

Section 4.1 Tokenization seems to address only the requirements of search
context tokenization (identification of tokens with position, sentence and
paragraph), and suggests a function of the form

declare function fts:tokenize( $searchContext as item(),
                               $language as xs:string? ) 
  as element(fts:tokenInfo)* external;

$language is an argument, because Section 3.4.1 Language Option states that the
language options can affect tokenization.

Section 3.2 states:

"Otherwise, each of those strings is tokenized into a sequence of tokens as
described in Section 4.1 Tokenization. "

However, tokenization of the search tokens must use a different process,
because it must vary depending on the wildcard option and doesn't attempt to
identify sentence and paragraph boundaries, returning fts:queryToken values
rather than fts:tokenInfo values, .  This suggests a function of the form:

declare function fts:tokenizeQuery( $ftWordsValue as xs:string*,
                                    $language as xs:string?,
                                    $wildcardOptionEnabled as xs:boolean ) 
  as element(fts:queryToken)* external;

The $wildcardOptionEnabled argument specifies how the query tokenizer should
handle wildcard indicators.

Is my understanding correct?

Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Tuesday, 9 November 2010 11:32:24 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:45:44 UTC