FTS comments from Kai Großjohann on 2003-03-22 (public-qt-comments@w3.org from March 2003)

From: Kai Großjohann <kai.grossjohann@uni-duisburg.de>
Date: Sat, 22 Mar 2003 22:18:57 +0100
To: public-qt-comments@w3.org
Message-ID: <8465qb5cce.fsf@lucy.is.informatik.uni-duisburg.de>
I have read the FTS requirements document and the use cases
(http://www.w3.org/TR/xmlquery-full-text-requirements/ and
http://www.w3.org/TR/xmlquery-full-text-use-cases/), and would like
to make some comments.

First of all, I'm happy that work is proceeding in the general
direction of providing more IR functionality in XQuery.  It is dear
to my heart :-)

I have two comments:

* Application of SCORE to non-text conditions.

  I believe that vagueness and uncertainty, the central issues of
  Information Retrieval, are vital features for systems even outside
  the domain of full text.  Consider the infamous used-car database
  example: say the user searches for a white Lincoln Continental from
  2001 with a given mileage (is that the right word? number of miles
  run by that car is what I mean) and price.  There are no full text
  conditions in this example.  Yet, what happens if there are no cars
  fulfilling the exact condition but only cars that are "close
  matches"?  One approach would be to interpret the conditions
  vaguely.  Another approach requires the user to specify another
  query to find those "close matches".  However, the latter approach
  requires the user to know which of the query conditions to relax to
  find that close match, and thus requires knowledge of the contents
  of the database.  Surely this is not desirable: if the user knew
  what's in the database, why search?

  Therefore, the "best match" approach is important also for non-text
  conditions.

  In the requirements document it says that the SCORE language should
  be either equal to the FTS language or a superset thereof.  I
  couldn't find a use case where a vague interpretation is given to a
  non-text condition.

  I'm saying that it should be possible for the user to specify a
  vague interpretation for *every* query condition: wherever XQuery
  allows strict equality, allow vague equality, too.  Wherever XQuery
  allows strict less-than, allow vague less-than, too.  Wherever
  XQuery allows Boolean and, allow a vague version, too.  And so on.

  (XQuery does not seem to allow vague conditions on the XML
  structure, either...)


* Higher-level, semantic, search predicates.

  The use cases document talks a lot about proximity search and that
  the user should be able to specify various special cases: word
  order required or not required, number of stopwords or
  non-stopwords allowed between the matching terms, whether or not
  an element boundary is allowed, and other things.

  I think that the user really wishes to do phrase search.

  All the above specifications are just (poor) approximations on that
  goal.  I don't think that the user wishes to think about the word
  order or the number of intervening stopwords that are allowed.  The
  user just wants to search for "information retrieval" and find
  "... retrieval of information ..." but not "... retrieval.
  Information about...".

  The situation is similar to stemming: in the old days the systems
  had wildcards, and then it was up to the user to emulate stemming
  with wildcards.  Now the FTS use cases talk about stemming,
  carefully sidestepping the problem of actual implementation.

  In the same vein, I suggest to talk about phrase search, and leave
  the implementation up to the, err, implementors.  (Actually, you
  offer wildcards in addition to stemming, so I guess it's okay to
  offer proximity search in addition to phrase search.  But phrase
  search is more important than proxmity search IMHO.)

  (I think the use cases document uses "phrase" to describe a
  sequence of words.  I use "phrase" in a linguistic sense of, say, a
  noun phrase.)
  

Kai
-- 
A preposition is not a good thing to end a sentence with.
Received on Saturday, 22 March 2003 16:21:59 UTC