Re: FTS comments from Pat Case on 2003-03-24 (public-qt-comments@w3.org from March 2003)

From: Pat Case <PCASE@crs.loc.gov>
Date: Mon, 24 Mar 2003 09:27:14 -0500
To: <kai.grossjohann@uni-duisburg.de>, <public-qt-comments@w3.org>
Cc: "<" <member-query-fttf@w3.org>
Message-Id: <se7ecf88.014@crs.loc.gov>
Kai, 

These are all personal responses. I can't speak for Working Group. See
inline.


Pat Case, Librarian, LIS Interface Team
Congressional Research Service
Library of Congress
101 Independence Ave., SE, LM-223
Washington, DC 20540-7000
202-707-9104 
202-252-3370 (Fax)
pcase@crs.loc.gov

>>> Kai Großjohann <kai.grossjohann@uni-duisburg.de> 03/22/03 04:18PM
>>>

I have read the FTS requirements document and the use cases
(http://www.w3.org/TR/xmlquery-full-text-requirements/ and
http://www.w3.org/TR/xmlquery-full-text-use-cases/), and would like
to make some comments.

First of all, I'm happy that work is proceeding in the general
direction of providing more IR functionality in XQuery.  It is dear
to my heart :-)

I have two comments:

* Application of SCORE to non-text conditions.

  I believe that vagueness and uncertainty, the central issues of
  Information Retrieval, are vital features for systems even outside
  the domain of full text.  Consider the infamous used-car database
  example: say the user searches for a white Lincoln Continental from
  2001 with a given mileage (is that the right word? number of miles
  run by that car is what I mean) and price.  There are no full text
  conditions in this example.  Yet, what happens if there are no cars
  fulfilling the exact condition but only cars that are "close
  matches"?  One approach would be to interpret the conditions
  vaguely.  Another approach requires the user to specify another
  query to find those "close matches".  However, the latter approach
  requires the user to know which of the query conditions to relax to
  find that close match, and thus requires knowledge of the contents
  of the database.  Surely this is not desirable: if the user knew
  what's in the database, why search?

  Therefore, the "best match" approach is important also for non-text
  conditions.

  In the requirements document it says that the SCORE language should
  be either equal to the FTS language or a superset thereof.  I
  couldn't find a use case where a vague interpretation is given to a
  non-text condition.

  I'm saying that it should be possible for the user to specify a
  vague interpretation for *every* query condition: wherever XQuery
  allows strict equality, allow vague equality, too.  Wherever XQuery
  allows strict less-than, allow vague less-than, too.  Wherever
  XQuery allows Boolean and, allow a vague version, too.  And so on.

  (XQuery does not seem to allow vague conditions on the XML
  structure, either...)

[Pat Case: I too would like to see score applicable to all XQuery.]

* Higher-level, semantic, search predicates.

  The use cases document talks a lot about proximity search and that
  the user should be able to specify various special cases: word
  order required or not required, number of stopwords or
  non-stopwords allowed between the matching terms, whether or not
  an element boundary is allowed, and other things.

  I think that the user really wishes to do phrase search.

  All the above specifications are just (poor) approximations on that
  goal.  I don't think that the user wishes to think about the word
  order or the number of intervening stopwords that are allowed.  The
  user just wants to search for "information retrieval" and find
  "... retrieval of information ..." but not "... retrieval.
  Information about...".

  The situation is similar to stemming: in the old days the systems
  had wildcards, and then it was up to the user to emulate stemming
  with wildcards.  Now the FTS use cases talk about stemming,
  carefully sidestepping the problem of actual implementation.

  In the same vein, I suggest to talk about phrase search, and leave
  the implementation up to the, err, implementors.  (Actually, you
  offer wildcards in addition to stemming, so I guess it's okay to
  offer proximity search in addition to phrase search.  But phrase
  search is more important than proximity search IMHO.)

  (I think the use cases document uses "phrase" to describe a
  sequence of words.  I use "phrase" in a linguistic sense of, say, a
  noun phrase.)
  
[Pat Case: Please remember that a phrase query is a proximity query
(ordered, allowing no intervening words).

Also  remember we are defining the functionalities which will be
available for implementors. We don't expect most end users to define the
parameters for a proximity query, but we do expect them to profit from
proximity querying. 

We expect  implementors to build GUIs which utilize proximity queries.
For example, a system may take any search terms in the Words search box
and return them in any order within 9 words of each other, then offer a
More button which might use an "and" operator. Or the implementors might
build queries under buttons or links. The functionality has to be there
so we can develop GUIs for end users. 

And yes, I do want the functionalities to surface for the small number
of expert users who can use them. 

I do not emphasize phrase querying because I think it is as dangerous
as "or" querying is useless. I advise end users to do wider unordered
proximity queries instead. In a system which supports phrase query I
would build a More button that runs a wider unordered proximity query to
pick up the missed results. My favorite example is in the internal
system I work on for congressional documents. Folks search on
"elementary education" and find very little. It is a reasonable query
but it fails because congressional bills almost exclusively carry the
phrase "elementary and secondary education". Allow a few intervening
characters and hundreds of bills are returned.]



Kai
-- 
A preposition is not a good thing to end a sentence with.
Received on Monday, 24 March 2003 09:30:11 UTC