Re: Terms and display terms in scan from Ray Larson on 2004-02-20 (www-zig@w3.org from February 2004)

From: Ray Larson <ray@SIMS.Berkeley.EDU>
Date: Thu, 19 Feb 2004 16:08:58 -0800 (PST)
To: ajk@mds.rmit.edu.au
Cc: www-zig@w3.org
Message-Id: <200402200008.i1K08wes012822@sherlock.sims.berkeley.edu>

>>>>> "Alan" == Alan Kent <ajk@mds.rmit.edu.au> writes:

    Alan> On Thu, Feb 19, 2004 at 11:35:53AM +0000, Robert Sanderson
    Alan> wrote:
    >> We return the stemmed term, eg 'happi' for happy, happily,
    >> happiness.

This isn't completely true (and Rob knows better) what is returned
from scan is what was put into the index after normalization, so
if stemming was requested during normalization, stemmed terms are 
returned, if not, they are not stemmed.

    Alan> So returned 'term' values may be munged, but are used for
    Alan> searching.

    Alan> This implies you have to guarantee any output of your
    Alan> stemmer can be fed back into the stemmer and have the same
    Alan> value output again.  Otherwise the term from the scan could
    Alan> not be used for searching.
That is the case for our stemmer, a stemmed word submitted to the
stemmer returns the same word

    Alan> In the case of soundex, this could be achieved by looking at
    Alan> the term and saying "ooh, that looks like the output of the
    Alan> soundex algorithm - I will just leave that alone".

Right, and I have seen some where the term in TermInfo is an incomprehensible
mangle of stuff, and displayTerm may or may not be filled in.

    Alan> This is also consistent with what Ashley does - if it has
    Alan> spaces, munge it. If it does not have spaces, maybe its a
    Alan> scan term so don't do anything to it.

    Alan> Thanks Alan


Cheers,
Ray Larson

Received on Thursday, 19 February 2004 19:09:02 UTC