Terms and display terms in scan

Hi,

I was wondering what normal practice was for various servers, and what
was assumed by clients with respect to scanning and terms/display terms.
I am more interested in concrete implementations rather than high level
semantics.

Is it assumed that all terms from a scan list can be used in queries?
If simple rules are used to extract terms from documents (eg: fold to
upper case), then there are no problems. But what if soundex or a stemmer
is used? The internal value for "SMITH" may be "S123". Should "S123"
be returned as the scan term? Searching for "S123" will not necessarily
match any records (the soundex algorithm may munge "S123" into another
different value, so it wont match).

One possible interpretation is to use the display term for the original
term. For example, put "S123" into the term field, but "SMITH" into the
display term.

This leads on to my question regarding what clients assume:

For clients that support scanning with a 'click to search' for the
returned terms, do clients in practice use the 'term' value or the
'displayTerm' value when formulating the search (if both are available)?
Bath says to display the 'displayTerm' if available, but is that used
in searches too?

Another thread: since many different values will map to the same soundex
root (that is the whole idea after all), do you end up with multiple
identical terms in a scan response, but with different display terms
associated with the terms. For example:

    S123	SMITH
    S123	SMITHYE
    S123	SMYTH
    S123	SCHMIDT

or should the terms be unique? Then we could return a "indicative" or
"sample" term as the display term (in practice it might be the first
term that was come across during indexing

    S000	SANDWITCH
    S123	SMITH
    S423	SLEEP

or maybe we could return multiple display terms (in a single string)
that map to the same 'term' value.

    S000	SANDWITCH
    S123	SMITH, SMITHYE, SMYTH, SCHMIDT
    S423	SLEEP

If clients use the display term to query on (when available), the latter
case seems an undesirable thing to do.

This also leads me on to ask how do clients normally paginate through a
scan list. Do they normally grab the last term returned from the
previous scan as the first term for the next scan request? Or do they
specify the same starting term every time, and change the start
position?


What I think is that the 'term' returned for a scan should be the
term that can be queried on. How a system internally implements something
is not relevant. So above, an internal data structure might use "S123",
but the 'term' returned in the scan list should be something that
can be searched on.

But I am less clear whether the display term is meant to also be searchable,
or whether it is more like a human friendly view, never to be used for
any purpose other than showing to a human. Ie: don't use it for queries.


Thanks for any wisdom!
Alan

Received on Thursday, 19 February 2004 00:57:12 UTC