- From: Alan Kent <ajk@mds.rmit.edu.au>
- Date: Thu, 19 Feb 2004 16:57:08 +1100
- To: ZIG <www-zig@w3.org>
Hi, I was wondering what normal practice was for various servers, and what was assumed by clients with respect to scanning and terms/display terms. I am more interested in concrete implementations rather than high level semantics. Is it assumed that all terms from a scan list can be used in queries? If simple rules are used to extract terms from documents (eg: fold to upper case), then there are no problems. But what if soundex or a stemmer is used? The internal value for "SMITH" may be "S123". Should "S123" be returned as the scan term? Searching for "S123" will not necessarily match any records (the soundex algorithm may munge "S123" into another different value, so it wont match). One possible interpretation is to use the display term for the original term. For example, put "S123" into the term field, but "SMITH" into the display term. This leads on to my question regarding what clients assume: For clients that support scanning with a 'click to search' for the returned terms, do clients in practice use the 'term' value or the 'displayTerm' value when formulating the search (if both are available)? Bath says to display the 'displayTerm' if available, but is that used in searches too? Another thread: since many different values will map to the same soundex root (that is the whole idea after all), do you end up with multiple identical terms in a scan response, but with different display terms associated with the terms. For example: S123 SMITH S123 SMITHYE S123 SMYTH S123 SCHMIDT or should the terms be unique? Then we could return a "indicative" or "sample" term as the display term (in practice it might be the first term that was come across during indexing S000 SANDWITCH S123 SMITH S423 SLEEP or maybe we could return multiple display terms (in a single string) that map to the same 'term' value. S000 SANDWITCH S123 SMITH, SMITHYE, SMYTH, SCHMIDT S423 SLEEP If clients use the display term to query on (when available), the latter case seems an undesirable thing to do. This also leads me on to ask how do clients normally paginate through a scan list. Do they normally grab the last term returned from the previous scan as the first term for the next scan request? Or do they specify the same starting term every time, and change the start position? What I think is that the 'term' returned for a scan should be the term that can be queried on. How a system internally implements something is not relevant. So above, an internal data structure might use "S123", but the 'term' returned in the scan list should be something that can be searched on. But I am less clear whether the display term is meant to also be searchable, or whether it is more like a human friendly view, never to be used for any purpose other than showing to a human. Ie: don't use it for queries. Thanks for any wisdom! Alan
Received on Thursday, 19 February 2004 00:57:12 UTC