Re: pre- and post- coordinate indexing from Leonard Will on 2005-10-20 (public-esw-thes@w3.org from October 2005)

From: Leonard Will <L.Will@willpowerinfo.co.uk>
Date: Thu, 20 Oct 2005 17:14:40 +0100
To: public-esw-thes@w3.org
Message-ID: <RK8T5gHwJ8VDFANP@willpowerinfo.co.uk>
In message
<677CE4DD24B12C4B9FA138534E29FB1D64C773@exchange11.fed.cclrc.ac.uk> on
Wed, 19 Oct 2005, "Miles, AJ (Alistair)" <A.J.Miles@rl.ac.uk> wrote

>Could you explain how the indexing/search systems work under the two
>scenarios (pre- and post- coordinate indexing)?  You mentioned an
>'indexing string' in another email, I'm assuming that this is a string of
>descriptors, composed by the indexer, and then entered into a database
>field?  What do indexing strings look like under the two senarios (i.e. what
>can and can't you write)?  What do the search strings look like under the
>two scenarios (i.e. what can and can't you write), and how is the search
>operation usually implemented?

Alistair

I see that Aida has got there first by responding to your queries. I
agree with what she says, so shall not repeat her points here but just
add a few comments.

>I'm a bit confused about a couple of things ...
>
>Firstly, a thesaurus directive such as:
>
>cut flower production USE cut flowers + crop production
>
>... is that for the searcher or for the indexer?

For both. Most thesauri, in the narrower sense of that word, are
designed primarily for post-coordinate indexing, i.e. they do not
normally have rules for combining terms into strings to denote compound
concepts. (If they do, they are adding elements of classification or
subject heading systems such as citation order, as discussed by Aida.)

An entry in a thesaurus saying

        cut flower production USE cut flowers + crop production

is therefore interpreted by an indexer as meaning

"assign the two descriptors 'cut flowers' and 'crop production'
to a document which deals with 'cut flower production'." These
descriptors are normally assigned independently without showing any
relationship between them.

A searcher will interpret this entry in the thesaurus to mean that a
search for the specific subject of 'cut flower production' should be
formulated as the Boolean statement ('cut flowers' AND 'crop
production').

>Is there a fundamental difference between thesauri intended for pre-
>coordinate use, and thesauri intended for post-coordinate use?

Inasmuch as they are both lists of uniquely labelled concepts and
relationships between these concepts, no.

The main difference is that a pre-coordinate system needs to have rules
for creating the pre-coordinated strings, and these may be quite
complex, depending on the nature and role of each concept in the string
in relation to the other concepts in the string. The system may
enumerate many of these compounds, to ensure that different users build
them in the same way, or it may rely on users interpreting the rules
consistently. It may specify that some concepts can be used only as
"subheadings", i.e. that they cannot be the first-cited concept in a
string. As I said above, once you start pre-coordinating concepts you
are moving from a simple thesaurus into the field of classification.

>Secondly, I'm *guessing* that under pre-coordinate indexing, an indexer
>could make the following two types of indexing assignment (inventing my
>own syntax):
>
>doc | subject
>----------------------------------
>1   | cut flowers, crop production
>2   | cut flowers + crop production
>
>In the first assignment, the indexer wishes to state that the subjects of
>document 1 are cut flowers, and crop production, although not necessarily
>the production of cut flowers.  In the second assignment, the indexer
>explicitly wishes to state that the subject of document 2 is (cut flowers +
>crop production) i.e. cut flower production.

Yes. The syntax varies, but in general your example 1 would be treated
as two occurrences of the "subject" metadata field, whereas your example
2 would be a single occurrence.

>How does the searcher then distinguish between these two statements?
>I'm guessing that under traditional search systems, a boolean search string
>such as 'cut flowers AND crop production' will not be able to distinguish
>between the two statements (because it's implemented via some sort of
>sub-string comparison), and will return both documents, is that correct?

Yes, though if the searcher knows the syntax (perhaps unlikely), he/she
will be able to search for a complete string in example 2 rather than
for a combination of two substrings.

The main advantage of pre-coordination is that it allows a listing of
documents to be produced in a logical and helpful order for browsing,
giving a view of related topics which would otherwise be separated: a
classified catalogue, in fact. In that case documents might be listed
under the headings:

cut flowers -- crop production
cut flowers -- marketing
cut flowers -- arrangement

and so on. If, as in this case, the "logical" sequence is in order of
time sequence, then alphabetical order is not adequate and a symbolic
notation is required to maintain the order, which is why classification
schemes use notations.

It is a matter of judgement in developing the scheme whether it is
better to cite the crop first, as in this case, or whether to cite the
process first, giving something like

crop production -- cut flowers
crop production -- fruit
crop production -- pot plants
marketing -- cut flowers
marketing -- fruit

and so on.

>  Is this something like the problem of 'false hits' that you mentioned
>previously Leonard?  If not, can you describe the problem of 'false hits' that
>you mentioned?

>And finally, am I right to assume that under post-coordinate indexing, the
>indexer does not have the ability to make the kind of distinction described
>above?

>Or is the problem of 'false hits' that if you have an indexing assignment e.g.
>...
>
>doc | subject
>---------------------------------------------------------------
>3   | calcimycin + standards, aspirin + administration & dosage
>
>... then a searcher querying for 'calcimycin AND administration & dosage'
>meaning to find documents about the administration and dosage of
>calcymicin, would erroneously receive document 3 in the result set?

Yes, except that "false hits" most often occurs with post-coordinate
systems where there are no links between terms, i.e. where your example
above has to be entered as

3       calcimycin
        standards
        aspirin
        administration & dosage

If some pre-coordination is used, as in your example 3, then there are
only two indexing strings and a knowledgeable searcher would be able to
avoid false hits by searching on complete strings rather than on
fragments.

Leonard

-- 
Willpower Information       (Partners: Dr Leonard D Will, Sheena E Will)
Information Management Consultants              Tel: +44 (0)20 8372 0092
27 Calshot Way, Enfield, Middlesex EN2 7BQ, UK. Fax: +44 (0)870 051 7276
L.Will@Willpowerinfo.co.uk               Sheena.Will@Willpowerinfo.co.uk
---------------- <URL:http://www.willpowerinfo.co.uk/> -----------------
Received on Thursday, 20 October 2005 16:15:17 UTC