- From: Leonard Will <L.Will@willpowerinfo.co.uk>
- Date: Thu, 20 Oct 2005 17:14:40 +0100
- To: public-esw-thes@w3.org
In message <677CE4DD24B12C4B9FA138534E29FB1D64C773@exchange11.fed.cclrc.ac.uk> on Wed, 19 Oct 2005, "Miles, AJ (Alistair)" <A.J.Miles@rl.ac.uk> wrote >Could you explain how the indexing/search systems work under the two >scenarios (pre- and post- coordinate indexing)? You mentioned an >'indexing string' in another email, I'm assuming that this is a string of >descriptors, composed by the indexer, and then entered into a database >field? What do indexing strings look like under the two senarios (i.e. what >can and can't you write)? What do the search strings look like under the >two scenarios (i.e. what can and can't you write), and how is the search >operation usually implemented? Alistair I see that Aida has got there first by responding to your queries. I agree with what she says, so shall not repeat her points here but just add a few comments. >I'm a bit confused about a couple of things ... > >Firstly, a thesaurus directive such as: > >cut flower production USE cut flowers + crop production > >... is that for the searcher or for the indexer? For both. Most thesauri, in the narrower sense of that word, are designed primarily for post-coordinate indexing, i.e. they do not normally have rules for combining terms into strings to denote compound concepts. (If they do, they are adding elements of classification or subject heading systems such as citation order, as discussed by Aida.) An entry in a thesaurus saying cut flower production USE cut flowers + crop production is therefore interpreted by an indexer as meaning "assign the two descriptors 'cut flowers' and 'crop production' to a document which deals with 'cut flower production'." These descriptors are normally assigned independently without showing any relationship between them. A searcher will interpret this entry in the thesaurus to mean that a search for the specific subject of 'cut flower production' should be formulated as the Boolean statement ('cut flowers' AND 'crop production'). >Is there a fundamental difference between thesauri intended for pre- >coordinate use, and thesauri intended for post-coordinate use? Inasmuch as they are both lists of uniquely labelled concepts and relationships between these concepts, no. The main difference is that a pre-coordinate system needs to have rules for creating the pre-coordinated strings, and these may be quite complex, depending on the nature and role of each concept in the string in relation to the other concepts in the string. The system may enumerate many of these compounds, to ensure that different users build them in the same way, or it may rely on users interpreting the rules consistently. It may specify that some concepts can be used only as "subheadings", i.e. that they cannot be the first-cited concept in a string. As I said above, once you start pre-coordinating concepts you are moving from a simple thesaurus into the field of classification. >Secondly, I'm *guessing* that under pre-coordinate indexing, an indexer >could make the following two types of indexing assignment (inventing my >own syntax): > >doc | subject >---------------------------------- >1 | cut flowers, crop production >2 | cut flowers + crop production > >In the first assignment, the indexer wishes to state that the subjects of >document 1 are cut flowers, and crop production, although not necessarily >the production of cut flowers. In the second assignment, the indexer >explicitly wishes to state that the subject of document 2 is (cut flowers + >crop production) i.e. cut flower production. Yes. The syntax varies, but in general your example 1 would be treated as two occurrences of the "subject" metadata field, whereas your example 2 would be a single occurrence. >How does the searcher then distinguish between these two statements? >I'm guessing that under traditional search systems, a boolean search string >such as 'cut flowers AND crop production' will not be able to distinguish >between the two statements (because it's implemented via some sort of >sub-string comparison), and will return both documents, is that correct? Yes, though if the searcher knows the syntax (perhaps unlikely), he/she will be able to search for a complete string in example 2 rather than for a combination of two substrings. The main advantage of pre-coordination is that it allows a listing of documents to be produced in a logical and helpful order for browsing, giving a view of related topics which would otherwise be separated: a classified catalogue, in fact. In that case documents might be listed under the headings: cut flowers -- crop production cut flowers -- marketing cut flowers -- arrangement and so on. If, as in this case, the "logical" sequence is in order of time sequence, then alphabetical order is not adequate and a symbolic notation is required to maintain the order, which is why classification schemes use notations. It is a matter of judgement in developing the scheme whether it is better to cite the crop first, as in this case, or whether to cite the process first, giving something like crop production -- cut flowers crop production -- fruit crop production -- pot plants marketing -- cut flowers marketing -- fruit and so on. > Is this something like the problem of 'false hits' that you mentioned >previously Leonard? If not, can you describe the problem of 'false hits' that >you mentioned? >And finally, am I right to assume that under post-coordinate indexing, the >indexer does not have the ability to make the kind of distinction described >above? >Or is the problem of 'false hits' that if you have an indexing assignment e.g. >... > >doc | subject >--------------------------------------------------------------- >3 | calcimycin + standards, aspirin + administration & dosage > >... then a searcher querying for 'calcimycin AND administration & dosage' >meaning to find documents about the administration and dosage of >calcymicin, would erroneously receive document 3 in the result set? Yes, except that "false hits" most often occurs with post-coordinate systems where there are no links between terms, i.e. where your example above has to be entered as 3 calcimycin standards aspirin administration & dosage If some pre-coordination is used, as in your example 3, then there are only two indexing strings and a knowledgeable searcher would be able to avoid false hits by searching on complete strings rather than on fragments. Leonard -- Willpower Information (Partners: Dr Leonard D Will, Sheena E Will) Information Management Consultants Tel: +44 (0)20 8372 0092 27 Calshot Way, Enfield, Middlesex EN2 7BQ, UK. Fax: +44 (0)870 051 7276 L.Will@Willpowerinfo.co.uk Sheena.Will@Willpowerinfo.co.uk ---------------- <URL:http://www.willpowerinfo.co.uk/> -----------------
Received on Thursday, 20 October 2005 16:15:17 UTC