- From: Jonathan Rowell <bigrat18@hotmail.com>
- Date: Tue, 26 Aug 2003 17:29:36 +0200
- To: mike@indexdata.com
- Cc: www-zig@w3.org
>From: Mike Taylor <mike@indexdata.com> >To: bigrat18@hotmail.com >Subject: Re: Attribute Architecture -- new type? >Date: Tue, 26 Aug 2003 14:18:54 +0100 > > > Date: Tue, 26 Aug 2003 14:24:26 +0200 > > From: "Jonathan Rowell" <bigrat18@hotmail.com> > > > >> The _sole_ purpose of the anyWords/allWords attributes is so that > >> the client can remain in this state of blissful ignorance -- so it > >> can say to the server, "Here's a bunch of words, pick them apart > >> exactly as you would do if they were part of a record contributing > >> to the index I'm searching". That's a valuable thing to be able to > >> do: it means that the client can submit "child's book-case" without > >> knowing or caring what the server will do with it, beyond that it > >> will Do The Right Thing. > > > > whatever "The Right Thing" might be. > >Right -- which is something only the server can know, since the >definition of The Right Thing for parsing query terms is "whatever >rules the server used when parsing the indexed record into terms". > > > This is precisely the problem of interoperability. Without escaping > > into regular expressions to cover "childs book-case" "child's > > bookcase" or "childs bookcase" (Let alone practical problems like > > Heinrich Boell, Heinrich Böll and even Heinrich Boll) because the > > people who data entry these things don't get them right. > >That's OK -- none of that matters (which is kind of the point). The >deal is that when you feed a record containing the phrase "child's >book-case" into the server, it breaks it down into terms for >indexing. It might use: > 2 terms: "child's", "book-case" > 3 terms: "child's", "book", "case" > 3 terms: "child", "s", "book-case" > 4 terms: "child", "s", "book", "case" >or something else again. The point is that only the server knows this >detail of its own implementation, and only it needs to. Because when >I submit a search-term "child's book-case" with the allWords >attribute, it will parse my search term _in the same way_ as the >record -- whatever that way was. So I'll find the record, which is >what we want. But it won't split "bookcase". Is it "car wash" or "car-wash" or "carwash"? Your counter examples contain characters (like apostrophe and hyphen) with which one can use to split the "phrase" into "words". Which character is used is probably irrelevant "child-s book'case"? Or am I to expect that those characters are to match as well? When it comes down to it : regular expressions! > > And then comes character set problems. I got a record out of the > > Danish National catalog where the apostrophe in "horses' hooves" > > was a Unicode backward modifing comma on the latin letter small > > s. Typographically correct of course. but you match it! > >Well, that's just plain broken. I don't think this is a problem with >the specifications being value, I think this is an implementation bug. Maybe. But this is the sort of thing which we come up against daily. > > It would be *nice* if the semantics were specified to an extent that > > one could be reasonably certain of getting an intelligent reponse. > >That's what allWords/anyWords -- correctly understood -- give you. Providing you've established what a word is. This is the senario I'm currently confronted with: I ask a MAB Z3950 to search on Author Heinrich Böll (I mean Heinrich B[ö|oe]ll, of course). The first problem with which I'm confronted is what other MAB categories they'vre thrown into the "Author-Index", so I often get back stuff where he wrote the introduction or made the tea. So I must study the Bib.Profile for the library - not a layman issue. Then I ask for all words - what does that mean? How can I be sure of what comes back is correct? Or worse still - how can I be sure that I have found everything which is to be found? Jonathan Jonathan _________________________________________________________________ The new MSN 8: smart spam protection and 2 months FREE* http://join.msn.com/?page=features/junkmail
Received on Tuesday, 26 August 2003 11:29:41 UTC