W3C home > Mailing lists > Public > www-zig@w3.org > August 2003

Re: Attribute Architecture -- new type?

From: Jonathan Rowell <bigrat18@hotmail.com>
Date: Tue, 26 Aug 2003 17:29:36 +0200
To: mike@indexdata.com
Cc: www-zig@w3.org
Message-ID: <Law12-F89wG8nlsjH1d00001a4a@hotmail.com>

>From: Mike Taylor <mike@indexdata.com>
>To: bigrat18@hotmail.com
>Subject: Re: Attribute Architecture -- new type?
>Date: Tue, 26 Aug 2003 14:18:54 +0100
>
> > Date: Tue, 26 Aug 2003 14:24:26 +0200
> > From: "Jonathan Rowell" <bigrat18@hotmail.com>
> >
> >> The _sole_ purpose of the anyWords/allWords attributes is so that
> >> the client can remain in this state of blissful ignorance -- so it
> >> can say to the server, "Here's a bunch of words, pick them apart
> >> exactly as you would do if they were part of a record contributing
> >> to the index I'm searching".  That's a valuable thing to be able to
> >> do: it means that the client can submit "child's book-case" without
> >> knowing or caring what the server will do with it, beyond that it
> >> will Do The Right Thing.
> >
> > whatever "The Right Thing" might be.
>
>Right -- which is something only the server can know, since the
>definition of The Right Thing for parsing query terms is "whatever
>rules the server used when parsing the indexed record into terms".
>
> > This is precisely the problem of interoperability. Without escaping
> > into regular expressions to cover "childs book-case" "child's
> > bookcase" or "childs bookcase" (Let alone practical problems like
> > Heinrich Boell, Heinrich Böll and even Heinrich Boll) because the
> > people who data entry these things don't get them right.
>
>That's OK -- none of that matters (which is kind of the point).  The
>deal is that when you feed a record containing the phrase "child's
>book-case" into the server, it breaks it down into terms for
>indexing.  It might use:
>	   2 terms: "child's", "book-case"
>	   3 terms: "child's", "book", "case"
>	   3 terms: "child", "s", "book-case"
>	   4 terms: "child", "s", "book", "case"
>or something else again.  The point is that only the server knows this
>detail of its own implementation, and only it needs to.  Because when
>I submit a search-term "child's book-case" with the allWords
>attribute, it will parse my search term _in the same way_ as the
>record -- whatever that way was.  So I'll find the record, which is
>what we want.

But it won't split "bookcase". Is it "car wash" or "car-wash" or "carwash"?
Your counter examples contain characters (like apostrophe and hyphen) with 
which one can use to split the "phrase" into "words". Which character is 
used is probably irrelevant "child-s book'case"? Or am I to expect that 
those characters are to match as well?

When it comes down to it : regular expressions!

> > And then comes character set problems. I got a record out of the
> > Danish National catalog where the apostrophe in "horses' hooves"
> > was a Unicode backward modifing comma on the latin letter small
> > s. Typographically correct of course. but you match it!
>
>Well, that's just plain broken.  I don't think this is a problem with
>the specifications being value, I think this is an implementation bug.

Maybe. But this is the sort of thing which we come up against daily.

> > It would be *nice* if the semantics were specified to an extent that
> > one could be reasonably certain of getting an intelligent reponse.
>
>That's what allWords/anyWords -- correctly understood -- give you.

Providing you've established what a word is.

This is the senario I'm currently confronted with: I ask a MAB Z3950 to 
search on Author Heinrich Böll (I mean Heinrich B[ö|oe]ll, of course). The 
first problem with which I'm confronted is what other MAB categories 
they'vre thrown into the "Author-Index", so I often get back stuff where he 
wrote the introduction or made the tea. So I must study the Bib.Profile for 
the library - not a layman issue. Then I ask for all words - what does that 
mean? How can I be sure of what comes back is correct? Or worse still - how 
can I be sure that I have found everything which is to be found?

Jonathan

Jonathan

_________________________________________________________________
The new MSN 8: smart spam protection and 2 months FREE*  
http://join.msn.com/?page=features/junkmail
Received on Tuesday, 26 August 2003 11:29:41 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 29 October 2009 06:12:23 GMT