Re: DASL Spec from Jim Davis on 1999-10-26 (www-webdav-dasl@w3.org from October to December 1999)

From: Jim Davis <jrd3@alum.mit.edu>
Date: Tue, 26 Oct 1999 16:05:26 +0200
To: Niket Patwardhan <niket@verity.com>, www-webdav-dasl@w3.org
Message-Id: <4.1.19991026131057.00ae4ab0@pop.xs4all.nl>
At 09:37 AM 9/16/99 -0700, Niket Patwardhan wrote:
>I have been reading the spec, and the discussion since. Here are a few
>things that need to be taken care of:-

Thanks for your comments.   Sorry to be so slow to address them.

>1) I think the spec should say something about the language (like English,
>French, German, - SQL, VerityQL) of the query, at least for servers that
>support content-based retrieval(CBR?). The only thing I can find is 5.6.1
>and 6. 

Are you asking for further editorial material (better explanations,
examples) or additional functionality in the protocol itself?

If the latter, are you asking for this for queries against property values
or contents (5.13)?
I would suppose you mean contents, since you cited 5.6.1 yourself.  In this
case, it would be very interesting if you made a concrete proposal for what
you think the protocol should do.  

If you do make such a proposal it would be good to address two points:

1)  What happens if the client specifies a language, and the server does
not know the language of the content document?  Does the query fail?

2) It would also be useful to know if any existing  implementation is
actually able to use this information, and how.  For example, if I have a
document in Microsoft Word, I can certainly tag words with a language, so
that Word could (in theory) distinguish French "chat" from English "chat".
But if I then index this document with eg. Verity, is the information
preserved?  As you probably know, the IETF requires there be working
implementations of all draft protocols, so there's no use defining a
behavior that no one has actually implemented.

It would also be helpful to get some guidance on the following question,
which has plagued us sorely:

How can we meaningfully compare strings in two very closely related
languages, for example EN_us and EN_gb (or whatever it is one calls English
in the English Isles).  While a "lift" in the UK may have a different
meaning than a "lift" in the USA, it still seems bad to fail the query. 

> I think that in 5.13 we
>should at least require that if they invoke the "CONTAINS" operator, they
>must have specified the natural language the client is using in his query.

So if no language is specified, the query *fails*?  What is the advantage
of this?

>Much of the useful value add of a text based search comes from
>things like:-
>
>Case In/sensitivity
>Phonemic In/sensitivity
>Stemming
>Thesaurii
>Part of speech analysis

I  fully agree that much value comes from these features.  Let me ask you
though, if DASL 1.0 did not have them, would it be useless?  We
deliberately did not define them because

1)  No one on the design team had the expertise to create a specification
that was both simple, clear,and interoperable (remember it has to work the
same way on two distinct implementations)

2) DASL must be easy to implement.

We knew that the very top of the line systems (such as Verity) supported
such features, but we reasoned that it was better to leave room for
expansion and/or vendor-specific extensions.  We think features like
stemming and thesaurii are better treated in this way.

>2) To prevent buffer overflows don't you want to say "The server MUST honor
>the limit" in either 5.14 or 5.15?

How would this help?  For those clients with fixed size buffers, the amount
of data in even a single record is difficult to predict.  (A property value
could be arbitrarily large).  And as far as I know, all reasonable clients
are able to handle arbitrarily large replies.  And if they can't, they can
always close the connection.

If we say the server MUST honor the limit, it just raises the barrier of
implementation.

>3) 5.16 is too limiting in that it allows only two values for case
>sensitivity (0 or 1). In real life, things are much more complicated - one
>could choose to ignore accents also for example.

It would be very excellent if someone with expertise (such as yourself)
could provide a list of all the sensible values.  Or perhaps you can tell
us whether the definition, as it is, is worse than no definition at all.
None of the editors have the expertise to decide clearly.

>4) The note in 5.17 should refer to query results, not queries, since that
>is what the score is associated with. 

Thanks, I'll make this change in the next version.

>Also, I hope there is some indication
>somewhere about where the result came from, so that scores can be compared
>if it is valid to do so.

I don't understand.  The client knows where it sent the query to.  What
other information did you have in mind?

best regards




Jim Davis
jrd3@alum.mit.edu
http://users.lanminds.com/contact/jdavis
Received on Tuesday, 26 October 1999 12:08:26 UTC