RE: Order By from Kevin Wiggen on 2000-06-10 (www-webdav-dasl@w3.org from April to June 2000)

From: Kevin Wiggen <wiggs@xythos.com>
Date: Fri, 09 Jun 2000 22:24:00 -0700
To: "Babich, Alan" <ABabich@filenet.com>, "'Tapas Nayak'" <tapasnay@Exchange.Microsoft.com>, www-webdav-dasl@w3.org
Message-id: <ONEOJMKKAIDAGPLOPJEDEEHECEAA.wiggs@xythos.com>
RE: Order By
I would like to take a quick step back and state my original question again
with some more info.  If I do the following dasl to a my webdav server:

<?xml version=\"1.0\"?>
<d:searchrequest xmlns:d="DAV:" >
    <d:basicsearch>
          <d:select>
                <d:prop><d:getcontentlength/><d:displayname/></d:prop>
           </d:select>
           <d:from>
                 <d:scope>
                     <d:href>http://www.sharemation.com/~kwiggen</d:href>
                     <d:depth> infinity</d:depth>
                  </d:scope>
            </d:from>
            <d:orderby>
                  <d:order><d:prop><d:getcontentlength/></d:prop></d:order>
            </d:orderby>
       </d:basicsearch>
</d:searchrequest>

I get (sorted correctly in my mind):

  <?xml version="1.0" encoding="utf-8" ?>
   <D:multistatus xmlns:D="DAV:">  <D:response>
   <D:href>http://www.sharemation.com/~kwiggen/winzip.log</D:href>
       <D:propstat> <D:prop>
            <D:getcontentlength><![CDATA[ 241 ]]> </D:getcontentlength>
            <D:displayname> <![CDATA[ winzip.log]]></D:displayname>
  </D:prop>
  <D:status>HTTP/1.1 200 OK</D:status>
  </D:propstat></D:response>
<D:response>
 <D:href>http://www.sharemation.com/~kwiggen/</D:href>
<D:propstat><D:prop>
    <D:displayname><![CDATA[ ~kwiggen]]></D:displayname>
  </D:prop>
  <D:status>HTTP/1.1 200 OK</D:status>
  </D:propstat>
   <D:propstat><D:prop>
         <D:getcontentlength />
  </D:prop>
  <D:status>HTTP/1.1 404 Not Found</D:status>
  </D:propstat>
  </D:response>
  </D:multistatus>


Notice that there IS a distinction between null values and 0 length values
in Webdav (for which this DASL Server was written).  In this case, the
property content length does NOT exist on a directory thus it is Not Found.
Since I am sorting by content length, I would expect to see this value
sorted LAST as I probably don't care about it.

In Webdav there is no way of setting a property value to NULL!  Either the
property exists on the object (in which case I would say the value is a
empty string that will sort first) or it doesn't exist on the resource in
which case I say its a NULL and sorts last.  Since DASL returns a Webdav
Propfind Response, we are giving the client the difference between null and
a 0 length string, and thus can sort correctly.

If a dead property exists on a resource, then someone had to set it, thus
the value is a least an empty string as it will now return in the 200 OK
section of a request.  Thus the value is not null and sorts first.  My
question revolves around the resources where that property does not exist
(Null in my mind), and thus they are separated out into a separate 404
section.  These I think should sort last, as they are not found.

Thus I would like to argue that nulls (when a property doesn't exist on a
resource) sort last.

Of course there is also the problem where there is no way to state the data
type of a dead property in Webdav, thus sorting might be wrong anyway.  But
that is another email :)



Kevin


  -----Original Message-----
  From: www-webdav-dasl-request@w3.org
[mailto:www-webdav-dasl-request@w3.org]On Behalf Of Babich, Alan
  Sent: Friday, June 09, 2000 4:42 PM
  To: 'Tapas Nayak'; www-webdav-dasl@w3.org
  Subject: RE: Order By


  Collaters do not interpret values (or lack of a value, i.e., null). They
just collate.
  Nor should collaters do more than that.

  The end user, not collaters, will interpret what null means. As you say,
there you can
  split the semantic hair about what null means in multiple ways.

  For example, at my company we sometimes use null as a data compaction
  technique. When there is no value for the status, the status is "OK".
  Other times, it means there is simply no value, and it doesn't mean
  anything special.

  There are a large number of other things null could mean in a particular
application.
  You have presented a couple of examples below

  Neither SQL or DASL know what a null value means to an application.
  The application itself knows what null means, however, unless it is a
generic
  query application, because that knowledge is hard coded into the
application.
  It seems that current DASL implementations are of the hard coded type,
  rather than being generic query applications. A non-generic query
  application can make whatever hard coded transformation it wants when
  displaying results to the user.

  I strongly believe we should keep DASL conceptually simple on release 1,
  and keep its implementation simple as well. Let's not guild the lily by
providing for
  specification to repositories of how null sorts on release 1. Why burden
release
  1 implementations with the extra baggage that most implementations
probably don't
  care about?

  The simpler DASL is, the more likely it is to get implemented, and the
more likely
  it is to be implemented correctly. Our approach to features is the usual:
Make
  feature cuts until it's not useful. Is DASL useful without allowing client
applications
  to tell DASL whether  nulls sort first or last on a particular query? I
think so.

  Once DASL has seen a certain amount of trial use, it will become more
clear
  what people with actual applications want. If we determine that telling
DASL whether
  nulls sort first or last is desirable after adequate trial use, that
feature could be added
  later. I can think of (to me) conceptually more important features that
are not going
  to be in the first release (e.g., query schema discovery, cross repository
searching,
  searching structured properties), but, in my opinion, DASL is still very
useful without
  these features.

  If DASL succeeds, I'm certain it will be enhanced significantly
  in the future, the same as any other software product or specification. If
we overload
  DASL with features for the first release, we are taking risks to it's ever
succeeding
  at all that we shouldn't be taking.

  Alan Babich

  BTW: An approach that may be of use to some repositories is to add a real
or
  computed property that can be used to put nulls where you want them in
  the collating sequence, by using multiple properties in the sort by
clause.
  For example, suppose a date property is "Closed_Date", and null means its
not
  closed, and you wanted open documents (i.e., not closed) to sort last.
  Add a real or computed integer property, "open", that
  has value 1 for open documents (i.e., null Closed_Date) and 0 for closed
  documents. Then "sort by open, Closed_Date" instead of just
  "sort by Closed_Date" will put open documents last, even though they
  have no value for Closed_Date. If expressions were allowed as the
  sort by elements, you could do it with expressions in the query sort by
  clause in the query by computing 1 or 0 depending on whether
  Closed_Date was null. Adding expressions as sort by elements would
  be a more general feature that provides more than a narrow feature that
  just tells repositories to put nulls first or last in the collating
sequence.
    -----Original Message-----
    From: Tapas Nayak [mailto:tapasnay@Exchange.Microsoft.com]
    Sent: Friday, June 09, 2000 2:00 PM
    To: Babich, Alan; www-webdav-dasl@w3.org
    Subject: RE: Order By


    The merging problem you have discussed is very fundamental. At the same
time restricting one way or the other
    also may not be right considering the fact that NULLs may be interpreted
in different ways. For example, consider a date field.
    If the date field is, say Close_Date (the date in which a specification
is closed), there a NULL means the spec is not closed yet and then
    if my sorting by close-date ascending puts NULLs at the front it may be
counter-intuitive. For same date field, one can have a completely
    different argument when a NULL would mean it exists but unfortunately
the value is available. That will be the case
    if the date field is Opened_Date. Thus I think we need option 3 (leave
to implementer) but considering the fact
    that collation is really key issue here there should be a way by which a
client can specify the intent. so that the same field
    is returned the same way by all providers. Interpretation of NULL value
for the same field when fetched from multiple
    repositories will have to be the same to the collator anyway.

    Regards
    Tapas
      -----Original Message-----
      From: Babich, Alan [mailto:ABabich@filenet.com]
      Sent: Friday, June 09, 2000 1:11 PM
      To: Tapas Nayak; www-webdav-dasl@w3.org
      Subject: RE: Order By


      Thanks for reminding me about merging results across multiple
repositories. That's a very key issue.

      On the first release of DASL, the DASL spec. does not specify enough
to accomplish merging
      the results across multiple repositories. You want to make searching
across multiple
      repositories transparent, so you want one metadata space across all
the repositories.
      This leaves only two choices for merging the metadata of the multiple
repositories involved:
      (1) intersect the metadata of all the repositories, or
      (2) take the set union of the metadata of all the repositories.
      The client would have to specify which behavior he wanted.
      DMA has already addressed this issue in excruciating detail, and great
success has
      resulted.  (Trust me, there was a lot of detail involved in specifying
it for DMA.)
      In DMA, you can query seamlessly across multiple different
repositories that have
      different metadata. Users absolutely love this.

      My strong opinion is that, IF THERE EVER IS A DASL, THEN SOMEDAY DASL
WILL QUERY ACROSS
      MULTIPLE REPOSITORIES. I feel that is inevitable. (The caps. are just
for visual emphasis in
      as ASCII-only e-mail world, not "shouting".)

      Based on my personal strong belief about that, I believe option (3)
(i.e., let the implementation
      decide) is not a viable choice. You would have to invent a way to
advertise which way nulls
      are sorted, and, probably, which way the client wants them sorted.
That's bad enough by itself.
      But when you consider the performance impact (e.g., if one of the
repositories
      involved in the query sorts nulls first, and one of the repositories
sorts nulls last),
      then you see that you have to wait for all the results from the one of
the repositories or the other
      (depending upon whether the merged stream of results sorts nulls first
or last) before you can
      return the FIRST result to the end user -- you can NOT return sorted
results to the end user
      incrementally. That absolutely kills the idea of alternative (3). What
you must do to serve
      the end users adequately is to merge the results incrementally as they
come in from the
      repositories, and present the results to the end user as soon as
possible. This requires that all
      the repositories sort nulls the same way.

      Therefore, only alternative (1) and (2) are viable in a world where
cross repository searches are
      performed.

      Alternative (2) is illogical, since it violates the principle of least
surprise. Putting the zero length
      strings at the front and the nulls at the end would confuse even me
for a at least a moment.

      This only leaves us with alternative (1), which is what's currently in
the DASL spec.

      Alan Babich

       -----Original Message-----
      From: Tapas Nayak [mailto:tapasnay@Exchange.Microsoft.com]
      Sent: Friday, June 09, 2000 12:56 AM
      To: Babich, Alan; Kevin Wiggen; www-webdav-dasl@w3.org
      Subject: RE: Order By


        It may be restrictive to prescribe one way or the other by
        interpreting nulls with one interpretation. NULLs have
        different interpretations in different semantic contexts.
        While for strings a NULL may be considered as a NULL string, in
        many cases NULL represent values not known or dont care. For an
integer
        field, say age, it may be incorrect to consider NULL as 0, for
example.
        It is this lack of a unique concrete interpretation of NULL in all
situations
        that left its interpretation to the implementation in SQL spec, I
think.
        And probably that is the appropriate thing to do here too. Thus
option 3 seems to be
        the appropriate thing. I understand, however, that they will pose a
problem in collation
        of results from more than one data sources, if the participating
data sources chose
        different implementations for the same property. That would indicate
possibly that
        even though it is the same property, different providers have taken
slightly different
        semantics for it. In that case either collator will have to do some
reconciliation
        of these different interpretations or enforce one semantics over the
other.

        Regards
        Tapas
        -----Original Message-----
        From: Babich, Alan [mailto:ABabich@filenet.com]
        Sent: Thursday, June 08, 2000 6:43 PM
        To: 'Kevin Wiggen'; www-webdav-dasl@w3.org
        Subject: RE: Order By



        Does anybody else out there have an opinion?

        The choices are:

        (1) Nulls sort first
        (2) Nulls sort last
        (3) Nulls sort first or last at the implementation's choice

        These three choices are compatible with the SQL spec., and
        make logical sense. Only choice (1) is compatible with DMA.

        Consider how zero length strings sort:
        "A" sorts before "AA", because it is a shorter string.
        (This is not only specified in the SQL spec.,
        it is the way things are listed in the phone book.)
        Therefore, the zero length string, "", sorts before "A"
        and before all other strings with length greater than zero.

        A null string is similar to a zero length string since
        neither contains any characters, so it seems odd to me
        to put null strings and zero length strings at opposite ends
        of the sort. I would think doing that could confuse some end users
        who are not sophisticated in the subtleties of zero length
        strings versus null strings. It also forces the implementation
        to split the semantic hair of whether zero length strings are
        null or not. I would prefer not to risk confusing users or
        implementers. In other words I prefer choice (1), and I
        claim that choices (2) and (3) have some risk of confusing
        some end users and implementers because of the similarity of
        null strings and zero length strings.

        (BTW, I agree with you, Kevin, that postgresql seems to be
        violating the SQL spec. Nulls must always sort first or last,
        so DESC must put nulls at the opposite end. In postgresql,
        as you have shown, DESC doesn't do that.)

        Alan

        -----Original Message-----
        From: Kevin Wiggen [mailto:wiggs@xythos.com]
        Sent: Thursday, June 08, 2000 4:42 PM
        To: Kevin Wiggen; Babich, Alan; Jim Davis; www-webdav-dasl@w3.org
        Subject: RE: Order By




        OK I will clarify (not that anyone cares)

        Oracle

        order by column1 => nulls sort last
        order by column1 DESC => nulls sort first

        Postgresql

        order by column1 => nulls sort last
        order by column1 DESC => nulls sort last

        I could argue that either one makes sense to me (although postgresql
seems
        to be violating the spec)....

        Kevin



        -----Original Message-----
        From: www-webdav-dasl-request@w3.org
        [mailto:www-webdav-dasl-request@w3.org]On Behalf Of Kevin Wiggen
        Sent: Thursday, June 08, 2000 4:03 PM
        To: Babich, Alan; Jim Davis; www-webdav-dasl@w3.org
        Subject: RE: Order By





        Well Oracle and Postgresql (the only db's I have around, I called a
friend
        to get Informix but he's not around), both sort the nulls last.
This makes
        sense to me, as if it doesn't exist, then you probably don't care
about it,
        thus it sorts last.

        So I guess I object to the spec.  I would like to see it changed so
that
        nulls sort last.

        Kevin

        -----Original Message-----
        From: Babich, Alan [mailto:ABabich@filenet.com]
        Sent: Wednesday, June 07, 2000 8:50 PM
        To: 'Kevin Wiggen'; Jim Davis; www-webdav-dasl@w3.org
        Subject: RE: Order By



        I just looked it up. I thought the SQL spec. required nulls to sort
first.
        Here's what the version of the SQL spec. that I have actually says:

        ANSI X3.135-1992 (Database Language SQL) in section 13.1, General
Rule 3)b),
        page 309, says "Whether a sort key value that is null is considered
greater
        or
        less than a non-null value is implementation-defined, but all sort
key
        values that
        are null shall either be considered greater than all non-null values
or be
        considered
        less than all non-null values."

        Last time I used Oracle (which was a while ago) it sorted nulls
first.
        To me, logically, null is even less significant than zero, so it
seems right
        to me
        that nulls should come first. I seem to remember that DMA query
requires
        nulls to sort first.

        I personally think it's better for everybody to do things the same
way
        (that's why we have specs.,
        isn't it?).

        We made it definite in DASL, and made nulls sort first.
        Any significant objections, or are there just comments but not a
true
        objection?



        Alan Babich

        -----Original Message-----
        From: Kevin Wiggen [mailto:wiggs@xythos.com]
        Sent: Wednesday, June 07, 2000 3:26 PM
        To: Jim Davis; www-webdav-dasl@w3.org
        Subject: RE: Order By




        After further review :)

        The fact that nulls sort first????, if the value is null it is
probably
        because the property isn't defined and therefore you could care
less, but
        these are the ones sorted first???  The spec acts as though this is
common
        in ANSI standard SQL, but I have only seen the opposite from every
DB I have
        ever worked with (which maybe means none of them follow the
standard)

        Kevin



        -----Original Message-----
        From: Jim Davis [mailto:jrd3@alum.mit.edu]
        Sent: Wednesday, June 07, 2000 7:47 AM
        To: Kevin Wiggen; www-webdav-dasl@w3.org
        Subject: Re: Order By



        At 10:16 PM 6/6/00 -0700, Kevin Wiggen wrote:
        >
        >
        >Does adding "Order By Content Length" (or any other file specific
property)
        >automatically add "AND Resource Type = File" to the Where clause of
a Dasl
        >Query?
        >
        >Or in English, does adding a file specific property to the orderby
clause
        in
        >a dasl query implicitly make all resources that are returned files?
or do
        >the directories and null resources simply sort last?

        The latter.   Nowhere does  the spec say that anything is added
        (implicitly) to the where clause, and I would never agree to that.
But it
        does say  "In the context of the DAV:orderby element, null values
are
        considered to collate before any actual (i.e., non null) value,
including
        strings of zero length".  (5.6 in the most recent draft)

        Is this a problem either technically or editorially?

        regards

        Jim
Received on Saturday, 10 June 2000 01:28:03 UTC