Order of null values in sorting

This is item 10 in Jim's issues list.

(1) Just thinking, off the top of your head, you
would think you have two possibilities to choose
from -- put nulls first in the collation, or put
nulls last. That is because you can't think of any 
logical reason to put null's between, say, strings 
that start with "A", and strings that start with "B", 
or between any other two consecutive characters. 
(You are obviously forced to pick to consecutive 
characters, because of the ambiguity of where
strings starting with an intervening character
would go in the ordering.) Anyway, at least I can't.

So, you have to boiled it down to two possibilities
right off the bat.

Now, consider the following two strings:

   AA
   AAA

This is the order you expect to see the strings
in a phone book -- the acronym for Alcoholics Anonymous 
comes before the acronym for American Automobile Association, 
right? Well, what is the third character of the first
string? It is null, of course. So you can view the
comparison as the comparison between two strings
of length three. That implies null comes before "A".
Similarly, "B" comes before "BB", etc., etc.

So, now you have thought of an excellent intuitive and 
logical reason to decide between putting nulls first or 
last. You have logically concluded that nulls should be 
first.

(2) Guess what? That is the way the SQL standard does it.
In other words, nulls collate before any actual
value for all datatypes. (For strings, nulls come
before zero length strings too.) This simple and well
accepted practice completely specifies the situation.

Now you are certain you have a winner. You not only
derived the answer from first principles, you agree 
with the most widely accepted practice and the most
widely accepted relevant standard.

(3) Making the collating sequence definite is important.
For one thing, many people are used to SQL, and they
expect it. For another, consider DASL middleware (i.e.,
software that performs queries across two or more
disjoint collections). The middleware will ask each
collection to order the results, and perform a
multi-list merge on their answer sets to form the
final answer set. If the middleware can't depend
upon nulls being first in all the answer sets, it
can't do a merge -- It has to do a doggone sort. This
has potentially serious performance implications.
Merging ordered lists requires no secondary
storage and is accomplished quickly -- you just scan 
each list sequentially in parallel, compare the
current element in each list, and output
the smallest of the current elements and advance
the current pointer for that list. Sorting requires 
N * log2(N) time, and, usually, temporary storage as 
well. Furthermore, writing a sort routine is a needless
barrier of entry to writing the middleware in the 
first place.

(4) Furthermore, there is no downside to adopting the SQL 
standard approach. 

(5) Putting nulls first is simple: It is independent of any 
internationalization issues. 

(6) Putting nulls first is simple: It is independent of
data type.

(7) Putting nulls first is easy to implement. If you
use SQL in your server, you get it free.

(8) I am not aware of any competing standards for putting
nulls other than first. I would be surprised if there
were any. (Of course, I admit I don't have infinite 
knowledge of all standards.)

Therefore, in view of (1)-(8), I propose we put nulls first 
in DASL sortby result orderings.

Alan Babich

Received on Friday, 17 July 1998 19:18:26 UTC