RE: JW24a (i18n sort ordering) - Unicode 3.0 from Babich, Alan on 2000-05-04 (www-webdav-dasl@w3.org from April to June 2000)

From: Babich, Alan <ABabich@filenet.com>
Date: Thu, 4 May 2000 11:39:14 -0700
To: www-webdav-dasl@w3.org
Message-ID: <C3AF5E329E21D2119C4C00805F6FF58F0398E915@hq-expo2.filenet.com>
Reference them in the DASL spec.
There is really no other choice.

Alan Babich

-----Original Message-----
From: Jim Whitehead [mailto:ejw@ics.uci.edu]
Sent: Wednesday, May 03, 2000 4:56 PM
To: infonuovo@email.com; www-webdav-dasl@w3.org; duerst@w3.org
Subject: RE: JW24a (i18n sort ordering) - Unicode 3.0


Hmm, just had a brief brainstorm. Since it's extremely unlikely that DASL WG
are the first people ever to encounter this issue, I decided to do a little
more sleuthing on the Web.

Invoking the awesome powers of Google, I ran across a reference to ISO/IEC
14651, which apparently addresses internationalized sort orderings.  The
discussion is in: http://www.stri.is/TC304/EOR/report.html.

This led me to the ISO/IEC working group developing 14651:
http://anubis.dkuug.dk/JTC1/SC22/WG20/
Officially known as "JTC1/SC22/WG20 - Internationalization" (say that 10
times fast :-)

This page has a link to the current working draft, which looks like it might
be exactly what we need (well, except for the fast that it doesn't appear to
be approved yet). Quoting from the Introduction:

This International Standard defines:

- A reference comparison method applicable to two characters strings in
order to determine their
respective order in a sorted list. The method can be applied on strings
exploiting the full repertoire of ISO/IEC 10646-1. This method is also
applicable to subsets of that repertoire, such as, for example, those of the
different ISO/IEC 8-bit standard character sets, or any other
character set, standardized or private, to produce ordering results valid
(after tailoring) for a given set of languages for each script. This method
uses transformation tables derived either
from the Common Template Table defined in this International Standard or
from one of its
tailorings.

- A reference format, using the Backus-Naur Form (BNF) to describe the
Common Template
Table used normatively in this International Standard.

- A specific Common Template Table used by the reference comparison method.
This table
describes a basic order for all characters encoded in the first edition of
ISO/IEC 10646-1 up to
Amendment 7. It allows for a further specification of a fully deterministic
ordering. The table is a starting point for enabling the specification of an
international string ordering adapted to different cultures, without
requiring an implementor to have a knowledge of all the different
scripts already encoded in the UCS.

NOTE 1: This Common Template Table may be modified with a minimum of effort
to suit the needs of a local environment. The main benefit, worldwide, is
that for other scripts, no modification should be required and that the
order will remain as consistent as possible and predictable from an
international point of view.

NOTE 2: The character repertoire described in this International Standard is
equivalent to that of the Unicode Standard Version 2.1.

I took a quick look through this document, and it has the nice quality that
it is intended to be normative, and deals with all of ISO 10646.  Since it
isn't approved yet, we probably shouldn't make following it a MUST
requirement, but it'll go a long way just to point out that this resource is
available.

In my Web searches, I also ran across:
Unicode Collation Algorithm, Unicode TR#10
http://www.unicode.org/unicode/reports/tr10/

This TR has the nice quality that it has been "Approved", though it is not
considered to be part of Unicode 3.0.  According to the Unicode site
http://www.unicode.org/unicode/reports/:

"APPROVED: A technical report that is approved, but not considered part of
the Unicode Standard, Version 3.0, must be separately referenced if it is
cited. Approved technical reports can be normative. This means that
implementations can claim conformance to them. At the current time, the
specifications in approved technical reports are provided as information and
guidance to implementers of the Unicode Standard, but do not form part of
the Standard itself. The Unicode Technical Committee may decide to
incorporate all or part of the material of such technical reports into a
future version of the Unicode Standard, either as informative or as
normative specification. "

So, it appears we have two resources to draw upon for sort ordering.  One
question that still remains is whether we should normatively reference
either, or just make them recommended reading when implementors start
running into these problems.  Thoughts?

- Jim
Received on Thursday, 4 May 2000 14:42:11 UTC