RE: JW24a (i18n sort ordering) - Unicode 3.0 from Martin J. Duerst on 2000-05-09 (www-webdav-dasl@w3.org from April to June 2000)

From: Martin J. Duerst <duerst@w3.org>
Date: Tue, 09 May 2000 20:15:02 +0900
To: Jim Whitehead <ejw@ics.uci.edu>, infonuovo@email.com, www-webdav-dasl@w3.org
Message-Id: <4.2.0.58.J.20000509201157.02fe5770@sh.w3.mag.keio.ac.jp>
At 00/05/03 16:55 -0700, Jim Whitehead wrote:
>Hmm, just had a brief brainstorm. Since it's extremely unlikely that DASL WG
>are the first people ever to encounter this issue, I decided to do a little
>more sleuthing on the Web.

Hello Jim,

You found the right resources. One thing I can also tell you
is that both of them (14651 and UTR#10) are supposed to be
identical. They were developed in close coordination.

For the future (still maybe a year or two away), the plans
I heard about are that updates will extend the repertoire
to Unicode 3.0 at least, and that there are proposals to
change the definition format to be based on XML.

Regards,   Martin.


>Invoking the awesome powers of Google, I ran across a reference to ISO/IEC
>14651, which apparently addresses internationalized sort orderings.  The
>discussion is in: http://www.stri.is/TC304/EOR/report.html.
>
>This led me to the ISO/IEC working group developing 14651:
>http://anubis.dkuug.dk/JTC1/SC22/WG20/
>Officially known as "JTC1/SC22/WG20 - Internationalization" (say that 10
>times fast :-)
>
>This page has a link to the current working draft, which looks like it might
>be exactly what we need (well, except for the fast that it doesn't appear to
>be approved yet). Quoting from the Introduction:
>
>This International Standard defines:
>
>- A reference comparison method applicable to two characters strings in
>order to determine their
>respective order in a sorted list. The method can be applied on strings
>exploiting the full repertoire of ISO/IEC 10646-1. This method is also
>applicable to subsets of that repertoire, such as, for example, those of the
>different ISO/IEC 8-bit standard character sets, or any other
>character set, standardized or private, to produce ordering results valid
>(after tailoring) for a given set of languages for each script. This method
>uses transformation tables derived either
>from the Common Template Table defined in this International Standard or
>from one of its
>tailorings.
>
>- A reference format, using the Backus-Naur Form (BNF) to describe the
>Common Template
>Table used normatively in this International Standard.
>
>- A specific Common Template Table used by the reference comparison method.
>This table
>describes a basic order for all characters encoded in the first edition of
>ISO/IEC 10646-1 up to
>Amendment 7. It allows for a further specification of a fully deterministic
>ordering. The table is a starting point for enabling the specification of an
>international string ordering adapted to different cultures, without
>requiring an implementor to have a knowledge of all the different
>scripts already encoded in the UCS.
>
>NOTE 1: This Common Template Table may be modified with a minimum of effort
>to suit the needs of a local environment. The main benefit, worldwide, is
>that for other scripts, no modification should be required and that the
>order will remain as consistent as possible and predictable from an
>international point of view.
>
>NOTE 2: The character repertoire described in this International Standard is
>equivalent to that of the Unicode Standard Version 2.1.
>
>I took a quick look through this document, and it has the nice quality that
>it is intended to be normative, and deals with all of ISO 10646.  Since it
>isn't approved yet, we probably shouldn't make following it a MUST
>requirement, but it'll go a long way just to point out that this resource is
>available.
>
>In my Web searches, I also ran across:
>Unicode Collation Algorithm, Unicode TR#10
>http://www.unicode.org/unicode/reports/tr10/
>
>This TR has the nice quality that it has been "Approved", though it is not
>considered to be part of Unicode 3.0.  According to the Unicode site
>http://www.unicode.org/unicode/reports/:
>
>"APPROVED: A technical report that is approved, but not considered part of
>the Unicode Standard, Version 3.0, must be separately referenced if it is
>cited. Approved technical reports can be normative. This means that
>implementations can claim conformance to them. At the current time, the
>specifications in approved technical reports are provided as information and
>guidance to implementers of the Unicode Standard, but do not form part of
>the Standard itself. The Unicode Technical Committee may decide to
>incorporate all or part of the material of such technical reports into a
>future version of the Unicode Standard, either as informative or as
>normative specification. "
>
>So, it appears we have two resources to draw upon for sort ordering.  One
>question that still remains is whether we should normatively reference
>either, or just make them recommended reading when implementors start
>running into these problems.  Thoughts?
>
>- Jim
Received on Tuesday, 9 May 2000 07:42:38 UTC