Re: [IndexedDB] Closing on bug 9903 (collations) from Aryeh Gregor on 2011-05-06 (public-webapps@w3.org from April to June 2011)

From: Aryeh Gregor <Simetrical+w3c@gmail.com>
Date: Fri, 6 May 2011 13:05:25 -0400
To: Jonas Sicking <jonas@sicking.cc>
Cc: Keean Schupke <keean@fry-it.com>, Pablo Castro <Pablo.Castro@microsoft.com>, "public-webapps@w3.org" <public-webapps@w3.org>
Message-ID: <BANLkTi=VW8e8yj4Zsdb_mLTZaEntfi+OLQ@mail.gmail.com>
On Thu, May 5, 2011 at 10:00 PM, Jonas Sicking <jonas@sicking.cc> wrote:
> We have already decided that we don't want to take on the complexity
> that comes with supporting changing collations on existing data. In
> particular it becomes very unclear what to do with data that is no
> longer unique under the new collation.

This is only an issue for unique indexes.  In MySQL, if you alter a
table such that a uniqueness constraint is violated, it will abort
with an error as soon as it detects the problem, not changing the
table.  But if you're using a non-binary collation function, you
rarely want a unique index anyway.

Still, I don't think this is needed for a first implementation of
collations.  It's something to support at some future date.

> I think ultimately we simply seem to disagree here. I think that
> supporting a standard set of collations is going to solve more than
> 80% of the use cases (which is a good rule of thumb for these things)
> for version 1 as well as is easier on users and so something we'll
> ultimately will want to add anyway. Thus adding it now won't be
> painting us in a corner and it solves the majority of use cases.
>
> If I understand you correctly you don't think that it solves the
> majority of use cases and you think that it adds API which is bad and
> that we should never add.
>
> Is this a correct assessment?

For my part, I agree that supporting a high-quality, comprehensive,
standard set of collations, such as UCA with CLDR tailoring, is going
to solve much more than 80% of the use-cases.  However,

1) Versioning is a possible issue if we want full interop, since CLDR
changes often.  If browsers can't update the collation of existing
indexes, they'll be forced to either stick to one version of CLDR
forever, or carry around multiple CLDR version implementations to
account for both old and new indexes.  Moreover, if browsers do ever
update their CLDR version, we'll have different collations going by
the same name in different browsers.  One way to work around this is
to specify for a first pass that browsers must implement some specific
CLDR version, like the latest at the time the standard is published,
and then just not update it for some indefinite period.

2) If there's going to be collation support in any version, it should
be full-fledged UCA, not anything less.  Better to push off collation
support entirely to a future version than to have some simplified or
undefined collation support that will have to be maintained forever.
So if possible, support for all CLDR locales would be great; failing
that, support for just untailored UCA; failing that, binary collation
only.  Much better to allow binary collation only than to not define
the collation behavior.

3) Allowing users to specify a collation function is not needed in a
first or second draft, but could be a useful feature for the future,
so it would be worthwhile to at least keep that in mind when defining
the API.  As long as the API could be later extended to support custom
functions without too much trouble, that should be enough for now IMO.
 I'm sure there are more important things to worry about.

(Custom collation functions can be useful for things other than
natural language.  For instance,
http://en.wikipedia.org/wiki/Special:LinkSearch lets you search
external links on Wikipedia by prefix.  It supports searching for
things like "*wikipedia.org", which will actually match a domain of
^.*wikipedia.org$ with any path.  This works by having an extra field
in the externallinks table containing the URL with domain names
reversed, like http://org.wikipedia.en./wiki/ instead of
http://en.wikipedia.org/wiki/, and this extra field is then indexed.
This is a waste of space, since we store the URLs twice.  In
PostgreSQL we could instead define an index based on a function
without having to create an extra column.  But as this example
illustrates, it's not essential functionality -- you can always add a
redundant column.)

On Fri, May 6, 2011 at 5:18 AM, Jonas Sicking <jonas@sicking.cc> wrote:
> Based on that, my conclusion is that we should go with what Pablo is
> proposing. And I think we should do it for v1.

If I understand correctly, Pablo's proposal is that the author be
allowed to specify a locale, and the browser can collate in some
undefined way based on that locale.  That sounds like a really bad
idea for interop.  If non-binary collation is supported in a first
version, it should be either

1) Two choices, binary or UCA 6.0.0.  (AFAIK, UCA gives fairly good
results for most languages even without tailoring, so it might be just
fine for v1.  It's vastly better than binary, for sure.)

2) In addition to binary and UCA 6.0.0, allow UCA 6.0.0 tailored by
any of the locales defined by CLDR 1.9.1.

There also needs to be some thought put into how to handle version
updates, since browsers cannot update their UCA or CLDR implementation
without rebuilding all existing indexes that used it (unless they keep
the old implementation forever).  It might be that browsers should
just stick to a fixed version for the time being (like 6.0.0 and
1.9.1), and we might decide that no further APIs are needed now to
accommodate possible future switches, but at least some thought needs
to be given to it.

On consideration, I don't think user-specified sortkey functions are
necessary at this stage.  If collations are to be identified by
strings for now, we could always overload the value to accept a
function at some later date if we wanted to support that.  So I
wouldn't worry about that further.

I would very much worry about not defining the exact collation
algorithms to be used.
Received on Friday, 6 May 2011 17:06:14 UTC