- From: Aryeh Gregor <Simetrical+w3c@gmail.com>
- Date: Fri, 6 May 2011 13:05:25 -0400
- To: Jonas Sicking <jonas@sicking.cc>
- Cc: Keean Schupke <keean@fry-it.com>, Pablo Castro <Pablo.Castro@microsoft.com>, "public-webapps@w3.org" <public-webapps@w3.org>
On Thu, May 5, 2011 at 10:00 PM, Jonas Sicking <jonas@sicking.cc> wrote: > We have already decided that we don't want to take on the complexity > that comes with supporting changing collations on existing data. In > particular it becomes very unclear what to do with data that is no > longer unique under the new collation. This is only an issue for unique indexes. In MySQL, if you alter a table such that a uniqueness constraint is violated, it will abort with an error as soon as it detects the problem, not changing the table. But if you're using a non-binary collation function, you rarely want a unique index anyway. Still, I don't think this is needed for a first implementation of collations. It's something to support at some future date. > I think ultimately we simply seem to disagree here. I think that > supporting a standard set of collations is going to solve more than > 80% of the use cases (which is a good rule of thumb for these things) > for version 1 as well as is easier on users and so something we'll > ultimately will want to add anyway. Thus adding it now won't be > painting us in a corner and it solves the majority of use cases. > > If I understand you correctly you don't think that it solves the > majority of use cases and you think that it adds API which is bad and > that we should never add. > > Is this a correct assessment? For my part, I agree that supporting a high-quality, comprehensive, standard set of collations, such as UCA with CLDR tailoring, is going to solve much more than 80% of the use-cases. However, 1) Versioning is a possible issue if we want full interop, since CLDR changes often. If browsers can't update the collation of existing indexes, they'll be forced to either stick to one version of CLDR forever, or carry around multiple CLDR version implementations to account for both old and new indexes. Moreover, if browsers do ever update their CLDR version, we'll have different collations going by the same name in different browsers. One way to work around this is to specify for a first pass that browsers must implement some specific CLDR version, like the latest at the time the standard is published, and then just not update it for some indefinite period. 2) If there's going to be collation support in any version, it should be full-fledged UCA, not anything less. Better to push off collation support entirely to a future version than to have some simplified or undefined collation support that will have to be maintained forever. So if possible, support for all CLDR locales would be great; failing that, support for just untailored UCA; failing that, binary collation only. Much better to allow binary collation only than to not define the collation behavior. 3) Allowing users to specify a collation function is not needed in a first or second draft, but could be a useful feature for the future, so it would be worthwhile to at least keep that in mind when defining the API. As long as the API could be later extended to support custom functions without too much trouble, that should be enough for now IMO. I'm sure there are more important things to worry about. (Custom collation functions can be useful for things other than natural language. For instance, http://en.wikipedia.org/wiki/Special:LinkSearch lets you search external links on Wikipedia by prefix. It supports searching for things like "*wikipedia.org", which will actually match a domain of ^.*wikipedia.org$ with any path. This works by having an extra field in the externallinks table containing the URL with domain names reversed, like http://org.wikipedia.en./wiki/ instead of http://en.wikipedia.org/wiki/, and this extra field is then indexed. This is a waste of space, since we store the URLs twice. In PostgreSQL we could instead define an index based on a function without having to create an extra column. But as this example illustrates, it's not essential functionality -- you can always add a redundant column.) On Fri, May 6, 2011 at 5:18 AM, Jonas Sicking <jonas@sicking.cc> wrote: > Based on that, my conclusion is that we should go with what Pablo is > proposing. And I think we should do it for v1. If I understand correctly, Pablo's proposal is that the author be allowed to specify a locale, and the browser can collate in some undefined way based on that locale. That sounds like a really bad idea for interop. If non-binary collation is supported in a first version, it should be either 1) Two choices, binary or UCA 6.0.0. (AFAIK, UCA gives fairly good results for most languages even without tailoring, so it might be just fine for v1. It's vastly better than binary, for sure.) 2) In addition to binary and UCA 6.0.0, allow UCA 6.0.0 tailored by any of the locales defined by CLDR 1.9.1. There also needs to be some thought put into how to handle version updates, since browsers cannot update their UCA or CLDR implementation without rebuilding all existing indexes that used it (unless they keep the old implementation forever). It might be that browsers should just stick to a fixed version for the time being (like 6.0.0 and 1.9.1), and we might decide that no further APIs are needed now to accommodate possible future switches, but at least some thought needs to be given to it. On consideration, I don't think user-specified sortkey functions are necessary at this stage. If collations are to be identified by strings for now, we could always overload the value to accept a function at some later date if we wanted to support that. So I wouldn't worry about that further. I would very much worry about not defining the exact collation algorithms to be used.
Received on Friday, 6 May 2011 17:06:14 UTC