- From: <bugzilla@jessica.w3.org>
- Date: Sun, 25 Sep 2016 19:25:24 +0000
- To: public-qt-comments@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=29865 --- Comment #2 from Michael Kay <mike@saxonica.com> --- I'd like to extend the scope of this issue: we have a number of examples of how functions such as contains() handle ignorable characters such as punctuation and spaces, and this area seems to have evolved in recent drafts of UCA. As always with UCA, it's very difficult to understand the full complexity of the specs, but here's an attempt. We define three values for the collation parameter "alternate": non-ignorable, shifted, and blanked. Although no-one would guess it from the name, "alternate" is about how "noise" characters like punctuation and whitespace are handled. In the spec we're very coy about saying what these mean, and the reason for that is that it's difficult and dangerous to paraphrase something so complex. But here's at attempt at a summary: "non-ignorable" - noise characters are treated like ordinary primary characters, generally with a sort order lower than other characters. So for example "data type" sorts before "database". "shifted" - noise characters are less significant than other differences, for example they are less significant than accents (secondary differences) or case (tertiary differences) - that is, they are treated as quaternary differences. In turn this means (I think) that if the strength of the collation is less than quaternary, then noise characters are ignored entirely. "blanked" - I think this means that if the strength of the collation is less than "identical", noise characters are ignored entirely. The other question is, what characters are treated as noise (which is my term, not a UCA or LDML term)? The UCA/LDML terms for these (completely unintuitively) is "variable" characters. In older versions of LDML and UCA this is defined by something called variableTop, in the most recent versions it is instead defined by maxVariable. I think this is a useful parameter and we should add it as follows: maxVariable=space|punct|symbol|currency - indicates that all characters in the specified group and earlier groups are treated as "noise" characters to be treated as defined by the "alternate" parameters. For example, maxVariable=punct indicates that characters classified as whitespace or punctuation get this treatment. alternate=non-ignorable|shifted|blanked - indicates how "noise" characters (as defined by the maxVariable parameter) are to be treated: non-ignorable indicates that they are significant characters in their own right; shifted indicates that they affect the comparison of strings only at the quaternary level; blanked indicates that they affect the comparison of strings only at the identical level. In addition, I think that interoperability demands that we define some defaults, especially as the defaults are in some cases different between UCA and LDML. strength=tertiary, alternate=non-ignorable, maxVariable=punct, backwards=no, caseLevel=no, normalization=no, numeric=no, caseFirst unspecified (default then depends on other parameters e.g. lang). Question: should lang default to the default language from the dynamic context? There are usability arguments in favour of this, but on the whole I think not: in XQuery "order by" the collation is defined statically and I think it's assumed that if a collation is specified as a literal, we know statically what collation is being used and can optimize accordingly, e.g. by selecting database indexes. Leave it implementation-defined, and an implementation can take it from the dynamic context it it chooses. -- You are receiving this mail because: You are the QA Contact for the bug.
Received on Sunday, 25 September 2016 19:25:33 UTC