[Bug 29865] [FO31] UCA collation in substring matching

https://www.w3.org/Bugs/Public/show_bug.cgi?id=29865

--- Comment #2 from Michael Kay <mike@saxonica.com> ---
I'd like to extend the scope of this issue: we have a number of examples of how
functions such as contains() handle ignorable characters such as punctuation
and spaces, and this area seems to have evolved in recent drafts of UCA. As
always with UCA, it's very difficult to understand the full complexity of the
specs, but here's an attempt.

We define three values for the collation parameter "alternate": non-ignorable,
shifted, and blanked. Although no-one would guess it from the name, "alternate"
is about how "noise" characters like punctuation and whitespace are handled.

In the spec we're very coy about saying what these mean, and the reason for
that is that it's difficult and dangerous to paraphrase something so complex.
But here's at attempt at a summary:

"non-ignorable" - noise characters are treated like ordinary primary
characters, generally with a sort order lower than other characters. So for
example "data type" sorts before "database".

"shifted" - noise characters are less significant than other differences, for
example they are less significant than accents (secondary differences) or case
(tertiary differences) - that is, they are treated as quaternary differences.
In turn this means (I think) that if the strength of the collation is less than
quaternary, then noise characters are ignored entirely.

"blanked" - I think this means that if the strength of the collation is less
than "identical", noise characters are ignored entirely.

The other question is, what characters are treated as noise (which is my term,
not a UCA or LDML term)? The UCA/LDML terms for these (completely
unintuitively) is "variable" characters. In older versions of LDML and UCA this
is defined by something called variableTop, in the most recent versions it is
instead defined by maxVariable. I think this is a useful parameter and we
should add it as follows:

maxVariable=space|punct|symbol|currency - indicates that all characters in the
specified group and earlier groups are treated as "noise" characters to be
treated as defined by the "alternate" parameters. For example,
maxVariable=punct indicates that characters classified as whitespace or
punctuation get this treatment.

alternate=non-ignorable|shifted|blanked - indicates how "noise" characters (as
defined by the maxVariable parameter) are to be treated: non-ignorable
indicates that they are significant characters in their own right; shifted
indicates that they affect the comparison of strings only at the quaternary
level; blanked indicates that they affect the comparison of strings only at the
identical level.

In addition, I think that interoperability demands that we define some
defaults, especially as the defaults are in some cases different between UCA
and LDML. strength=tertiary, alternate=non-ignorable, maxVariable=punct,
backwards=no, caseLevel=no, normalization=no, numeric=no, caseFirst unspecified
(default then depends on other parameters e.g. lang).

Question: should lang default to the default language from the dynamic context?
There are usability arguments in favour of this, but on the whole I think not:
in XQuery "order by" the collation is defined statically and I think it's
assumed that if a collation is specified as a literal, we know statically what
collation is being used and can optimize accordingly, e.g. by selecting
database indexes. Leave it implementation-defined, and an implementation can
take it from the dynamic context it it chooses.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.

Received on Sunday, 25 September 2016 19:25:33 UTC