RE: F&O function fn:normalize-unicode from Addison Phillips [wM] on 2003-10-31 (public-qt-comments@w3.org from October 2003)

From: Addison Phillips [wM] <aphillips@webmethods.com>
Date: Fri, 31 Oct 2003 13:07:06 -0800
To: "Ashok Malhotra" <ashokma@microsoft.com>, <public-qt-comments@w3.org>
Cc: <w3c-xml-schema-ig@w3.org>, <w3c-i18n-ig@w3.org>
Message-ID: <PNEHIBAMBMLHDMJDDFLHGEJHHCAA.aphillips@webmethods.com>
Hi Ashok,

The following is an entirely personal (not official) response, offering a
couple of clarifications. As the new chair of the I18N Working Group, I want
to assure you of a swift response. Since many of our members are traveling
or on vacation at the moment, it may yet be a few days before you receive
something official. As the new guy, I'm a little hesitant to leap into this
discussion. Nonetheless, I have some comments below which might help.

On the question of the timing of CharMod's finalization, it's very clear
that this is a key deliverable that a lot of WGs are waiting on. Certainly
I've seen the requests from the I18N WG asking people to reference the
CharMod or aspects thereof. It's unfair of us to do that if we can't get a
reasonable schedule for release. I hope to clarify that shortly and to
everyone's satisfaction.

If you have concerns or questions about the I18N WG's methods, deliverables,
or comments, I would urge you to contact me privately to discuss them. I'm
very concerned that we address such issues.

Following my sig are my comments on the technical questions raised in your
note.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.

--------------

When CharMod talks about "normalization" it refers to Unicode normalization
forms as defined in the Unicode Standard (currently version 4.0). Every
character has well-defined combining characteristics, so your comment:

<quot>
(Most combining characters are accents and it is arguably rare for such
characters to start a sentence but aren't 'h' and 'l' combining characters
is Spanish for the forms 'ch' and 'll'?  If so, many sentences will start
with these characters)
</quot>

... while insightful about the nature of some languages, is not applicable
in this case. 'c', 'h', 'l' and other characters are always fully spacing.
Some languages, as you note, use digraph characters: other examples include
the Dutch 'ij' digraph and many such letters in Central European langauges.
None of these cases are at issue here.

A Unicode combining character is defined as a character with a combining
class > 0 in the Unicode character database. In all cases, a character
either *is* a combining mark at all times and in all languages or it is
*not* a combining mark. The language of the material does not affect this,
allowing algorithmic normalization.

It seems to me that you are correct about the nature of the fn:substring
function being able to create strings that are not fully-normalized. If
fn:substring were modified require its output to be fully-normalized, then I
think the right thing to do would *not* be to delete the starting
(offending) character, but to insert a space character at the start to
"carry" the combining mark (I think this is actually covered, but not very
explicitly in comment [61]).

This has the side effect of:

  substring(mystr, 0,10) + substring(mystr, 10, length(mystr) != mystr

Therefore I would say that the fn:substring operation cannot produce
fully-normalized text, because that would seem to fly in the face of the
purpose of such an operator.

Again, speaking strictly for myself, I can see how you have arrived at the
conclusions you have made. There are functions which can produce
"fully-normalized" strings and check strings for "fully-normalized-ness":
they are exactly the same as the function that produces Unicode
Normalization Form C, plus a check that the first character in the string
being normalized is a base character (e.g. has a combining class of 0). If
the last test fails, a space is inserted at the start of the data to carry
the combining mark.

This loses no information. A function such as fn:normalize-unicode can
easily be written. Whether other XML Query operations should be defined as
producing fully-normalized output (or not) I think is a separate question.
In fact, I think there are cases where it might be better if they were
defined as NOT producing fully-normalized and rather only produce
include-normalized text (such as the substring example above). Then a
separate operation such as fn:normalize-unicode becomes very important (in
the event that a segment of content produced by a function that does not
guarantee fully-normalized output must be emitted as fully-normalized).


-----Original Message-----
From: w3c-i18n-ig-request@w3.org [mailto:w3c-i18n-ig-request@w3.org]On
Behalf Of Ashok Malhotra
Sent: mardi 28 octobre 2003 14:30
To: public-qt-comments@w3.org
Cc: w3c-xml-schema-ig@w3.org; w3c-i18n-ig@w3.org
Subject: F&O function fn:normalize-unicode


Both Schema and the I18N WGs requested that the fn:normalize-unicode
function should make it a requirement to support the
W3C normalization form, now called 'fully-normalized'.  See XML Schema
comments item [2.6]
http://lists.w3.org/Archives/Public/public-qt-comments/2003Aug/0003.html
and I18N WG comments item [62]
http://lists.w3.org/Archives/Public/public-qt-comments/2003Jul/0105.html.
This was discussed at length at the joint WG meeting in Toronto in
mid-September.  It was felt that 'fully-normalized' defined a property
without
a well defined algorithm to achieve it.  The character model says that for
text to be "fully-normalized" constructs (sentences in our case) should not
start
with a combining character so that appending a character to a string never
creates a non-normalized string.  But suppose a string operation such as
fn:substring
results in such a string i.e. one that starts with a combining character.
What should be done, should the initial combining character just be removed?
(Most combining characters are accents and it is arguably rare for such
characters to start a sentence but aren't 'h' and 'l' combining characters
is Spanish for the forms 'ch' and 'll'?  If so, many sentences will start
with these characters)
Michael Kay made this point in
http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2003Oct/0051.html
saying: "It's not at all clear to me that supporting "fully-normalized" form
makes any sense at all. Whereas the Unicode normalization forms all describe
an algorithm for normalizing data, the "fully-normalized" form is described
only as a property of a string. There is no algorithm provided for making a
string fully-normalized, and the only algorithms that one might come up with
involve losing information. In my view throwing away characters in order to
make the characters that remain normalized is not a useful thing to do. I
think that someone has completely misunderstood the requirement here."
Thus, the WGs felt that it was unclear how to implement 'fully-normalized'
and so did not agree to making it a normalization form that must be
supported by fn:normalize-unicode.
They recommended that we wait until the Character Model becomes at
Recommendation at which time, perhaps, it will become clear what this form
means and how it should be implemented.


All the best, Ashok
Received on Friday, 31 October 2003 16:12:40 UTC