RE: F&O function fn:normalize-unicode from Ashok Malhotra on 2003-10-31 (public-qt-comments@w3.org from October 2003)

From: Ashok Malhotra <ashokma@microsoft.com>
Date: Fri, 31 Oct 2003 14:04:54 -0800
To: <aphillips@webmethods.com>, <public-qt-comments@w3.org>
Cc: <w3c-xml-schema-ig@w3.org>, <w3c-i18n-ig@w3.org>
Message-ID: <EDB607C8AC991F40BE646533A1A673E87CB128@RED-MSG-42.redmond.corp.microsoft.com>
Addison:
Thank you for your reply.  I'll try and get it on the agenda for
discussion next week.

You have supplied an essential piece of information about the workings
of the "fully-normalized" form which, at least I, was lacking.

"... a check that the first character in the string being normalized is
a base character (e.g. has a combining class of 0). If the last test
fails, a space is inserted at the start of the data to carry the
combining mark."

All the best, Ashok

-----Original Message-----
From: Addison Phillips [wM] [mailto:aphillips@webmethods.com] 
Sent: Friday, October 31, 2003 1:07 PM
To: Ashok Malhotra; public-qt-comments@w3.org
Cc: w3c-xml-schema-ig@w3.org; w3c-i18n-ig@w3.org
Subject: RE: F&O function fn:normalize-unicode

Hi Ashok,

The following is an entirely personal (not official) response, offering
a
couple of clarifications. As the new chair of the I18N Working Group, I
want
to assure you of a swift response. Since many of our members are
traveling
or on vacation at the moment, it may yet be a few days before you
receive
something official. As the new guy, I'm a little hesitant to leap into
this
discussion. Nonetheless, I have some comments below which might help.

On the question of the timing of CharMod's finalization, it's very clear
that this is a key deliverable that a lot of WGs are waiting on.
Certainly
I've seen the requests from the I18N WG asking people to reference the
CharMod or aspects thereof. It's unfair of us to do that if we can't get
a
reasonable schedule for release. I hope to clarify that shortly and to
everyone's satisfaction.

If you have concerns or questions about the I18N WG's methods,
deliverables,
or comments, I would urge you to contact me privately to discuss them.
I'm
very concerned that we address such issues.

Following my sig are my comments on the technical questions raised in
your
note.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.

--------------

When CharMod talks about "normalization" it refers to Unicode
normalization
forms as defined in the Unicode Standard (currently version 4.0). Every
character has well-defined combining characteristics, so your comment:

<quot>
(Most combining characters are accents and it is arguably rare for such
characters to start a sentence but aren't 'h' and 'l' combining
characters
is Spanish for the forms 'ch' and 'll'?  If so, many sentences will
start
with these characters)
</quot>

... while insightful about the nature of some languages, is not
applicable
in this case. 'c', 'h', 'l' and other characters are always fully
spacing.
Some languages, as you note, use digraph characters: other examples
include
the Dutch 'ij' digraph and many such letters in Central European
langauges.
None of these cases are at issue here.

A Unicode combining character is defined as a character with a combining
class > 0 in the Unicode character database. In all cases, a character
either *is* a combining mark at all times and in all languages or it is
*not* a combining mark. The language of the material does not affect
this,
allowing algorithmic normalization.

It seems to me that you are correct about the nature of the fn:substring
function being able to create strings that are not fully-normalized. If
fn:substring were modified require its output to be fully-normalized,
then I
think the right thing to do would *not* be to delete the starting
(offending) character, but to insert a space character at the start to
"carry" the combining mark (I think this is actually covered, but not
very
explicitly in comment [61]).

This has the side effect of:

  substring(mystr, 0,10) + substring(mystr, 10, length(mystr) != mystr

Therefore I would say that the fn:substring operation cannot produce
fully-normalized text, because that would seem to fly in the face of the
purpose of such an operator.

Again, speaking strictly for myself, I can see how you have arrived at
the
conclusions you have made. There are functions which can produce
"fully-normalized" strings and check strings for
"fully-normalized-ness":
they are exactly the same as the function that produces Unicode
Normalization Form C, plus a check that the first character in the
string
being normalized is a base character (e.g. has a combining class of 0).
If
the last test fails, a space is inserted at the start of the data to
carry
the combining mark.

This loses no information. A function such as fn:normalize-unicode can
easily be written. Whether other XML Query operations should be defined
as
producing fully-normalized output (or not) I think is a separate
question.
In fact, I think there are cases where it might be better if they were
defined as NOT producing fully-normalized and rather only produce
include-normalized text (such as the substring example above). Then a
separate operation such as fn:normalize-unicode becomes very important
(in
the event that a segment of content produced by a function that does not
guarantee fully-normalized output must be emitted as fully-normalized).


-----Original Message-----
From: w3c-i18n-ig-request@w3.org [mailto:w3c-i18n-ig-request@w3.org]On
Behalf Of Ashok Malhotra
Sent: mardi 28 octobre 2003 14:30
To: public-qt-comments@w3.org
Cc: w3c-xml-schema-ig@w3.org; w3c-i18n-ig@w3.org
Subject: F&O function fn:normalize-unicode


Both Schema and the I18N WGs requested that the fn:normalize-unicode
function should make it a requirement to support the
W3C normalization form, now called 'fully-normalized'.  See XML Schema
comments item [2.6]
http://lists.w3.org/Archives/Public/public-qt-comments/2003Aug/0003.html
and I18N WG comments item [62]
http://lists.w3.org/Archives/Public/public-qt-comments/2003Jul/0105.html
.
This was discussed at length at the joint WG meeting in Toronto in
mid-September.  It was felt that 'fully-normalized' defined a property
without
a well defined algorithm to achieve it.  The character model says that
for
text to be "fully-normalized" constructs (sentences in our case) should
not
start
with a combining character so that appending a character to a string
never
creates a non-normalized string.  But suppose a string operation such as
fn:substring
results in such a string i.e. one that starts with a combining
character.
What should be done, should the initial combining character just be
removed?
(Most combining characters are accents and it is arguably rare for such
characters to start a sentence but aren't 'h' and 'l' combining
characters
is Spanish for the forms 'ch' and 'll'?  If so, many sentences will
start
with these characters)
Michael Kay made this point in
http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2003Oct/0051.html
saying: "It's not at all clear to me that supporting "fully-normalized"
form
makes any sense at all. Whereas the Unicode normalization forms all
describe
an algorithm for normalizing data, the "fully-normalized" form is
described
only as a property of a string. There is no algorithm provided for
making a
string fully-normalized, and the only algorithms that one might come up
with
involve losing information. In my view throwing away characters in order
to
make the characters that remain normalized is not a useful thing to do.
I
think that someone has completely misunderstood the requirement here."
Thus, the WGs felt that it was unclear how to implement
'fully-normalized'
and so did not agree to making it a normalization form that must be
supported by fn:normalize-unicode.
They recommended that we wait until the Character Model becomes at
Recommendation at which time, perhaps, it will become clear what this
form
means and how it should be implemented.


All the best, Ashok
Received on Friday, 31 October 2003 17:04:58 UTC