F&O function fn:normalize-unicode from Ashok Malhotra on 2003-10-28 (public-qt-comments@w3.org from October 2003)

From: Ashok Malhotra <ashokma@microsoft.com>
Date: Tue, 28 Oct 2003 14:30:16 -0800
To: <public-qt-comments@w3.org>
Cc: <w3c-xml-schema-ig@w3.org>, <w3c-i18n-ig@w3.org>
Message-ID: <EDB607C8AC991F40BE646533A1A673E877B1A0@RED-MSG-42.redmond.corp.microsoft.com>

Both Schema and the I18N WGs requested that the fn:normalize-unicode
function should make it a requirement to support the 
W3C normalization form, now called 'fully-normalized'.  See XML Schema
comments item [2.6]
http://lists.w3.org/Archives/Public/public-qt-comments/2003Aug/0003.html

and I18N WG comments item [62]
http://lists.w3.org/Archives/Public/public-qt-comments/2003Jul/0105.html
.  

This was discussed at length at the joint WG meeting in Toronto in
mid-September.  It was felt that 'fully-normalized' defined a property
without 
a well defined algorithm to achieve it.  The character model says that
for text to be "fully-normalized" constructs (sentences in our case)
should not start
with a combining character so that appending a character to a string
never creates a non-normalized string.  But suppose a string operation
such as fn:substring 
results in such a string i.e. one that starts with a combining
character.  What should be done, should the initial combining character
just be removed?

(Most combining characters are accents and it is arguably rare for such
characters to start a sentence but aren't 'h' and 'l' combining
characters is Spanish for the forms 
'ch' and 'll'?  If so, many sentences will start with these characters)

Michael Kay made this point in
http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2003Oct/0051.html
saying: "It's not at all clear to me that supporting "fully-normalized"
form makes any sense at all. Whereas the Unicode normalization forms all
describe an algorithm for normalizing data, the "fully-normalized" form
is described only as a property of a string. There is no algorithm
provided for making a string fully-normalized, and the only algorithms
that one might come up with involve losing information. In my view
throwing away characters in order to make the characters that remain
normalized is not a useful thing to do. I think that someone has
completely misunderstood the requirement here."

Thus, the WGs felt that it was unclear how to implement
'fully-normalized' and so did not agree to making it a normalization
form that must be supported by fn:normalize-unicode.
They recommended that we wait until the Character Model becomes at
Recommendation at which time, perhaps, it will become clear what this
form means and how it should be implemented.


All the best, Ashok

Received on Tuesday, 28 October 2003 17:30:20 UTC