- From: Addison Phillips [wM] <aphillips@webmethods.com>
- Date: Fri, 31 Oct 2003 13:07:06 -0800
- To: "Ashok Malhotra" <ashokma@microsoft.com>, <public-qt-comments@w3.org>
- Cc: <w3c-xml-schema-ig@w3.org>, <w3c-i18n-ig@w3.org>
Hi Ashok, The following is an entirely personal (not official) response, offering a couple of clarifications. As the new chair of the I18N Working Group, I want to assure you of a swift response. Since many of our members are traveling or on vacation at the moment, it may yet be a few days before you receive something official. As the new guy, I'm a little hesitant to leap into this discussion. Nonetheless, I have some comments below which might help. On the question of the timing of CharMod's finalization, it's very clear that this is a key deliverable that a lot of WGs are waiting on. Certainly I've seen the requests from the I18N WG asking people to reference the CharMod or aspects thereof. It's unfair of us to do that if we can't get a reasonable schedule for release. I hope to clarify that shortly and to everyone's satisfaction. If you have concerns or questions about the I18N WG's methods, deliverables, or comments, I would urge you to contact me privately to discuss them. I'm very concerned that we address such issues. Following my sig are my comments on the technical questions raised in your note. Best Regards, Addison Addison P. Phillips Director, Globalization Architecture webMethods | Delivering Global Business Visibility http://www.webMethods.com Chair, W3C Internationalization (I18N) Working Group Chair, W3C-I18N-WG, Web Services Task Force http://www.w3.org/International Internationalization is an architecture. It is not a feature. -------------- When CharMod talks about "normalization" it refers to Unicode normalization forms as defined in the Unicode Standard (currently version 4.0). Every character has well-defined combining characteristics, so your comment: <quot> (Most combining characters are accents and it is arguably rare for such characters to start a sentence but aren't 'h' and 'l' combining characters is Spanish for the forms 'ch' and 'll'? If so, many sentences will start with these characters) </quot> ... while insightful about the nature of some languages, is not applicable in this case. 'c', 'h', 'l' and other characters are always fully spacing. Some languages, as you note, use digraph characters: other examples include the Dutch 'ij' digraph and many such letters in Central European langauges. None of these cases are at issue here. A Unicode combining character is defined as a character with a combining class > 0 in the Unicode character database. In all cases, a character either *is* a combining mark at all times and in all languages or it is *not* a combining mark. The language of the material does not affect this, allowing algorithmic normalization. It seems to me that you are correct about the nature of the fn:substring function being able to create strings that are not fully-normalized. If fn:substring were modified require its output to be fully-normalized, then I think the right thing to do would *not* be to delete the starting (offending) character, but to insert a space character at the start to "carry" the combining mark (I think this is actually covered, but not very explicitly in comment [61]). This has the side effect of: substring(mystr, 0,10) + substring(mystr, 10, length(mystr) != mystr Therefore I would say that the fn:substring operation cannot produce fully-normalized text, because that would seem to fly in the face of the purpose of such an operator. Again, speaking strictly for myself, I can see how you have arrived at the conclusions you have made. There are functions which can produce "fully-normalized" strings and check strings for "fully-normalized-ness": they are exactly the same as the function that produces Unicode Normalization Form C, plus a check that the first character in the string being normalized is a base character (e.g. has a combining class of 0). If the last test fails, a space is inserted at the start of the data to carry the combining mark. This loses no information. A function such as fn:normalize-unicode can easily be written. Whether other XML Query operations should be defined as producing fully-normalized output (or not) I think is a separate question. In fact, I think there are cases where it might be better if they were defined as NOT producing fully-normalized and rather only produce include-normalized text (such as the substring example above). Then a separate operation such as fn:normalize-unicode becomes very important (in the event that a segment of content produced by a function that does not guarantee fully-normalized output must be emitted as fully-normalized). -----Original Message----- From: w3c-i18n-ig-request@w3.org [mailto:w3c-i18n-ig-request@w3.org]On Behalf Of Ashok Malhotra Sent: mardi 28 octobre 2003 14:30 To: public-qt-comments@w3.org Cc: w3c-xml-schema-ig@w3.org; w3c-i18n-ig@w3.org Subject: F&O function fn:normalize-unicode Both Schema and the I18N WGs requested that the fn:normalize-unicode function should make it a requirement to support the W3C normalization form, now called 'fully-normalized'. See XML Schema comments item [2.6] http://lists.w3.org/Archives/Public/public-qt-comments/2003Aug/0003.html and I18N WG comments item [62] http://lists.w3.org/Archives/Public/public-qt-comments/2003Jul/0105.html. This was discussed at length at the joint WG meeting in Toronto in mid-September. It was felt that 'fully-normalized' defined a property without a well defined algorithm to achieve it. The character model says that for text to be "fully-normalized" constructs (sentences in our case) should not start with a combining character so that appending a character to a string never creates a non-normalized string. But suppose a string operation such as fn:substring results in such a string i.e. one that starts with a combining character. What should be done, should the initial combining character just be removed? (Most combining characters are accents and it is arguably rare for such characters to start a sentence but aren't 'h' and 'l' combining characters is Spanish for the forms 'ch' and 'll'? If so, many sentences will start with these characters) Michael Kay made this point in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2003Oct/0051.html saying: "It's not at all clear to me that supporting "fully-normalized" form makes any sense at all. Whereas the Unicode normalization forms all describe an algorithm for normalizing data, the "fully-normalized" form is described only as a property of a string. There is no algorithm provided for making a string fully-normalized, and the only algorithms that one might come up with involve losing information. In my view throwing away characters in order to make the characters that remain normalized is not a useful thing to do. I think that someone has completely misunderstood the requirement here." Thus, the WGs felt that it was unclear how to implement 'fully-normalized' and so did not agree to making it a normalization form that must be supported by fn:normalize-unicode. They recommended that we wait until the Character Model becomes at Recommendation at which time, perhaps, it will become clear what this form means and how it should be implemented. All the best, Ashok
Received on Friday, 31 October 2003 16:12:40 UTC