Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Jonathan Kew on 2009-02-02 (www-style@w3.org from February 2009)

From: Jonathan Kew <jonathan@jfkew.plus.com>
Date: Mon, 2 Feb 2009 23:33:24 +0000
To: fantasai <fantasai.lists@inkedblade.net>
Cc: "Phillips, Addison" <addison@amazon.com>, Boris Zbarsky <bzbarsky@MIT.EDU>, Mark Davis <mark.davis@icu-project.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, "www-style@w3.org" <www-style@w3.org>, andrewc@vicnet.net.au
Message-Id: <EF626147-BE5A-4CA3-AE2A-B1BF9C028E2B@jfkew.plus.com>

On 2 Feb 2009, at 22:36, fantasai wrote:

>
> Phillips, Addison wrote:
>> ... Both are semantically equivalent and normalize to U+00E9. I can  
>> send
>> either to the server in my request and get the appropriate  
>> (normalized)
>> value in return. Conversely, I should be able to select:
>> <p>&#x65;&#x300;</p>
>> ... using either form. I might be returned the original (non- 
>> normalized)
>> sequence in the result. The point is that processes that are  
>> normalization
>> sensitive must behave as if the data were normalized. Why is that a
>> contradiction?
>
> I think Boris's point is that we have a message from Andrew Cunningham
>  http://lists.w3.org/Archives/Public/www-style/2009Feb/0033.html
> saying that form input data must not be normalized. This is  
> incompatible
> with the idea that the browser can internally adopt NFC.

I confess that I didn't really understand that message at the time. So  
I've just re-read it, and also looked up some MARC21-related  
materials. Now I'm ready to say that I disagree with this position. To  
quote from that message:

> the normalisation of form fields should be determined the web  
> developer.
> Normalisation in some context may violate standards in some  
> industries.
> One taht comes to mind is libraries. Many of the newer integrated  
> library
> management systems will use a web browser as a client for the  
> cataloguing
> modules. Normalising form fields would result in violating the MARC21
> character model.

A library cataloguing module (for example) is a specialized system  
that will in any case have to perform special validation/filtering on  
its input, if that input is provided in Unicode by the browser but  
must comply with the MARC21 character model when stored in the  
database. I don't believe, therefore, that normalization makes a  
significant difference to the situation. The cataloguing module can  
easily apply whichever form of normalization it requires, or a custom  
normalization-like transformation, if that helps it to process the  
text appropriately.

> If i were working on content in some langauges like igbo, and wanted  
> to
> include tone markers to use as an alternative display of data, its  
> easier
> to work with NFD data and filter tone marks out when applying standard
> orthographic views.

True, but it is easy for the process that wants to provide alternative  
views of the data to pass that data through a normalization filter at  
that time. Again, this is a specialized application that already has  
detailed knowledge of the particular kind of data it is interested in,  
and how that is encoded; if it wants to rely on NFD representation in  
order to do a tone-mark-filtering operation, it should explicitly  
apply NFD to the data. I don't think this has any bearing on how a  
general-purpose web browser may or should present text to the server.

> To have a browser normalise
> to NFC and then have a web developer have to renormalise data to NFD  
> or in
> the case of MARC21 build a completely new normalisation routine that
> matches the MARC21 character model which is nearly but not quite NFD  
> is
> creating a burden for the web developer in question.

The web developer who is developing processes that depend on a  
particular normalization form, whether NFC, NFD, or some other custom  
transformation, must face that burden anyway. Otherwise the process  
will never be robustly interoperable with the wider world of encoded  
text.

We may wish this burden didn't exist at all, but it does (and won't be  
going away any time soon -- Unicode is here to stay). And software  
developers -- rather than web page and stylesheet authors -- are the  
right people to carry that burden. For operations that the browser  
carries out, such as matching CSS selectors, the browser developer  
must handle it, whether up-front or on-the-fly. For operations that  
some back-end process carries out, such as perhaps MARC21 data  
validation, the developer of that process has to deal with it.

JK

Received on Monday, 2 February 2009 23:35:10 UTC