Re: Unicode Normalization thread should slow down; summary needed from Henri Sivonen on 2009-02-11 (www-style@w3.org from February 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 11 Feb 2009 09:43:10 +0200
To: Robert J Burns <rob@robburns.com>
Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <CF8E5B86-CB4C-401C-A532-A43F04409EF7@iki.fi>
On Feb 10, 2009, at 19:00, Robert J Burns wrote:

> Having the example back helps dramatically. However, you've taken  
> the issue and boiled it down to the solved portion, ignoring what  
> the thrust of the thread was about.

What was the thrust of the i18n core comments then? Except for your  
remarks, as far as I can tell, the thread has revolved around keyboard  
input order or differences in input methods between operating systems  
causing different code point sequences for same visual apperances.

If that's a solved problem, great!

I realize that the CSS WG doesn't work by the HTML Design Principles,  
but since decision on the CSS side would leak to HTML IDs and class  
names, I'm looking at this from the point of view of the impact on  
HTML IDs and class names in the light of the HTML Design Principles.  
The Support World Languages and Solve Real Problems principles would  
apply.

Unicode Normalization in the abstract is not a Real Problem. It *may*  
be a *solution* to Real Problems. The i18n principle is Support World  
Languages--not Use All Unicode Algorithms. Thus, we should see if  
there are world languages whose users *actually* face problems with  
the way IDs and class names are matched. If we find none, there's  
nothing to fix and we can stop even if Unicode allows such cases to be  
constructed. If there are some, then we should consider how those  
actual problems are best addressed. (It's not a given that the best  
solution is changing the consumer side.)

In particular, we shouldn't assume a late-normalization hammer and  
start constructing (perhaps using Character Palette on OS X) nails  
that we can hit with it.

Furthermore, we should very carefully distinguish changes to systems  
that remove discrimination against languages or level the playing  
field between languages and changes that are new features that were  
previously unsupported even for the language most considered by the  
designers of the system (English in this case). The first kind of  
changes and the second kind of changes have a different moral status,  
and it's bogus to try to advance the changes of the second kind by  
portraying them as changes of the first kind. New features shouldn't  
be admitted on any moral principle based upon non-discrimination of  
languages. Instead, new features should be subjected to a cost/benefit  
analysis on their own right.

> I'm happy that keyboard input is normalized. However, I knew that  
> all along,

Well, Jonathan Kew's latest email to this thread says it's not always  
so. Furthermore, this exercise is about sharing knowledge among  
multiple people, so it doesn't help if you knew it all along.

> so that wasn't even the discussion I thought I was having.

What discussion were you having and was it the same discussion that  
i18n core participants were having?

>>> Even before these latest changes the page needed more  
>>> clarification added. For example, some of the solutions are  
>>> difficult to differentiate: for example number #1 and #4  
>>> (originally #3).
>>
>> What's difficult to differentiate between #1 and #4? Solution #1 is  
>> requiring input methods to normalize. Solution #4 is creating a new  
>> encoding name utf-8-nfc for allowing authors to opt in to consumer- 
>> side normalization on the encoding decoder layer.
>
> Well now they're #2 and #5 (since you've inserted #1 before the  
> previous list. I gave them they're summary descriptions, from L.  
> David Baron's original email, so it won't necessarily help me to  
> repeat them here, but just in case ("Require authoring applications  
> to normalize text" and "Require all text in a consistent  
> normalization on the level of document conformance").

I see. Those two are indeed almost the same thing but from different  
points of view. The former prescribes a solution and the latter  
prescribes a way of detecting the problem. Perhaps they should be  
merged.

> But if someone unilaterally changes the entire page to a completely  
> separate and already solved issue then that gets in the way of that  
> approach.

I think the issue in this thread mainly is that there is no consensus  
that the issue is already solved. If it turns out that it is, that's a  
happy outcome. And in that case, we shouldn't try to construct another  
issue that can be hit with the normalization hammer.

> While most keyboards might be able to be designed to limit the input  
> of identifiers to canonically ordered character sequences, the  
> problem is that characters might be input by all sorts of means (not  
> just keyboards): including pasting, character palette, and keyboard  
> input. An identifier might begin its life from an innocent copy and  
> paste from the document content by the initial author of the  
> identifier. Other subsequent authors may try to match the identifier  
> through keyboard input or character palette input (perhaps  
> unsuccessfully due to differing compositions and orderings). So this  
> is in particular a canonical normalization problem (though Henri has  
> attempted, but I'm afraid unsuccessfully, to restate in some terms  
> of only keyboard input).

Has i18n core (or anyone else) identified copying and pasting as  
something that in workflows occurring in practice doesn't preserve  
identifier identity under the kinds of comparisons that are currently  
performed in the Open Web platform in general and in Selector  
implementations in particular?

If the problem can be cornered to having to construct it with  
Character Palette to experience it, I'd be happy to invoke Solve Real  
Problems and declare such construction as not enough of a Real Problem  
to need Solving.

(Seriously, if you get an HTML file from someone else and it has a  
class name with characters your that are foreign to your usual input  
method and you are tasked with writing a selector for the class names,  
do you copy and paste the string from the file you got or do you open  
the character palette and try to locate those characters visually  
there one by one?)

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Wednesday, 11 February 2009 07:43:54 UTC