RE: Matching vs. normalization, CSS matching from Larry Masinter on 2011-08-23 (www-tag@w3.org from August 2011)

From: Larry Masinter <masinter@adobe.com>
Date: Tue, 23 Aug 2011 12:59:52 -0700
To: "Phillips, Addison" <addison@lab126.com>, "ashok.malhotra@oracle.com" <ashok.malhotra@oracle.com>
CC: "www-tag@w3.org" <www-tag@w3.org>, "member-i18n-core@w3.org" <member-i18n-core@w3.org>
Message-ID: <C68CB012D9182D408CED7B884F441D4D05D41A86AD@nambxv01a.corp.adobe.com>
# Agreed, although most applications try to limit the opportunities for false positives/negatives. How we limit those chances is what's at the heart of this conversation.

My reasoning is that false positives (considering things equivalent when they should not be)  cannot be avoided by changing the inputs, while false negatives (things should be considered equivalent but they are not detected as such) can be avoided by earlier stage input. 

So in lieu of other considerations, choose the comparison which avoids possible false positives.

Which leads to exact equivalence.

> If CSS Selectors were limited to applying stylesheets, this might be true (although I don't agree: while styling is "optional", can you think of an important website that uses the default stylesheet?). Besides, CSS Selectors are only one convenient example of the normalization problem in identifier matching in W3C specs.

> For that matter, I think a CSS author would consider it a serious problem if, for an entirely invisible reason, their stylesheet didn't work with apparently matching content. With many stylesheets authored separately from the content they style--and with many documents assembled dynamically--it may be very difficult to debug "why my style sometimes doesn't work". 

I was careful to lay out general considerations and apply the reasoning to this case in particular. But please note that most of the time the task is to choose the "least bad" among several unpleasant alternatives.  Every failure is difficult to debug.  Sometimes you can fix the problem by making the "bug" disappear, but often the solution is to make sure that the "bug" appears early, often, and reliably.  In this case, if MOST systems do not normalize on comparison, then making sure that NO system normalizes on comparison will actually improve the situation, because the failure will be less predictable.

> Normalizing all textual content is considered harmful, since it can affect the appearance and, occasionally, the functionality of the content. This is one reason why our proposal seeks to limit normalization to identifier matching.

Isn't it the case eliminating normalization from ALL matching would make things MORE predictable, since there are many systems that do not normalize?

>> An editor, authoring tool, etc., might warn if it comes across unnormalized strings.

> But they don't and they may not offer recognizable options to users if they did.

Something has to give, some place has to detect attempts to compare unnormalized strings. Putting the burden on author-time development tools makes more sense than putting the burden on runtime on every web page presentation.

>  Would your mother-in-law know if she wanted her document in Form NFC?

Those who write CSS and raw markup are more likely technical.

>  Further, note that the problem affects both the selector and the content being selected against. Either may be denormalized or denormalized in part (if assembled dynamically). Finally, as noted above, some documents use denormalized *content* on purpose. Normalizing the whole document would damage this.

This seems like a straw-man. If either is unnormalized, it is a likely validation error.

>> If there are input methods (as you point out, for example, for 
>> Vietnamese) which often result in "unnormalized" strings, perhaps this 
>> might suggest either a different normalization algorithm or just 
>> avoiding use of strings for which input methods vary so widely.

> I don't understand "a different normalization algorithm" in this context. What would a different algorithm be? I can understand strict codepoint comparison as the alternative, but the choice of a normalization means, well, the choice of *some* normalization.

I'll drop the line of "alternative normalization algorithms".

> The latter suggestion is the whole point: do I understand that you are suggesting that (for example) Vietnamese cannot be used in attribute values because that breaks selection? Ditto for any language whose input methods sometimes produce denormalized text? This argument is pretty close to "just use ASCII 'cause it always works for me" :-). How is the user supposed to know which sequences are "bad"? Should some words be permitted while others are not? You can't tell from how the keyboard works.

We are faced with choosing the least bad among several unpleasant alternatives.  Just use ASCII, European languages, Japanese or Chinese, but do not use characters for which input produces unnormalized text, if you want your selectors to work reliably. This is just how to deal with the reality of the deployed infrastructure. This is not something W3C can fix by writing new specifications. Don't blame us.

> We've gone to all this trouble to allow any user's language to be used as values in attributes (and, for that matter, as the name of an element or attribute in XML), 

Well, "many more languages than before" is still quite a milestone, even if you haven't reached "any user's language".

And I'm sure there are some users who will complain they can't use their private name characters or symbol names. 

> only to turn back at the last second and deny certain languages full use of the features because of the user's keyboard or their tools or their operating environment?

It isn't "turning back" and certainly not "at the last second" to acknowledge that a desired result cannot be obtained because of the nature of the currently deployed software.

>  This is antithetical to our WG's mission, which is why we're here instead of just acquiescing quietly to the non-normalizing status quo.

I disagree fundamentally. If deployed keyboard software produces strings that cannot be reliably matched using the algorithm that is deployed, you don't have a "specification" problem, you have a "changed deployed software" problem.

There are protocol elements and presentation elements. Selector and element names are protocol elements.

http://tools.ietf.org/html/rfc2324#section-3 

shows what happens when you try to treat a protocol element as a presentational element. I think internationalizing protocol elements is well beyond the scope and mission of the I18N activity.
Received on Tuesday, 23 August 2011 20:00:37 UTC