RE: Matching vs. normalization, CSS matching from Phillips, Addison on 2011-08-23 (www-tag@w3.org from August 2011)

From: Phillips, Addison <addison@lab126.com>
Date: Tue, 23 Aug 2011 09:14:37 -0700
To: Larry Masinter <masinter@adobe.com>, "ashok.malhotra@oracle.com" <ashok.malhotra@oracle.com>
CC: "www-tag@w3.org" <www-tag@w3.org>, "member-i18n-core@w3.org" <member-i18n-core@w3.org>
Message-ID: <131F80DEA635F044946897AFDA9AC3476A95603ADB@EX-SEA31-D.ant.amazon.com>

Larry wrote:
> 
> Even for a single application, the choices may vary depending on the
> application. For example, with respect to URI or path equivalence, a web cache
> might want to be conservative (err on the side of 'not equivalent') when
> deciding whether to use a previously cached value, while liberal (err on the
> side of 'equivalent') when deciding whether to invalidate a previously cached
> value after a modification.

Agreed, although most applications try to limit the opportunities for false positives/negatives. How we limit those chances is what's at the heart of this conversation.

> 
> In lieu of any other considerations (no particular cost to false negatives), I
> would think that 'conservative' should hold, i.e., use 'exact match of Unicode
> character sequence' as the equivalence relationship.

This is, indeed, conservative and makes sense in many cases---but not all cases. And this is one potential outcome of this discussion.

> 
> In the particular case of CSS sector matching, the cost of false negatives is low
> (things aren't styled properly, when styling itself is optional), I'd say that CSS
> selector matching should use exact string matching, and urge against any use
> of normalization in the matching process.

If CSS Selectors were limited to applying stylesheets, this might be true (although I don't agree: while styling is "optional", can you think of an important website that uses the default stylesheet?). Besides, CSS Selectors are only one convenient example of the normalization problem in identifier matching in W3C specs.

For that matter, I think a CSS author would consider it a serious problem if, for an entirely invisible reason, their stylesheet didn't work with apparently matching content. With many stylesheets authored separately from the content they style--and with many documents assembled dynamically--it may be very difficult to debug "why my style sometimes doesn't work". 

Normalizing all textual content is considered harmful, since it can affect the appearance and, occasionally, the functionality of the content. This is one reason why our proposal seeks to limit normalization to identifier matching.

> 
> An editor, authoring tool, etc., might warn if it comes across unnormalized
> strings.

But they don't and they may not offer recognizable options to users if they did. Would your mother-in-law know if she wanted her document in Form NFC? Further, note that the problem affects both the selector and the content being selected against. Either may be denormalized or denormalized in part (if assembled dynamically). Finally, as noted above, some documents use denormalized *content* on purpose. Normalizing the whole document would damage this.

> 
> If there are input methods (as you point out, for example, for Vietnamese)
> which often result in "unnormalized" strings, perhaps this might suggest either
> a different normalization algorithm or just avoiding use of strings for which
> input methods vary so widely.

I don't understand "a different normalization algorithm" in this context. What would a different algorithm be? I can understand strict codepoint comparison as the alternative, but the choice of a normalization means, well, the choice of *some* normalization.

The latter suggestion is the whole point: do I understand that you are suggesting that (for example) Vietnamese cannot be used in attribute values because that breaks selection? Ditto for any language whose input methods sometimes produce denormalized text? This argument is pretty close to "just use ASCII 'cause it always works for me" :-). How is the user supposed to know which sequences are "bad"? Should some words be permitted while others are not? You can't tell from how the keyboard works.

We've gone to all this trouble to allow any user's language to be used as values in attributes (and, for that matter, as the name of an element or attribute in XML), only to turn back at the last second and deny certain languages full use of the features because of the user's keyboard or their tools or their operating environment? This is antithetical to our WG's mission, which is why we're here instead of just acquiescing quietly to the non-normalizing status quo.

Addison

Received on Tuesday, 23 August 2011 16:15:04 UTC