RE: Matching vs. normalization, CSS matching from Larry Masinter on 2011-08-23 (www-tag@w3.org from August 2011)

From: Larry Masinter <masinter@adobe.com>
Date: Tue, 23 Aug 2011 00:08:58 -0700
To: "Phillips, Addison" <addison@lab126.com>, "ashok.malhotra@oracle.com" <ashok.malhotra@oracle.com>
CC: "www-tag@w3.org" <www-tag@w3.org>, "member-i18n-core@w3.org" <member-i18n-core@w3.org>
Message-ID: <C68CB012D9182D408CED7B884F441D4D05D41A85C2@nambxv01a.corp.adobe.com>
In general, your choice of equivalence / comparison algorithm to choose for a particular application depends on the relative costs of false positives (thinking two strings are equivalent when they are not, for that application) vs. the cost of false negatives (considering things to be different when they actually are equivalent).

Even for a single application, the choices may vary depending on the application. For example, with respect to URI or path equivalence, a web cache might want to be conservative (err on the side of 'not equivalent') when deciding whether to use a previously cached value, while liberal (err on the side of 'equivalent') when deciding whether to invalidate a previously cached value after a modification.

In lieu of any other considerations (no particular cost to false negatives), I would think that 'conservative' should hold, i.e., use 'exact match of Unicode character sequence' as the equivalence relationship.

In the particular case of CSS sector matching, the cost of false negatives is low (things aren't styled properly, when styling itself is optional), I'd say that CSS selector matching should use exact string matching, and urge against any use of normalization in the matching process.

An editor, authoring tool, etc., might warn if it comes across unnormalized strings. 

If there are input methods (as you point out, for example, for Vietnamese) which often result in "unnormalized" strings, perhaps this might suggest either a different normalization algorithm or just avoiding use of strings for which input methods vary so widely.


-----Original Message-----
From: Phillips, Addison [mailto:addison@lab126.com] 
Sent: Saturday, August 20, 2011 7:22 PM
To: ashok.malhotra@oracle.com
Cc: Larry Masinter; www-tag@w3.org; member-i18n-core@w3.org
Subject: RE: Matching vs. normalization

Hi Ashok,

Of course ordering and sorting are important for organizing information. And normalization has a role to play there. However, excepting binary comparison of strings, most sorting implementations already have to deal with text equivalences. And even badly organized data may still be functional.

Or, to go to Larry's point, just as it makes more sense to talk about "string matching" rather than normalization, it makes sense to talk about "collation" rather than normalization as well. As it happens, the base document of the Character Model has a whole section on collation (3.5) and this is non-controversial: I cannot remember a WG ever having objected to sorting things "correctly".

Identifier matching, however, is both fundamental and (in our view) flawed. 

Addison


> -----Original Message-----
> From: ashok malhotra [mailto:ashok.malhotra@oracle.com]
> Sent: Saturday, August 20, 2011 9:47 AM
> To: Phillips, Addison
> Cc: Larry Masinter; www-tag@w3.org; member-i18n-core@w3.org
> Subject: Re: Matching vs. normalization
> 
> Addison:
> Matching is clearly important but isn't ordering/sorting also important?
> All the best, Ashok
> 
> On 8/20/2011 9:19 AM, Phillips, Addison wrote:
> > I agree. Which is why the proposal at [1] is couched in terms of 
> > string identity
> matching.
> >
> > Normalization doesn't provide a complete matching solution. Unicode 
> > canonical equivalence deals with a specific problem inherent to 
> > Unicode, but most programmers already know of other equivalences 
> > that are interesting---case insensitivity, for example. A complete 
> > matching algorithm will need to deal with all of the equivalences in text.
> > (estzet vs. double-s is an example of this, since the uppercase of 
> > the former is two regular letters 's'.)
> >
> > A different example of what you're talking about might be the 
> > Unicode
> Collation Algorithm (UTS#10 [2]), which allows for tailoring and 
> strengths, as well as contextual equivalence.
> >
> > I'm pretty sure I don't agree about having "fuzzy" matching for the 
> > cases our
> WG has in mind for charmod (or for IRI), in which you want an 
> unambiguous comparison of identifiers. When I write an HTML processor 
> or a CSS stylesheet, I need to know that token 'abc' matches (or does 
> not match) token 'ABC'. There is no room for fuzz in these cases. And 
> document authors won't thank us for providing highly promiscuous matching in these cases either.
> >
> > Addison
> >
> > [1] http://www.w3.org/International/wiki/NormalizationProposal

> > [2] http://www.unicode.org/reports/tr10/

> >
> >> -----Original Message-----
> >> From: Larry Masinter [mailto:masinter@adobe.com]
> >> Sent: Saturday, August 20, 2011 7:43 AM
> >> To: Larry Masinter; Phillips, Addison; www-tag@w3.org
> >> Cc: member-i18n-core@w3.org
> >> Subject: Matching vs. normalization
> >>
> >> One thing I have been wanting to do to the IRI comparison document 
> >> and that I think might help with charmed is to move away from 
> >> "normalization" (or
> >> normalization) in the standards and describe the algorithms in 
> >> terms of "matching". I think this is important in many situations 
> >> where equivalence of two strings may be ambiguous or not completely 
> >> determined or certain. Of course you can derive an equivalence 
> >> relationship from a normalization algorithm by saying "two strings 
> >> are equivalent if they have the same normal form". However, there 
> >> are some situations where you can't quite know whether downstream 
> >> tools will treat two strings equivalently, and what you need is a 'partial match'
> algorithm.
> >>
> >> For example, you might say that 's-set' and 'double s' are 
> >> partially equivalent -- some downstream tools will treat them the 
> >> same, others not. Whether or not you want them to 'match' will 
> >> depend on how conservative or liberal you need the matching 
> >> algorithm to be.  If you are looking for, say, a node ID, you might use 'exact match'
> >> first, and only use 'fuzzy match' in an error recovery situation...
> >> only invoke the cost of normalization in the cases where there 
> >> would
> otherwise be a failure.
> >>
> >> I don't think you can explore those alternatives for solutions if 
> >> the problem space is couched in terms of normalization rather than 
> >> comparison and equivalence.
> >>
> >> Larry
> >> --
> >> http://larry.masinter.net

> >>
> >> -----Original Message-----
> >> From: www-tag-request@w3.org [mailto:www-tag-request@w3.org] On 
> >> Behalf Of Larry Masinter
> >> Sent: Saturday, August 20, 2011 4:16 PM
> >> To: Phillips, Addison; www-tag@w3.org
> >> Cc: member-i18n-core@w3.org
> >> Subject: RE: Any update on TAG request?
> >>
> >> I'm afraid that to analyze the situation I'd want to go back to 
> >> first principles, in terms of workflows and interoperability 
> >> problems. It seems you’ve done most of this work, but perhaps not 
> >> gathered it together
> in this way.
> >>
> >> a. What are the workflows and current deployed components and what 
> >> do they do?
> >> b. What are the interoperability problems in those workflows?
> >> c.  What should content authors and creators of tools that generate 
> >> content do in order to minimize downstream interoperability difficulties?
> >> d. What interoperability problems or applicability limitations 
> >> still occur when one (and even if everyone) follows the advice in c?
> >> e.  Enumerate possible recommendations we could make for downstream 
> >> tools, such that (presuming widespread adoption  eventually) the 
> >> interoperability problems of d would be minimized.
> >> f.  Get consensus among the various providers of downstream tools 
> >> (like browsers and parsers) to agree to implement one of the 
> >> choices
> identified for e.
> >> -- by providing clear arguments for how doing so would make the web
> better.
> >> g. Update advice of c. with information about how to evaluate tool 
> >> maker progress in implementing recommendations of f.
> >>
> >>  From this view likely need some combination of "how should content 
> >> authors cope with the existing mess" and "what could tool makers do 
> >> such that, if all agree, the mess will get better".
> >> I don't see these two approaches as exclusive.  (I think this 
> >> analysis applies for not just charmed but many other parts of web 
> >> architecture, so I've tried to write it more generally).
> >>
> >>> 1. Do nothing. Do not require normalization by implementations or in specs.
> >> Create educational materials to help content authors understand the 
> >> problem and try to avoid it.
> >>
> >> I don't see this as "do nothing" and it seems like a very important 
> >> thing to provide in any case. So of course we should do this right 
> >> away. I think writing this advice and being clear about the 
> >> interoperability problems avoiding is an important first step to 
> >> improving
> anything.
> >>
> >>> 2. Adopt our proposal for identifier and token/string matching
> normalization.
> >> Revise Charmod-Norm to embody this. Ensure that specs address these 
> >> requirements in the future.
> >>
> >> You have a proposal for e. which you believe that, if widely 
> >> implemented, will improve the interoperability situation -- a 
> >> preferred alternative which you are trying to get consensus behind 
> >> in f. But it sounds like there is some cost to implementors, and 
> >> the cost is such that there is little or no benefit to anyone 
> >> unless that choice is
> widely implemented.
> >>
> >> The art is then convincing implementors to accept the cost, with 
> >> only a promise of benefit.
> >>
> >> Can the costs/benefits be allocated? Could you say, for example, 
> >> that normalization MUST be applied to Vietnamese but SHOULD be 
> >> applied to other Unicode strings?
> >> Perhaps there are other hybrid alternatives that do not have the 
> >> same cost/benefit tradeoffs, in that the costs are limited to 
> >> situations where there are clear benefits over the current situation?
> >>
> >> Larry
> >> --
> >> http://larry.masinter.net
Received on Tuesday, 23 August 2011 07:09:47 UTC