RE: Matching vs. normalization from Phillips, Addison on 2011-08-20 (www-tag@w3.org from August 2011)

From: Phillips, Addison <addison@lab126.com>
Date: Sat, 20 Aug 2011 09:19:07 -0700
To: Larry Masinter <masinter@adobe.com>, "www-tag@w3.org" <www-tag@w3.org>
CC: "member-i18n-core@w3.org" <member-i18n-core@w3.org>
Message-ID: <131F80DEA635F044946897AFDA9AC3476A954D08AD@EX-SEA31-D.ant.amazon.com>
I agree. Which is why the proposal at [1] is couched in terms of string identity matching. 

Normalization doesn't provide a complete matching solution. Unicode canonical equivalence deals with a specific problem inherent to Unicode, but most programmers already know of other equivalences that are interesting---case insensitivity, for example. A complete matching algorithm will need to deal with all of the equivalences in text. (estzet vs. double-s is an example of this, since the uppercase of the former is two regular letters 's'.)

A different example of what you're talking about might be the Unicode Collation Algorithm (UTS#10 [2]), which allows for tailoring and strengths, as well as contextual equivalence. 

I'm pretty sure I don't agree about having "fuzzy" matching for the cases our WG has in mind for charmod (or for IRI), in which you want an unambiguous comparison of identifiers. When I write an HTML processor or a CSS stylesheet, I need to know that token 'abc' matches (or does not match) token 'ABC'. There is no room for fuzz in these cases. And document authors won't thank us for providing highly promiscuous matching in these cases either.

Addison

[1] http://www.w3.org/International/wiki/NormalizationProposal

[2] http://www.unicode.org/reports/tr10/ 

> -----Original Message-----
> From: Larry Masinter [mailto:masinter@adobe.com]
> Sent: Saturday, August 20, 2011 7:43 AM
> To: Larry Masinter; Phillips, Addison; www-tag@w3.org
> Cc: member-i18n-core@w3.org
> Subject: Matching vs. normalization
> 
> One thing I have been wanting to do to the IRI comparison document and that I
> think might help with charmed is to move away from "normalization" (or
> normalization) in the standards and describe the algorithms in terms of
> "matching". I think this is important in many situations where equivalence of
> two strings may be ambiguous or not completely determined or certain. Of
> course you can derive an equivalence relationship from a normalization
> algorithm by saying "two strings are equivalent if they have the same normal
> form". However, there are some situations where you can't quite know whether
> downstream tools will treat two strings equivalently, and what you need is a
> 'partial match' algorithm.
> 
> For example, you might say that 's-set' and 'double s' are partially equivalent --
> some downstream tools will treat them the same, others not. Whether or not
> you want them to 'match' will depend on how conservative or liberal you need
> the matching algorithm to be.  If you are looking for, say, a node ID, you might
> use 'exact match' first, and only use 'fuzzy match' in an error recovery
> situation... only invoke the cost of normalization in the cases where there would
> otherwise be a failure.
> 
> I don't think you can explore those alternatives for solutions if the problem
> space is couched in terms of normalization rather than comparison and
> equivalence.
> 
> Larry
> --
> http://larry.masinter.net

> 
> -----Original Message-----
> From: www-tag-request@w3.org [mailto:www-tag-request@w3.org] On Behalf
> Of Larry Masinter
> Sent: Saturday, August 20, 2011 4:16 PM
> To: Phillips, Addison; www-tag@w3.org
> Cc: member-i18n-core@w3.org
> Subject: RE: Any update on TAG request?
> 
> I'm afraid that to analyze the situation I'd want to go back to first principles, in
> terms of workflows and interoperability problems. It seems you’ve done most
> of this work, but perhaps not gathered it together in this way.
> 
> a. What are the workflows and current deployed components and what do they
> do?
> b. What are the interoperability problems in those workflows?
> c.  What should content authors and creators of tools that generate content do
> in order to minimize downstream interoperability difficulties?
> d. What interoperability problems or applicability limitations still occur when
> one (and even if everyone) follows the advice in c?
> e.  Enumerate possible recommendations we could make for downstream tools,
> such that (presuming widespread adoption  eventually) the interoperability
> problems of d would be minimized.
> f.  Get consensus among the various providers of downstream tools (like
> browsers and parsers) to agree to implement one of the choices identified for e.
> -- by providing clear arguments for how doing so would make the web better.
> g. Update advice of c. with information about how to evaluate tool maker
> progress in implementing recommendations of f.
> 
> From this view likely need some combination of "how should content authors
> cope with the existing mess" and "what could tool makers do such that, if all
> agree, the mess will get better".
> I don't see these two approaches as exclusive.  (I think this analysis applies for
> not just charmed but many other parts of web architecture, so I've tried to
> write it more generally).
> 
> > 1. Do nothing. Do not require normalization by implementations or in specs.
> Create educational materials to help content authors understand the problem
> and try to avoid it.
> 
> I don't see this as "do nothing" and it seems like a very important thing to
> provide in any case. So of course we should do this right away. I think writing
> this advice and being clear about the interoperability problems avoiding is an
> important first step to improving anything.
> 
> > 2. Adopt our proposal for identifier and token/string matching normalization.
> Revise Charmod-Norm to embody this. Ensure that specs address these
> requirements in the future.
> 
> You have a proposal for e. which you believe that, if widely implemented, will
> improve the interoperability situation -- a preferred alternative which you are
> trying to get consensus behind in f. But it sounds like there is some cost to
> implementors, and the cost is such that there is little or no benefit to anyone
> unless that choice is widely implemented.
> 
> The art is then convincing implementors to accept the cost, with only a promise
> of benefit.
> 
> Can the costs/benefits be allocated? Could you say, for example, that
> normalization MUST be applied to Vietnamese but SHOULD be applied to other
> Unicode strings?
> Perhaps there are other hybrid alternatives that do not have the same
> cost/benefit tradeoffs, in that the costs are limited to situations where there
> are clear benefits over the current situation?
> 
> Larry
> --
> http://larry.masinter.net
Received on Saturday, 20 August 2011 16:21:45 UTC