Matching vs. normalization

One thing I have been wanting to do to the IRI comparison document and that I think might help with charmed is to move away from "normalization" (or normalization) in the standards and describe the algorithms in terms of "matching". I think this is important in many situations where equivalence of two strings may be ambiguous or not completely determined or certain. Of course you can derive an equivalence relationship from a normalization algorithm by saying "two strings are equivalent if they have the same normal form". However, there are some situations where you can't quite know whether downstream tools will treat two strings equivalently, and what you need is a 'partial match' algorithm.

For example, you might say that 's-set' and 'double s' are partially equivalent -- some downstream tools will treat them the same, others not. Whether or not you want them to 'match' will depend on how conservative or liberal you need the matching algorithm to be.  If you are looking for, say, a node ID, you might use 'exact match' first, and only use 'fuzzy match' in an error recovery situation... only invoke the cost of normalization in the cases where there would otherwise be a failure.

I don't think you can explore those alternatives for solutions if the problem space is couched in terms of normalization rather than comparison and equivalence.

Larry
--
http://larry.masinter.net


-----Original Message-----
From: www-tag-request@w3.org [mailto:www-tag-request@w3.org] On Behalf Of Larry Masinter
Sent: Saturday, August 20, 2011 4:16 PM
To: Phillips, Addison; www-tag@w3.org
Cc: member-i18n-core@w3.org
Subject: RE: Any update on TAG request?

I'm afraid that to analyze the situation I'd want to go back to first principles, in terms of workflows and interoperability problems. It seems you’ve done most of this work, but perhaps not gathered it together in this way.

a. What are the workflows and current deployed components and what do they do?
b. What are the interoperability problems in those workflows?
c.  What should content authors and creators of tools that generate content do in order to minimize downstream interoperability difficulties?
d. What interoperability problems or applicability limitations still occur when one (and even if everyone) follows the advice in c?
e.  Enumerate possible recommendations we could make for downstream tools, such that (presuming widespread adoption  eventually) the interoperability problems of d would be minimized.
f.  Get consensus among the various providers of downstream tools (like browsers and parsers) to agree to implement one of the choices identified for e. -- by providing clear arguments for how doing so would make the web better.
g. Update advice of c. with information about how to evaluate tool maker progress in implementing recommendations of f.

From this view likely need some combination of "how should content authors cope with the existing mess" and "what could tool makers do such that, if all agree, the mess will get better".
I don't see these two approaches as exclusive.  (I think this analysis applies for not just charmed but many other parts of web architecture, so I've tried to write it more generally).

> 1. Do nothing. Do not require normalization by implementations or in specs. Create educational materials to help content authors understand the problem and try to avoid it.

I don't see this as "do nothing" and it seems like a very important thing to provide in any case. So of course we should do this right away. I think writing this advice and being clear about the interoperability problems avoiding is an important first step to improving anything.

> 2. Adopt our proposal for identifier and token/string matching normalization. Revise Charmod-Norm to embody this. Ensure that specs address these requirements in the future.

You have a proposal for e. which you believe that, if widely implemented, will improve the interoperability situation -- a preferred alternative which you are trying to get consensus behind in f. But it sounds like there is some cost to implementors, and the cost is such that there is little or no benefit to anyone unless that choice is widely implemented.

The art is then convincing implementors to accept the cost, with only a promise of benefit.  

Can the costs/benefits be allocated? Could you say, for example, that normalization MUST be applied to Vietnamese but SHOULD be applied to other Unicode strings?
Perhaps there are other hybrid alternatives that do not have the same cost/benefit tradeoffs, in that the costs are limited to situations where there are clear benefits over the current situation?

Larry
--
http://larry.masinter.net

Received on Saturday, 20 August 2011 14:43:48 UTC