RE: Matching vs. normalization from Phillips, Addison on 2011-08-20 (www-tag@w3.org from August 2011)

From: Phillips, Addison <addison@lab126.com>
Date: Sat, 20 Aug 2011 10:21:56 -0700
To: "ashok.malhotra@oracle.com" <ashok.malhotra@oracle.com>
CC: Larry Masinter <masinter@adobe.com>, "www-tag@w3.org" <www-tag@w3.org>, "member-i18n-core@w3.org" <member-i18n-core@w3.org>
Message-ID: <131F80DEA635F044946897AFDA9AC3476A954D08BC@EX-SEA31-D.ant.amazon.com>
Hi Ashok,

Of course ordering and sorting are important for organizing information. And normalization has a role to play there. However, excepting binary comparison of strings, most sorting implementations already have to deal with text equivalences. And even badly organized data may still be functional.

Or, to go to Larry's point, just as it makes more sense to talk about "string matching" rather than normalization, it makes sense to talk about "collation" rather than normalization as well. As it happens, the base document of the Character Model has a whole section on collation (3.5) and this is non-controversial: I cannot remember a WG ever having objected to sorting things "correctly".

Identifier matching, however, is both fundamental and (in our view) flawed. 

Addison


> -----Original Message-----
> From: ashok malhotra [mailto:ashok.malhotra@oracle.com]
> Sent: Saturday, August 20, 2011 9:47 AM
> To: Phillips, Addison
> Cc: Larry Masinter; www-tag@w3.org; member-i18n-core@w3.org
> Subject: Re: Matching vs. normalization
> 
> Addison:
> Matching is clearly important but isn't ordering/sorting also important?
> All the best, Ashok
> 
> On 8/20/2011 9:19 AM, Phillips, Addison wrote:
> > I agree. Which is why the proposal at [1] is couched in terms of string identity
> matching.
> >
> > Normalization doesn't provide a complete matching solution. Unicode
> > canonical equivalence deals with a specific problem inherent to
> > Unicode, but most programmers already know of other equivalences that
> > are interesting---case insensitivity, for example. A complete matching
> > algorithm will need to deal with all of the equivalences in text.
> > (estzet vs. double-s is an example of this, since the uppercase of the
> > former is two regular letters 's'.)
> >
> > A different example of what you're talking about might be the Unicode
> Collation Algorithm (UTS#10 [2]), which allows for tailoring and strengths, as
> well as contextual equivalence.
> >
> > I'm pretty sure I don't agree about having "fuzzy" matching for the cases our
> WG has in mind for charmod (or for IRI), in which you want an unambiguous
> comparison of identifiers. When I write an HTML processor or a CSS stylesheet,
> I need to know that token 'abc' matches (or does not match) token 'ABC'. There
> is no room for fuzz in these cases. And document authors won't thank us for
> providing highly promiscuous matching in these cases either.
> >
> > Addison
> >
> > [1] http://www.w3.org/International/wiki/NormalizationProposal

> > [2] http://www.unicode.org/reports/tr10/

> >
> >> -----Original Message-----
> >> From: Larry Masinter [mailto:masinter@adobe.com]
> >> Sent: Saturday, August 20, 2011 7:43 AM
> >> To: Larry Masinter; Phillips, Addison; www-tag@w3.org
> >> Cc: member-i18n-core@w3.org
> >> Subject: Matching vs. normalization
> >>
> >> One thing I have been wanting to do to the IRI comparison document
> >> and that I think might help with charmed is to move away from
> >> "normalization" (or
> >> normalization) in the standards and describe the algorithms in terms
> >> of "matching". I think this is important in many situations where
> >> equivalence of two strings may be ambiguous or not completely
> >> determined or certain. Of course you can derive an equivalence
> >> relationship from a normalization algorithm by saying "two strings
> >> are equivalent if they have the same normal form". However, there are
> >> some situations where you can't quite know whether downstream tools
> >> will treat two strings equivalently, and what you need is a 'partial match'
> algorithm.
> >>
> >> For example, you might say that 's-set' and 'double s' are partially
> >> equivalent -- some downstream tools will treat them the same, others
> >> not. Whether or not you want them to 'match' will depend on how
> >> conservative or liberal you need the matching algorithm to be.  If
> >> you are looking for, say, a node ID, you might use 'exact match'
> >> first, and only use 'fuzzy match' in an error recovery situation...
> >> only invoke the cost of normalization in the cases where there would
> otherwise be a failure.
> >>
> >> I don't think you can explore those alternatives for solutions if the
> >> problem space is couched in terms of normalization rather than
> >> comparison and equivalence.
> >>
> >> Larry
> >> --
> >> http://larry.masinter.net

> >>
> >> -----Original Message-----
> >> From: www-tag-request@w3.org [mailto:www-tag-request@w3.org] On
> >> Behalf Of Larry Masinter
> >> Sent: Saturday, August 20, 2011 4:16 PM
> >> To: Phillips, Addison; www-tag@w3.org
> >> Cc: member-i18n-core@w3.org
> >> Subject: RE: Any update on TAG request?
> >>
> >> I'm afraid that to analyze the situation I'd want to go back to first
> >> principles, in terms of workflows and interoperability problems. It
> >> seems you’ve done most of this work, but perhaps not gathered it together
> in this way.
> >>
> >> a. What are the workflows and current deployed components and what do
> >> they do?
> >> b. What are the interoperability problems in those workflows?
> >> c.  What should content authors and creators of tools that generate
> >> content do in order to minimize downstream interoperability difficulties?
> >> d. What interoperability problems or applicability limitations still
> >> occur when one (and even if everyone) follows the advice in c?
> >> e.  Enumerate possible recommendations we could make for downstream
> >> tools, such that (presuming widespread adoption  eventually) the
> >> interoperability problems of d would be minimized.
> >> f.  Get consensus among the various providers of downstream tools
> >> (like browsers and parsers) to agree to implement one of the choices
> identified for e.
> >> -- by providing clear arguments for how doing so would make the web
> better.
> >> g. Update advice of c. with information about how to evaluate tool
> >> maker progress in implementing recommendations of f.
> >>
> >>  From this view likely need some combination of "how should content
> >> authors cope with the existing mess" and "what could tool makers do
> >> such that, if all agree, the mess will get better".
> >> I don't see these two approaches as exclusive.  (I think this
> >> analysis applies for not just charmed but many other parts of web
> >> architecture, so I've tried to write it more generally).
> >>
> >>> 1. Do nothing. Do not require normalization by implementations or in specs.
> >> Create educational materials to help content authors understand the
> >> problem and try to avoid it.
> >>
> >> I don't see this as "do nothing" and it seems like a very important
> >> thing to provide in any case. So of course we should do this right
> >> away. I think writing this advice and being clear about the
> >> interoperability problems avoiding is an important first step to improving
> anything.
> >>
> >>> 2. Adopt our proposal for identifier and token/string matching
> normalization.
> >> Revise Charmod-Norm to embody this. Ensure that specs address these
> >> requirements in the future.
> >>
> >> You have a proposal for e. which you believe that, if widely
> >> implemented, will improve the interoperability situation -- a
> >> preferred alternative which you are trying to get consensus behind in
> >> f. But it sounds like there is some cost to implementors, and the
> >> cost is such that there is little or no benefit to anyone unless that choice is
> widely implemented.
> >>
> >> The art is then convincing implementors to accept the cost, with only
> >> a promise of benefit.
> >>
> >> Can the costs/benefits be allocated? Could you say, for example, that
> >> normalization MUST be applied to Vietnamese but SHOULD be applied to
> >> other Unicode strings?
> >> Perhaps there are other hybrid alternatives that do not have the same
> >> cost/benefit tradeoffs, in that the costs are limited to situations
> >> where there are clear benefits over the current situation?
> >>
> >> Larry
> >> --
> >> http://larry.masinter.net
Received on Saturday, 20 August 2011 17:22:35 UTC