Re: Unicode Normalization

On Tue, Feb 3, 2009 at 5:04 AM, Robert J Burns <rob@robburns.com> wrote:
> The problem with this is that there would have to be a prior agreement so
> that a Unicode processing application could count on everything received
> already as NFC and that's simply not the case. If a Unicode UA is incapable
> of processing NFD (which also implies it cannot process NFC characters that
> are combining characters) then it would be up to that application to convert
> internally to something it could handle (just what conversion it would do, I
> don't know).

Who's talking about a Unicode UA being unable to process NFD?  The
question on the table seems to be whether UAs should normalize all
input to NFC when they parse it.  This would permit them to process
NFC, NFD, or any other normalized or non-normalized input.  They would
then probably end up sending responses like form data in NFC even if
they received the original input in NFD.  If the server prefers to use
NFD internally, it's up to the server to then convert back to NFD on
its end.

We aren't really talking about transmission formats here, AFAICT, or
at least that wasn't the original question.  The question is whether
it's acceptable for browsers to internally normalize all input somehow
(to NFC, NFD, whatever) as soon as it's received, so that they can
ensure that they make correct comparisons according to the Unicode
standard.  This is relevant to CSS because it seems to be the best way
of ensuring that CSS comparisons aren't normalization-sensitive.

I'm not clear on what exactly the objections are to that, other than
possibly violating the XML standard (it would be surprising to me if
that did violate XML).  The only practical objection I can see is that
some sites might be broken and not do normalization themselves.  You
could have something like user registers with a name in NFD (or
entirely unnormalized) in non-normalizing browser -> site saves to
database -> same user tries to log in later in a normalizing browser
-> login fails because site thinks the names are different.  I don't
know whether this would be a problem in practice.

Received on Tuesday, 3 February 2009 14:54:12 UTC