Re: Normalization, was: RE: [Widget URI] Internationalization, widget IRI? from Marcos Caceres on 2009-09-07 (public-webapps@w3.org from July to September 2009)

From: Marcos Caceres <marcosc@opera.com>
Date: Mon, 07 Sep 2009 16:32:43 +0200
To: Marcin Hanclik <Marcin.Hanclik@access-company.com>
CC: Robin Berjon <robin@berjon.com>, public-webapps WG <public-webapps@w3.org>
Message-ID: <4AA5198B.1040205@opera.com>

Marcin Hanclik wrote:
> Hi Marcos,
>
>>> The spec just treats them as opaque strings.
> Yes. This is the reason for my email to I18N.
>
>>> Ok, so what you are saying is, given an XML document's encoding, any URI
>>> should be converted to a default encoding (say, UTF-8)?
>
> This is one of the proposed solutions.
> In the email to I18N I asked/suggested that moving everything to UTF8 could be studied, but I was not sure whether it is ok for the developers who could have non-UTF8 text editors at hand (assuming config.xml is developed some basic text editor).

That's ok. Best practice for developers is to write XML in UTF-8. If a 
developer writes XML in some obscure encoding, then, it is to their 
determent. The same would happen on the Web, you can't stop that. And, 
if a new super format emerges that is better than UTF-8, developers 
should be able to use it (and UAs support it).

> The main motivation for default encoding is to move from octets to characters.

Yep.

> Opaque strings with pct encoding bring unnecessary encoding that should actually vanish if the URI/IRI normalization would be mandated.

This is why we treat them as opaque strings.

>>> I can make this explicit.
> Perfect.
>
>>> <widget id="foo:mañana">  is a valid URI.
> This is BTW comment: it seems to be IRI, since "ñ" is non-ASCII.

A crap. I meant "valid IRI" (if I say URI again, just pretend I meant 
IRI :)).

>>> Right. That is an implementation detail - my implementation might be
>>> super internally optimized to run UTF-16. But, as you always know what
>>> the bytes are from the XML file, there should be no problem for comparison:
>>>
>>> XML file(utf-8 or ISO-XXXX-Y)<-->  UA (UTF-16)<-->  zip archive(CP437|UTF-8)
> Agreed.
>
> To sum up:
> The whole issue about IRI/URI normalization is about treatment of the IRI-valued attributes as a string of characters and not as a string of octets. Such normalization is currently not in P&C and my understanding is that the normalizations mentioned in RFC3987 must be explicitly mandated in specs using it to make them effective.

Ok, I was not aware that RFC3987 says we have to normalize IRIs to a 
canonical form. Grumble... guess I gotta read that spec again :(

Like I said, the way we designed this was that IRIs were just opaque 
strings. The internal representation of that string is irrelevant, but 
the following metadata is maintained:

   1. the string is treated as a IRI (hence, could be normalized, if 
need be). So a = new IRI("someString");

   2. The encoding of the IRI recorded for transcoding (as needed).

IRIs come in two flavors: encoded and normalized. Mandating one over the 
other to developers is a waste of time. UAs will just have to deal with 
that internally (I guess that's what RFC3987 is for).

> Character-set conversion is another issue.
> In [1] I wrote:
> "So by inclusion of [XML], it seems that other encodings than UTF-8 are implicitly mandated, or?
> I am not sure whether this is the understanding in WebApps."

Yes. This is certainly the understanding in Web Apps. A UA can support 
whatever encodings it wants. A UA is only required to support UTF-8 - 
every other encoding optional (though you would be pretty silly if you 
didn't support a few common encoding formats, but we leave those to the 
market).

> And it seems that this is to be pending for discussion in I18N [2].

Ok, now that I'm starting to understand all this a bit better, I might 
be able to contribute to [2]. Thanks again for helping me understand the 
problem.

Kind regards,
Marcos

[2] 
http://lists.w3.org/Archives/Public/public-i18n-core/2009JulSep/0065.html

Received on Monday, 7 September 2009 14:39:47 UTC