RE: Normalization, was: RE: [Widget URI] Internationalization, widget IRI? from Marcin Hanclik on 2009-09-08 (public-webapps@w3.org from July to September 2009)

From: Marcin Hanclik <Marcin.Hanclik@access-company.com>
Date: Tue, 8 Sep 2009 10:51:36 +0200
To: Marcos Caceres <marcosc@opera.com>
CC: Robin Berjon <robin@berjon.com>, public-webapps WG <public-webapps@w3.org>
Message-ID: <FAA1D89C5BAF1142A74AF116630A9F2C2890C66091@OBEEX01.obe.access-company.com>
Hi Marcos,

Thanks for your comments.
It seems we are aligned.

>> UAs will just have to deal with that internally
I assume there could be an easy solution to the normalization:
The normalization / mandating some equivalence-determining algorithm (from RFC3987) could go into P&C.
Then maybe I18N would not have to be bothered further.

Thanks,
Marcin

Marcin Hanclik
ACCESS Systems Germany GmbH
Tel: +49-208-8290-6452  |  Fax: +49-208-8290-6465
Mobile: +49-163-8290-646
E-Mail: marcin.hanclik@access-company.com

-----Original Message-----
From: Marcos Caceres [mailto:marcosc@opera.com]
Sent: Monday, September 07, 2009 4:33 PM
To: Marcin Hanclik
Cc: Robin Berjon; public-webapps WG
Subject: Re: Normalization, was: RE: [Widget URI] Internationalization, widget IRI?



Marcin Hanclik wrote:
> Hi Marcos,
>
>>> The spec just treats them as opaque strings.
> Yes. This is the reason for my email to I18N.
>
>>> Ok, so what you are saying is, given an XML document's encoding, any URI
>>> should be converted to a default encoding (say, UTF-8)?
>
> This is one of the proposed solutions.
> In the email to I18N I asked/suggested that moving everything to UTF8 could be studied, but I was not sure whether it is ok for the developers who could have non-UTF8 text editors at hand (assuming config.xml is developed some basic text editor).

That's ok. Best practice for developers is to write XML in UTF-8. If a
developer writes XML in some obscure encoding, then, it is to their
determent. The same would happen on the Web, you can't stop that. And,
if a new super format emerges that is better than UTF-8, developers
should be able to use it (and UAs support it).

> The main motivation for default encoding is to move from octets to characters.

Yep.

> Opaque strings with pct encoding bring unnecessary encoding that should actually vanish if the URI/IRI normalization would be mandated.

This is why we treat them as opaque strings.

>>> I can make this explicit.
> Perfect.
>
>>> <widget id="foo:mañana">  is a valid URI.
> This is BTW comment: it seems to be IRI, since "ñ" is non-ASCII.

A crap. I meant "valid IRI" (if I say URI again, just pretend I meant
IRI :)).

>>> Right. That is an implementation detail - my implementation might be
>>> super internally optimized to run UTF-16. But, as you always know what
>>> the bytes are from the XML file, there should be no problem for comparison:
>>>
>>> XML file(utf-8 or ISO-XXXX-Y)<-->  UA (UTF-16)<-->  zip archive(CP437|UTF-8)
> Agreed.
>
> To sum up:
> The whole issue about IRI/URI normalization is about treatment of the IRI-valued attributes as a string of characters and not as a string of octets. Such normalization is currently not in P&C and my understanding is that the normalizations mentioned in RFC3987 must be explicitly mandated in specs using it to make them effective.

Ok, I was not aware that RFC3987 says we have to normalize IRIs to a
canonical form. Grumble... guess I gotta read that spec again :(

Like I said, the way we designed this was that IRIs were just opaque
strings. The internal representation of that string is irrelevant, but
the following metadata is maintained:

   1. the string is treated as a IRI (hence, could be normalized, if
need be). So a = new IRI("someString");

   2. The encoding of the IRI recorded for transcoding (as needed).

IRIs come in two flavors: encoded and normalized. Mandating one over the
other to developers is a waste of time. UAs will just have to deal with
that internally (I guess that's what RFC3987 is for).

> Character-set conversion is another issue.
> In [1] I wrote:
> "So by inclusion of [XML], it seems that other encodings than UTF-8 are implicitly mandated, or?
> I am not sure whether this is the understanding in WebApps."

Yes. This is certainly the understanding in Web Apps. A UA can support
whatever encodings it wants. A UA is only required to support UTF-8 -
every other encoding optional (though you would be pretty silly if you
didn't support a few common encoding formats, but we leave those to the
market).

> And it seems that this is to be pending for discussion in I18N [2].

Ok, now that I'm starting to understand all this a bit better, I might
be able to contribute to [2]. Thanks again for helping me understand the
problem.

Kind regards,
Marcos

[2]
http://lists.w3.org/Archives/Public/public-i18n-core/2009JulSep/0065.html


________________________________________

Access Systems Germany GmbH
Essener Strasse 5  |  D-46047 Oberhausen
HRB 13548 Amtsgericht Duisburg
Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda

www.access-company.com

CONFIDENTIALITY NOTICE
This e-mail and any attachments hereto may contain information that is privileged or confidential, and is intended for use only by the
individual or entity to which it is addressed. Any disclosure, copying or distribution of the information by anyone else is strictly prohibited.
If you have received this document in error, please notify us promptly by responding to this e-mail. Thank you.
Received on Tuesday, 8 September 2009 08:52:51 UTC