- From: Marcos Caceres <marcosc@opera.com>
- Date: Mon, 07 Sep 2009 16:32:43 +0200
- To: Marcin Hanclik <Marcin.Hanclik@access-company.com>
- CC: Robin Berjon <robin@berjon.com>, public-webapps WG <public-webapps@w3.org>
Marcin Hanclik wrote: > Hi Marcos, > >>> The spec just treats them as opaque strings. > Yes. This is the reason for my email to I18N. > >>> Ok, so what you are saying is, given an XML document's encoding, any URI >>> should be converted to a default encoding (say, UTF-8)? > > This is one of the proposed solutions. > In the email to I18N I asked/suggested that moving everything to UTF8 could be studied, but I was not sure whether it is ok for the developers who could have non-UTF8 text editors at hand (assuming config.xml is developed some basic text editor). That's ok. Best practice for developers is to write XML in UTF-8. If a developer writes XML in some obscure encoding, then, it is to their determent. The same would happen on the Web, you can't stop that. And, if a new super format emerges that is better than UTF-8, developers should be able to use it (and UAs support it). > The main motivation for default encoding is to move from octets to characters. Yep. > Opaque strings with pct encoding bring unnecessary encoding that should actually vanish if the URI/IRI normalization would be mandated. This is why we treat them as opaque strings. >>> I can make this explicit. > Perfect. > >>> <widget id="foo:mañana"> is a valid URI. > This is BTW comment: it seems to be IRI, since "ñ" is non-ASCII. A crap. I meant "valid IRI" (if I say URI again, just pretend I meant IRI :)). >>> Right. That is an implementation detail - my implementation might be >>> super internally optimized to run UTF-16. But, as you always know what >>> the bytes are from the XML file, there should be no problem for comparison: >>> >>> XML file(utf-8 or ISO-XXXX-Y)<--> UA (UTF-16)<--> zip archive(CP437|UTF-8) > Agreed. > > To sum up: > The whole issue about IRI/URI normalization is about treatment of the IRI-valued attributes as a string of characters and not as a string of octets. Such normalization is currently not in P&C and my understanding is that the normalizations mentioned in RFC3987 must be explicitly mandated in specs using it to make them effective. Ok, I was not aware that RFC3987 says we have to normalize IRIs to a canonical form. Grumble... guess I gotta read that spec again :( Like I said, the way we designed this was that IRIs were just opaque strings. The internal representation of that string is irrelevant, but the following metadata is maintained: 1. the string is treated as a IRI (hence, could be normalized, if need be). So a = new IRI("someString"); 2. The encoding of the IRI recorded for transcoding (as needed). IRIs come in two flavors: encoded and normalized. Mandating one over the other to developers is a waste of time. UAs will just have to deal with that internally (I guess that's what RFC3987 is for). > Character-set conversion is another issue. > In [1] I wrote: > "So by inclusion of [XML], it seems that other encodings than UTF-8 are implicitly mandated, or? > I am not sure whether this is the understanding in WebApps." Yes. This is certainly the understanding in Web Apps. A UA can support whatever encodings it wants. A UA is only required to support UTF-8 - every other encoding optional (though you would be pretty silly if you didn't support a few common encoding formats, but we leave those to the market). > And it seems that this is to be pending for discussion in I18N [2]. Ok, now that I'm starting to understand all this a bit better, I might be able to contribute to [2]. Thanks again for helping me understand the problem. Kind regards, Marcos [2] http://lists.w3.org/Archives/Public/public-i18n-core/2009JulSep/0065.html
Received on Monday, 7 September 2009 14:39:47 UTC