Re: [Widget URI] Internationalization, widget IRI? from Marcos Caceres on 2009-09-07 (public-webapps@w3.org from July to September 2009)

From: Marcos Caceres <marcosc@opera.com>
Date: Mon, 07 Sep 2009 15:00:32 +0200
To: Marcin Hanclik <Marcin.Hanclik@access-company.com>
CC: Robin Berjon <robin@berjon.com>, public-webapps WG <public-webapps@w3.org>
Message-ID: <4AA503F0.4030309@opera.com>
Marcin Hanclik wrote:
> Hi Marcos,
>
> As a summary of the URI/IRI-related issues, we have currently the following as far as I can tell:
> 1. URI/IRI normalization in P&C [1], it is currently at I18N [2]
> 2. Widget URI issues related to internationalization [3]
>
> The URI/IRI normalization in P&C is mainly for attribute values that are to be IRIs. At present these are:
> a) @id in<widget>
> b) @href in<author>
> c) @href in<license>
> d) @name in<feature>

There is no normalization done on any of those values (by designed, and 
modeled explicitly after the behavior of XML namespaces, which are also 
not normalized). The spec just treats them as opaque strings.

Remember that the P&C UA does not do anything meaningful with any of the 
metadata it collects (leaves that to other UAs). It merely validates 
that data (The UA just checks if the value of the attribute is a "valid 
IRI", that's it! And it certainly does not need to do any normalization 
to check for validity).

For example, from the spec:

widget@id:
"If the id attribute is used, then let id be the result of applying the 
rule for getting a single attribute value to the id attribute. If id is 
a valid IRI, then let widget id be the value of the id..."

Where "A valid IRI is one that matches the IRI token of the [RFC3987] 
specification".

(i.e., read the @id value, if it is a valid IRI, save it.)

So:

<widget id="foo:bar 123"> is not valid.

<widget id="fooo:bar%20123"> is valid, and the value is "fooo:bar%20123".

<widget id="foo:mañana"> is a valid URI.

Also, licence@href:
"If the href attribute is used, then let potential license href be the 
result of applying the rule for getting a single attribute value to the 
href attribute.
If potential license href is not a valid IRI or a valid path, then the 
href attribute is in error and the user agent must ignore the attribute.
If potential license href is a valid IRI, then let widget license href 
be the value of potential license href."

(i.e., read the @href value, if it is a valid IRI, save it.)

And so on... it's the same for the other attributes. We don't normalize 
anything nor should they be normalized. They are just treated as opaque 
strings.

The only place where it changes is if license href is a "valid path", in 
which case a UA checks for the file internally in the package. I think 
this is where you see issues arising...

> Your use cases seem to be related to the above, since you quote non-ASCII character in the @src of<content>.

Yes, zip relative paths, which are either CP437 or UTF-8 internally in 
the widget package.

What is assumed is a layer of abstraction between the package and the 
config document:

  content@src (xml, any *supported* encoding)
       <-->  UTF-8 (mapping) <--> CP437 (zip archive)

> They are exactly the same with regard to the above issues 1. and 2.
> They differ on the CP437/UTF8 level.

Yes, but you know your input (the encoding of the XML document). If you 
know your input, then you can easily convert it to whatever your want ( 
CP437, UTF8, or whatever) so long as you support it.

The spec says:
"If doc is encoded in a format that is unsupported by the user agent, 
then the user agent must terminate this algorithm and treat this widget 
package as an invalid Zip archive."

Supported in this context means that the UA can convert from one 
encoding to another (e.g., BIG-5 to UTF-8).

> The widgets URI is on the character level and my point was about naming it URI (octet-level, whereas IRI operates clearly on character level).

Ok, so what you are saying is, given an XML document's encoding, any URI 
should be converted to a default encoding (say, UTF-8)?

> My comments to the details:
>
> P&C addresses the transcoding from CP437 to UTF8 [4] ( however, only as SHOULD, so maybe it should be also SHALL? This was not raised yet and it is probably late now):
> " For the sake of comparison and matching, it is recommended that a user agent treat all Zip-relative paths as [UTF-8]."

Right. That is an implementation detail - my implementation might be 
super internally optimized to run UTF-16. But, as you always know what 
the bytes are from the XML file, there should be no problem for comparison:

XML file(utf-8 or ISO-XXXX-Y) <--> UA (UTF-16) <--> zip archive(CP437|UTF-8)

Also, it's not too late change this if this is causing real issues in 
implementations (that is the point of CR- to gain implementation 
experience and fix bugs before proceeding to PR).

> The "problematic" character in your case is 'ñ', U+00F1.
> In CP437 it is has the value 0xA4, in ISO-8859-1 it is 0xF1.

Yes.

> In UTF8 this character is encoded as the sequence of the following octets: 0xC3 0xB1.

Yes.

> The assumption of P&C seems to be that everything gets converted to UTF8.
> The only issue is that this is an assumption.

I can make this explicit.

In the "rule for finding a file within a widget package", step 2:

2.  If path is not encoded in UTF-8, convert path to UTF-8.

> My case of IRI and your cases with file name are similar with regard to this assumption.

Right.

> Specifically in case of IRI we have the issue of pct encoding, in your cases we have "just" character-set transcoding.
 >
> I hope it is clearer now.

Yeah, I'm kinda getting it now.

Kind regards,
Marcos
Received on Monday, 7 September 2009 13:01:28 UTC