Normalization, was: RE: [Widget URI] Internationalization, widget IRI? from Marcin Hanclik on 2009-09-07 (public-webapps@w3.org from July to September 2009)

From: Marcin Hanclik <Marcin.Hanclik@access-company.com>
Date: Mon, 7 Sep 2009 15:58:31 +0200
To: Marcos Caceres <marcosc@opera.com>
CC: Robin Berjon <robin@berjon.com>, public-webapps WG <public-webapps@w3.org>
Message-ID: <FAA1D89C5BAF1142A74AF116630A9F2C2890C66055@OBEEX01.obe.access-company.com>
Hi Marcos,

>>The spec just treats them as opaque strings.
Yes. This is the reason for my email to I18N.

>>Ok, so what you are saying is, given an XML document's encoding, any URI
>>should be converted to a default encoding (say, UTF-8)?

This is one of the proposed solutions.
In the email to I18N I asked/suggested that moving everything to UTF8 could be studied, but I was not sure whether it is ok for the developers who could have non-UTF8 text editors at hand (assuming config.xml is developed some basic text editor).
The main motivation for default encoding is to move from octets to characters.
Opaque strings with pct encoding bring unnecessary encoding that should actually vanish if the URI/IRI normalization would be mandated.

>>I can make this explicit.
Perfect.

>><widget id="foo:mañana"> is a valid URI.
This is BTW comment: it seems to be IRI, since "ñ" is non-ASCII.

>>Right. That is an implementation detail - my implementation might be
>>super internally optimized to run UTF-16. But, as you always know what
>>the bytes are from the XML file, there should be no problem for comparison:
>>
>>XML file(utf-8 or ISO-XXXX-Y) <--> UA (UTF-16) <--> zip archive(CP437|UTF-8)
Agreed.

To sum up:
The whole issue about IRI/URI normalization is about treatment of the IRI-valued attributes as a string of characters and not as a string of octets. Such normalization is currently not in P&C and my understanding is that the normalizations mentioned in RFC3987 must be explicitly mandated in specs using it to make them effective.
Character-set conversion is another issue.
In [1] I wrote:
"So by inclusion of [XML], it seems that other encodings than UTF-8 are implicitly mandated, or?
I am not sure whether this is the understanding in WebApps."
And it seems that this is to be pending for discussion in I18N [2].

Thanks,
Marcin

[1] http://lists.w3.org/Archives/Public/public-i18n-core/2009JulSep/0042.html

[2] http://lists.w3.org/Archives/Public/public-i18n-core/2009JulSep/0065.html


Marcin Hanclik
ACCESS Systems Germany GmbH
Tel: +49-208-8290-6452  |  Fax: +49-208-8290-6465
Mobile: +49-163-8290-646
E-Mail: marcin.hanclik@access-company.com

-----Original Message-----
From: Marcos Caceres [mailto:marcosc@opera.com]
Sent: Monday, September 07, 2009 3:01 PM
To: Marcin Hanclik
Cc: Robin Berjon; public-webapps WG
Subject: Re: [Widget URI] Internationalization, widget IRI?



Marcin Hanclik wrote:
> Hi Marcos,
>
> As a summary of the URI/IRI-related issues, we have currently the following as far as I can tell:
> 1. URI/IRI normalization in P&C [1], it is currently at I18N [2]
> 2. Widget URI issues related to internationalization [3]
>
> The URI/IRI normalization in P&C is mainly for attribute values that are to be IRIs. At present these are:
> a) @id in<widget>
> b) @href in<author>
> c) @href in<license>
> d) @name in<feature>

There is no normalization done on any of those values (by designed, and
modeled explicitly after the behavior of XML namespaces, which are also
not normalized). The spec just treats them as opaque strings.

Remember that the P&C UA does not do anything meaningful with any of the
metadata it collects (leaves that to other UAs). It merely validates
that data (The UA just checks if the value of the attribute is a "valid
IRI", that's it! And it certainly does not need to do any normalization
to check for validity).

For example, from the spec:

widget@id:
"If the id attribute is used, then let id be the result of applying the
rule for getting a single attribute value to the id attribute. If id is
a valid IRI, then let widget id be the value of the id..."

Where "A valid IRI is one that matches the IRI token of the [RFC3987]
specification".

(i.e., read the @id value, if it is a valid IRI, save it.)

So:

<widget id="foo:bar 123"> is not valid.

<widget id="fooo:bar%20123"> is valid, and the value is "fooo:bar%20123".

<widget id="foo:mañana"> is a valid URI.

Also, licence@href:
"If the href attribute is used, then let potential license href be the
result of applying the rule for getting a single attribute value to the
href attribute.
If potential license href is not a valid IRI or a valid path, then the
href attribute is in error and the user agent must ignore the attribute.
If potential license href is a valid IRI, then let widget license href
be the value of potential license href."

(i.e., read the @href value, if it is a valid IRI, save it.)

And so on... it's the same for the other attributes. We don't normalize
anything nor should they be normalized. They are just treated as opaque
strings.

The only place where it changes is if license href is a "valid path", in
which case a UA checks for the file internally in the package. I think
this is where you see issues arising...

> Your use cases seem to be related to the above, since you quote non-ASCII character in the @src of<content>.

Yes, zip relative paths, which are either CP437 or UTF-8 internally in
the widget package.

What is assumed is a layer of abstraction between the package and the
config document:

  content@src (xml, any *supported* encoding)
       <-->  UTF-8 (mapping) <--> CP437 (zip archive)

> They are exactly the same with regard to the above issues 1. and 2.
> They differ on the CP437/UTF8 level.

Yes, but you know your input (the encoding of the XML document). If you
know your input, then you can easily convert it to whatever your want (
CP437, UTF8, or whatever) so long as you support it.

The spec says:
"If doc is encoded in a format that is unsupported by the user agent,
then the user agent must terminate this algorithm and treat this widget
package as an invalid Zip archive."

Supported in this context means that the UA can convert from one
encoding to another (e.g., BIG-5 to UTF-8).

> The widgets URI is on the character level and my point was about naming it URI (octet-level, whereas IRI operates clearly on character level).

Ok, so what you are saying is, given an XML document's encoding, any URI
should be converted to a default encoding (say, UTF-8)?

> My comments to the details:
>
> P&C addresses the transcoding from CP437 to UTF8 [4] ( however, only as SHOULD, so maybe it should be also SHALL? This was not raised yet and it is probably late now):
> " For the sake of comparison and matching, it is recommended that a user agent treat all Zip-relative paths as [UTF-8]."

Right. That is an implementation detail - my implementation might be
super internally optimized to run UTF-16. But, as you always know what
the bytes are from the XML file, there should be no problem for comparison:

XML file(utf-8 or ISO-XXXX-Y) <--> UA (UTF-16) <--> zip archive(CP437|UTF-8)

Also, it's not too late change this if this is causing real issues in
implementations (that is the point of CR- to gain implementation
experience and fix bugs before proceeding to PR).

> The "problematic" character in your case is 'ñ', U+00F1.
> In CP437 it is has the value 0xA4, in ISO-8859-1 it is 0xF1.

Yes.

> In UTF8 this character is encoded as the sequence of the following octets: 0xC3 0xB1.

Yes.

> The assumption of P&C seems to be that everything gets converted to UTF8.
> The only issue is that this is an assumption.

I can make this explicit.

In the "rule for finding a file within a widget package", step 2:

2.  If path is not encoded in UTF-8, convert path to UTF-8.

> My case of IRI and your cases with file name are similar with regard to this assumption.

Right.

> Specifically in case of IRI we have the issue of pct encoding, in your cases we have "just" character-set transcoding.
 >
> I hope it is clearer now.

Yeah, I'm kinda getting it now.

Kind regards,
Marcos

________________________________________

Access Systems Germany GmbH
Essener Strasse 5  |  D-46047 Oberhausen
HRB 13548 Amtsgericht Duisburg
Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda

www.access-company.com

CONFIDENTIALITY NOTICE
This e-mail and any attachments hereto may contain information that is privileged or confidential, and is intended for use only by the
individual or entity to which it is addressed. Any disclosure, copying or distribution of the information by anyone else is strictly prohibited.
If you have received this document in error, please notify us promptly by responding to this e-mail. Thank you.
Received on Monday, 7 September 2009 13:59:40 UTC