RE: mediafragment track names and IRIs. from Phillips, Addison on 2010-06-23 (public-i18n-core@w3.org from April to June 2010)

From: Phillips, Addison <addison@lab126.com>
Date: Wed, 23 Jun 2010 11:12:03 -0400
To: Jack Jansen <Jack.Jansen@cwi.nl>
CC: Yves Lafon <ylafon@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>, "public-media-fragment@w3.org" <public-media-fragment@w3.org>
Message-ID: <C7A5719F1E562149BA9171F58BEE2CA4129E903208@EX-IAD6-B.ant.amazon.com>

Hi Jack,

> I am confused, maybe you can enlighten me.

Hopefully I can untangle any confusion I've caused!

> 
> We have based our media fragment URIs on rfc3987. For the one area
> where things make a difference (encoding track name and ID
> parameters), this document specifically states that percent-escapes
> should be interpreted as UTF-8 (last paragraph of section 3.2.2).

When I looked at the Media Fragments draft, I didn't see a reference to IRI (3987) and a number of references to URI (3986). I may not have been looking in the right place, of course. I'm looking at:

   http://www.w3.org/TR/2010/WD-media-frags-20100413/

The key thing about the section regarding track names to me would be to put things "the other way around". That is, if you're using IRI, then a track name would be a sequence of Unicode characters. The sequence is encoded to a URI by percent-encoding using UTF-8 according to the rules in IRI. Instead I see a definition in terms of URI in which a "utf8string" is the percent-encoded representation of the underlying track name.

> 
> But, the CharMod <http://www.w3.org/TR/CharMod-resid> reference you
> cite refers to the much older URI specification rfc2396 (and then
> adds stuff to it to say things should be utf-8 encoded). Rfc2396 is
> indeed "not good enough" for us, as it talks about byte values for
> percent encoding.

CharMod-Resid has the problem of having been published before IRI or the most recent URI were final. So it couldn't reference them normatively. 

Please note: URI (including 3986) talks about byte values for percent encoding. Percent encoding handles octets (bytes), not characters. Interpretation of the encoded bytes as characters comes on another level. Using IRI means using characters and transforming them to (percent-encoded) URIs using a fixed, well-known character encoding (UTF-8). But even IRIs can contain percent encoded sequences that represent "random bytes".

I'm concerned that perhaps there is confusion about what an IRI is vs. what a URI is. Would it be useful for (selected members of) our WG to attend one of your teleconferences (or vice versa)?

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N, IETF IRI WGs)

Internationalization is not a feature.
It is an architecture.

Received on Wednesday, 23 June 2010 15:12:36 UTC