does URL parsing result in Unicode strings or in ASCII strings?

This is from a conversation on public-ietf-w3c@w3.org (BCC'd)
Re: [url] Requests for Feedback (was Feedback from TPAC)

Martin:
> >> The URL spec, as far as I understand, allows Unicode as input, so in
> >> that respect, it isn't ghettoizing. But it converts all output to ASCII,
> >> and so essentially sends a message that Unicode is second-class.
> >>
> >> My understanding is that the reason for this is that current browser
> >> interfaces are working that way, and I'm not against documenting that,
> >> but I'd wish we could get away from that limitation for the general case
> >> (i.e. parser results are still Unicode).

We should not evaluate protocol choices by considering
whether something "sends a message" or what sounds like
an appeal not to make Unicode "second-class".
Bad, non-interoperable design "sends a message".

In fact, URLs when invented were ASCII only, and the only
message we should send is that upgrading from ASCII
to Unicode isn't simply a matter of increasing the repertoire.
Yes, Unicode is second. 

Sam:
> > There are a couple of conflicting requirements that make that difficult.
> > If you make an API for resource identifiers, you don't want it to change
> > behavior when new schemes are introduced; you probably also want that an
> > input like `example:///ö` is handled the same as `example:///%c3%b6` and
> > then also avoid turning `data:image/png,...%xx...` into a mix of random
> > Unicode characters interspersed with %xx escapes that would not round-
> > trip if decoded. If you want Unicode output, and a data-like scheme is
> > introduced, you cannot satisfy all requirements.

Martin:
> This is indeed a theoretical problem, but one that in practice rarely
> shows up and is rather easily dealt with.
> 
> First, data:-like schemes are few and far between.

a) Either the use cases are in scope or they're not. If they're in scope then their requirements must
be met. "Few and far between" by what measure? And how do you "easily deal with" unpredictable new schemes which are known to very widely deployed URL parsers?

(I think there's a clue here about 'why URL shouldn't be a Living Standard' because there are orders of magnitude more deployed implementations.)

> Second, there's no reason to convert to Unicode sequences of %xx that
> can't be converted in full.

This leaves a complex and unstable question of which sequences should or shouldn't be converted, or can't be "in full".    It leaves many possible results rather than a canonical one. 

> Third, the equivalence between "ö" and "%c3%b6" might be provided at
> a higher level in the API, because "is handled the same" assumes a
> universal equivalence function for URIs and IRIs when the specs clearly
> explain that there is no such thing (see
> http://tools.ietf.org/html/rfc3986#section-6 and
> http://tools.ietf.org/html/rfc3987#section-5).

3986/3987 on equivalence have problems, which is why I split out equivalence into a separate document.
I don't think it is productive or useful to provide an API which makes it easy to distinguish the input form of URL among "http:" and "HTTP:" or  between "ö" and "%c3%b6".

Since hostname punycode has to be handled at this level, encoding of other parts of the URL should be, too.

Received on Tuesday, 6 January 2015 22:00:00 UTC