RE: How browsers display IRI's with mixed encodings from Phillips, Addison on 2011-07-27 (public-iri@w3.org from July 2011)

From: Phillips, Addison <addison@lab126.com>
Date: Tue, 26 Jul 2011 18:13:30 -0700
To: Leif H Silli <xn--mlform-iua@xn--mlform-iua.no>, "duerst@it.aoyama.ac.jp" <duerst@it.aoyama.ac.jp>
CC: "chris@lookout.net" <chris@lookout.net>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <131F80DEA635F044946897AFDA9AC3476A94986B60@EX-SEA31-D.ant.amazon.com>

> >
> >> It is one thing that %FC needs to work (in some sense - like
> >> quirks-mode pages also have to work even if it is not valid). But if
> >> there is no good necessary usecase for %FC, then we should help
> >> authors avoid problems by encourage validators to warn against it use.
> >
> > There's nothing invalid with %FC.
> 
> My suggestion was that it should *become* invalid/get a warning in - let's say -
> HTML5 docs.

Making the literal sequence %FC invalid would be a Bad Thing. It would make it impossible to encode certain resources that are otherwise completely valid.
> 
> > A URI that contains %FC is perfectly valid (check RFC 3986). Because it's a
> valid URI, it's also a valid IRI.
> 
> But an author which -today- inserts %FC is likely to do a mistake - or at least
> make a bad choice, no?

An author who inserts u-umlaut and expects to get %FC is making a mistake. An author who inserts %FC and expects to see u-umlaut is making a mistake (or should be). But an author who inserts %FC because that's what her server expects? Valid. And an author who inserts u-umlaut and expects it to display as u-umlaut and send (as %C3%BC in URI form)? Also valid, IMHO.

> 
> > And it's useful in some circumstances. Imagine a server where all the
> resource names are encoded in iso-8859-1 (or any other legacy (single-byte)
> encoding). What you tell http (or whatever other scheme/protocol) by
> using %FC is that you want the resource with the name with the <0xFC> byte in
> it.
> 
> How common are such servers these days?

They should be really really common, since that's what URI *says* %FC means.

> 
> My focus is authors. And of course it could be the author meant %FC. But might
> it not more often be simply a result of a bad %-encoder or on a misconception?
> 

The problem, as I see it, is not with the sequence %FC. It is with the character U+00FC appearing in an HTML document inside a URI path. 

I tend to think that the interpretation of %FC using page encoding is bad because an IRI (or URI) lacks the necessary context to make that determination. I agree with Boris's earlier message on the list that showing %FC is a bad user experience. But shouldn't we be trying to close on a well-defined set of behaviors that content authors (and others) can understand? I think such an approach would include the behavior described above, even at the expense of some usability. And who looks at those really long URIs full of percent gunk anyway? :-))

Addison

Received on Wednesday, 27 July 2011 01:14:01 UTC