Re: location uri, ucs and the http scheme definition.

Robert Collins wrote:
> > > I'd be happy with a HTTP/1.1 errata that updates the http:// scheme to
> > > declare it as utf8 before the escape encoding is done.
> > 
> > Not reasonable.
> > 
> > There are a significant number of HTTP/1.1-compliant servers which
> > work with URLs that are derived from text in other encodings, and
> > there are servers where the encoding depends on the URL (because the
> > server's job is to pass along the URL unmodified to individual
> > resource handlers).
> 
> I put it to you that this has occured because of the lack of guidance in
> rfc2616. Even though we can't retroactively change the standard, adding
> in the std66 recommendation as a wg recommendation would be a positive
> step IMO.

Those web servers _far_ predate RFC2616.  Whatever guidance goes into
an HTTP URI standard, it must remain backward compatible with what's
widely deployed, which is precisely why the RFCs don't mandate it yet,
even as they suggest further work is needed on it.

> Anything compliant with any of the uri standards must continue to work
> with any % escape uri representation. Sure - but it would be nice to
> document what *should* work.

It is documented.  According to documentation, any %-escaped octet
sequence should work :)  Converting them to *characters for visual
presentation* is outside the scope of HTTP.

> > In principle, the escape-encoding represents an application-specific
> > opaque octet stream, and it need not represent "characters" at all.
> 
> For URIs in general, yes. but std66 section 2.5 does provide guidance
> for this...
> >     - How non-ASCII characters in documents in places such as an
> >       "href" attribute are converted into proper URIs for HTTP.
> > 
> >     - How non-ASCII characters in forms are converted into proper
> >       URI query parts.  (This is covered somewhat already in HTTP 4).
> > 
> >     - How non-ASCII characters in other parts of a typical client's
> >       user interface such as the "location bar", are converted into
> >       proper URLs for HTTP document retrieval.
> 
> Which, given we started this thread on the Location header in http,
> which sets the user interface location bar ... seems relevant to me.

The Location header only has that effect in _web browsers_.

There are lots of other programs which use HTTP for which the
"characters" encoded in a URL are irrelevant.

Increasingly, we may find that non-web-browser HTTP agents see
non-ASCII characters in parts of a document that claim to be URIs, and
must follow them.  Or, they see URIs containing %-encoded characters
and need to convert those to presentable text in documents.

Broadly, the UTF-8-ness affects programs which relate documents
containing non-ASCII characters with URLs.  For example, a spider
which indexes pages that happen to contain non-ASCII characters in the
URLs in "href" attributes... those are actually not valid URLs, but
the spider has to make a heuristic decision if it's to follow them.

Unfortunately, if we mandate that non-ASCII characters found in "href"
URL attributes should be %-escaped as UTF-8 to follow them, we'll find
that this *breaks* some existing deployed sites.  Maybe this is for
the best...

In the other direction, a program which is generating index pages of
links may wish to present the links visually as text, converting
%-escape sequences into good looking text.  However, this may look
nice but it's prone to causing security problems...

> Anyway, what I'd like to see is some reference suggesting a best
> practice for http uris, if that is able to be defined. Using whatever
> guidelines are present for the next http protocol would be ideal ;0.

It still has to be backward-compatible, if it's HTTP/1.2.

So guidance for future applications is sensible.  Making it a
requirement has to done much more carefully.

-- Jamie

Received on Tuesday, 23 August 2005 15:55:03 UTC