Re: STD 66,RFC 3986 on Uniform Resource Identifier (URI): Generic Syntax from Chris Lilley on 2005-01-26 (www-tag@w3.org from January 2005)

From: Chris Lilley <chris@w3.org>
Date: Wed, 26 Jan 2005 18:08:03 +0100
To: Paul Libbrecht <paul@activemath.org>
Cc: Martin Duerst <duerst@w3.org>, www-tag@w3.org
Message-ID: <1345579659.20050126180803@w3.org>
On Wednesday, January 26, 2005, 9:58:28 AM, Paul wrote:


PL> Dare I ask how this will solve the lack-of-encoding issue of get 
PL> request parameters ?

Hmm, interesting angle on the versioning/extensibility problem...

RFC 2396 defined a transformation from octets to percent escapes, but
gave no information on the transformation from characters to octets.
Hence the encoding problem that you describe for the query part of a
URI.

RFC 3986 gives an encoding of UTF-8 to this information, so a fully
escaped URI can be inserted into content. RFC 3987 allows the
information to be stored more naturally as characters, so an IRI can be
inserted into content (or created on the fly) and converted into an RFC
3986-compliant URI if the the transport or URI scheme does not allow
IRIs to be used directly.

PL> Currently, I think, most server infrastructures consider the encoding
PL> of %xx sequences as being the ASCII, ISO-8859-1, or the platform 
PL> encoding...

or the encoding of 'the web page' from which it came; awkward if the
request does not originate from a web page, or was bookmarked, or
generated on the fly, etc. Or, in summary, it varies wildly and is
based on unsound assumptions as to the demographics and settings of the
clients.

PL> special care is needed to avoid this and consider them
PL> UTF-8-encoded (I think, this spec proposes that).

Yes, by clearly differentiating between octets and characters, its is
thus clear what the process is to go from user-supplied non-ascii form
data form an html form in an arbitrary character encoding, to an IRI
that encodes that form data.

Its also clear how to transform that IRI to a URI. On the plus side,
this URI is compatible with existing infrastructure. On the minus side,
its compatible with existing infrastructure :) so the receiving server
side application has no indication that the URI has its query part
encoded in conformance with RFC 3986 and 3987 (hex escapes represent
characters encoded in utf-8) rather than, say, RFC 2396 (escapes
represent octets, relation of octets to characters is a wild guess).

If the IRI is stored as an IRI in an XML body part, there is no
ambiguity. But that requires POST rather than GET.

PL> How are implementors expected to react ?

There seems to be an emerging trend to identify both the encoding used
to construct the URI, and the encoding used for the original character
data (in the case of ambiguous round-tripping). For example, google uses
an 'input encoding' (ie) and 'output encoding'(oe) parameter - if I search on a
random unicode character, Cyrillic capital letter Nje (&#x040A;) it
gives me the following URI

 http://www.google.fr/search?sourceid=mozclient&scoring=d&ie=utf-8&oe=utf-8&q=%D0%8A

This approach could be followed with with RFC 3986 by always setting
oe=utf-8. With RFC 3987, the query part is a string of characters, so
there is no ambiguity and no need for an additional parameter to verify
the encoding.

Over time, server applications that accept URIs should migrate to
expecting an output encoding of utf-8 and RFC 3986 conformance.

Server applications that accept IRIs or that use POST (eg, of an xml
body part whose encoding is described by the use of an xml encoding
declaration) do not exhibit this problem.

PL> Le 26 janv. 05, à 03:26, Martin Duerst a écrit :
>> The URI spec is an IETF Standard! Thanks to all the people who
>> contributed over the long time this has been in the works!
>> [...]
>> >        URL:        ftp://ftp.rfc-editor.org/in-notes/rfc3986.txt





-- 
 Chris Lilley                    mailto:chris@w3.org
 Chair, W3C SVG Working Group
 Member, W3C Technical Architecture Group
Received on Wednesday, 26 January 2005 17:08:18 UTC