- From: Chris Lilley <chris@w3.org>
- Date: Wed, 26 Jan 2005 18:08:03 +0100
- To: Paul Libbrecht <paul@activemath.org>
- Cc: Martin Duerst <duerst@w3.org>, www-tag@w3.org
On Wednesday, January 26, 2005, 9:58:28 AM, Paul wrote: PL> Dare I ask how this will solve the lack-of-encoding issue of get PL> request parameters ? Hmm, interesting angle on the versioning/extensibility problem... RFC 2396 defined a transformation from octets to percent escapes, but gave no information on the transformation from characters to octets. Hence the encoding problem that you describe for the query part of a URI. RFC 3986 gives an encoding of UTF-8 to this information, so a fully escaped URI can be inserted into content. RFC 3987 allows the information to be stored more naturally as characters, so an IRI can be inserted into content (or created on the fly) and converted into an RFC 3986-compliant URI if the the transport or URI scheme does not allow IRIs to be used directly. PL> Currently, I think, most server infrastructures consider the encoding PL> of %xx sequences as being the ASCII, ISO-8859-1, or the platform PL> encoding... or the encoding of 'the web page' from which it came; awkward if the request does not originate from a web page, or was bookmarked, or generated on the fly, etc. Or, in summary, it varies wildly and is based on unsound assumptions as to the demographics and settings of the clients. PL> special care is needed to avoid this and consider them PL> UTF-8-encoded (I think, this spec proposes that). Yes, by clearly differentiating between octets and characters, its is thus clear what the process is to go from user-supplied non-ascii form data form an html form in an arbitrary character encoding, to an IRI that encodes that form data. Its also clear how to transform that IRI to a URI. On the plus side, this URI is compatible with existing infrastructure. On the minus side, its compatible with existing infrastructure :) so the receiving server side application has no indication that the URI has its query part encoded in conformance with RFC 3986 and 3987 (hex escapes represent characters encoded in utf-8) rather than, say, RFC 2396 (escapes represent octets, relation of octets to characters is a wild guess). If the IRI is stored as an IRI in an XML body part, there is no ambiguity. But that requires POST rather than GET. PL> How are implementors expected to react ? There seems to be an emerging trend to identify both the encoding used to construct the URI, and the encoding used for the original character data (in the case of ambiguous round-tripping). For example, google uses an 'input encoding' (ie) and 'output encoding'(oe) parameter - if I search on a random unicode character, Cyrillic capital letter Nje (Њ) it gives me the following URI http://www.google.fr/search?sourceid=mozclient&scoring=d&ie=utf-8&oe=utf-8&q=%D0%8A This approach could be followed with with RFC 3986 by always setting oe=utf-8. With RFC 3987, the query part is a string of characters, so there is no ambiguity and no need for an additional parameter to verify the encoding. Over time, server applications that accept URIs should migrate to expecting an output encoding of utf-8 and RFC 3986 conformance. Server applications that accept IRIs or that use POST (eg, of an xml body part whose encoding is described by the use of an xml encoding declaration) do not exhibit this problem. PL> Le 26 janv. 05, à 03:26, Martin Duerst a écrit : >> The URI spec is an IETF Standard! Thanks to all the people who >> contributed over the long time this has been in the works! >> [...] >> > URL: ftp://ftp.rfc-editor.org/in-notes/rfc3986.txt -- Chris Lilley mailto:chris@w3.org Chair, W3C SVG Working Group Member, W3C Technical Architecture Group
Received on Wednesday, 26 January 2005 17:08:18 UTC