W3C home > Mailing lists > Public > www-ws-desc@w3.org > April 2002

W3C WSDL WG: 6b. Define encoding for non-ASCII characters in request URL

From: Jeffrey Schlimmer <jeffsch@windows.microsoft.com>
Date: Tue, 30 Apr 2002 15:13:11 -0700
Message-ID: <2E33960095B58E40A4D3345AB9F65EC106E14981@win-msg-01.wingroup.windeploy.ntdev.microsoft.com>
To: <www-ws-desc@w3.org>
This one is pretty simple. It proposes following the IRI Internet draft
and XML Inclusions Candidate Recommendation.

Comments?

Issue 6b: Define encoding for characters outside ASII in a request URL

PROBLEM

The content of a message may include non-ASCII characters but these are
not allowed in URIs.

BACKGROUND

RFC 2396 [3] Section 2.4.3 states that non-ASCII characters must be
encoded into octets in some manner but does not define an encoding nor a
mechanism to declare the encoding used:

   For original character sequences that contain non-ASCII characters,
   however, the situation is more difficult. Internet protocols that
   transmit octet sequences intended to represent character sequences
   are expected to provide some way of identifying the charset used, if
   there might be more than one [RFC2277].  However, there is currently
   no provision within the generic URI syntax to accomplish this
   identification.

From HTML 4.01 [1], Section 17.13.1

Note. The "get" method restricts form data set values to ASCII
characters. Only the "post" method (with enctype="multipart/form-data")
is specified to cover the entire [ISO10646] character set.

XML Inclusions [5], Section 4.1.1 defines a UTF-8 based recoding of
non-ASCII characters:

The disallowed characters include all non-ASCII characters, plus the
excluded characters listed in Section 2.4 of [IETF RFC 2396], except for
the number sign (#) and percent sign (%) characters and the square
bracket characters re-allowed in [IETF RFC 2732]. Disallowed characters
are escaped as follows:

Each disallowed character is converted to UTF-8 [IETF RFC 2279] as one
or more bytes.

Any bytes corresponding to a disallowed character are escaped with the
URI escaping mechanism (that is, converted to %HH, where HH is the
hexadecimal notation of the byte value).

The original character is replaced by the resulting character sequence.

This recommendation is in agreement with RFC 2718 [4] Section 2.2.5:

      ... Unless there is some compelling reason for a
      particular scheme to do otherwise, translating character sequences
      into UTF-8 (RFC 2279) [3] and then subsequently using the %HH
      encoding for unsafe octets is recommended.

Internationalized Resource Identifiers (IRIs) [2] are analogous to URIs
but contain characters from the Universal Character set. Section 2.3
makes the same recommendation as XML Inclusions, Section 4.1.1. Note
that IRIs are described in an active IETF draft but are not yet
standardized.

PROPOSAL

Until IRIs are standardized, follow URI Escaping as outlined XML
Include.

REFERENCES

[1] http://www.w3.org/TR/html401/cover.html 
[2] http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt 
[3] ftp://ftp.isi.edu/in-notes/rfc2396.txt 
[4] http://www.ietf.org/rfc/rfc2718.txt?number=2718 
[5] http://www.w3.org/TR/xinclude/ 

EOF
Received on Tuesday, 30 April 2002 18:17:33 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:58:19 GMT