- From: Jeffrey Schlimmer <jeffsch@windows.microsoft.com>
- Date: Tue, 30 Apr 2002 15:13:11 -0700
- To: <www-ws-desc@w3.org>
This one is pretty simple. It proposes following the IRI Internet draft and XML Inclusions Candidate Recommendation. Comments? Issue 6b: Define encoding for characters outside ASII in a request URL PROBLEM The content of a message may include non-ASCII characters but these are not allowed in URIs. BACKGROUND RFC 2396 [3] Section 2.4.3 states that non-ASCII characters must be encoded into octets in some manner but does not define an encoding nor a mechanism to declare the encoding used: For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one [RFC2277]. However, there is currently no provision within the generic URI syntax to accomplish this identification. From HTML 4.01 [1], Section 17.13.1 Note. The "get" method restricts form data set values to ASCII characters. Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire [ISO10646] character set. XML Inclusions [5], Section 4.1.1 defines a UTF-8 based recoding of non-ASCII characters: The disallowed characters include all non-ASCII characters, plus the excluded characters listed in Section 2.4 of [IETF RFC 2396], except for the number sign (#) and percent sign (%) characters and the square bracket characters re-allowed in [IETF RFC 2732]. Disallowed characters are escaped as follows: Each disallowed character is converted to UTF-8 [IETF RFC 2279] as one or more bytes. Any bytes corresponding to a disallowed character are escaped with the URI escaping mechanism (that is, converted to %HH, where HH is the hexadecimal notation of the byte value). The original character is replaced by the resulting character sequence. This recommendation is in agreement with RFC 2718 [4] Section 2.2.5: ... Unless there is some compelling reason for a particular scheme to do otherwise, translating character sequences into UTF-8 (RFC 2279) [3] and then subsequently using the %HH encoding for unsafe octets is recommended. Internationalized Resource Identifiers (IRIs) [2] are analogous to URIs but contain characters from the Universal Character set. Section 2.3 makes the same recommendation as XML Inclusions, Section 4.1.1. Note that IRIs are described in an active IETF draft but are not yet standardized. PROPOSAL Until IRIs are standardized, follow URI Escaping as outlined XML Include. REFERENCES [1] http://www.w3.org/TR/html401/cover.html [2] http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt [3] ftp://ftp.isi.edu/in-notes/rfc2396.txt [4] http://www.ietf.org/rfc/rfc2718.txt?number=2718 [5] http://www.w3.org/TR/xinclude/ EOF
Received on Tuesday, 30 April 2002 18:17:33 UTC