URI Escaping and IRIs (was Issues 6a, 6d, 41. Define encoding into a request URL) from Jonathan Marsh on 2002-04-29 (www-ws-desc@w3.org from April 2002)

From: Jonathan Marsh <jmarsh@microsoft.com>
Date: Mon, 29 Apr 2002 11:07:46 -0700
To: <www-ws-desc@w3.org>
Message-ID: <330564469BFEC046B84E591EB3D4D59C05C06408@red-msg-08.redmond.corp.microsoft.com>
Does the new Internationalized Resource Identifier (IRI) have any impact
on your proposal?  It loosens the escaping requirements.  In other specs
like XLink, care is taken to perform escaping at the latest possible
stage (in the URI resolver component), so that one day the escaping step
can be removed altogether.

[1] http://www.ietf.org/internet-drafts/draft-duerst-iri-00.txt

> -----Original Message-----
> From: Jeffrey Schlimmer [mailto:jeffsch@windows.microsoft.com]
> Sent: Friday, April 26, 2002 7:10 PM
> To: www-ws-desc@w3.org
> Subject: Issues 6a, 6d, 41. Define encoding into a request URL
> 
> These three non-SOAP HTTP binding issues seem to go together. Does the
> proposal below sound reasonable?
> 
> --Jeff
> 
> Issue 6a: Define encoding of complex types into a request URL
> Issue 6d: Define encoding for URL-sensitive characters into a request
> URL
> Issue 41: Define encoding of attributes into a request URL
> 
> PROBLEM
> 
> It is not clear how to encode some of the data types and/or characters
> in WSDL-defined messages into an HTTP GET request URI.
> 
> BACKGROUND
> 
> RFC 2396 [3] Section 2.4 defines an overall guideline for including
data
> characters in a URI. Quoting from the RFC:
> 
>    Data must be escaped if it does not have a representation using an
>    unreserved character; this includes data that does not correspond
to
>    a printable character of the US-ASCII coded character set, or that
>    corresponds to any US-ASCII character that is disallowed ...
> 
> RFC 2396 Section 2.3 defines "unreserved" characters:
> 
>       unreserved  = alphanum | mark
> 
>       mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" |
")"
> 
> RFC 2396 Section 2.4.1 defines a means to escape characters:
> 
>    An escaped octet is encoded as a character triplet, consisting of
the
>    percent character "%" followed by the two hexadecimal digits
>    representing the octet code. For example, "%20" is the escaped
>    encoding for the US-ASCII space character.
> 
>       escaped     = "%" hex hex
>       hex         = digit | "A" | "B" | "C" | "D" | "E" | "F" |
>                             "a" | "b" | "c" | "d" | "e" | "f"
> 
> RFC 2396 Section 2.4.2 notes that the percent character must itself be
> escaped if the intent is to include it literally in the URI:
> 
>    Because the percent "%" character always has the reserved purpose
of
>    being the escape indicator, it must be escaped as "%25" in order to
>    be used as data within a URI.  Implementers should be careful not
to
>    escape or unescape the same string more than once, since unescaping
>    an already unescaped string might lead to misinterpreting a percent
>    data character as another escaped character, or vice versa in the
>    case of escaping an already escaped string.
> 
> RFC 2616 [2] references RFC 2396, defines the "http:" URI scheme, and
> uses several of the "reserved" characters defined by RFC 2396. For
> example, ":", "/", and "?".
> 
> RFC 2616 Section 3.2.1 cautiously recommends processing URIs of
> unbounded length and defines an error code if the request URI is
longer
> than the server can handle:
> 
>    The HTTP protocol does not place any a priori limit on the length
of
>    a URI. Servers MUST be able to handle the URI of any resource they
>    serve, and SHOULD be able to handle URIs of unbounded length if
they
>    provide GET-based forms that could generate such URIs. A server
>    SHOULD return 414 (Request-URI Too Long) status if a URI is longer
>    than the server can handle (see section 10.4.15).
> 
>       Note: Servers ought to be cautious about depending on URI
lengths
>       above 255 bytes, because some older client or proxy
>       implementations might not properly support these lengths.
> 
> In addition to the above caution, it should be noted that accepting
HTTP
> requests of unbounded length opens a server to a simple denial of
> service attack.
> 
> RFC 2616 Section 10.4.1 defines the 400 Bad Request status code as a
> response when the request syntax is malformed.
> 
> HTML 4.01 [1] Section 17.13 defines an encoding of HTML form data into
a
> request URI for the GET method. A question mark ("?") separates the
base
> URI from a representation of each control: control name, equals sign
> ("="), control value. Controls are separated by an ampersand ("&") or
> semicolon (";").
> 
> HTML 4.01 Section 17 specifies two additional character escape
> recommendations beyond RFC 2396. First, HTML 4.01 defines the plus
sign
> ("+") as a replacement for the space character; both this and the RFC
> 2396 defined "%20" seem to be used in practice. Second, HTML 4.01
> defines line breaks as carriage return, line feed pairs, to be escaped
> per RFC 2396.
> 
> WSDL 1.1 [4] allows defining messages that have aggregate data, i.e.,
of
> XML Schema complexType.
> 
> WSDL 1.1 defines a binding for HTTP 1.1's GET method. In this binding,
> the data in the input message is encoded as a request URI with an
> "http:" scheme.
> 
> WSDL 1.1 defines two binding extensions in the
> http://schemas.xmlsoap.org/wsdl/http/ namespace to indicate how
message
> parts are to be encoded into an HTTP request URI.
> 
> urlEncoded indicates that all message parts are to be encoded using
the
> HTML 4.01 form GET request URI syntax, treating part names (values) as
> control names (values), respectively.
> 
> urlReplacement indicates that values for message parts are substituted
> into a pattern supplied in the location attribute of the operation
> extension element.
> 
> The W3C Web Services Description WG MUST define a normative
description
> of a HTTP GET and POST binding for the next version of WSDL.
> 
> PROPOSAL
> 
> Define a new binding extension element urlXML in the HTTP binding
> namespace.
> 
> The urlXML element indicates that the message part is encoded into the
> HTTP request URI as XML. "Unreserved" characters are escaped per RFC
> 2396 Section 2.4. Line breaks are encoded per HTML 4.01 Section 17.
> (Spaces are encoded per RFC 2396.)
> 
> If a server un-escapes the request URI and the result is not
well-formed
> XML, the server SHOULD respond with 400 Bad Request.
> 
> In the simplest variation of this proposal, only one style and
encoding
> would be defined. However each of the existing variations could be
> defined. For instance, if the style is document, there is no wrapper
> element, and if the message has > 1 part, the resulting XML document
> will have > 1 root. If the style is rpc, the wrapper element is named
> after the operation, and the resulting XML document always has == 1
> root.
> 
> EXAMPLE
> 
> Consider the following hypothetical example of a Web Service Operation
> to register a new user from some regional office. Here is the data,
> message, operation, and binding definition in WSDL. (XML namespace
> declarations are omitted for the sake of brevity.)
> 
> <complexType name="myType">
>   <sequence>
>     <element name="Name" type="string"/>
>     <element name="BirthDate" type="date"/>
>   </sequence>
>   <attribute name="sex" type="string"/>
> </complexType>
> 
> <message name="myMessage">
>   <part name="Region" type="xsd:string"/>
>   <part name="NewUser" type="tns:myType"/>
> </message>
> <message name="m2">...</message>
> 
> <portType name="myPort">
>   <operation name="myOperation">
>     <input message="tns:myMessage"/>
>     <output message="tns:m2"/>
>   </operation>
> </portType>
> 
> <binding name="myBinding">
>   <http:binding verb="GET"
>       style="document"/> <!-- proposed -->
>   <operation name="myOperation">
>     <html:operation location="myURL"/>
>     <input>
>       <http:urlXML use="literal"/> <!-- proposed -->
>     </input>
>     <output>...</output>
>   </operation>
> </binding>
> 
> A request is encoded into a request URI in two stages. First, the
> message is encoded using the indicated style (document, no wrapper)
into
> XML.
> 
> <Region>1</Region>
> <NewUser sex='male'>
>   <Name>John Doe</Name>
>   <BirthDate>1960-01-01</BirthDate>
> </NewUser>
> 
> Then all but "unreserved" characters are escaped in the request URI.
(To
> make the example readable, white space and line breaks have not been
> escaped, but they would be in an actual request URI.)
> 
> %3CRegion%3E1%3C%2FRegion%3E
> %3CNewUser sex%3D'male'%3E
>   %3CName%3EJohn Doe%3C%2FName%3E
>   %3CBirthDate%3E1960-01-01%3C%2FBirthDate%3E
> %3C%2FNewUser%3E
> 
> REFERENCES
> 
> [1] http://www.w3.org/TR/html401/cover.html
> [2] ftp://ftp.isi.edu/in-notes/rfc2616.txt
> [3] ftp://ftp.isi.edu/in-notes/rfc2396.txt
> [4] http://www.w3.org/TR/wsdl
> 
> EOF
Received on Monday, 29 April 2002 14:07:58 UTC