Issues 6a, 6d, 41. Define encoding into a request URL

These three non-SOAP HTTP binding issues seem to go together. Does the
proposal below sound reasonable?

--Jeff

Issue 6a: Define encoding of complex types into a request URL
Issue 6d: Define encoding for URL-sensitive characters into a request
URL
Issue 41: Define encoding of attributes into a request URL

PROBLEM

It is not clear how to encode some of the data types and/or characters
in WSDL-defined messages into an HTTP GET request URI.

BACKGROUND

RFC 2396 [3] Section 2.4 defines an overall guideline for including data
characters in a URI. Quoting from the RFC:

   Data must be escaped if it does not have a representation using an
   unreserved character; this includes data that does not correspond to
   a printable character of the US-ASCII coded character set, or that
   corresponds to any US-ASCII character that is disallowed ...

RFC 2396 Section 2.3 defines "unreserved" characters:

      unreserved  = alphanum | mark

      mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

RFC 2396 Section 2.4.1 defines a means to escape characters:

   An escaped octet is encoded as a character triplet, consisting of the
   percent character "%" followed by the two hexadecimal digits
   representing the octet code. For example, "%20" is the escaped
   encoding for the US-ASCII space character.

      escaped     = "%" hex hex
      hex         = digit | "A" | "B" | "C" | "D" | "E" | "F" |
                            "a" | "b" | "c" | "d" | "e" | "f"

RFC 2396 Section 2.4.2 notes that the percent character must itself be
escaped if the intent is to include it literally in the URI:

   Because the percent "%" character always has the reserved purpose of
   being the escape indicator, it must be escaped as "%25" in order to
   be used as data within a URI.  Implementers should be careful not to
   escape or unescape the same string more than once, since unescaping
   an already unescaped string might lead to misinterpreting a percent
   data character as another escaped character, or vice versa in the
   case of escaping an already escaped string.

RFC 2616 [2] references RFC 2396, defines the "http:" URI scheme, and
uses several of the "reserved" characters defined by RFC 2396. For
example, ":", "/", and "?".

RFC 2616 Section 3.2.1 cautiously recommends processing URIs of
unbounded length and defines an error code if the request URI is longer
than the server can handle:

   The HTTP protocol does not place any a priori limit on the length of
   a URI. Servers MUST be able to handle the URI of any resource they
   serve, and SHOULD be able to handle URIs of unbounded length if they
   provide GET-based forms that could generate such URIs. A server
   SHOULD return 414 (Request-URI Too Long) status if a URI is longer
   than the server can handle (see section 10.4.15).

      Note: Servers ought to be cautious about depending on URI lengths
      above 255 bytes, because some older client or proxy
      implementations might not properly support these lengths.

In addition to the above caution, it should be noted that accepting HTTP
requests of unbounded length opens a server to a simple denial of
service attack.

RFC 2616 Section 10.4.1 defines the 400 Bad Request status code as a
response when the request syntax is malformed.

HTML 4.01 [1] Section 17.13 defines an encoding of HTML form data into a
request URI for the GET method. A question mark ("?") separates the base
URI from a representation of each control: control name, equals sign
("="), control value. Controls are separated by an ampersand ("&") or
semicolon (";").

HTML 4.01 Section 17 specifies two additional character escape
recommendations beyond RFC 2396. First, HTML 4.01 defines the plus sign
("+") as a replacement for the space character; both this and the RFC
2396 defined "%20" seem to be used in practice. Second, HTML 4.01
defines line breaks as carriage return, line feed pairs, to be escaped
per RFC 2396.

WSDL 1.1 [4] allows defining messages that have aggregate data, i.e., of
XML Schema complexType.

WSDL 1.1 defines a binding for HTTP 1.1's GET method. In this binding,
the data in the input message is encoded as a request URI with an
"http:" scheme.

WSDL 1.1 defines two binding extensions in the
http://schemas.xmlsoap.org/wsdl/http/ namespace to indicate how message
parts are to be encoded into an HTTP request URI.

urlEncoded indicates that all message parts are to be encoded using the
HTML 4.01 form GET request URI syntax, treating part names (values) as
control names (values), respectively.

urlReplacement indicates that values for message parts are substituted
into a pattern supplied in the location attribute of the operation
extension element.

The W3C Web Services Description WG MUST define a normative description
of a HTTP GET and POST binding for the next version of WSDL.

PROPOSAL

Define a new binding extension element urlXML in the HTTP binding
namespace. 

The urlXML element indicates that the message part is encoded into the
HTTP request URI as XML. "Unreserved" characters are escaped per RFC
2396 Section 2.4. Line breaks are encoded per HTML 4.01 Section 17.
(Spaces are encoded per RFC 2396.)

If a server un-escapes the request URI and the result is not well-formed
XML, the server SHOULD respond with 400 Bad Request.

In the simplest variation of this proposal, only one style and encoding
would be defined. However each of the existing variations could be
defined. For instance, if the style is document, there is no wrapper
element, and if the message has > 1 part, the resulting XML document
will have > 1 root. If the style is rpc, the wrapper element is named
after the operation, and the resulting XML document always has == 1
root.

EXAMPLE

Consider the following hypothetical example of a Web Service Operation
to register a new user from some regional office. Here is the data,
message, operation, and binding definition in WSDL. (XML namespace
declarations are omitted for the sake of brevity.)

<complexType name="myType">
  <sequence>
    <element name="Name" type="string"/>
    <element name="BirthDate" type="date"/>
  </sequence>
  <attribute name="sex" type="string"/>
</complexType>

<message name="myMessage">
  <part name="Region" type="xsd:string"/>
  <part name="NewUser" type="tns:myType"/>
</message>
<message name="m2">...</message>

<portType name="myPort">
  <operation name="myOperation">
    <input message="tns:myMessage"/>
    <output message="tns:m2"/> 
  </operation>
</portType>

<binding name="myBinding">
  <http:binding verb="GET"
      style="document"/> <!-- proposed -->
  <operation name="myOperation">
    <html:operation location="myURL"/>
    <input>
      <http:urlXML use="literal"/> <!-- proposed -->
    </input>
    <output>...</output>
  </operation>
</binding>

A request is encoded into a request URI in two stages. First, the
message is encoded using the indicated style (document, no wrapper) into
XML.
 
<Region>1</Region>
<NewUser sex='male'>
  <Name>John Doe</Name>
  <BirthDate>1960-01-01</BirthDate>
</NewUser>

Then all but "unreserved" characters are escaped in the request URI. (To
make the example readable, white space and line breaks have not been
escaped, but they would be in an actual request URI.)

%3CRegion%3E1%3C%2FRegion%3E
%3CNewUser sex%3D'male'%3E
  %3CName%3EJohn Doe%3C%2FName%3E
  %3CBirthDate%3E1960-01-01%3C%2FBirthDate%3E
%3C%2FNewUser%3E

REFERENCES

[1] http://www.w3.org/TR/html401/cover.html 
[2] ftp://ftp.isi.edu/in-notes/rfc2616.txt
[3] ftp://ftp.isi.edu/in-notes/rfc2396.txt
[4] http://www.w3.org/TR/wsdl

EOF

Received on Friday, 26 April 2002 22:14:01 UTC