RE: Issues with Packaging Application Payloads from Mark A. Jones on 2000-10-19 (xml-dist-app@w3.org from October 2000)

From: Mark A. Jones <jones@research.att.com>
Date: Thu, 19 Oct 2000 13:56:36 -0400
To: xml-dist-app@w3.org
Message-ID: <39EF35D3.50464256@research.att.com>
> RE: Issues with Packaging Application Payloads
>
> From: HUGHES,MARK (Non-HP-FtCollins,ex1) (mark_hughes@non.hp.com)
> Date: Wed, Oct 18 2000
>
>   Then there's #4, the *RIGHT WAY* to do this, which is:
> A) Before inserting your arbitrary text into your XML wrapper, run it
> through a filter that replaces & with &amp;, < with &lt;, and > with &gt;.
> B) Before handing arbitrary text back to the user, run it through a filter
> that replaces &lt; with <, &gt; with >, and &amp; with &.
>
>   Voila, the problem is solved.  You don't have the byte bloat of base64,
> you don't have the limitation of not including ]]> in CDATA, and you don't
> have to mess up validation.
>
>   XML is 8-bit clean (through UTF-8/16), so you can even send binary this
> way (though admittedly, at 50% bloat for 128-255, as compared to the 33%
> bloat of base64).
>
>   It's easy.  It's nigh-perfect.  Why would anyone NOT do this?
>
> --
>  <a href="http://kuoi.asui.uidaho.edu/~kamikaze/"> Mark Hughes </a>
>

Basically, the two approaches to packaging are delimiting and byte/character counting.

Some protocols send a byte count and then that many bytes.  This has the drawback for dynamically
generated data that you don't know the bytecount in advance.

XML CDATA, MIME's boundary-strings, and SMTP's dot termination are all delimiting examples.
Delimiting schemes typically allow you to explicitly escape embedded delimiters (e.g., dot-stuffing
in SMTP or the backslash character in many programming languages).  Others, like CDATA, force you
to concatenate sections to break up the would-be delimiter/terminator.

Mark's approach is the typical one taken with XML application payloads in SOAP.  It takes care of
embedded CDATA delimiters, which become "]]&gt;".  Encoding and decoding can be done on the fly for
dynamically generated content.  It does not solve the byte bloat issue with binary data which still
must undergo encoding/decoding.

There would be two approaches to using a SAX-style parser:

1) Use two applications (or a recursive application) -- one for the xml protocol processing and a
separate one for the application data after the replacements have taken place.  There may be
distinct advantages to this arrangment in terms of re-establishing an appropriate execution
environment.

2) Have the elements that contain the &lt;/&gt;/&amp; encodings directly flag the parser's CDATA
handler, tokenization and parsing routines to do the decoding and recursive parsing into the CDATA
section (as though it weren't quoted) while obeying the CDATA terminator (as though it were
quoted).  [essentially building in a meta-interpretation feature to the SAX parsing model]  With
this approach, interpretation could proceed incrementally without having to hit the end of CDATA,
do substitutions, and explicitly invoke the XML parser on the decoded content.  This approach also
allows a more fluid interaction of document features (lexical-scoping, id processing, etc.) between
the XP and application data environments, but it isn't clear if this is good or bad.  It also would
mean revising/extending existing SAX parsers which might be problematic.

Mark A. Jones
AT&T Labs - Research
Shannon Laboratory
Room A201
180 Park Ave.
Florham Park, NJ  07932-0971

email: jones@research.att.com
phone: (973) 360-8326
  fax: (973) 360-8970
Received on Thursday, 19 October 2000 13:56:39 UTC