Potential New Issue: The Boundaries of Content Coding from Robin Berjon on 2006-09-02 (www-tag@w3.org from September 2006)

From: Robin Berjon <robin.berjon@expway.fr>
Date: Sat, 2 Sep 2006 18:15:41 +0200
To: www-tag@w3.org
Message-Id: <7FFED4F9-D957-486D-83FE-A3BDFCC34AF9@expway.fr>
Dear all,

this is an issue that I know has been discussed here and there, but I  
don't recall it being brought up as a TAG issue proper, and I don't  
believe I've seen any final resolution on the matter (if there has  
been one, I haven't seen it).

I'm thinking about it largely from an XML and efficient XML  
background, but I believe it applies equally to the variety of RDF  
syntaxes hanging around. For context, this is how it is discussed in  
the XBC Properties Document under Content Type Management[0]:

"""
The media type and encoding infrastructure provides for a common and  
simple way of identifying the contents of a document and the content  
coding with which it is transmitted. It is fundamental to the  
functioning of the Web and enables powerful features such as content  
negotiation. While required for the Web, these mechanisms are not  
specific to it and are typically reused in many other situations.
It is therefore desirable that formats meant to be used on the Web  
define (and preferably register) the media type and/or encoding that  
one is to use when transmitting them.

There are multiple ways in which an alternate XML format could define  
how media types and encodings are to be used with it. Several options  
of note and their associated trade-offs are:

     • The alternate XML serialization is considered to just be a  
content coding. In this case it may have a media type (as gzip does  
with 'application/gzip' in addition to the 'gzip' content coding) but  
the principal way of using it is to keep the original media type of  
the XML content and only change the content coding. The upside of  
this approach is that the existing content dispatching system is  
untouched, that the media type information is fully useful, and that  
the content coding infrastructure is put to good use. The downside is  
that there is philosophical and technical dissent as to whether an  
alternate XML serialization is an encoding in the way that gzip is —a  
discussion that needs to involve considerations concerning the 5.22  
Roundtrip Support[1], 5.5 Directly Readable and Writable[2], and 5.16  
Integratable into XML Stack[3] properties. With this approach content  
negotiation is fully possible. The behaviour of fragment identifiers  
does not need to be re-specified.

     • The alternate XML format is not a mere content coding but  
requires the definition of one or more media types. This case  
subdivides into two options:

       o There is only the alternate XML format's media type. Any  
content sent using that format must have that media type. The upside  
of this approach is that it is simple. The downside is that you lose  
all media type information of the original XML content so that you  
must then define another system to provide that information, or  
define new media types for all possible content (application/ 
binxhtml, image/binsvg, etc.). With this content negotiation is  
entirely impossible (or rather, totally useless) unless new media  
types are defined for all things XML. The behaviour of fragment  
identifiers becomes impossible to specify, or has to be re-specified  
for all the new media types.

       o A new media type suffix is defined in the manner that it was  
done for XML content (e.g., "+bix") to be used for all content  
expressed using the alternate XML serialization. The upside of this  
approach is that it's simple and that the diversity of media types is  
maintained. The downside is that it requires much more intrusive  
modifications to systems that rely on existing media types. With this  
content negotiation is possible, but with lesser power. The behaviour  
of fragment identifiers has to be re-specified to map back to the one  
in +xml types.
"""


In short, I think that it boils down to defining what exactly may  
constitute content coding. The gzip case is simple: it reproduces the  
original content byte for byte, which firmly place it in the content  
coding basket.

And it's extremely tempting to just stop there: if it loses only one  
byte from the original physical representation of the content, no  
matter how irrelevant that byte may be, then it's not a content  
coding and actually defines a new media type. Temptingly simple, but  
as described above, potentially extremely impractical too.

Taking the SVG case as an example, there's a lot of information that  
can be lost with no impact. First there's everything that does not  
typically matter in XML. By that I don't mean genuine DM constructs  
such as comments, but the parts that are normally ignored: e.g.  
attribute order, white space between attributes, the difference  
between empty elements. And then there's a lot specific to SVG that  
can be modified with no impact either, for instance the exact syntax  
of path data. Can an encoding that optimises those away still  
honestly count itself as an HTTP content coding?

If adding new media types were a zero-cost operation, the solution  
would be simple: just add new ones, probably using some form of +exi  
suffix, yielding image/svg+exi. But it's not, there are many XML  
types, and given that one of the primary goals of EXI is to disrupt  
the existing ecosystem as little as possible this cost would tend to  
look like a bad idea.

So the question is: if one does not go with the stringent byte- 
preserving approach, where do we draw the line? Everything is an  
encoding at some level but one doesn't see image/raw documents being  
shuttled around with content coding JPEG. It may be that this problem  
is EXI-specific, and that all that is required is for the EXI WG to  
draw some guidelines if it does end up producing a format (or in fact  
even if it doesn't given that people will be using efficient XML  
anyway). I would find such a response satisfactory, but I would  
prefer that the TAG mulled over the issue first and then told them  
that that is the way to go, rather than see the EXI WG produce such  
guidelines only for us all to realise later that there is actually a  
more general take to be had there, and risk that the general take  
contradict what the EXI folks would come up with.

Have a nice week-end!


[0] http://www.w3.org/TR/xbc-properties/#content-type-management
[1] http://www.w3.org/TR/xbc-properties/#roundtrip-support
[2] http://www.w3.org/TR/xbc-properties/#directly-readable-writable
[3] http://www.w3.org/TR/xbc-properties/#integratable-into-xml-stack
-- 
Robin Berjon
    Senior Research Scientist
    Expway, http://expway.com/
Received on Saturday, 2 September 2006 16:16:40 UTC