Re: data URIs - filename and content-disposition from Alessandro Angeli on 2012-03-01 (uri@w3.org from March 2012)

From: Alessandro Angeli <uri.w3.org@riseoftheants.com>
Date: Thu, 01 Mar 2012 12:48:07 -0500
To: <uri@w3.org>
Message-ID: <BBB60F793AA442D4A4913E9853F301B0@kaneda>
I will revive this old thread with a proposal, in case this is ever
going to be implemented.

The original proposal is to add support for the FILENAME and
CONTENT-DISPOTION params in the MEDIATYPE part of a "data:" URI.

It evolved into a more generic support for a HEADERS param.

The former has the benefit of simplicity, the latter of flexibility.
But, at least judging from the discussion of the related proposal in the
Firefox bug-tracking system
(https://bugzilla.mozilla.org/show_bug.cgi?id=532230), neither is easily
implemented because of both parsing and handling limitations.

Moreover, both proposals require the definition of a new param (either
CONTENT-DISPOTION or HEADERS) that can be applied to all MEDIATYPEs and
the repurposing of the FILENAME param, which originally only applies to
the Content-Disposition header field and not the MEDIATYPE.

However, as far as I can tell, it is possible to achieve an even more
generic and flexible result than what would be accomplished by the
HEADERS param in a completely standard-compliant way by using the
message/* MEDIATYPE, so that the payload (DATA part) of the "data:" URI
would be a complete message/*, including its header fields.

For example, using message/http, one would have (all in one line):

{data:message/http,HTTP 200 OK|Content-Type:text/plain;charset=utf-8
|Content-Disposition:attachment;filename=%22hello world.txt%22||HELLO
WORLD}

I used {} to delimit the URI and I used spaces and | for readability,
but they are supposed to be escaped as %20 and %0D%0A (that is, I used |
to represent a new line). I also used unescaped reserved chars because
of the consideration at the end of this message.

Using message/rfc822:

{data:message/rfc822,Content-Type:text/plain;charset=utf-8
|Content-Disposition:attachment;filename=%22hello world.txt%22||HELLO
WORLD}

The benefits over the HEADERS param would be:

1) no need to define a new param

2) more flexible (you can even specify the HTTP response line)

3) to implement it, I believe it should be possible to simply unescape
the whole payload and pipe the result as an octet-stream into the
browser's HTTP response handler (if using message/http; if using
message/rfc822, a fake response line could be prefixed to the payload to
turn it into a message/http)

4) base64 encoding can be specified for the whole payload or only for
the message/* body, using the usual Content-Transfer-Encoding header
field

5) it is possible to use quoted-printable, which may be more compact
(after all, "=" does not need to be URI-escaped)

6) it is even possible to use gzip compression, which may mitigate the
bloating caused by base64

The implementation suggested in 3) would be the full embodiment of the
stated purpose of the "data:" URI, which is an inline representation of
an external resource: the header metadata of an HTTP resource is part of
the resource, but the current widespread usage of the "data:" URL can
only represent a subset of the Content-Type header field.

It should also have a performance not worse than fetching the resource
externally (assuming that unescaping the payload is not slower than
transferring it over a network).

About the unescaped chars, RFC2397:3 claims that URLCHAR is imported
from RFC2396. However, RFC2396 does not have a definition for URLCHAR.
Instead, it defines the following 3 char classes (the definitions are
equivalent to the ones in RFC2396:A, but rewrote in a more
human-understandable way):

pchar         = escaped | alphanum | mark
              | ":" | "@" | "&" | "=" | "+" | "$" | ","
uric_no_slash = pchar | ";" | "?"
uric          = pchar | ";" | "?" | "/"

They are used in the following URI parts (again, partially rewrote and
keeping only the ABSOLUTEURI form of the URI-REFERENCE):

URI-reference = scheme ":" (opaque_part | hier_part) ["#" fragment]
opaque_part   = uric_no_slash *uric
hier_part     = ( ["//" authority] [abs_path] ) ["?" query] )
abs_path      = "/"  segment *( "/" segment )
segment       = *pchar *( ";" *pchar )
query         = *uric
fragment      = *uric

I would think that a "data:" URI uses the OPAQUE_PART syntax, in which
case the unescaped chars are allowed. But they would also be allowed if
using the HIER_PART one (except maybe in some parts of the AUTHORITY,
which is not used in "data:" anyway).

-- 
Alessandro
Received on Thursday, 1 March 2012 17:48:20 UTC