data protocol URIs (was CSS selectors and xml:id)

On Sunday, May 8, 2005, 8:30:41 PM, Ian wrote:

IH> On Sun, 8 May 2005, Bjoern Hoehrmann wrote:
>> 
>> The data:text/css,test{background:red}test#test{background:lime} is a
>> URI reference to the "test{background:lime}" fragment of the style 
>> sheet. I sure hope there is no WHATWG proposal that changes that as that
>> would be incompatible with a broad range of URI and RFC 2397 
>> implementations.

IH> Um, data: URIs have no fragment identifiers.

I believe this is correct, with a small amount of digging. See below.

IH> Why would "#" have any
IH> special meaning in data: URIs?

If they existed, they would have a clear meaning

IH>  (Having said that, data: URIs maybe
IH> _should_ have fragment identifiers, but that's another story.)

:)

OK, now to the digging. From

RFC 2397 The "data" URL scheme
http://www.ietf.org/rfc/rfc2397.txt

the syntax is

       dataurl    := "data:" [ mediatype ] [ ";base64" ] "," data
       mediatype  := [ type "/" subtype ] *( ";" parameter )
       data       := *urlchar
       parameter  := attribute "=" value

   where "urlchar" is imported from [RFC2396], and "type", "subtype",
   "attribute" and "value" are the corresponding tokens from [RFC2045],
   represented using URL escaped encoding of [RFC2396] as necessary.

If there had been a fragment part, I believe that it would have been
called out in the syntax above. However, it is possible that a # is
allowed in *urlchar and if it is, it is possible that it is interpreted
as a fragment (its also possible that it is treated merely as another
character). Lets see.

Unfortunately RFC 2396 does not define a urlchar token, nor does RFC
3986 which supercedes it nor RFC 1738 which it in large part
superceded. I believe that the uric token in RFC 2396 is meant.

uric          = reserved | unreserved | escaped
reserved      = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                      "$" | ","
unreserved    = alphanum | mark
mark          = "-" | "_" | "." | "!" | "~" | "*" | "'" |
                      "(" | ")"
escaped       = "%" hex hex

from that, I deduce that # is not allowed in uric and thus, if the
assumption is correct, not in urlchar and thus, data URIs have no
fragment part.

Of course, in RFC 2396, URI did not have a fragment part anyway, only
URI-reference does ... perhaps if RFC 2397 only defines an absoluteURI
then the generic syntax *would* allow a fragment part?

In RFC 3986, in table D.2, uric is superceded by

unreserved / pct-encoded / ";" / "?" / ":" /
"@" / "&" / "=" / "+" / "$" / "," / "/"

Possibly a small revision to RFC 2397, using the ABNF from RFC 3986,
would be helpful.

-- 
 Chris Lilley                    mailto:chris@w3.org
 Chair, W3C SVG Working Group
 W3C Graphics Activity Lead

Received on Monday, 9 May 2005 11:34:42 UTC