[whatwg] Proposal for improved handling of '#' inside of data URIs from Nils Dagsson Moskopp on 2011-09-10 (public-whatwg-archive@w3.org from September 2011)

From: Nils Dagsson Moskopp <nils@dieweltistgarnichtso.net>
Date: Sun, 11 Sep 2011 01:53:50 +0200
Message-ID: <20110911015350.6316f657@desudesudesu>

Daniel Holbert <dholbert at mozilla.com> schrieb am Sat, 10 Sep 2011
14:15:09 -0700:

> [?]

> Browsers handle the "#" character in data URIs very differently, and
> the arguably "correct" behavior is probably not what authors actually
> want in many cases.

Do you have any evidence for that assertion, e.g. author surveys,
occurance in sites, number of duplicates in mozilla bugzilla (relative
to other common bugs)?

Anecdotally, my take: As an author, I would not think that the
semantics of ?#? in URIs change depending on the scheme. Additionally,
people tend to become confused when stuff gets special-cased
arbitrarily, see the hashbang scenario.

> This could be more intuitive/do-what-I-mean if we restricted the
> cases under which "#" is treated as a fragment-ID delimiter inside of
> data URIs. In particular: when a "#" character is followed by ">" or
> "<" in a data URI, I propose that we *don't* treat the "#" as a
> delimiter, and instead just treat it as part of the encoded document.

This change would probably have to be communicated to other software
working with data URIs (Python's urlparse module comes to mind). Do you
intend to update the RFC on the point or leave that usage
non-conforming?

> Now, a set of tests, to which I'll refer below:
>    http://people.mozilla.org/~dholbert/dataURIHashTests/tests_v1.xhtml

> [?]

> THE PROPOSAL & HOW IT HELPS:
> ============================
> We can help out the author by relaxing our fragment-ID-parsing rules
> a bit here.
> 
> Note that in cases where an author *accidentally* includes "#" inside 
> their data URI (e.g. <body background="#f00">), there almost
> certainly will be more content following it -- in particular, there
> will be an </html>, or an </svg>, or at least a ">" (if it's inside
> the final tag) still to come.

What's with the unencoded bracket (should be %3C) and space (should be
%20) beforehand? Why wouldn't parsing stop at those points?

If it doesn't, the given string isn't an URI anyway, or is it? If it
isn't, error recovery rules are pretty much arbitrary (looking it up in
Your Favourite Search Engine seems to be one popular way).

> So we can proactively check for >/< characters anywhere after the
> "#", and if we find them, then we can pretty safely assume that the
> author intended for the "#" to be part of the document, rather than a
> fragment-ID delimiter.

Is fragment use in data URIs possible at all? Also, my common sense
tingles: It seems to me that would be a category error. Discuss.

> [?]
> 
> With my proposal here -- relaxing the situations under which "#"
> should be treated as a delimiter in a data URI -- I think we'd better
> match author expectations and improve the browser-compatibility
> picture.

The last point ? interoperability ? is satisfied by any widely
implemented outcome. The first point ? author expectations ? I
question. So, how often does this occur?

> Thoughts?

Interesting.

-- 
Nils Dagsson Moskopp // erlehmann
<http://dieweltistgarnichtso.net>

Received on Saturday, 10 September 2011 16:53:50 UTC