Re: [whatwg] Stripping newlines from URI attributes (fwd) from Ian Hickson on 2009-08-05 (public-html@w3.org from August 2009)

From: Ian Hickson <ian@hixie.ch>
Date: Wed, 5 Aug 2009 01:07:42 +0000 (UTC)
To: Larry Masinter <masinter@adobe.com>
Cc: public-iri@w3.org, public-html@w3.org
Message-ID: <Pine.LNX.4.62.0908050104190.6420@hixie.dreamhostps.com>
Larry,

Please find below some feedback that bears on the IRI draft you are 
working on. It looks like at least U+000A, U+000D, and U+0009 need to be 
stripped from the value entirely before parsing.

Here is a testcase that may help in fully reverse-engineering the 
behaviour that is actually necessary:
http://software.hixie.ch/utilities/js/live-dom-viewer/?%3C!DOCTYPE%20html%3EDo%20you%20see%20cats%3F%20%3Cimg%20src%3D%22ima%26%23x09%3B%26%23x0A%3B%26%23x0D%3Bge%22%20alt%3D%22no%22%3E

Cheers,
-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

---------- Forwarded message ----------
Subject: Re: [whatwg] Stripping newlines from URI attributes
From: Ian Hickson <ian@hixie.ch>
To: Kartikaya Gupta <lists.whatwg@stakface.com>,
    Anne van Kesteren <annevk@opera.com>,
    Elliotte Rusty Harold <elharo@ibiblio.org>,
    Philip Taylor <excors+whatwg@gmail.com>,
    Alex Henrie <alexhenrie24@gmail.com>,
    Robert O'Callahan <robert@ocallahan.org>
Cc: whatwg@whatwg.org
Date: Wed, 5 Aug 2009 01:03:57 +0000 (UTC)

On Thu, 30 Jul 2009, Kartikaya Gupta wrote:
>
> It seems that most browsers do some sort of newline and tab removal from 
> URI attributes. For example, if you have
> 
> <img src="foo
> bar.jpg">
> 
> browsers will still render the image called "foobar.jpg" despite the 
> CRLF pair in the middle of the src attribute. The behavior actually 
> seems a bit more complex; quote from one of my co-workers who 
> investigated this:
> 
> > <img id='bar' width="288" height="48" foo="abc
> > def" src="http://m.theglobeandmail.com/image-
> > server/img//rO0ABXQAS2Z7aHR0cDovL2JldGEuaW1hZ2VzLnRoZWdsb2JlYW5kbWFpbC5jb20vaW1hZ2VzL21v
> > YmlsZS9nYW1fZmxhZy5wbmd9dDBmMjg4dA==.png" alt="img" />
> >  
> > <script type="text/javascript"> 
> > alert( document.getElementById('bar').getAttribute('src').indexOf('\n') );
> > alert( document.getElementById('bar').src.indexOf('\n') );
> > </script>
> >  
> > Firefox and Sarafi both generate two alerts, 36 and -1.
> > 
> > It seems mozilla ignores 0x09, 0x0a, 0x0d in the URI
> > Whereas webkit seems to ignore 0x09, 0x0a, 0x0d in the path.
> > 
> > Try putting a CRLF inside the authority and
> > alert( document.getElementById('bar').src.indexOf('\n') );
> > 
> > will return non -1 in safari. But will still fetch the image. Firefox seems to return -1 all the time.
> > 
> > Opera is like firefox. 
> 
> This behavior doesn't seem to be specced anywhere as far as I can tell. 
> Assuming the WEBADDRESSES spec referred to in HTML5 is the one at 
> http://www.w3.org/html/wg/href/draft.html that only says to trim 
> leading/trailing whitespace and url-encode the rest. This doesn't seem 
> to match existing behavior, so it should probably be updated.

I'll forward this e-mail to Larry, who is working on the relevant spec 
now.


> On a related note, I was wondering if all these "spin-off" specs could 
> be listed somewhere easy to find; it took me a while to locate the web 
> addresses one and I had to use google to find it. Putting a list at, 
> say, http://www.whatwg.org/specs/ would be handy; or even better, the 
> references section in the HTML5 spec could list them.

The references section will in due course; in the meantime, please feel 
free to construct such a list on the wiki if that would be of help.


On Thu, 30 Jul 2009, Anne van Kesteren wrote:
> 
> Any chance you could also check whether this applies to CSS, 
> XMLHttpRequest, HTTP Location, etc.? So for I've found that browsers use 
> the same URL processor everywhere (though sometimes the URL character 
> encoding flag is set to UTF-8 and cannot be changed). As such it would 
> be nice to know if that is still true here or whether this is a 
> pre-processing step specific to HTML attribute values.

Looks like yes, at least for CSS:

   <!DOCTYPE html><style>body { background: url("ima\Age"); }</style>X

...results in a background.

On Thu, 30 Jul 2009, Philip Taylor wrote:
> 
> We should attempt to maintain compatibility with existing content, and 
> whitespace in URI attributes seems very common in existing content, 
> e.g.:
> 
> http://www.topdogphotos.com/photo-gallery/gallery11.html (newlines in
> <a href>, <img src>)
> 
> http://www.sprig.com/coyuchi_george_or_thor_hooded_baby_towel (tabs
> and &#xD;&#xA; in <img src>)
> 
> and loads more.

Thanks for looking into this.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 5 August 2009 01:08:20 UTC