Re: [whatwg] Stripping newlines from URI attributes (fwd) from Erik van der Poel on 2009-08-05 (public-html@w3.org from August 2009)

From: Erik van der Poel <erikv@google.com>
Date: Wed, 5 Aug 2009 14:44:10 -0700
To: Ian Hickson <ian@hixie.ch>
Cc: Larry Masinter <masinter@adobe.com>, public-iri@w3.org, public-html@w3.org
Message-ID: <c07a32650908051444l3b940e00s37d9e93e1becb164@mail.gmail.com>
Interesting. So, in addition to the HREF -> DNS/HTTP mapping, we see
differences between the browsers in their DOM behavior.

By the way, what does this WG intend to do about characters where the
browsers don't all behave the same way? The browsers are quite
consistent about TAB/CR/LF, but IE, Safari, Chrome and Opera convert \
to / in the host and path parts of the URI, while Firefox doesn't.

Erik

On Tue, Aug 4, 2009 at 6:07 PM, Ian Hickson<ian@hixie.ch> wrote:
>
> Larry,
>
> Please find below some feedback that bears on the IRI draft you are
> working on. It looks like at least U+000A, U+000D, and U+0009 need to be
> stripped from the value entirely before parsing.
>
> Here is a testcase that may help in fully reverse-engineering the
> behaviour that is actually necessary:
> http://software.hixie.ch/utilities/js/live-dom-viewer/?%3C!DOCTYPE%20html%3EDo%20you%20see%20cats%3F%20%3Cimg%20src%3D%22ima%26%23x09%3B%26%23x0A%3B%26%23x0D%3Bge%22%20alt%3D%22no%22%3E
>
> Cheers,
> --
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
>
> ---------- Forwarded message ----------
> Subject: Re: [whatwg] Stripping newlines from URI attributes
> From: Ian Hickson <ian@hixie.ch>
> To: Kartikaya Gupta <lists.whatwg@stakface.com>,
>    Anne van Kesteren <annevk@opera.com>,
>    Elliotte Rusty Harold <elharo@ibiblio.org>,
>    Philip Taylor <excors+whatwg@gmail.com>,
>    Alex Henrie <alexhenrie24@gmail.com>,
>    Robert O'Callahan <robert@ocallahan.org>
> Cc: whatwg@whatwg.org
> Date: Wed, 5 Aug 2009 01:03:57 +0000 (UTC)
>
> On Thu, 30 Jul 2009, Kartikaya Gupta wrote:
>>
>> It seems that most browsers do some sort of newline and tab removal from
>> URI attributes. For example, if you have
>>
>> <img src="foo
>> bar.jpg">
>>
>> browsers will still render the image called "foobar.jpg" despite the
>> CRLF pair in the middle of the src attribute. The behavior actually
>> seems a bit more complex; quote from one of my co-workers who
>> investigated this:
>>
>> > <img id='bar' width="288" height="48" foo="abc
>> > def" src="http://m.theglobeandmail.com/image-
>> > server/img//rO0ABXQAS2Z7aHR0cDovL2JldGEuaW1hZ2VzLnRoZWdsb2JlYW5kbWFpbC5jb20vaW1hZ2VzL21v
>> > YmlsZS9nYW1fZmxhZy5wbmd9dDBmMjg4dA==.png" alt="img" />
>> >
>> > <script type="text/javascript">
>> > alert( document.getElementById('bar').getAttribute('src').indexOf('\n') );
>> > alert( document.getElementById('bar').src.indexOf('\n') );
>> > </script>
>> >
>> > Firefox and Sarafi both generate two alerts, 36 and -1.
>> >
>> > It seems mozilla ignores 0x09, 0x0a, 0x0d in the URI
>> > Whereas webkit seems to ignore 0x09, 0x0a, 0x0d in the path.
>> >
>> > Try putting a CRLF inside the authority and
>> > alert( document.getElementById('bar').src.indexOf('\n') );
>> >
>> > will return non -1 in safari. But will still fetch the image. Firefox seems to return -1 all the time.
>> >
>> > Opera is like firefox.
>>
>> This behavior doesn't seem to be specced anywhere as far as I can tell.
>> Assuming the WEBADDRESSES spec referred to in HTML5 is the one at
>> http://www.w3.org/html/wg/href/draft.html that only says to trim
>> leading/trailing whitespace and url-encode the rest. This doesn't seem
>> to match existing behavior, so it should probably be updated.
>
> I'll forward this e-mail to Larry, who is working on the relevant spec
> now.
>
>
>> On a related note, I was wondering if all these "spin-off" specs could
>> be listed somewhere easy to find; it took me a while to locate the web
>> addresses one and I had to use google to find it. Putting a list at,
>> say, http://www.whatwg.org/specs/ would be handy; or even better, the
>> references section in the HTML5 spec could list them.
>
> The references section will in due course; in the meantime, please feel
> free to construct such a list on the wiki if that would be of help.
>
>
> On Thu, 30 Jul 2009, Anne van Kesteren wrote:
>>
>> Any chance you could also check whether this applies to CSS,
>> XMLHttpRequest, HTTP Location, etc.? So for I've found that browsers use
>> the same URL processor everywhere (though sometimes the URL character
>> encoding flag is set to UTF-8 and cannot be changed). As such it would
>> be nice to know if that is still true here or whether this is a
>> pre-processing step specific to HTML attribute values.
>
> Looks like yes, at least for CSS:
>
>   <!DOCTYPE html><style>body { background: url("ima\Age"); }</style>X
>
> ...results in a background.
>
> On Thu, 30 Jul 2009, Philip Taylor wrote:
>>
>> We should attempt to maintain compatibility with existing content, and
>> whitespace in URI attributes seems very common in existing content,
>> e.g.:
>>
>> http://www.topdogphotos.com/photo-gallery/gallery11.html (newlines in
>> <a href>, <img src>)
>>
>> http://www.sprig.com/coyuchi_george_or_thor_hooded_baby_towel (tabs
>> and &#xD;&#xA; in <img src>)
>>
>> and loads more.
>
> Thanks for looking into this.
>
> --
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
>
>
Received on Wednesday, 5 August 2009 21:44:53 UTC