Re: FW: single percent from Martin J. Dürst on 2009-10-11 (public-iri@w3.org from October 2009)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Mon, 12 Oct 2009 03:05:02 +0900
To: Erik van der Poel <erikv@google.com>
CC: Larry Masinter <masinter@adobe.com>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <4AD21E4E.4030904@it.aoyama.ac.jp>
Hello Erik,

Sorry to be late with my answer.

On 2009/09/29 23:39, Erik van der Poel wrote:
> On Tue, Sep 29, 2009 at 3:50 AM, "Martin J. Dürst"
> <duerst@it.aoyama.ac.jp>  wrote:
>> Hello Erik, others,
>>
>> On 2009/09/26 3:23, Erik van der Poel wrote:
>>> On Fri, Sep 25, 2009 at 10:42 AM, Larry Masinter<masinter@adobe.com>
>>>   wrote:
>>>> Some mail didn't get sent to public-IRI which should have been:
>>>>
>>>> On 2009/09/03 7:33, Larry Masinter wrote:
>>>>> Sorry to be rehashing what I think are old topics, but the discussion of
>>>>> these things seems to be scattered around on a zillion mailing lists:
>>>>>
>>>>>
>>>>>    *   I'm not sure why  http://example.com/%<http://example.com/%25>
>>>>>   should be illegal as an IRI. I remember some discussion of this, but not
>>>>> the resolution. Why not update IRI to allow it, since it seems to work in
>>>>> most systems?
>>> I think this got garbled along the way, but I assume you're talking
>>> about a percent sign (%) in the path part that is not followed by two
>>> hex digits. This does not "work in most systems". Our automated tests
>>> show that IE8 will not send the HTTP request, Safari4 escapes % as
>>> %25, while Firefox, Chrome and Opera leave the % as is.
>> Oh, interesting. I think Larry and I were assuming that there was some
>> uniform behavior at least for major browsers that we could document (instead
>> of HTML5). If there's such variation, my first proposal would be to go with
>> the most conservative variant (single percents are simply illegal ->  don't
>> send request,...). (My second proposal would be to mention more lenitent
>> processing only as a MAY.)
>
> When there is a single percent in the query part, the browsers are a
> bit more consistent. MSIE, Firefox, Chrome and Opera leave it as %
> while Safari escapes it as %25.
>
> The browsers are also a bit more consistent when there is a single
> percent in the host part. MSIE, Firefox, Chrome and Opera don't emit
> any DNS or HTTP packets, while Safari sends DNS and HTTP packets with
> the % as is.

Could make sense to create a table for these. And "a bit more 
consistent" may help, or may not.

>>>> Martin:
>>>>
>>>> It's illegal in URIs, too. The URI and IRI syntaxes should be as
>>>> parallel as possible. In terms of implementations, it may be easy for
>>>> consumers, but for producers, it's not. It's much easier to just escape
>>>> than to go and check whether (one or) two hex digits are following
>>>> (which would change the meaning totally).
>>> Surely that depends on the type of producer. For HTML form
>>> submissions, % should be escaped as %25,
>> Yes, if you have a '%' which is simply data, you should convert it to '%25'.
>>
>>> but for HTML hrefs, the
>>> producer is also a consumer and should first check whether two hex
>>> digits follow.
>> I'm not sure what you mean here by "the producer is also a consumer". Can
>> you explain?
>
> A browser consumes HTML and produces DNS and HTTP packets (requests).

Okay. When it sends out a request to a proxy, it uses a full URI, so 
there it is an URI producer. When there's no proxy, only the path and 
query part are sent, so I'm not sure we can consider that an URI 
producer. But maybe the best way to see this is indeed to look at it as 
a relative URI.

>>> The big question is what to do about a % sign that is
>>> not followed by two hex digits. The major browsers currently handle
>>> this differently, so producers would be wise to avoid this,
>> Very much so indeed. Even if major browsers handled this all the same way,
>> there's much more than just major browsers that processes URIs or IRIs.
>
> Yes, I agree that single percents should be strongly discouraged (but
> we need to document and try to standardize the behavior of producers
> that are also consumers).

"consumers that are also producers" and "browsers" may not exactly be 
the same. The first is potentially much wider.

>>> but it is
>>> not clear to me what advice should be given to consumer/producer
>>> implementers. Is it better to be conservative like IE and reject it?
>>> Or is it better to be forgiving like Firefox and just send out the
>>> lone % sign? (Note: this particular case is interesting, because IE is
>>> usually the forgiving one, while Firefox is the conservative one.)
>> Well, there's always the hope for progress.
>
> Yes.
>
>>>> Martin:
>>>>
>>>> I think the purpose is to %-encode '[' and ']' except for the authority
>>>> part, where they are needed for IPV6. The encoding is done because '['
>>>> and ']' are not allowed elsewhere than in IP-literal.
>>> I don't see why [ and ] should be disallowed in the path and query
>>> parts,
>> Well, currently the specs say so (the URI spec says so, and the IRI spec
>> follows it).
>
> That may be true, but is there a good reason why those are disallowed?

I think it's historical. '[' and '[' were originally not allowed at all. 
The reason for this is that they are in positions that ISO 646 allows 
national and other variants to change. But I don't think the ISO 646 
argument is relevant anymore. Later, they were introduced to express 
IPv6 addresses only. I think we may look at whether to allow them 
formally in other parts.


>>> but the major browsers currently handle those characters
>>> differently in the path/query. (Some browsers %-encode, others don't.)
>> Can you give details?
>
> Path: MSIE, Chrome and Opera leave [ and ] as is, while Firefox and
> Safari %-encode them.
>
> Query: MSIE, Firefox, Chrome and Opera leave [ and ] as is, while
> Safari %-encodes them.

Okay, I think that's variation that we should be able to handle, in 
particular should we choose to officially allow them in other parts, 
because it's always possible to escape characters that are not escaped, 
except for places where the escaping serves as an explicit distinction 
between the reserved (syntactic delimiter) and unreserved (payload) 
function of a character.

>> Overall, I'm more and more wondering how we as editors, or a potential IETF
>> IRI WG, would deal with the kind of variability between browsers that Erik
>> is bringing up here. I thought we could just work from what HTML5 had,
>> because that reflected wide current practice among browsers, but that
>> doesn't seem to really be true.
>
> My proposal is to document the differences (between the major
> browsers). I can produce the differences for a number of test cases.

Documenting the differences is definitely a very good start.

> The hope is that the browsers will try to align.

I guess just documenting the differences won't bring them there.

> In some cases, it may
> be a good idea to discuss how to align. In particular, we are not only
> concerned about interoperability -- we are also concerned about
> security.

Yes, very much so.

> Where the browsers do not align, the documented differences serve as
> warnings to producers (to avoid those areas).

Please always remember that there are many other types of software out 
there besides browsers.

> The specs might give recommendations, and also good reasons for those
> recommendations. In particular, it would be nice to have good security
> considerations.

Fully agreed.

Regards,   Martin.

> Erik
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
Received on Sunday, 11 October 2009 20:21:30 UTC