decoding during pre-processing (was Re: reviewing draft-weber-iri-guidelines-00) from Chris Weber on 2011-07-21 (public-iri@w3.org from July 2011)

From: Chris Weber <chris@lookout.net>
Date: Thu, 21 Jul 2011 12:30:22 -0700
To: "Phillips, Addison" <addison@lab126.com>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <4E287E4E.9020008@lookout.net>

On 7/11/2011 4:35 PM, Phillips, Addison wrote:
>>
>> Should this step of unescaping be limited to the&lt;iunreserved&gt; set?
>>
>
> I don't see how it can be. Note that HTML processors should be removing the escapes at a higher level of processing.
>
> Just curious, why don't you mean unescaping of %-encoded values? I would think that one goal in processing an IRI would be to arrive at the "canonical" IRI.
>
> Addison

Is any unescaping or decoding necessary or safe at this point?  For IRIs 
there may be mixed encodings (percent-encoded) between the query and 
other components.  The unescaping of percent-encoded values from 
iunreserved could be the case per 
<http://tools.ietf.org/html/rfc3986#section-2.4>

    When a URI is dereferenced, the components and subcomponents
    significant to the scheme-specific dereferencing process (if any)
    must be parsed and separated before the percent-encoded octets within
    those components can be safely decoded, as otherwise the data may be
    mistaken for component delimiters.  The only exception is for
    percent-encoded octets corresponding to characters in the unreserved
    set, which can be decoded at any time.  For example, the octet
    corresponding to the tilde ("~") character is often encoded as "%7E"
    by older URI processing implementations; the "%7E" can be replaced by
    "~" without changing its interpretation.

So then to handle arbitrary encodings, pre-processing the entire string 
at this point would look similar to what's in section 3.7 of 3987 
<http://tools.ietf.org/html/draft-ietf-iri-3987bis-05#section-3.7> correct?

And, taking into consideration that the string might be in a non-UTF-8 
encoding, then this would need to happen before the transcoding to UTF-8 
in step 3 at 
<http://tools.ietf.org/html/draft-weber-iri-guidelines-01#section-4>.

Although there could be a problem, because 3987 section 3.5 and 7.2 
describe how the query component in the http/s scheme could be handled 
in a potentially different encoding from the rest of the string.  So the 
steps referenced above for re-percent-encoding illegal UTF-8 encountered 
during decoding would need to be a part of pre-processing as well... or 
would that cause problems for other components when applied to the 
entire reference string?

For example, where iso-8859-1 was the encoding used for the string:

http://www.example.com/%EF%BC%A1/?foo=%A1

Should not become:

http://www.example.com/%EF%BC%A1/?foo=%C2%A1

Best regards,
-Chris

Received on Thursday, 21 July 2011 19:30:51 UTC