- From: Chris Weber <chris@lookout.net>
- Date: Thu, 21 Jul 2011 12:30:22 -0700
- To: "Phillips, Addison" <addison@lab126.com>, "public-iri@w3.org" <public-iri@w3.org>
On 7/11/2011 4:35 PM, Phillips, Addison wrote:
>>
>> Should this step of unescaping be limited to the<iunreserved> set?
>>
>
> I don't see how it can be. Note that HTML processors should be removing the escapes at a higher level of processing.
>
> Just curious, why don't you mean unescaping of %-encoded values? I would think that one goal in processing an IRI would be to arrive at the "canonical" IRI.
>
> Addison
Is any unescaping or decoding necessary or safe at this point? For IRIs
there may be mixed encodings (percent-encoded) between the query and
other components. The unescaping of percent-encoded values from
iunreserved could be the case per
<http://tools.ietf.org/html/rfc3986#section-2.4>
When a URI is dereferenced, the components and subcomponents
significant to the scheme-specific dereferencing process (if any)
must be parsed and separated before the percent-encoded octets within
those components can be safely decoded, as otherwise the data may be
mistaken for component delimiters. The only exception is for
percent-encoded octets corresponding to characters in the unreserved
set, which can be decoded at any time. For example, the octet
corresponding to the tilde ("~") character is often encoded as "%7E"
by older URI processing implementations; the "%7E" can be replaced by
"~" without changing its interpretation.
So then to handle arbitrary encodings, pre-processing the entire string
at this point would look similar to what's in section 3.7 of 3987
<http://tools.ietf.org/html/draft-ietf-iri-3987bis-05#section-3.7> correct?
And, taking into consideration that the string might be in a non-UTF-8
encoding, then this would need to happen before the transcoding to UTF-8
in step 3 at
<http://tools.ietf.org/html/draft-weber-iri-guidelines-01#section-4>.
Although there could be a problem, because 3987 section 3.5 and 7.2
describe how the query component in the http/s scheme could be handled
in a potentially different encoding from the rest of the string. So the
steps referenced above for re-percent-encoding illegal UTF-8 encountered
during decoding would need to be a part of pre-processing as well... or
would that cause problems for other components when applied to the
entire reference string?
For example, where iso-8859-1 was the encoding used for the string:
http://www.example.com/%EF%BC%A1/?foo=%A1
Should not become:
http://www.example.com/%EF%BC%A1/?foo=%C2%A1
Best regards,
-Chris
Received on Thursday, 21 July 2011 19:30:51 UTC