decoding during pre-processing (was Re: reviewing draft-weber-iri-guidelines-00)

From: Chris Weber <chris@lookout.net>
Date: Thu, 21 Jul 2011 12:30:22 -0700
Message-ID: <4E287E4E.9020008@lookout.net>
To: "Phillips, Addison" <addison@lab126.com>, "public-iri@w3.org" <public-iri@w3.org>
On 7/11/2011 4:35 PM, Phillips, Addison wrote:
>> Should this step of unescaping be limited to the&lt;iunreserved&gt; set?
> I don't see how it can be. Note that HTML processors should be removing the escapes at a higher level of processing.
> Just curious, why don't you mean unescaping of %-encoded values? I would think that one goal in processing an IRI would be to arrive at the "canonical" IRI.
Is any unescaping or decoding necessary or safe at this point?  For IRIs 
there may be mixed encodings (percent-encoded) between the query and 
other components.  The unescaping of percent-encoded values from 
iunreserved could be the case per 

    When a URI is dereferenced, the components and subcomponents
    significant to the scheme-specific dereferencing process (if any)
    must be parsed and separated before the percent-encoded octets within
    those components can be safely decoded, as otherwise the data may be
    mistaken for component delimiters.  The only exception is for
    percent-encoded octets corresponding to characters in the unreserved
    set, which can be decoded at any time.  For example, the octet
    corresponding to the tilde ("~") character is often encoded as "%7E"
    by older URI processing implementations; the "%7E" can be replaced by
    "~" without changing its interpretation.

So then to handle arbitrary encodings, pre-processing the entire string 
at this point would look similar to what's in section 3.7 of 3987 
<http://tools.ietf.org/html/draft-ietf-iri-3987bis-05#section-3.7> correct?

And, taking into consideration that the string might be in a non-UTF-8 
encoding, then this would need to happen before the transcoding to UTF-8 
in step 3 at 

Although there could be a problem, because 3987 section 3.5 and 7.2 
describe how the query component in the http/s scheme could be handled 
in a potentially different encoding from the rest of the string.  So the 
steps referenced above for re-percent-encoding illegal UTF-8 encountered 
during decoding would need to be a part of pre-processing as well... or 
would that cause problems for other components when applied to the 
entire reference string?

For example, where iso-8859-1 was the encoding used for the string:


Should not become:


