- From: Chris Weber <chris@lookout.net>
- Date: Thu, 21 Jul 2011 12:30:22 -0700
- To: "Phillips, Addison" <addison@lab126.com>, "public-iri@w3.org" <public-iri@w3.org>
On 7/11/2011 4:35 PM, Phillips, Addison wrote: >> >> Should this step of unescaping be limited to the<iunreserved> set? >> > > I don't see how it can be. Note that HTML processors should be removing the escapes at a higher level of processing. > > Just curious, why don't you mean unescaping of %-encoded values? I would think that one goal in processing an IRI would be to arrive at the "canonical" IRI. > > Addison Is any unescaping or decoding necessary or safe at this point? For IRIs there may be mixed encodings (percent-encoded) between the query and other components. The unescaping of percent-encoded values from iunreserved could be the case per <http://tools.ietf.org/html/rfc3986#section-2.4> When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters. The only exception is for percent-encoded octets corresponding to characters in the unreserved set, which can be decoded at any time. For example, the octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by "~" without changing its interpretation. So then to handle arbitrary encodings, pre-processing the entire string at this point would look similar to what's in section 3.7 of 3987 <http://tools.ietf.org/html/draft-ietf-iri-3987bis-05#section-3.7> correct? And, taking into consideration that the string might be in a non-UTF-8 encoding, then this would need to happen before the transcoding to UTF-8 in step 3 at <http://tools.ietf.org/html/draft-weber-iri-guidelines-01#section-4>. Although there could be a problem, because 3987 section 3.5 and 7.2 describe how the query component in the http/s scheme could be handled in a potentially different encoding from the rest of the string. So the steps referenced above for re-percent-encoding illegal UTF-8 encountered during decoding would need to be a part of pre-processing as well... or would that cause problems for other components when applied to the entire reference string? For example, where iso-8859-1 was the encoding used for the string: http://www.example.com/%EF%BC%A1/?foo=%A1 Should not become: http://www.example.com/%EF%BC%A1/?foo=%C2%A1 Best regards, -Chris
Received on Thursday, 21 July 2011 19:30:51 UTC