- From: Giovanni Campagna <scampa.giovanni@gmail.com>
- Date: Sun, 29 Mar 2009 13:37:19 +0100
(In this email I will use URL5 as a short for Web Addresses, as that previously was the URL part of HTML5) As subject says, this is the continuation of the thread about LEIRI vs URL5 archived at <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-March/018929.html>, where discussion diverged to "good" vs "bad" standard and the adoption of URL5 in other Internet-related technologies. In this email I want to talk only about technical differences in the processing requirements of URL5 and LEIRI. Ian Hickson as repeatdly said that URL5 and (LE)IRI are different in the processing model, last time in <http://lists.w3.org/Archives/Public/public-html/2009Mar/0693.html>, adding that the URL5 model is the one used by current applications. I'm not sure about the last part of his sentence, but this is outside the scope of this thread. The current status is: - RFC3986 to define URIs, their validity and their processing - RFC3987 to define IRIs, their validity, their processing and their conversion to URIs - the IRI-bis draft at <http://www.w3.org/International/iri-edit/draft-duerst-iri-bis.html> to define LEIRIs and their conversion to IRIs - the URL5 document, to define Web Addresses and their conversion to URIs Let's see if we can find some differences in those documents, that really need a different technology. Well, hypertext locations, either URIs, IRIs or URL5s, are sequences of characters, so the difference must be in the handling of those characters. Reasons are taken from the IRI-bis draft and from the URI RFC. Note that invalidity for URL5 does not mean parse error. = U+0000 - U+001F: Unicode control C0: - in a URI: invalid, must be percent-encoded. Processing: stop - in a IRI: invalid, must be percent-encoded. Processing: stop Reason: "There is no way to transmit these characters reliably except potentially in electronic form. Even when in electronic form, some software components might silently filter out some of these characters, or may stop processing alltogether when encountering some of them. These characters may affect text display in subtle, unnoticable ways or in drastic, global, and irreversible ways depending on the hardware and software involved." - in a LEIRI: valid. Processing to IRI: percent-encode - in a URL5: invalid. Processing to URI: percent-encode = " " U+0020: Space - in a URI: invalid, must be percent-encoded. Processing: stop - in a IRI: invalid, must be percent-encoded. Processing: stop Reason: "Some formats and applications use space as a delimiter, e.g. for items in a list" - in a LEIRI: valid. Processing to IRI: percent-encode - in a URL5: invalid. Processing to URI: percent-encode = "<" U+003C, ">" U+003E, '"' U+0022, "\" U+005C, "^" (U+005E), "`" (U+0060), "{" (U+007B), "|" (U+007C), and "}" (U+007D): Delimiters and Unwise characters - in a URI: invalid, must be percent-encoded. Processing: stop - in a IRI: invalid, must be percent-encoded. Processing: stop Reason: "Appendix C of [RFC3986] suggests the use of double-quotes ("http://example.com/") and angle brackets (<http://example.com/>) as delimiters for URIs in plain text." and "Also, "the fact that these characters are not used in URIs or IRIs has encouraged their use outside URIs or IRIs in contexts that may include URIs or IRIs." - in a LEIRI: valid. Processing to IRI: percent-encode - in a URL5: invalid. Processing to URI: percent-encode Please note also that all references in this email are delimited by "<" and ">" = "%" U+0025: Percent sign: - in a URI: valid if followed by two characters in range [A-Fa-f0-9] (hexadecimal digit). Processing: emit a percent-encoding token - in a IRI: valid if followed by two characters in range [A-Fa-f0-9] (hexadecimal digit). Processing: percent-decode if the char is allowed without percent-encoding, else emit a percent-econding token. Processing to URI: none - in a LEIRI: valid if followed by two characters in range [A-Fa-f0-9] (hexadecimal digit). Processing to IRI: none. - in a URL5: valid if followed by two characters in range [A-Fa-f0-9] (hexadecimal digit). Processing to URI: percent-encode = ":" , "/" , "?" , "#" , "[" , "]" , "@", "!" , "$" , "&" , "'" , "(" , ")" , "*" , "+" , "," , ";" , "=": Delimiters allowed in URIs - in a URI: valid but have special meaning, else must be percent-encoded. Processing: depends on scheme-specific syntax. - in a IRI: valid but have special meaning, else must be percent-encoded. Processing: depends on scheme-specific syntax. - in a LEIRI: valid but have special meaning, else must be percent-encoded. Processing to IRI: none - in a URL5: valid but have special meaning, *cannot be percent-encoded*. Processing to URI: "]" , "[" are automatically percent-encoded after the host part, the rest is leaved as-is. "#" is automatically percent-encoded in the fragment identifier. = U+00A0-U+D7FF , U+F900-FDCF , U+FDF0-FFEF : Non-ASCII Unicode - in a URI: invalid, must be percent-encoded. Processing: stop - in a IRI: valid. Processing: none - in a LEIRI: valid. Processing to IRI: none - in a URL5: valid. Processing to URI: percent-encode = U+200E, U+200F, U+202A-202E, U+FFF0-FFFD, U+E000-F8FF, U+F0000-FFFFD, U+100000-10FFFD, U+E0000-E0FFF: Special, Bidi, non chars, etc. - in a URI: invalid, must be percent-encoded. Processing: stop - in a IRI: invalid, must be percent-encoded. Processing: stop Reason: "These code points provide functionality beyond that useful in a (Legacy Extended) IRI" - in a LEIRI: valid. Processing to IRI: percent-encode - in a URL5: valid. Processing to URI: percent-encode = U+D800-U+DFFF: Surrogate code units - in a URI: invalid. Processing: stop - in a IRI: invalid. Processing: stop - in a LEIRI: invalid. Processing to IRI: stop Reason: "These do not represent Unicode codepoints" - in a URL5: invalid. Processing to URI: percent-encode Summing up, the differences between URL5 and LEIRI are only about the percent sign and its uses for delimiters. The fact that "%" is automatically converted to "%25" means that authors can no more use percent-encoding to allow transmission of those chars as plain data. Please note that, even if sub-delims are allowed non encoded, they may have special meaning in a scheme specific syntax. One example is "&", which is allowed in URIs, but has a special meaning in the query-part of HTTP URIs. How can UAs send forms with "&" in value without causing security problems on the receiving server? The same for "=", "/" , "?": how can I transmit those chars? It is forbidden to ask questions in GET forms? Or on the other side: do I need to percent-decode twice on the receiving server? What about backward-compatibility with existing server-side applications that expect to percent-encode just once? Giovanni
Received on Sunday, 29 March 2009 05:37:19 UTC