Hey Dan, Not sure what the right list is for this, but anyway, attached are five e-mails regarding Web Addresses that were sent to the WHATWG list. Hopefully you'll be able to do something with them; I couldn't work out what concrete suggestions were being made on a quick scan. Cheers, -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
attached mail follows:
(In this email I will use URL5 as a short for Web Addresses, as that previously was the URL part of HTML5) As subject says, this is the continuation of the thread about LEIRI vs URL5 archived at <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-March/018929.html>, where discussion diverged to "good" vs "bad" standard and the adoption of URL5 in other Internet-related technologies. In this email I want to talk only about technical differences in the processing requirements of URL5 and LEIRI. Ian Hickson as repeatdly said that URL5 and (LE)IRI are different in the processing model, last time in <http://lists.w3.org/Archives/Public/public-html/2009Mar/0693.html>, adding that the URL5 model is the one used by current applications. I'm not sure about the last part of his sentence, but this is outside the scope of this thread. The current status is: - RFC3986 to define URIs, their validity and their processing - RFC3987 to define IRIs, their validity, their processing and their conversion to URIs - the IRI-bis draft at <http://www.w3.org/International/iri-edit/draft-duerst-iri-bis.html> to define LEIRIs and their conversion to IRIs - the URL5 document, to define Web Addresses and their conversion to URIs Let's see if we can find some differences in those documents, that really need a different technology. Well, hypertext locations, either URIs, IRIs or URL5s, are sequences of characters, so the difference must be in the handling of those characters. Reasons are taken from the IRI-bis draft and from the URI RFC. Note that invalidity for URL5 does not mean parse error. = U+0000 - U+001F: Unicode control C0: - in a URI: invalid, must be percent-encoded. Processing: stop - in a IRI: invalid, must be percent-encoded. Processing: stop Reason: "There is no way to transmit these characters reliably except potentially in electronic form. Even when in electronic form, some software components might silently filter out some of these characters, or may stop processing alltogether when encountering some of them. These characters may affect text display in subtle, unnoticable ways or in drastic, global, and irreversible ways depending on the hardware and software involved." - in a LEIRI: valid. Processing to IRI: percent-encode - in a URL5: invalid. Processing to URI: percent-encode = " " U+0020: Space - in a URI: invalid, must be percent-encoded. Processing: stop - in a IRI: invalid, must be percent-encoded. Processing: stop Reason: "Some formats and applications use space as a delimiter, e.g. for items in a list" - in a LEIRI: valid. Processing to IRI: percent-encode - in a URL5: invalid. Processing to URI: percent-encode = "<" U+003C, ">" U+003E, '"' U+0022, "\" U+005C, "^" (U+005E), "`" (U+0060), "{" (U+007B), "|" (U+007C), and "}" (U+007D): Delimiters and Unwise characters - in a URI: invalid, must be percent-encoded. Processing: stop - in a IRI: invalid, must be percent-encoded. Processing: stop Reason: "Appendix C of [RFC3986] suggests the use of double-quotes ("http://example.com/") and angle brackets (<http://example.com/>) as delimiters for URIs in plain text." and "Also, "the fact that these characters are not used in URIs or IRIs has encouraged their use outside URIs or IRIs in contexts that may include URIs or IRIs." - in a LEIRI: valid. Processing to IRI: percent-encode - in a URL5: invalid. Processing to URI: percent-encode Please note also that all references in this email are delimited by "<" and ">" = "%" U+0025: Percent sign: - in a URI: valid if followed by two characters in range [A-Fa-f0-9] (hexadecimal digit). Processing: emit a percent-encoding token - in a IRI: valid if followed by two characters in range [A-Fa-f0-9] (hexadecimal digit). Processing: percent-decode if the char is allowed without percent-encoding, else emit a percent-econding token. Processing to URI: none - in a LEIRI: valid if followed by two characters in range [A-Fa-f0-9] (hexadecimal digit). Processing to IRI: none. - in a URL5: valid if followed by two characters in range [A-Fa-f0-9] (hexadecimal digit). Processing to URI: percent-encode = ":" , "/" , "?" , "#" , "[" , "]" , "@", "!" , "$" , "&" , "'" , "(" , ")" , "*" , "+" , "," , ";" , "=": Delimiters allowed in URIs - in a URI: valid but have special meaning, else must be percent-encoded. Processing: depends on scheme-specific syntax. - in a IRI: valid but have special meaning, else must be percent-encoded. Processing: depends on scheme-specific syntax. - in a LEIRI: valid but have special meaning, else must be percent-encoded. Processing to IRI: none - in a URL5: valid but have special meaning, *cannot be percent-encoded*. Processing to URI: "]" , "[" are automatically percent-encoded after the host part, the rest is leaved as-is. "#" is automatically percent-encoded in the fragment identifier. = U+00A0-U+D7FF , U+F900-FDCF , U+FDF0-FFEF : Non-ASCII Unicode - in a URI: invalid, must be percent-encoded. Processing: stop - in a IRI: valid. Processing: none - in a LEIRI: valid. Processing to IRI: none - in a URL5: valid. Processing to URI: percent-encode = U+200E, U+200F, U+202A-202E, U+FFF0-FFFD, U+E000-F8FF, U+F0000-FFFFD, U+100000-10FFFD, U+E0000-E0FFF: Special, Bidi, non chars, etc. - in a URI: invalid, must be percent-encoded. Processing: stop - in a IRI: invalid, must be percent-encoded. Processing: stop Reason: "These code points provide functionality beyond that useful in a (Legacy Extended) IRI" - in a LEIRI: valid. Processing to IRI: percent-encode - in a URL5: valid. Processing to URI: percent-encode = U+D800-U+DFFF: Surrogate code units - in a URI: invalid. Processing: stop - in a IRI: invalid. Processing: stop - in a LEIRI: invalid. Processing to IRI: stop Reason: "These do not represent Unicode codepoints" - in a URL5: invalid. Processing to URI: percent-encode Summing up, the differences between URL5 and LEIRI are only about the percent sign and its uses for delimiters. The fact that "%" is automatically converted to "%25" means that authors can no more use percent-encoding to allow transmission of those chars as plain data. Please note that, even if sub-delims are allowed non encoded, they may have special meaning in a scheme specific syntax. One example is "&", which is allowed in URIs, but has a special meaning in the query-part of HTTP URIs. How can UAs send forms with "&" in value without causing security problems on the receiving server? The same for "=", "/" , "?": how can I transmit those chars? It is forbidden to ask questions in GET forms? Or on the other side: do I need to percent-decode twice on the receiving server? What about backward-compatibility with existing server-side applications that expect to percent-encode just once? Giovanni
attached mail follows:
On Sun, 29 Mar 2009 14:37:19 +0200, Giovanni Campagna <scampa.giovanni@gmail.com> wrote: > Summing up, the differences between URL5 and LEIRI are only about the > percent sign and its uses for delimiters. I'm not sure if you're correct about those differences, but even if you are they are not the only differences. E.g. LEIRIs perform normalization if the input encoding is non-Unicode. URLs do not. URLs can encode their query component per the input encoding (and do so for HTML and some APIs). LEIRIs cannot. (Also, I'm not sure if the WHATWG list is the right place to discuss this as the editor of the new draft might not read this list at all.) -- Anne van Kesteren http://annevankesteren.nl/
attached mail follows:
2009/3/29 Anne van Kesteren <annevk@opera.com>: > On Sun, 29 Mar 2009 14:37:19 +0200, Giovanni Campagna > <scampa.giovanni@gmail.com> wrote: >> >> Summing up, the differences between URL5 and LEIRI are only about the >> percent sign and its uses for delimiters. > > I'm not sure if you're correct about those differences, but even if you are > they are not the only differences. E.g. LEIRIs perform normalization if the > input encoding is non-Unicode. URLs do not. URLs can encode their query > component per the input encoding (and do so for HTML and some APIs). LEIRIs > cannot. What is the problem with normalization? Is there a standard for conversion to non-Unicode to Unicode? I guess no, so normalization (which should always be done) is perfectly legal. In addition, IRIs are defined as a sequence of Unicode codepoints. It does not matter how those codepoints are stored (ASCII, ISO-8859-1, UTF-8), only the Unicode version of them. This is the same as URL5s, by the way, because none of them is defined on octets and both use the RFC3986 method for percent-encoding (using UTF-8) > (Also, I'm not sure if the WHATWG list is the right place to discuss this as > the editor of the new draft might not read this list at all.) > Unfortunately, I cannot join the public-html list. I could cross-post this to www-html or www-archive but it would break the archives and make it difficult to follow. > -- > Anne van Kesteren > http://annevankesteren.nl/ > Giovanni
attached mail follows:
On Sun, 29 Mar 2009 15:01:51 +0200, Giovanni Campagna <scampa.giovanni@gmail.com> wrote: > 2009/3/29 Anne van Kesteren <annevk@opera.com>: >> I'm not sure if you're correct about those differences, but even if you >> are they are not the only differences. E.g. LEIRIs perform >> normalization if the input encoding is non-Unicode. URLs do not. URLs >> can encode their query >> component per the input encoding (and do so for HTML and some APIs). >> LEIRIs cannot. > > What is the problem with normalization? Is there a standard for > conversion to non-Unicode to Unicode? > I guess no, so normalization (which should always be done) is perfectly > legal. It's about Unicode Normalization. (And it should not always be done.) > In addition, IRIs are defined as a sequence of Unicode codepoints. It > does not matter how those codepoints are stored (ASCII, ISO-8859-1, > UTF-8), only the Unicode version of them. Please read the IRI specification again. Specifically section 3.1. > This is the same as URL5s, by the way, because none of them is defined > on octets and both use the RFC3986 method for percent-encoding (using > UTF-8) No, it's not always using UTF-8. -- Anne van Kesteren http://annevankesteren.nl/
attached mail follows:
2009/3/29 Anne van Kesteren <annevk@opera.com>: > On Sun, 29 Mar 2009 15:01:51 +0200, Giovanni Campagna > <scampa.giovanni@gmail.com> wrote: >> >> 2009/3/29 Anne van Kesteren <annevk@opera.com>: >>> >>> I'm not sure if you're correct about those differences, but even if you >>> are they are not the only differences. E.g. LEIRIs perform normalization if >>> the input encoding is non-Unicode. URLs do not. URLs can encode their query >>> component per the input encoding (and do so for HTML and some APIs). >>> LEIRIs cannot. >> >> What is the problem with normalization? Is there a standard for >> conversion to non-Unicode to Unicode? >> I guess no, so normalization (which should always be done) is perfectly >> legal. > > It's about Unicode Normalization. (And it should not always be done.) If I convert from ISO-8859-1 and find "À" (decimal 192), I can emit "À" U+00C0 LATIN CAPITAL A WITH GRAVE or "A" U+0041 LATIN CAPITAL LETTER A followed by " ̀" U+0300 COMBINING GRAVE ACCENT One is NFC, the other is NFD, and both are legal and simple. >> In addition, IRIs are defined as a sequence of Unicode codepoints. It >> does not matter how those codepoints are stored (ASCII, ISO-8859-1, >> UTF-8), only the Unicode version of them. > > Please read the IRI specification again. Specifically section 3.1. Specification says that IRIs must be a in normalized UCS when created from user input, else it must be converted to Unicode if not already (and the conversion should be normalizing), else it must be converted from UTF-8 / 16 / 32 to UCS but not normalized. I don't see a particular problem in this. >> This is the same as URL5s, by the way, because none of them is defined >> on octets and both use the RFC3986 method for percent-encoding (using >> UTF-8) > > No, it's not always using UTF-8. RFC3986 never creates percent encoding (percent-encoding is used for unspecified binary data) but says that text components should be encoded as UTF-8 and that rules are estabilished by scheme specific syntaxes. > -- > Anne van Kesteren > http://annevankesteren.nl/ > GiovanniReceived on Thursday, 30 April 2009 23:51:04 UTC
This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:33:35 UTC