Web Addresses feedback from Ian Hickson on 2009-04-30 (www-archive@w3.org from April 2009)

From: Ian Hickson <ian@hixie.ch>
Date: Thu, 30 Apr 2009 23:50:25 +0000 (UTC)
To: danc@w3.org
Cc: www-archive@w3.org
Message-ID: <Pine.LNX.4.62.0904302349111.18851@hixie.dreamhostps.com>

Hey Dan,

Not sure what the right list is for this, but anyway, attached are five 
e-mails regarding Web Addresses that were sent to the WHATWG list. 
Hopefully you'll be able to do something with them; I couldn't work out 
what concrete suggestions were being made on a quick scan.

Cheers,
-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Forwarded message 1

From: Giovanni Campagna <scampa.giovanni@gmail.com>
Date: Sun, 29 Mar 2009 13:37:19 +0100
Subject: [whatwg] Web Addresses vs Legacy Extended IRI (again)
To: WHATWG List <whatwg@whatwg.org>
Message-ID: <65307430903290537h40f602e9t22c611c20055955f@mail.gmail.com>

(In this email I will use URL5 as a short for Web Addresses, as that
previously was the URL part of HTML5)
As subject says, this is the continuation of the thread about LEIRI vs
URL5 archived at
<http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-March/018929.html>,
where discussion diverged to "good" vs "bad" standard and the adoption
of URL5 in other Internet-related technologies.
In this email I want to talk only about technical differences in the
processing requirements of URL5 and LEIRI.
Ian Hickson as repeatdly said that URL5 and (LE)IRI are different in
the processing model, last time in
<http://lists.w3.org/Archives/Public/public-html/2009Mar/0693.html>,
adding that the URL5 model is the one used by current applications.
I'm not sure about the last part of his sentence, but this is outside
the scope of this thread.

The current status is:
- RFC3986 to define URIs, their validity and their processing
- RFC3987 to define IRIs, their validity, their processing and their
conversion to URIs
- the IRI-bis draft at
<http://www.w3.org/International/iri-edit/draft-duerst-iri-bis.html>
to define LEIRIs and their conversion to IRIs
- the URL5 document, to define Web Addresses and their conversion to URIs

Let's see if we can find some differences in those documents, that
really need a different technology.
Well, hypertext locations, either URIs, IRIs or URL5s, are sequences
of characters, so the difference must be in the handling of those
characters.
Reasons are taken from the IRI-bis draft and from the URI RFC. Note
that invalidity for URL5 does not mean parse error.

= U+0000 - U+001F: Unicode control C0:
- in a URI: invalid, must be percent-encoded. Processing: stop
- in a IRI: invalid, must be percent-encoded. Processing: stop
Reason: "There is no way to transmit these characters reliably except
potentially in electronic form. Even when in electronic form, some
software components might silently filter out some of these
characters, or may stop processing alltogether when encountering some
of them. These characters may affect text display in subtle,
unnoticable ways or in drastic, global, and irreversible ways
depending on the hardware and software involved."
- in a LEIRI: valid. Processing to IRI: percent-encode
- in a URL5: invalid. Processing to URI: percent-encode

= " " U+0020: Space
- in a URI: invalid, must be percent-encoded. Processing: stop
- in a IRI: invalid, must be percent-encoded. Processing: stop
Reason: "Some formats and applications use space as a delimiter, e.g.
for items in a list"
- in a LEIRI: valid. Processing to IRI: percent-encode
- in a URL5: invalid. Processing to URI: percent-encode

= "<" U+003C, ">" U+003E, '"' U+0022, "\" U+005C, "^" (U+005E), "`"
(U+0060), "{" (U+007B), "|" (U+007C), and "}" (U+007D): Delimiters and
Unwise characters
- in a URI: invalid, must be percent-encoded. Processing: stop
- in a IRI: invalid, must be percent-encoded. Processing: stop
Reason: "Appendix C of [RFC3986] suggests the use of double-quotes
("http://example.com/") and angle brackets (<http://example.com/>) as
delimiters for URIs in plain text." and "Also, "the fact that these
characters are not used in URIs or IRIs has encouraged their use
outside URIs or IRIs in contexts that may include URIs or IRIs."
- in a LEIRI: valid. Processing to IRI: percent-encode
- in a URL5: invalid. Processing to URI: percent-encode
Please note also that all references in this email are delimited by "<" and ">"

= "%" U+0025: Percent sign:
- in a URI: valid if followed by two characters in range [A-Fa-f0-9]
(hexadecimal digit). Processing: emit a percent-encoding token
- in a IRI: valid if followed by two characters in range [A-Fa-f0-9]
(hexadecimal digit). Processing: percent-decode if the char is allowed
without percent-encoding, else emit a percent-econding token.
Processing to URI: none
- in a LEIRI: valid if followed by two characters in range [A-Fa-f0-9]
(hexadecimal digit). Processing to IRI: none.
- in a URL5: valid if followed by two characters in range [A-Fa-f0-9]
(hexadecimal digit). Processing to URI: percent-encode

= ":" , "/" , "?" , "#" , "[" , "]" , "@", "!" , "$" , "&" , "'" , "("
, ")" , "*" , "+" , "," , ";" , "=": Delimiters allowed in URIs
- in a URI: valid but have special meaning, else must be
percent-encoded. Processing: depends on scheme-specific syntax.
- in a IRI: valid but have special meaning, else must be
percent-encoded. Processing: depends on scheme-specific syntax.
- in a LEIRI: valid but have special meaning, else must be
percent-encoded. Processing to IRI: none
- in a URL5: valid but have special meaning, *cannot be
percent-encoded*. Processing to URI: "]" , "[" are automatically
percent-encoded after the host part, the rest is leaved as-is. "#" is
automatically percent-encoded in the fragment identifier.

= U+00A0-U+D7FF , U+F900-FDCF , U+FDF0-FFEF : Non-ASCII Unicode
- in a URI: invalid, must be percent-encoded. Processing: stop
- in a IRI: valid. Processing: none
- in a LEIRI: valid. Processing to IRI: none
- in a URL5: valid. Processing to URI: percent-encode

= U+200E, U+200F, U+202A-202E, U+FFF0-FFFD, U+E000-F8FF,
U+F0000-FFFFD, U+100000-10FFFD, U+E0000-E0FFF: Special, Bidi, non
chars, etc.
- in a URI: invalid, must be percent-encoded. Processing: stop
- in a IRI: invalid, must be percent-encoded. Processing: stop
Reason: "These code points provide functionality beyond that useful in
a (Legacy Extended) IRI"
- in a LEIRI: valid. Processing to IRI: percent-encode
- in a URL5: valid. Processing to URI: percent-encode

=  U+D800-U+DFFF: Surrogate code units
- in a URI: invalid. Processing: stop
- in a IRI: invalid. Processing: stop
- in a LEIRI: invalid. Processing to IRI: stop
Reason: "These do not represent Unicode codepoints"
- in a URL5: invalid. Processing to URI: percent-encode

Summing up, the differences between URL5 and LEIRI are only about the
percent sign and its uses for delimiters.

The fact that "%" is automatically converted to "%25" means that
authors can no more use percent-encoding to allow transmission of
those chars as plain data. Please note that, even if sub-delims are
allowed non encoded, they may have special meaning in a scheme
specific syntax.
One example is "&", which is allowed in URIs, but has a special
meaning in the query-part of HTTP URIs. How can UAs send forms with
"&" in value without causing security problems on the receiving
server? The same for "=", "/" , "?": how can I transmit those chars?
It is forbidden to ask questions in GET forms?

Or on the other side: do I need to percent-decode twice on the
receiving server? What about backward-compatibility with existing
server-side applications that expect to percent-encode just once?

Giovanni

Forwarded message 2

From: Anne van Kesteren <annevk@opera.com>
Date: Sun, 29 Mar 2009 14:46:38 +0200
Subject: Re: [whatwg] Web Addresses vs Legacy Extended IRI (again)
To: "Giovanni Campagna" <scampa.giovanni@gmail.com>, "WHATWG List" <whatwg@whatwg.org>
Message-ID: <op.urj0f0y164w2qv@annevk-t60.oslo.opera.com>

On Sun, 29 Mar 2009 14:37:19 +0200, Giovanni Campagna  
<scampa.giovanni@gmail.com> wrote:
> Summing up, the differences between URL5 and LEIRI are only about the
> percent sign and its uses for delimiters.

I'm not sure if you're correct about those differences, but even if you  
are they are not the only differences. E.g. LEIRIs perform normalization  
if the input encoding is non-Unicode. URLs do not. URLs can encode their  
query component per the input encoding (and do so for HTML and some APIs).  
LEIRIs cannot.

(Also, I'm not sure if the WHATWG list is the right place to discuss this  
as the editor of the new draft might not read this list at all.)

-- 
Anne van Kesteren
http://annevankesteren.nl/

Forwarded message 3

From: Giovanni Campagna <scampa.giovanni@gmail.com>
Date: Sun, 29 Mar 2009 14:01:51 +0100
Subject: Re: [whatwg] Web Addresses vs Legacy Extended IRI (again)
To: Anne van Kesteren <annevk@opera.com>
Cc: WHATWG List <whatwg@whatwg.org>
Message-ID: <65307430903290601u86caa00ueafd15a6c30789f2@mail.gmail.com>

2009/3/29 Anne van Kesteren <annevk@opera.com>:
> On Sun, 29 Mar 2009 14:37:19 +0200, Giovanni Campagna
> <scampa.giovanni@gmail.com> wrote:
>>
>> Summing up, the differences between URL5 and LEIRI are only about the
>> percent sign and its uses for delimiters.
>
> I'm not sure if you're correct about those differences, but even if you are
> they are not the only differences. E.g. LEIRIs perform normalization if the
> input encoding is non-Unicode. URLs do not. URLs can encode their query
> component per the input encoding (and do so for HTML and some APIs). LEIRIs
> cannot.

What is the problem with normalization? Is there a standard for
conversion to non-Unicode to Unicode?
I guess no, so normalization (which should always be done) is perfectly legal.

In addition, IRIs are defined as a sequence of Unicode codepoints. It
does not matter how those codepoints are stored (ASCII, ISO-8859-1,
UTF-8), only the Unicode version of them.
This is the same as URL5s, by the way, because none of them is defined
on octets and both use the RFC3986 method for percent-encoding (using
UTF-8)

> (Also, I'm not sure if the WHATWG list is the right place to discuss this as
> the editor of the new draft might not read this list at all.)
>

Unfortunately, I cannot join the public-html list. I could cross-post
this to www-html or www-archive but it would break the archives and
make it difficult to follow.

> --
> Anne van Kesteren
> http://annevankesteren.nl/
>

Giovanni

Forwarded message 4

From: Anne van Kesteren <annevk@opera.com>
Date: Sun, 29 Mar 2009 15:06:35 +0200
Subject: Re: [whatwg] Web Addresses vs Legacy Extended IRI (again)
To: "Giovanni Campagna" <scampa.giovanni@gmail.com>
Cc: WHATWG List <whatwg@whatwg.org>
Message-ID: <op.urj1c9pi64w2qv@annevk-t60.oslo.opera.com>

On Sun, 29 Mar 2009 15:01:51 +0200, Giovanni Campagna  
<scampa.giovanni@gmail.com> wrote:
> 2009/3/29 Anne van Kesteren <annevk@opera.com>:
>> I'm not sure if you're correct about those differences, but even if you  
>> are they are not the only differences. E.g. LEIRIs perform  
>> normalization if the input encoding is non-Unicode. URLs do not. URLs  
>> can encode their query
>> component per the input encoding (and do so for HTML and some APIs).  
>> LEIRIs cannot.
>
> What is the problem with normalization? Is there a standard for
> conversion to non-Unicode to Unicode?
> I guess no, so normalization (which should always be done) is perfectly  
> legal.

It's about Unicode Normalization. (And it should not always be done.)


> In addition, IRIs are defined as a sequence of Unicode codepoints. It
> does not matter how those codepoints are stored (ASCII, ISO-8859-1,
> UTF-8), only the Unicode version of them.

Please read the IRI specification again. Specifically section 3.1.


> This is the same as URL5s, by the way, because none of them is defined
> on octets and both use the RFC3986 method for percent-encoding (using
> UTF-8)

No, it's not always using UTF-8.


-- 
Anne van Kesteren
http://annevankesteren.nl/

Forwarded message 5

From: Giovanni Campagna <scampa.giovanni@gmail.com>
Date: Sun, 29 Mar 2009 14:31:38 +0100
Subject: Re: [whatwg] Web Addresses vs Legacy Extended IRI (again)
To: Anne van Kesteren <annevk@opera.com>
Cc: WHATWG List <whatwg@whatwg.org>
Message-ID: <65307430903290631s3c54f43dva0e6797a6f11fe4a@mail.gmail.com>

2009/3/29 Anne van Kesteren <annevk@opera.com>:
> On Sun, 29 Mar 2009 15:01:51 +0200, Giovanni Campagna
> <scampa.giovanni@gmail.com> wrote:
>>
>> 2009/3/29 Anne van Kesteren <annevk@opera.com>:
>>>
>>> I'm not sure if you're correct about those differences, but even if you
>>> are they are not the only differences. E.g. LEIRIs perform normalization if
>>> the input encoding is non-Unicode. URLs do not. URLs can encode their query
>>> component per the input encoding (and do so for HTML and some APIs).
>>> LEIRIs cannot.
>>
>> What is the problem with normalization? Is there a standard for
>> conversion to non-Unicode to Unicode?
>> I guess no, so normalization (which should always be done) is perfectly
>> legal.
>
> It's about Unicode Normalization. (And it should not always be done.)

If I convert from ISO-8859-1 and find "À" (decimal 192), I can emit
"À" U+00C0 LATIN CAPITAL A WITH GRAVE or "A" U+0041 LATIN CAPITAL
LETTER A followed by " ̀" U+0300 COMBINING GRAVE ACCENT
One is NFC, the other is NFD, and both are legal and simple.

>> In addition, IRIs are defined as a sequence of Unicode codepoints. It
>> does not matter how those codepoints are stored (ASCII, ISO-8859-1,
>> UTF-8), only the Unicode version of them.
>
> Please read the IRI specification again. Specifically section 3.1.

Specification says that IRIs must be a in normalized UCS when created
from user input, else it must be converted to Unicode if not already
(and the conversion should be normalizing), else it must be converted
from UTF-8  / 16 / 32 to UCS but not normalized.
I don't see a particular problem in this.

>> This is the same as URL5s, by the way, because none of them is defined
>> on octets and both use the RFC3986 method for percent-encoding (using
>> UTF-8)
>
> No, it's not always using UTF-8.

RFC3986 never creates percent encoding (percent-encoding is used for
unspecified binary data) but says that text components should be
encoded as UTF-8 and that rules are estabilished by scheme specific
syntaxes.

> --
> Anne van Kesteren
> http://annevankesteren.nl/
>

Giovanni

Received on Thursday, 30 April 2009 23:51:04 UTC